An ATF corpus can be turned into a TEI corpus; this document describes the TEI conventions used and discusses some of the issues in this conversion process.
The ATF processor turns ATF into XTF--a multi-stream XML output which separates the transliteration, the lemmatization and multi-word analysis of phrases, named entities, measures etc. The TEI implementation converts this to a single-stream representation conformant to the TEI P5 guidelines which can be validated using a schema generated by Roma.
In a few cases, the mapping of XTF to TEI is suboptimal as a result of the lack of tags with exactly appropriate semantics or the forced use of infelicitous constructs.
Each ATF text is turned into a TEI
text within a
teiCorpus
. Support for the kinds of information that can
go in the teiHeader
is weak in ATF corpora; this should
be corrected.
The individual documents are available from a link on the project
page for each text, under the label `Analytic View
'. The
reason for this choice for the label is that the TEI version of the
transliterations integrates the results of running various kinds of
content-analyzers on the texts with the text data itself. This makes
it easy to colourize the various components identified by the
analyzers.
The final version of the TEI corpus is the concatenation of all of the files in a project prefaced by a TEI header which includes elements derived from the project glossaries. Thus, the TEI corpus has the potential to represent all of the project's textual and glossary data in a single file. Further developments of the XTF to TEI conversion will aim to make the TEI corpus a complete representation of the project's glossaries, texts and metadata.
The schema is very nearly vanilla TEI P5 as generated by Roma--the full text can be browsed from the Resources section below.
The only additions which have been made by hand are the definition
of a simple XLink attribute set (att.xlink.attributes
),
and the referencing of that definition as an optional part of the
name
, note
and title
elements.
This allows a few key links to be implemented directly in the
browsable XML (when viewed with FireFox, at least).
A very basic header is generated in order to meet the TEI minimum requirements.
TEI div
elements are used for discourse blocks (body,
witnesses, document-date and others). Blocks which come before the
body
are placed in the TEI front
section;
blocks which come after the body
are placed in the TEI
back
section.
XTF structural divisions are rendered with milestone tags. In the
case of the outer structural division type `surface
' we
use the TEI milestone
tag. For column and line breaks we
use cb
and lb
respectively.
Almost all of the inline markup used by XTF (more precisely, by GDL) is handled well by TEI. A few exceptions are noted here.
There is no suppliedSpan
, though it
would be a natural since there is addSpan
,
delSpan
and damageSpan
. As a result, the
equivalent to square-bracketed text is implemented using paired
anchor
tags (it is not possible to use an
anchor
... ptr
pair because ptr
is not allowed in w
).
There is no direct TEI tagging for the Assyriological
practice of indicating collations (in this case, collation as in
`checking of tablet' rather than collation of manuscript folios as in
TEI). The XTF/TEI implementation uses a conventional mapping of
flagged graphemes to TEI tags based on the corr
element.
The values high/medium/low
are defined to be the specific
equivalents of ATF flag combinations as in the following list.
?
= <corr cert="low">
*?
= <corr cert="medium">
*
= <corr cert="high">
The lemmatization is partly integrated in the use of the
w
element. We push the definition of the
@lemma
attribute to include the full
citation-form/guide-word/POS triple that is the standard referencing
mechanism between XTF texts and their corresponding Corpus-Based
Dictionaries. The additional annotation that is encoded in the
forms
structures in XTF files is not presently included.
This will be rectified in a future release, probably by defining a TEI
fs
(feature-structure) item for each form and including
it in the corpus preamble, then referencing it using the
@ana
attribute on w
.
Handling of orthographic forms which contain more than one
grammatical word is not discussed in TEI P5. The approach taken in
the XTF/TEI conversion is to wrap the entire orthographic form in the
first w
tag, then to emit additional w
elements with empty content as hosts for the subsequent lemmata.
Persons are handled in conformance with TEI P5. Two lists are
generated from the names glossary, a listPerson
and a
listNym
. These are then referenced from the
forename
tags in the body of texts. At present the
export of data from the names glossary to
listPerson
/listNym
is not complete, but the
father
, gfather
and ancestor
properties are emitted as relation
s in the
listPerson
.
Not yet annotated.
Not yet annotated.
18 Dec 2019Steve Tinney
Steve Tinney, 'Text Encoding Initiative (TEI) for Oracc', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2019 [http://oracc.museum.upenn.edu/doc/about/standards/tei/]