An ATF corpus can be turned into a TEI corpus; this document describes the TEI conventions used and discusses some of the issues in this conversion process.
The ATF processor turns ATF into XTF--a multi-stream XML output which separates the transliteration, the lemmatization and multi-word analysis of phrases, named entities, measures etc. The TEI implementation converts this to a single-stream representation conformant to the TEI P5 guidelines which can be validated using a schema generated by Roma.
In a few cases, the mapping of XTF to TEI is suboptimal as a result of the lack of tags with exactly appropriate semantics or the forced use of infelicitous constructs.
Each ATF text is turned into a TEI text within a
teiCorpus. Support for the kinds of information that can
go in the teiHeader is weak in ATF corpora; this should
be corrected.
The individual documents are available from a link on the project
page for each text, under the label `Analytic View'. The
reason for this choice for the label is that the TEI version of the
transliterations integrates the results of running various kinds of
content-analyzers on the texts with the text data itself. This makes
it easy to colourize the various components identified by the
analyzers.
The final version of the TEI corpus is the concatenation of all of the files in a project prefaced by a TEI header which includes elements derived from the project glossaries. Thus, the TEI corpus has the potential to represent all of the project's textual and glossary data in a single file. Further developments of the XTF to TEI conversion will aim to make the TEI corpus a complete representation of the project's glossaries, texts and metadata.
The schema is very nearly vanilla TEI P5 as generated by Roma--the full text can be browsed from the Resources section below.
The only additions which have been made by hand are the definition
of a simple XLink attribute set (att.xlink.attributes),
and the referencing of that definition as an optional part of the
name, note and title elements.
This allows a few key links to be implemented directly in the
browsable XML (when viewed with FireFox, at least).
A very basic header is generated in order to meet the TEI minimum requirements.
TEI div elements are used for discourse blocks (body,
witnesses, document-date and others). Blocks which come before the
body are placed in the TEI front section;
blocks which come after the body are placed in the TEI
back section.
XTF structural divisions are rendered with milestone tags. In the
case of the outer structural division type `surface' we
use the TEI milestone tag. For column and line breaks we
use cb and lb respectively.
Almost all of the inline markup used by XTF (more precisely, by GDL) is handled well by TEI. A few exceptions are noted here.
There is no suppliedSpan, though it
would be a natural since there is addSpan,
delSpan and damageSpan. As a result, the
equivalent to square-bracketed text is implemented using paired
anchor tags (it is not possible to use an
anchor ... ptr pair because ptr
is not allowed in w).
There is no direct TEI tagging for the Assyriological
practice of indicating collations (in this case, collation as in
`checking of tablet' rather than collation of manuscript folios as in
TEI). The XTF/TEI implementation uses a conventional mapping of
flagged graphemes to TEI tags based on the corr element.
The values high/medium/low are defined to be the specific
equivalents of ATF flag combinations as in the following list.
? = <corr cert="low">
*? = <corr cert="medium">
* = <corr cert="high">
The lemmatization is partly integrated in the use of the
w element. We push the definition of the
@lemma attribute to include the full
citation-form/guide-word/POS triple that is the standard referencing
mechanism between XTF texts and their corresponding Corpus-Based
Dictionaries. The additional annotation that is encoded in the
forms structures in XTF files is not presently included.
This will be rectified in a future release, probably by defining a TEI
fs (feature-structure) item for each form and including
it in the corpus preamble, then referencing it using the
@ana attribute on w.
Handling of orthographic forms which contain more than one
grammatical word is not discussed in TEI P5. The approach taken in
the XTF/TEI conversion is to wrap the entire orthographic form in the
first w tag, then to emit additional w
elements with empty content as hosts for the subsequent lemmata.
Persons are handled in conformance with TEI P5. Two lists are
generated from the names glossary, a listPerson and a
listNym. These are then referenced from the
forename tags in the body of texts. At present the
export of data from the names glossary to
listPerson/listNym is not complete, but the
father, gfather and ancestor
properties are emitted as relations in the
listPerson.
Not yet annotated.
Not yet annotated.
18 Dec 2019Steve Tinney
Steve Tinney, 'Text Encoding Initiative (TEI) for Oracc', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2019 [http://oracc.museum.upenn.edu/doc/about/standards/tei/]