This document provides an overview of language-specific annotation conventions for Sumerian used in Oracc. We focus here on the data-entry view of linguistic annotation giving only enough additional technical background to ensure that correct annotation of ATF files can be carried out.
Sumerian lemmatization works by drawing information from four places (unless a project has been configured in a non-standard manner):
00lib/sux.glo
This structure results in the phenomemon of clean files producing the occasional harvestable item. What is happening is that the lemmatizer finds a match in ePSD, and that satisfies it. But harvesting is done by reference only to the project glossary, so the harvester finds something new to give notice about.
Sumerian uses the same basic parts of speech as Akkadian, but with the addition of three important ones:
As a matter of principle, it is best to do as little as possible when lemmatizing Sumerian--let the machine do the work wherever possible. In the normal case, it is enough to lemmatize with a citation form and a sense--all of the other components will get filled in for you by either ePSD or the morphological analyzer.
In particular, ePSD's approach to adjectives in Sumerian is that
almost all of them are simply verbs, whose nonfinite forms can be used
in a variety of ways including modifiers and participles. We
generally do not tag these using the Extended POS (EPOS), even where a
word has more than one listed POS in ePSD. For example, you can
lemmatize nir
as nir[lordly]
or
nir[lord]
and the lemmatizer will find the right POS/EPOS
combination for you.
In the absence of proper documentation on how to do ePSD-style
morphology, it is best simply to leave X
for the
morphology field for now.
For Sumerian two non-standard fields are used: the BASE and the CONT (continuation). Additionally, morphological analysis is implemented for Sumerian, and the lemmatizer automatically calls the morphological analyzer (SMA, Sumerian Morphological Analyzer) to fill in the BASE, CONT and MORPH fields if possible. This means that most Sumerian lemmatization is the same as any other language.
For some languages part of the written instance gives a particular
base form of the word, for example in Sumerian mu-un-du3
the base is du3
. It is normally unnecessary to specify
this unless the form being lemmatized uses a base which is new. When
given as part of the lemmatization this must come after the closing
square bracket and is introduced by a forward slash (/
)
character:
1. mu-un-du6 #lem: +du[build]/du6
Two special conventions are implemented for bases which combine the
writing of a non-base component with a base component, like the use
of be2
for the preverbal b
and the
base e
.
a-r·e
m°e
.For Sumerian the continuation is the grapheme which encodes the
final consonant of the base as its initial consonant. It is normally
unnecessary to specify this unless the form being lemmatized is new.
When specified, it must come immediately after the base, separated
from it by a plus sign and then a hyphen (+-
):
1. du-ga #lem: dug[speak]/du+-ga=g+a
In Sumerian data-input notation, verbal prefixes are separated from
the base by a colon (:
). Sumerian morphology conventions
are further described [REF]here[REF]. A simple Sumerian example looks
like this: mu.na.ni:~. A noun with
terminative case-marker looks like this: ~,esze (the character which introduces noun
morphemes is the comma (,); subsequent noun-morphemes are separated by
periods.
Morphology may be included directly on the lemmatization, following
any POS. In such cases, the separator is a hash character
(#
) with no surrounding spaces, and the morphology string
following directly afterwards: du[build]V#mu.na.ni:~. This is mainly needed
for syllabic writings.
Two common forms of annotation carried out manually are disambiguation and augmentation; the difference between them is that disambiguation is necessary when a form give part of a morpheme but that part could be analyzed more than one way. The three cases that are recognized in Sumerian are: Locative-Terminative vs. Ergative when the form ends in /e/; Locative vs. Genitive when a nominal form ends in /a/; and Nominalizer vs. Copula when a verbal form ends in /a/.
Augmentation is used when no part of the morpheme is preserved in the writing of the form; it is an easy way of adding unexpressed morphemes such as Sumerian /ak/ and other case-markers. Augmentation is discussed further below.
Disambiguation can be given as part of the sense immediately before the closing square bracket in a lemmatization string; these disambiguations refer to choices available in the lexicon. For Sumerian a common lexical disambiguation is the choice between intransitive and transitive in labile verbs or so-called causatives:
\i = select intransitive \t = select transitive
We specify the unmarked case to be intransitive so that, e.g.,
gub[stand]
needs no further annotation when intransitive;
when transitive it should be annotated as gub[stand\t]
.
Further, if distinct words are used in the sense of a verb which
has intransitive and transitive senses it is not necessary to add the
transitivity or the POS. Thus, dadara[tied]
and
dadara[bind]
would be taken as intransitive and transitive
respectively without further annotation.
Ambiguous forms which are susceptible to multiple analyses even
within the same CFGW can be disambiguated using the syntax \<DISAMBIGUATOR>
. The particular
disambiguators are language-specific; examples in Sumerian
include:
Note that this kind of disambiguation only needs to be carried out when a corpus is destined for syntactic analysis.
\a = select locative form \k = select genitive form \l = select locative-terminative form (default) \e = select ergative form \a = nominalizer \m = copula (am)
For examples see the next version of Gudea 1 below.
Augmentation consists of adding to morphological sequences. Augmentation is currently primitive and consists exclusively of the ability to append morphemes at the end of the morphology given in the forms file. This is probably only useful for the common case of adding unexpressed case-markers to Sumerian annotation; for more complex cases, the entire morphology string must be given inline as described under 'Inlining' above.
Augmentation is given after POS and the optional disambiguation,
but before any morphology string; it is indicated using the plus sign
(+
).
Given the Disambiguation and Augmentation conventions above, a sample text can be annotated:
&Q000887 = Gudea 1 @composite 1. {d}ba-u2 #lem: DN 2. dumu an-na #lem: dumu[child]; An[]DN\k 3. nin-a-ni #lem: nin[lady]+.*ra18 Dec 2019
Steve Tinney
Steve Tinney, 'SUX: Oracc Linguistic Annotation for Sumerian', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2019 [http://oracc.museum.upenn.edu/doc/help/languages/sumerian/]