Glossaries in Oracc provide a place to collect information about words in a way that can be used by the lemmatizer to create an exhaustive index of words and forms in a corpus. The formal name of the glossary type in Oracc is the Corpus-Based Dictionary, and the abbreviation CBD is often used in Oracc to refer to this type.
An Oracc glossary consists of a brief header and a series of entries. Each entry consists of a series of tags most of which have content that continues on the rest of the line. Each tag must begin a new line, and blank lines are not allowed within entries. When the content is too long to fit on one line comfortably, any tag may be continued onto one or more continuation lines. Continuation lines must begin with one or more space characters.
Each glossary article begins with an @entry
tag.
This gives the Citation Form (CF), Guide Word (GW) and
Part-of-Speech (POS) for the word, in the form CF [GW] POS,
i.e., the GW is in square brackets, and there is a space before
the open bracket and after the close bracket. The CF [GW] POS
must be unique within a glossary.
The @entry
tag may be preceded by characters
from the ACD module, and may be followed by an exclamation
mark, !
, indicating priority, or an
asterisk *
indicating that the entry is a phrase
consisting of a sequence of words which retain their own
meanings. PSUs, sequences of words with a meaning distinct from
the individual parts, are not marked with the asterisk.
Each article must end with @end entry
written on
its own line.
The @bases
tag is used to list the bases for
Sumerian words. Bases are separated by a semicolon
(;
), e.g., du; du₃
. Bases may be
specified as a primary base with alternate transliterations of
the base in parentheses, in whch case multiple alternate bases
may be separated by a comma followed by a space
(,
), e.g., sag₉ (sa₆, ša₆, šag₉)
.
Stems can be associated with bases by giving the stem,
preceded by an asterisk and followed by a space, immediately
before a primary base, e.g., *du du₇; *dudu
du₇-du₇
.
The ACD module also provided facilities for remapping an existing base to a new base.
The @form
tag links instances in the corpus to
the entries in the glossary. The first element is always a
written form as it might occur in the corpus but without any
half, square, angle or round brackets. Following the written
form various elements may occur depending in part on the
language of the corpus. These elements are always marked by one
or more characters which indicate the type of the element. No
space is allowed between the type character and the data
element.
@bases
part of the entry.-ma
in inim-ma
. The
deconstruction of the grapheme gives the consonant which
continues the grapheme followed by the vowel which is normally a
morpheme or morpheme constituent.@form
lines for COFs and
PSUs, as described in the relevant pages.@form
tag may be followed by an exclamation
mark, !
, indicating priority.The @sense
tag gives meanings for the word as
well as a Part-of-Speech for the sense, which is the Effective
Part-of-Speech (EPOS) in lemmatization.
The @sense
tag may be preceded by characters
from the ACD module, and may be followed by an exclamation
mark, !
, indicating priority.
An optional (in fact, rarely given) guideword specific to the
sense, or SGW, may follow the @sense
tag in square
brackets: this is used by the lemmatizer to provide a strict
match when testing the inline sense for a match to the glossary
sense.
The first required element after the @sense
tag
is a valid Part-of-Speech. After the POS comes the meaning for
the sense.