PSUs: Phrasal Semantic Units

This page describes how L2 handles Phrasal Semantic Units, or glossary entries which consist of more than one word.

Introduction

PSUs are phrases that warrant their own dictionary entry, or a specific listing under one of the headwords. Such phrases may have their own meaning which is distinct from the sum of the parts, or they may simply be idiomatic usages which it is interesting to include in the glossary.

Although it is easy to be confused by the overlap caused by the fact that many PSUs are written using COFs, the two are completely separate. As far as L2 is concerned, PSUs are simply sequences of lemmata which are checked to ensure that the individual components match known criteria. COFs, on the other hand, are purely a feature of the writing interface, and have nothing to do with the interpretation of lemmata.

Lemmatizing

PSUs are lemmatized simply by lemmatizing the individual components: a special layer of L2 is responsible for identifying the phrases and linking the words together. Note the following considerations:

When lemmatizing the words it is not generally necessary to give a SENSE. However, within the L2 system the constituent words of PSUs are always associated with some SENSE of the word. As a result, when the GW of a word does not match a SENSE, it is better to give some keywords of the SENSE when lemmatizing. Also, some compounds and idioms may use different SENSEs of a word; in this case, too, it is necessary to give a SENSE when lemmatizing.
Sometimes a sequence of words should not be treated as a PSU even though the sequence is listed in the glossary as such. To prevent the lemmatizer treating the sequence as a PSU, use ! before the first lemma of a PSU to show that it is NOT a PSU in this instance. For instance, if ana[to]PRP; ṭarsi[extent]N is in your glossary with the meaning "opposite", write !ana[to]PRP; ṭarsi[extent]N when you want this phrase to keep its literal meaning, "to an extent". The mnemonic is that ! is a common boolean operator for NOT: the ! tells L2 *not* to process the word as part of a PSU.
You can use - before a lemma to omit a word (usually a MOD or AV) from the middle of a PSU, e.g., libbašu[interior]N; -ul[not]MOD; iṭâb[be(come) good]V for libba ṭiābu [be(come) satisfied] V.
You can specify the sense of an idiom, in the GW of the first Akkadian element, like this: ŠA₃.HUL = +lumnu[evil+=eclipsed state]N$lumun&+libbu[interior]N$libbi, where lumun libbi has the GW "sorrow" but in some (mostly astronomical) contexts means "eclipsed state".

Glossarizing

In the glossary, a PSU has its own @entry, and in addition each of its constituents must have their own @entry. Each constituent must have all of the information, including proper @form lines, as any other word: the constituent entries are ignorant of the fact that they are later gathered into PSUs.

A PSU @entry has one additional line relative to other words: a @parts specification, which gives the sequence of consituents which makes up the PSU:

@entry ēkal māšarti [review palace] N
@parts ēkallu[palace]N māšartu[inspection]N
...
@sense N review palace
@end entry

As with lemmatization, the constituents do not need an explicit SENSE to be given unless there are multiple PSUs with the same sequence of words but which differ in the SENSE of the one or more of the constituents, or if the GW does not match any of the SENSEs of the constituent. See 'Diagnostics' below for examples.

Sometimes the constituents of a PSU may be written in more than one order. In such cases, the glossary simply needs multiple @part lines:

@entry ina pān dagālu [wait for someone] V
@parts ina[in]PRP pānu[front]N dagālu[see]V
@parts dagālu[see]V ina[in]PRP pānu[front]N

The @form lines of PSUs also have some special characteristics. One is that the first element of a @form, the written form, may contain multiple words: in this case they are joined by underscores. The other is that they may only contain NORM entries in addition to the written form, and each of the NORMs is prefixed by its own $-sign:

@form ina_pa-ni-šu₂-nu_a-da-gal $ina $pānišunu $adaggal
@form ina_pa-ni-šu₂-nu_i-da-gal $ina $pānišunu $idaggal

COFs in the written forms of PSUs are straightforward when the entire PSU is written with a single COF:

@form im-muh-hi $ina $muhhi

When the writing mixes a COF with other constituents, however, it is necessary to tell L2 how many of the NORMs of the @form line are used up by the COF. This is done by adding the special sequence _0 (underscore followed by the digit zero) for each COF-constituent after the first:

@form {na₄}NIR₂.PA_0_iṣ-ṣu-ri $hulāl $kappi $iṣṣūri

L2 Diagnostics

PSU component not found in glossary

This diagnostic is generated when processing a @form line which contains COF components indicated with parentheses. The diagnostic gives the line number of the COF @form which is being processed, and indicates the spelling and an expected normalization which has not been found. Since the COF handling is not tied to entries, but to spelling and normalization combinations, the diagnostic cannot tell you which @entry it expected to find the component in.

Here is an actual example:

00lib/akk-x-neoass.glo:3167: (g2a) PSU component #2 
         i-da-a-ti=dāt[behind]PRP$dāti not found in glossary

This tells you that at line 3167 in 00lib/akk-x-neoass.glo, a defective @form line is being processed in a PSU @entry.

The defect may be in any of several places. When processing a @form line, L2 takes each PSU component in turn, and combines the written form and the normalization from the @form line with the signature data from the relevant word in the @parts line. The number given in the error message tells you the component of the form/parts lines it is currently working on.

If more than one @parts line is given in the PSU @entry, L2 tries all of the @parts lines to find a complete set of matches before it reports errors.

To debug this error, visit the offending line in the glossary and look at the contexts and the individual word entries. Some common causes of this error are:

@form line with its components out of order
the component word does not have a matching @form line
the POS on the component in the @part line is wrong
there is no matching SENSE for in the entry for the component

In the example error, the current word is dāt[behind]PRP Unless there is a bug in L2, you will find that the expected form line is missing, in this case:

@form i-da-ti $(ina) $dāti

Assuming that there is no typo and that the @form entry really should be there, you can now fix it by simply editing the glossary to add the @form line.

18 Dec 2019 osc at oracc dot org

Steve Tinney

Steve Tinney, 'PSUs: Phrasal Semantic Units', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2019 [http://oracc.museum.upenn.edu/doc/help/glossaries/psus/]