This document contains a brief explanation of the rationale and implementation of one of Oracc's three entity recognisers: the Person Specification Analyzer (PSA). It includes schemas for configuration data and output. This document is a work-in-progress; it will certainly change as they are implemented.
Persons are a subclass of Named Entity referring, in Mesopotamian culture at least, to humans, gods and beings in between. By convention, we write PN for Personal Name, with numeric subscripts to distinguish, e.g., PN1 from PN2.
A Person Specification (person-spec) is a series of words which identify a specific individual, though the specification may be indeterminate with respect to identity. Thus, in Ur III administrative texts a transaction function such as ŋiri = "via" may be followed not only by the name of a person, but also by a professional or functional designation referring to the transactor by office. Within certain contexts, then, a person-spec may be an unadorned functional term.
Besides names and functional designations, a person-spec may include information on the individual's relatives, most commonly in the form "PN1 son of PN2", but with the expected range of possible relational designators.
A person-spec with no relational designators is called a
simple
person-spec. A person-spec with relational
designators is called a complex
person-spec. Such
complex person-specs are actually person-specs which contain
person-specs; these container relationships may be recursive.
A person-spec may also include designations of the individual's profession or customary function, and/or of the individual's role in a given transaction. In such constructs as "PN1 son of PN2, the scribe" the attribution of the profession to the correct PN may be uncertain.
Texts frequently contain relationship information expressed by anaphora, and these anaphoric references are themselves frequently discontinuous with respect to the person-spec itself. Examples of this include constructs such as "PN1 ... PN2, his father". Identifying such references, even if incompletely, is a desideratum.
Various kinds of discourse context contain information relevant to person-specs. These include the syntactic/sentence-frame context (person-spec's role is buyer/seller, etc.) and object context (person-spec's actions include sealing document).
The implementation must be:
The input data must be:
Language configuration data must be:
Input data which does not meet these contraints can be preprocessed appropriately.
head
of the
person-spec.properties
. Person-spec properties include
the name(s) of the person, relations such as 'son-of', daughter-of',
and profession/role designations.forward
or backward
. Operators that are
applied forward
contribute properties to the following
person-spec; those which are applied backward contribute properties to
the preceding person-spec.outer
, or top-level, and inner
, or
lower-level. Consider the following examples from Sumerian:(0) Dudu dubsar (1) Dudu dumu Dada dubsar (2) Dudu dumu Dada dumu Didi
dubsar
, "scribe", is
clear: it applies to the head of the single person-spec.(1') (1 Dudu dumu (2 Dada))
dubsar
could
be applied either at the inner or the outer scope. Application at the
inner scope yields:(1'a) (1 Dudu dumu (2 Dada dubsar 2))
(1'b) (1 Dudu dumu (2 Dada) dubsar)
inner
or outer
property on the lemmatization of the instance.dumu
as inner
and dubsar
as
outer
yields:(2') (1 Dudu dumu (2 Dada dumu (3 Didi) 2) dubsar)
The configuration data for the PSA consists of a collection of property specifications keyed to lemmata and, possibly, lemma-sequences or patterns. Each language has its own configuration block. The PSA itself is language-agnostic, relying only on the information which is attached to the lemmata as a result of applying the configuration data to the input.
N.B.: In the present schema support for multi-word
PSA tokens, such as Akkadian māru ša
is not yet defined.
It is, however, a recognized requirement and the intention is to
address this requirement using the Ngram mechanism.
namespace psa = "http://oracc.org/ns/psa/1.0" start = config config = element psa:config { lang* } lang = element psa:lang { attribute psa:target-lang { xsd:NMTOKEN }, confdata* } confdata = lemma | ngram lemma = element psa:lemma { cfgw , ((binary , prop-prev , prop-next) |(unary , prop-this , direction)) , scope } ngram = element psa:ngram { text } cfgw = attribute psa:cfgw { text } binary = attribute psa:arity { "binary" } unary = attribute psa:arity { "unary" } direction = attribute psa:dir { "backward" | "forward" } scope = attribute psa:scope { "inner" | " outer" } prop-next = attribute psa:next { text } prop-prev = attribute psa:prev { text } prop-this = attribute psa:this { text }
The results of running the PSA on an XTF text may appear in several places:
The PSA may result in the (partial) disambiguation of PN instances, resulting in elimination of unnecessary forms from ambiguity lists.
The PSA may result in increased specificity of PN identifications, for example, a PN marked as SENSE=0, i.e., unresolved, may be able to recategorized as a specific individual.
In the case of person-specs whose head is a PN, unary properties attached to person-specs by the PSA are retrofitted into the XCL tree as properties on the instance forms.
Relations and other structures that have more than one member are entered as linksets in the Linkbase. From there they can be harvested and integrated into the Corpus-Based Dictionaries.
Steve Tinney
Steve Tinney, 'PSA: The Person Specification Analyser', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2019 [http://oracc.museum.upenn.edu/doc/help/lemmatising/psa/]