Oracc Grapheme Sorting

OSL is sorted according to the system-wide Oracc grapheme sorting algorithm which is defined in the the GDL library. The same sort is used by PCSL and in sorting transliteration fragments such as the grammatical bases in Sumerian glossary articles. A separate page gives the reference version of the OSL sort order [/osl/signlist/SortOrder/].

A sequence is one or more graphemes which are normally separated by delimiters. In the normal case, a sort-item is created for each grapheme or delimiter and added to the list of sort-items that represents the sequence.

The exception is that determinatives do not have a delimiter on one side--after the determinative for preposed determinatives and before the determinative for postposed determinatives. Following normal disciplinary practice, determinatives are ignored when sorting so no sort-item is created for determinatives--any preceding or following delimiters are added to the list of sort-items.

The algorithm works by splitting a sort-item into segments and comparing the segments in turn. The segments are:

TEXT
The original grapheme or delimiter.
TYPE
An integer which is 0 for regular graphemes and for delimiters; 1 for punctuation; 2 for numbers.
BASE
The TEXT stripped of subscript digits or modifiers and lowercased.
KEY

The BASE passed through Oracc's grapheme collation sequence; the characters in KEY are remapped to conform to the delimiter and grapheme character and letter orders.

MODS
A list of zero or more modifiers of the kind expressed with @ and ~ in ATF.
INDEX
The grapheme index, if any.
REPEAT
The repeat count for a number as an integer; -1 is a flag value for no repeater.
SUFFIX
For list-patterned graphemes only; see the description below.

Sorting is carried out with the C library qsort function which compares two items at a time. The comparison routine, gsort_cmp, is in Oracc II's lib/gdl/gsort.c [https://github.com/oracc/oracc2/blob/main/lib/gdl/gsort.c].

Comparing items steps through the segments and when two segments compare different (including where one grapheme has the segment and the other does not) comparison stops and a result is returned. The steps are:

  1. Compare TYPE: the type values mean mean that in a list of signs, regular graphemes sort first, punctuation follows in its own block, and numbers are at the end in their own block.
  2. Compare KEY character by character.

    Delimiters sort before letters. The character order for delimiters puts the delimiters that separate juxtaposed signs first, and those which represent integrative relationships second: SPACE - . + : @ × & % . The algorithm also uses NULL delimiters to influence sort order in a few cases.

    The letter order for graphemes is: ʾ a b c d e f g ŋ h ḫ i j k l m n o p q r s ś š ṣ t ṭ u v w x y z.

  3. Compare INDEX: subscript digits are mapped to ASCII digits and the sequence is converted to an integer; subscript x has an INDEX value of 1000. Graphemes with no explicit index have INDEX=0 so that X and X₁ sort as expected.
  4. Compare SUFFIX: suffixes occur only on list names. If one item has a suffix and the other does not, the one without suffix comes before the one with suffix. If both items have suffixes they are subjected to a simple string compare.
  5. Compare MODS; each MOD is a single @ or ~ item. MODs with ~ sort before MODs with @ because the ~ indicates a glyph-variant whereas a @ indicates a new sign created by modifying another sign in the ways specified by the @-modifier.

    By definition of the data-type in Oracc, MODS are ASCII alphanumeric sequences matching [~@][a-z0-9]+. They are normally compared with a simple string comparison so they sort in simple alphabetic order, but an exception is made for the modifiers that consist entirely of digits 0..9 such as @90, etc. These are sorted in numeric sequence.

  6. Compare REPEAT
  7. If this point is reached the items are the same when lowercased; if they differ in letter case then lower case is sorted before upper.

Punctuation in Sorting

A sign consisting of P followed by subscript digits is TYPE 1, punctuation.

Numbers in Sorting

Three categories of sign have TYPE 2, number:

Because numbers are compared with BASE first and REPEAT last, all of the numbers with the same GRAPHEME are grouped together--1(N01), 2(N02) etc., then 1(N01@f) and so on.

See also the section on mixed number/non-number compounds below.

Qualified Graphemes

Qualified graphemes such as ašₓ(AB) are treated as sequences with the value and sign separated by a NULL delimiter.

Multiplier Graphemes

Multiplier graphemes have special treatment and are sorted as though they were REPEAT(GRAPHEME), like numbers but with TYPE=0. In addition, a NULL delimiter is added after the multiplier grapheme to ensure that it sorts after the base grapheme. This means that |3×AN| sorts immediately after AN, for example.

Listy Graphemes

List numbers are segmented slightly differently from regular graphemes although they have the same TYPE (i.e., 0). The sort algorithm considers any grapheme that fails the number-pattern test but contains 1 or more ASCII digits to be a "list", or listy grapheme. The part before the digits becomes BASE; the digits are converted to an integer and the INDEX is set to 10000 + the integer result of the digit conversion. If anything follows the digits it becomes the SUFFIX segment. A list name could also potentially have MODS, e.g., LAK001a@c.

Mixed Number/Non-Number Compounds

In Proto-Cuneiform there are compounds such as |1(N57).ŠAH₂| which are considered as non-numbers for the purposes of Unicode PC and which are sorted by the Oracc sort in the regular grapheme block rather than the number block.

This is done by coercing the type of the first sort-item to 0, regular grapheme, but leaving the rest of the segments the same. The BASE is then N57, which sorts before NA; the group of N57.X compounds is kept together in the sorted sequence because the number-like BASE-then-REPEAT sorting is still obeyed.

 
Back to top ^^
 
CC BY-SA The OSL Project, 2014-
http://oracc.org/Sorting/