OSL is sorted according to the system-wide Oracc grapheme sorting algorithm which is defined in the the GDL library. The same sort is used by PCSL and in sorting transliteration fragments such as the grammatical bases in Sumerian glossary articles. A separate page gives the reference version of the OSL sort order [/osl/signlist/SortOrder/].
A sequence
is one or more graphemes which are
normally separated by delimiters. In the normal case, a
sort-item
is created for each grapheme or delimiter
and added to the list of sort-items
that represents
the sequence
.
The exception is that determinatives do not have a delimiter
on one side--after the determinative for preposed determinatives
and before the determinative for postposed
determinatives. Following normal disciplinary practice,
determinatives are ignored when sorting so no
sort-item
is created for determinatives--any
preceding or following delimiters are added to the list of
sort-items
.
The algorithm works by splitting a sort-item
into segments and comparing the segments in turn. The segments
are:
The BASE passed through Oracc's grapheme collation sequence; the characters in KEY are remapped to conform to the delimiter and grapheme character and letter orders.
@
and ~
in ATF.Sorting is carried out with the C library qsort
function which compares two items at a time. The comparison
routine, gsort_cmp
, is in Oracc II's lib/gdl/gsort.c [https://github.com/oracc/oracc2/blob/main/lib/gdl/gsort.c].
Comparing items steps through the segments and when two segments compare different (including where one grapheme has the segment and the other does not) comparison stops and a result is returned. The steps are:
Compare KEY character by character.
Delimiters sort before letters. The character order for
delimiters puts the delimiters that separate juxtaposed
signs first, and those which represent integrative
relationships second: SPACE - . + : @ × & %
. The algorithm also uses NULL delimiters to
influence sort order in a few cases.
The letter order for graphemes is: ʾ a b c
d e f g ŋ h ḫ i j k l m n o p q r s ś š ṣ t ṭ u v w x y
z
.
x
has an INDEX value of 1000. Graphemes with no
explicit index have INDEX=0 so that X and X₁ sort as
expected.Compare MODS; each MOD is a single @
or
~
item. MODs with ~
sort before
MODs with @
because the ~
indicates a glyph-variant whereas a @
indicates
a new sign created by modifying another sign in the ways
specified by the @
-modifier.
By definition of the data-type in Oracc, MODS are ASCII
alphanumeric sequences matching
[~@][a-z0-9]+
. They are normally compared with
a simple string comparison so they sort in simple alphabetic
order, but an exception is made for the modifiers that
consist entirely of digits 0..9
such as
@90
, etc. These are sorted in numeric
sequence.
A sign consisting of P
followed by subscript
digits is TYPE 1, punctuation.
Three categories of sign have TYPE 2, number:
N/n
followed by
optional digits 0..9
, e.g., n N
N01
. By convention, N-numbers smaller than 10 are
padded with a leading zero so these can safely be compared as
strings.n
, N
, or a series
of digits 0..9
. By convention, REPEATER is not
padded with leading zeroes, so they must be compared as
integers in order for 1(N01) and 10(N01) to sort
correctly.Because numbers are compared with BASE first and REPEAT last, all of the numbers with the same GRAPHEME are grouped together--1(N01), 2(N02) etc., then 1(N01@f) and so on.
See also the section on mixed number/non-number compounds below.
Qualified graphemes such as ašₓ(AB)
are treated
as sequences with the value and sign separated by a NULL
delimiter.
Multiplier graphemes have special treatment and are sorted as
though they were REPEAT(GRAPHEME), like numbers but with
TYPE=0. In addition, a NULL delimiter is added after the
multiplier grapheme to ensure that it sorts after the base
grapheme. This means that |3×AN|
sorts immediately
after AN, for example.
List numbers are segmented slightly differently from regular
graphemes although they have the same TYPE (i.e., 0). The sort algorithm
considers any grapheme that fails the number-pattern test but
contains 1 or more ASCII digits to be a "list", or listy
grapheme. The part before the digits becomes BASE; the digits
are converted to an integer and the INDEX is set to 10000 + the
integer result of the digit conversion. If anything follows the
digits it becomes the SUFFIX segment. A list name could also
potentially have MODS, e.g., LAK001a@c
.
In Proto-Cuneiform there are compounds such as |1(N57).ŠAH₂| which are considered as non-numbers for the purposes of Unicode PC and which are sorted by the Oracc sort in the regular grapheme block rather than the number block.
This is done by coercing the type of the first
sort-item
to 0, regular grapheme, but leaving the
rest of the segments the same. The BASE is then N57, which sorts
before NA; the group of N57.X compounds is kept together in the
sorted sequence because the number-like BASE-then-REPEAT
sorting is still obeyed.