Collations and Linguistic Sorting

This chapter provides the basic concepts of national characters and linguistic sorting.

The default Unicode sorting order is provided in https://www.unicode.org/Public/UCA/latest/allkeys.txt.

This table (the Default Unicode Collation Element Table) provides a mapping from characters to collation elements for all the explicitly weighted characters.

Multilevel Comparisons

There are different levels of comparisons to pay attention to, such as case and accent sensitivity. From the Default Unicode Collation Element Table, referred to above, the following definition for the letter b is picked:

0062  ; [.0A29.0020.0002.0062] # LATIN SMALL LETTER B

 

Within square brackets there are four levels of comparison keys; the Primary level, the Secondary level, the Tertiary level and the Quaternary level.

Primary level:

Typically, this is used to denote differences between base characters (for example, a < b). It is the strongest difference. For example, dictionaries are divided into different sections by the base character. This is also called the level-1 strength.

Mimer SQL’s predefined level 1 collations have names ending with _1, e.g. ENGLISH_1.

Secondary level:

Accents in the characters are usually considered secondary differences (for example, ab < áb < ac). A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.

Mimer SQL’s predefined level 2 collations have names ending with _2, e.g. ENGLISH_2.

Note:In some languages (such as Icelandic), certain accented letters are considered to be separate base characters.

Tertiary level:

Upper and lower case differences in characters are distinguished at the tertiary level (for example, ab < Ab < áb). In addition, a variant of a letter differs from the base form on the tertiary level (such as a and ª). A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called level-3 strength.

Mimer SQL’s predefined level 3 collations have names ending with _3, e.g. ENGLISH_3.

Quaternary level:

When punctuation is ignored (such as space and hyphen) at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, ab < a c < a-c < ac). A quaternary difference is ignored when there is a primary, secondary, or tertiary difference. This is also called the level-4 strength.

Multilevel comparison means the following: Two strings are compared on the primary level. If the comparison for this level fails to establish a unique and determined sequence for the strings, the second level are taken into consideration. If this likewise fails to produce a unique sequence, the tertiary level is invoked, and after this the quaternary level is used. If still a unique sequence can not be established, the two strings are regarded as equivalent.

How far to go in this comparison chain is decided by the definition of the collation used on the data. See Tailorings.

Alternate Weighting

Alternate collation elements, i.e. punctuation, can be treated different depending on the weighting method used:

Non-ignorable

Alternate collation elements are treated as normal collation elements. This is the default option.

Shifted

Alternate collation elements are set to zero at the primary, secondary and tertiary level, and the fourth-level weight is set to the primary weight. All other collation elements, with a non-zero primary weight, will receive a fourth-level weight of 0xFFFF. If the primary weight is zero, the fourth-level weight is also zero.

Shift-trimmed

Alternate collation elements are set to zero at the primary, secondary and tertiary level, and the fourth-level weight is set to the primary weight. All other collation elements are set to zero. This will emulate POSIX behavior.

The following gives an example of the alternate weighting differences.

Non-ignorable

Shifted

Shift-trimmed

de luge

death

death

de Luge

de luge

deluge

de-luge

de-luge

de luge

de-Luge

deluge

de-luge

death

de Luge

deLuge

deluge

de-Luge

de Luge

deLuge

deLuge

de-Luge

demark

demark

demark

Tailorings

A tailoring is a set of rules and attributes that forms a so called collation delta string, which is used as the basis when creating a collation. When a new collation is to be created, the tailoring describes how to modify an existing collation definition to get the new one. A collation is created by the CREATE COLLATION statement, see CREATE COLLATION.

Attributes

When creating a collation, the tailoring string can include attribute settings for comparison level, accent order, which case that should be first in order and alternate weighting.

Attributes are optional.

Option

Values

Description

Level

[Level 1]

[Level 2]

[Level 3]

[Level 4]

Sort level for the collation.

[Level 3] is default.

Accent order

[AccentOrder Forward]

[AccentOrder Backward]

Secondary level ordering direction.

[AccentOrder Forward] is default.

Case first

[CaseFirst Lower]

[CaseFirst Upper]

Tertiary level case ordering.

[CaseFirst Lower] is default.

Alternate

[Alternate Non-ignorable]

[Alternate Shifted]

[Alternate Shift-trimmed]

Alternatives for punctuation.

[Alternate Non-ignorable] is default.

Hiragana

[Hiragana On]

[Hiragana Off]

Option for Japanese sorting.

Use
[Hiragana On][Level 4] for full Japans ordering.

[Hiragana Off] is default.

Numeric

[Numeric On]

[Numeric Off]

Option for numeric sorting.

[Numeric Off] is default.

Special sort rules

Language

Attribute

Description

Chinese

[CJK KangXi]

[CJK PinYin]

[CJK Stroke]

[CJK ZhuYin]

Use special sort rules for Chinese characters.

Japanese

[CJK Kanji]

Use the JIS X 4061-1996 collation rules. Gives proper ordering of PROLONGED SOUND MARK and ITERATION MARK.

Korean

[CJK Hanja]

Sort Hanja characters secondary different from the corresponding Hangul character.

Vietnamese

[CJK ChuNom]

Use syllable by syllable processing. In lexical ordering, differences in letters are treated as primary, differences in tone markings as secondary, and differences in case as tertiary differences. Ordering according to primary and secondary differences proceeds syllable by syllable. According to this principle, a dictionary lists “ban mai” before “bàn cát” because the secondary difference in the first syllable takes precedence over the primary difference in the second.

Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu

[Indic]

Use traditional collation rules for Indic languages, which provides for proper sorting of words ending with a dead consonant (without an inherent vowel).

Rules

The rules in a tailoring string defines how to change the underlying collation. Each rule contains a string of ordered characters that starts with a reset value.

Symbol

Example

Example description

&

&Z

Reset at this letter. Rules will be relative to this letter from here on.

<

a < b

Identifies a primary level difference between a and b.

<<

e << ê

Identifies a secondary level difference between e and ê.

<<<

s <<< S

Identifies a tertiary level difference between s and S.

=

i = y

Signifies no difference between i and y.

"

","

The quoted character , (comma).

#

#0141#

Hexadecimal representation of Polish L with stroke.

 

 

Note:; can be used to represent secondary relations and , to represent tertiary relations, instead of << and <<< respectively.

Example

The following is a Danish tailoring example:

[level 4]

[casefirst upper]

[alternate shifted]

& Y << ü <<< Ü

& Z < æ <<< Æ << ä <<< Ä < ø <<< Ø << ö <<< Ö < å <<< Å << aa <<< Aa <<< AA

 

Sorting Examples

Numerical data sorting

Here is an example on how to sort numerical data properly:

SQL>CREATE TABLE alphanum (codes VARCHAR(10));

SQL>INSERT INTO  alphanum VALUES('A123');

SQL>INSERT INTO  alphanum VALUES('A234');

SQL>INSERT INTO  alphanum VALUES('A23');

SQL>INSERT INTO  alphanum VALUES('A3');

SQL>INSERT INTO  alphanum VALUES('A1');

 

SQL>-- Regular order [Numeric Off]

SQL>SELECT * FROM alphanum ORDER BY codes;

 

CODES

==========

A1

A123

A23

A234

A3

 

SQL>-- Numeric order [Numeric On]

SQL>CREATE COLLATION numeric FROM eor USING '[Numeric On]';

SQL>SELECT * FROM alphanum ORDER BY codes COLLATE numeric;

 

CODES

==========

A1

A3

A23

A123

A234

 

Two column sorting

Here is an example on how to sort two fields properly; in this case 'last name', 'first name':

SQL>create table name(last varchar(32),first varchar(32));

SQL>insert into  name values('van Diesel','Peter');

SQL>insert into  name values('van Diesel','Thomas');

SQL>insert into  name values('vanDiesel','Peter');

SQL>insert into  name values('vanDiesel','Thomas');

SQL>insert into  name values('Van Diesel','Peter');

SQL>insert into  name values('Van Diesel','Thomas');

SQL>insert into  name values('Van','Stephan');

SQL>insert into  name values('Van','Buster');

SQL>create collation names from EOR_1

SQL&using '[level 4][alternate shifted]&9<","'; -- ',' before 'A'

SQL>select last || ', ' || first as fullname

SQL&from name order by fullname collate names;

 

FULLNAME

========

Van, Buster

Van, Stephan

van Diesel, Peter

vanDiesel, Peter

Van Diesel, Peter

van Diesel, Thomas

vanDiesel, Thomas

Van Diesel, Thomas

Name prefix handling

Example on how to treat different Mac prefixes as equal. Typical names are MacAlister, McAlister, McDonell, MacDougel and M'Dougel.

SQL>create collation mac_english_3 from english_3 using

SQL&'&MAC<<<mc<<<Mc<<<MC<<<m#27#<<<M#27#';

SQL>create collation mac_english_2 from mac_english_3 using '[level 2]';

SQL>create table macs (name varchar(32));

SQL>insert into macs values('M''Dougel');

SQL>insert into macs values('McDonell');

SQL>insert into macs values('MacAlister');

SQL>insert into macs values('McAlister');

SQL>insert into macs values('MacDougel');

SQL>select * from macs order by name collate english_3;

name

================================

M'Dougel

MacAlister

MacDougel

McAlister

McDonell

 

                 5 rows found

 

SQL>select * from macs order by name collate mac_english_3;

name

================================

MacAlister

McAlister

McDonell

MacDougel

M'Dougel

 

                 5 rows found

 

SQL>select * from macs where name = 'macalister' collate mac_english_2;

name

================================

MacAlister

McAlister

 

                 2 rows found

 

SQL>select * from macs where name = 'mcalister' collate mac_english_2;

name

================================

MacAlister

McAlister

 

                 2 rows found

 

SQL>select * from macs where name = 'm''alister' collate mac_english_2;

name

================================

MacAlister

McAlister

 

                 2 rows found

Collating Details

Expanding Characters

A single character can map to a sequence of collation elements. For instance, ß is equivalent to ss. In German Phonebook ä, ö and ü sort as though they were ae, oe and ue respectively.

Contracting Character Sequences

Many languages have digraphs, which actually counts as separate letters. In traditional Spanish, ch sorts between c and d, and ll sorts between l and m. Two characters are mapped into a single collation element that cause the combination to be ordered differently from either character individually.

Another example of contractions are lj and nj in Bosnian and Croatian, which sorts after l and n respectively.

Order without contraction

Order with contraction “nj” sorting after “n”

Na

Na

Ni

Ni

Nj

Nk

Nja

Nz

Njz

Nj

Nk

Nja

Nz

Njz

Oa

Oa

Backward Accent Ordering

Some languages, particularly French, require words to be ordered on the secondary level by comparing backwards from right to left.

Example

English ordering

French ordering

Cote

Cote

Coté

Côte

Côte

Coté

Côté

Côté

Indic

Attribute: [indic]
Function: Method for traditional Indic collation

The traditional Indic sort order is as follows:

1Vowel

2Vowelless consonant

3Vowelless consonant + Vowel

4Vowelless consonant + Vowelless consonant

5Vowelless consonant + Vowelless consonant + Vowel

6... and so on

As the consonant letters in Indic scripts includes an inherent vowel /a/, the following transformations are applied before sorting:

1Consonant + Virama => Vowelless consonant

2Consonant + Vowel-sign => Vowelless consonant + Vowel

3Consonant => Vowelless consonant + A

Transformation examples:

indic.png

 

The method for traditional Indic collation effectively works for the following scripts:

Devanagari (Hindi, Konkani, Marathi, Nepali and Sanskrit)

Bengali (Assamese, Bengali and Manipuri)

Gujarati

Oriya

Telugu

Kannada

Malayalam

The famous authoritative Monier-Williams: Sanskrit-English Dictionary is a good reference:

https://www.ibiblio.org/sripedia/ebooks/mw/

https://www.ibiblio.org/sripedia/ebooks/mw/0000/mw__0033.html

The [indic] attribute also works for Tamil, but with different rules as used in the authoritative University of Madras: Tamil Lexicon http://dsal.uchicago.edu/dictionaries/tamil-lex/

Punjabi does not need any tailoring, the default order follows the rules in the Punjabi University: Punjabi-English Dictionary ISBN:8173800960.

Without the [indic] attribute, a very large tailoring is needed for traditional collation. See the following example for Devanagari.

&#0903#<#0915##094D#

&#0915#<#0916##094D#

&#0916#<#0917##094D#

&#0917#<#0918##094D#

&#0918#<#0919##094D#

&#0919#<#091A##094D#

&#091A#<#091B##094D#

&#091B#<#091C##094D#

&#091C#<#091D##094D#

&#091D#<#091E##094D#

&#091E#<#091F##094D#

&#091F#<#0920##094D#

&#0920#<#0921##094D#

&#0921#<#0922##094D#

&#0922#<#0923##094D#

&#0923#<#0924##094D#

&#0924#<#0925##094D#

&#0925#<#0926##094D#

&#0926#<#0927##094D#

&#0927#<#0928##094D#

&#0928#<#092A##094D#

&#092A#<#092B##094D#

&#092B#<#092C##094D#

&#092C#<#092D##094D#

&#092D#<#092E##094D#

&#092E#<#092F##094D#

&#092F#<#0930##094D#

&#0930#<#0932##094D#

&#0932#<#0933##094D#

&#0933#<#0935##094D#

&#0935#<#0936##094D#

&#0936#<#0937##094D#

&#0937#<#0938##094D#

&#0938#<#0939##094D#

 

&#0915##094D##0905#=#0915#

&#0915##094D##0906#=#0915##093E#

&#0915##094D##0907#=#0915##093F#

&#0915##094D##0908#=#0915##0940#

&#0915##094D##0909#=#0915##0941#

&#0915##094D##090A#=#0915##0942#

&#0915##094D##090B#=#0915##0943#

&#0915##094D##0960#=#0915##0944#

&#0915##094D##090C#=#0915##0962#

&#0915##094D##0961#=#0915##0963#

&#0915##094D##090D#=#0915##0945#

&#0915##094D##090E#=#0915##0946#

&#0915##094D##090F#=#0915##0947#

&#0915##094D##0910#=#0915##0948#

&#0915##094D##0911#=#0915##0949#

&#0915##094D##0912#=#0915##094A#

&#0915##094D##0913#=#0915##094B#

&#0915##094D##0914#=#0915##094C#

 

...

same pattern for #0916#..#0938# (32)

...

 

&#0939##094D##0905#=#0939#

&#0939##094D##0906#=#0939##093E#

&#0939##094D##0907#=#0939##093F#

&#0939##094D##0908#=#0939##0940#

&#0939##094D##0909#=#0939##0941#

&#0939##094D##090A#=#0939##0942#

&#0939##094D##090B#=#0939##0943#

&#0939##094D##0960#=#0939##0944#

&#0939##094D##090C#=#0939##0962#

&#0939##094D##0961#=#0939##0963#

&#0939##094D##090D#=#0939##0945#

&#0939##094D##090E#=#0939##0946#

&#0939##094D##090F#=#0939##0947#

&#0939##094D##0910#=#0939##0948#

&#0939##094D##0911#=#0939##0949#

&#0939##094D##0912#=#0939##094A#

&#0939##094D##0913#=#0939##094B#

&#0939##094D##0914#=#0939##094C#

Japanese

Attribute: [CJK Kanji]
Function: JIS X 4061-1996 rules for SOUND/ITERATION MARKS

This attribute is an implementation of JIS X 4061-1996 and the collation rules are based on that standard.

The following criteria are considered in order until the collation order is determined. By default, Levels 1 to 4 are applied and Level 5 is ignored (as JIS does).

Level 1: alphabetic ordering

The character classes are sorted in the following order:

Space characters, Symbols and Punctuations, Digits,

Latin Letters, Greek Letters, Cyrillic Letters,

Hiragana/Katakana letters, Kanji ideographs.

 

In the class, alphabets are collated alphabetically; Kana letters are AIUEO-betically (in the Gozyuon order).

For Kanji, see Kanji Classes.

Other characters are collated as defined.

Characters not defined as a collation element are ignored and skipped on collation.

Level 2: diacritic ordering

In the Latin vowels, the order is as shown the following list.

One without diacritical mark, then with diacritical mark.

 

In Kana, the order is as shown the following list.

A voiceless kana, the voiced, then the semi-voiced

(if exists). (eg. Ka before Ga; Ha before Ba before Pa)

Level 3: case ordering

A small Latin character is less than the corresponding capital character.

In Kana, the order is as shown in the following list:

replaced PROLONGED SOUND MARK(U+30FC);

Small Kana;

replaced ITERATION MARK (U+309D, U+309E, U+30FD or U+30FE);

then normal kana

 

For example, Katakana A + PROLONGED SOUND MARK, Katakana A + Small Katakana A, Katakana A + ITERATION MARK, Katakana A + Katakana A.

Level 4: variant ordering

Hiragana is lesser than Katakana.

Level 5: width ordering

A character that belongs to the block Halfwidth and Fullwidth Forms is greater than the corresponding normal character.

Note:According to the JIS standard, the level 5 should be ignored.

Kanji Classes

There are three Kanji classes:

1The 'saisho' (minimum) Kanji class

It comprises five Kanji-like characters, i.e. U+3003, U+3005, U+4EDD, U+3006, U+3007. Any Kanji except U+4EDD are ignored on collation.

2The 'kihon' (basic) Kanji class

It comprises JIS levels 1 and 2 kanji in addition to the minimum Kanji class. Sorted in the JIS order. Any Kanji excepting those defined by JIS X 0208 are ignored on collation.

3The 'kakucho' (extended) Kanji class

All the CJK Unified Ideographs in addition to the minimum Kanji class. Sorted in the Unicode order.

Note:This is the implemented class.

Korean

Attribute: [CJK Hanja]
Function: Special sort table access

Hanja characters are sorted with secondary difference from the corresponding Hangul character.

Vietnamese

Attribute: [CJK ChuNom]
Function: Syllable by syllable processing

In lexical ordering, differences in letters are treated as primary, differences in tone markings as secondary, and differences in case as tertiary differences. Ordering according to primary and secondary differences proceeds syllable by syllable. According to this principle, a dictionary lists “ban mai” before “bàn cát” because the secondary difference in the first syllable takes precedence over the primary difference in the second.