Collations and Linguistic Sorting
This chapter provides the basic concepts of national characters and linguistic sorting.
The default Unicode sorting order is provided in https://www.unicode.org/Public/UCA/latest/allkeys.txt.
This table (the Default Unicode Collation Element Table) provides a mapping from characters to collation elements for all the explicitly weighted characters.
There are different levels of comparisons to pay attention to, such as case and accent sensitivity. From the Default Unicode Collation Element Table, referred to above, the following definition for the letter b is picked:
0062 ; [.0A29.0020.0002.0062] # LATIN SMALL LETTER B
Within square brackets there are four levels of comparison keys; the Primary level, the Secondary level, the Tertiary level and the Quaternary level.
Typically, this is used to denote differences between base characters (for example, a < b). It is the strongest difference. For example, dictionaries are divided into different sections by the base character. This is also called the level-1 strength.
Mimer SQL’s predefined level 1 collations have names ending with _1, e.g. ENGLISH_1.
Accents in the characters are usually considered secondary differences (for example, ab < áb < ac). A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.
Mimer SQL’s predefined level 2 collations have names ending with _2, e.g. ENGLISH_2.
Note:In some languages (such as Icelandic), certain accented letters are considered to be separate base characters.
Upper and lower case differences in characters are distinguished at the tertiary level (for example, ab < Ab < áb). In addition, a variant of a letter differs from the base form on the tertiary level (such as a and ª). A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called level-3 strength.
Mimer SQL’s predefined level 3 collations have names ending with _3, e.g. ENGLISH_3.
When punctuation is ignored (such as space and hyphen) at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, ab < a c < a-c < ac). A quaternary difference is ignored when there is a primary, secondary, or tertiary difference. This is also called the level-4 strength.
Multilevel comparison means the following: Two strings are compared on the primary level. If the comparison for this level fails to establish a unique and determined sequence for the strings, the second level are taken into consideration. If this likewise fails to produce a unique sequence, the tertiary level is invoked, and after this the quaternary level is used. If still a unique sequence can not be established, the two strings are regarded as equivalent.
How far to go in this comparison chain is decided by the definition of the collation used on the data. See Tailorings.
Alternate collation elements, i.e. punctuation, can be treated different depending on the weighting method used:
Non-ignorable
Alternate collation elements are treated as normal collation elements. This is the default option.
Shifted
Alternate collation elements are set to zero at the primary, secondary and tertiary level, and the fourth-level weight is set to the primary weight. All other collation elements, with a non-zero primary weight, will receive a fourth-level weight of 0xFFFF. If the primary weight is zero, the fourth-level weight is also zero.
Shift-trimmed
Alternate collation elements are set to zero at the primary, secondary and tertiary level, and the fourth-level weight is set to the primary weight. All other collation elements are set to zero. This will emulate POSIX behavior.
The following gives an example of the alternate weighting differences.
Non-ignorable |
Shifted |
Shift-trimmed |
de luge |
death |
death |
de Luge |
de luge |
deluge |
de-luge |
de-luge |
de luge |
de-Luge |
deluge |
de-luge |
death |
de Luge |
deLuge |
deluge |
de-Luge |
de Luge |
deLuge |
deLuge |
de-Luge |
demark |
demark |
demark |
A tailoring is a set of rules and attributes that forms a so called collation delta string, which is used as the basis when creating a collation. When a new collation is to be created, the tailoring describes how to modify an existing collation definition to get the new one. A collation is created by the CREATE COLLATION statement, see CREATE COLLATION.
Attributes
When creating a collation, the tailoring string can include attribute settings for comparison level, accent order, which case that should be first in order and alternate weighting.
Attributes are optional.
Option |
Values |
Description |
Level |
[Level 1] [Level 2] [Level 3] [Level 4] |
Sort level for the collation. [Level 3] is default. |
Accent order |
[AccentOrder Forward] [AccentOrder Backward] |
Secondary level ordering direction. [AccentOrder Forward] is default. |
Case first |
[CaseFirst Lower] [CaseFirst Upper] |
Tertiary level case ordering. [CaseFirst Lower] is default. |
Alternate |
[Alternate Non-ignorable] [Alternate Shifted] [Alternate Shift-trimmed] |
Alternatives for punctuation. [Alternate Non-ignorable] is default. |
Hiragana |
[Hiragana On] [Hiragana Off] |
Option for Japanese sorting. Use [Hiragana Off] is default. |
Numeric |
[Numeric On] [Numeric Off] |
Option for numeric sorting. [Numeric Off] is default. |
Special sort rules
Language |
Attribute |
Description |
Chinese |
[CJK KangXi] [CJK PinYin] [CJK Stroke] [CJK ZhuYin] |
Use special sort rules for Chinese characters. |
Japanese |
[CJK Kanji] |
Use the JIS X 4061-1996 collation rules. Gives proper ordering of PROLONGED SOUND MARK and ITERATION MARK. |
Korean |
[CJK Hanja] |
Sort Hanja characters secondary different from the corresponding Hangul character. |
Vietnamese |
[CJK ChuNom] |
Use syllable by syllable processing. In lexical ordering, differences in letters are treated as primary, differences in tone markings as secondary, and differences in case as tertiary differences. Ordering according to primary and secondary differences proceeds syllable by syllable. According to this principle, a dictionary lists “ban mai” before “bàn cát” because the secondary difference in the first syllable takes precedence over the primary difference in the second. |
Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu |
[Indic] |
Use traditional collation rules for Indic languages, which provides for proper sorting of words ending with a dead consonant (without an inherent vowel). |
Rules
The rules in a tailoring string defines how to change the underlying collation. Each rule contains a string of ordered characters that starts with a reset value.
Symbol |
Example |
Example description |
& |
&Z |
Reset at this letter. Rules will be relative to this letter from here on. |
< |
a < b |
Identifies a primary level difference between a and b. |
<< |
e << ê |
Identifies a secondary level difference between e and ê. |
<<< |
s <<< S |
Identifies a tertiary level difference between s and S. |
= |
i = y |
Signifies no difference between i and y. |
" |
"," |
The quoted character , (comma). |
# |
#0141# |
Hexadecimal representation of Polish L with stroke. |
Note:; can be used to represent secondary relations and , to represent tertiary relations, instead of << and <<< respectively.
Example
The following is a Danish tailoring example:
[level 4] [casefirst upper] [alternate shifted] & Y << ü <<< Ü & Z < æ <<< Æ << ä <<< Ä < ø <<< Ø << ö <<< Ö < å <<< Å << aa <<< Aa <<< AA |
Sorting Examples
Numerical data sorting
Here is an example on how to sort numerical data properly:
SQL>CREATE TABLE alphanum (codes VARCHAR(10));
SQL>INSERT INTO alphanum VALUES('A123');
SQL>INSERT INTO alphanum VALUES('A234');
SQL>INSERT INTO alphanum VALUES('A23');
SQL>INSERT INTO alphanum VALUES('A3');
SQL>INSERT INTO alphanum VALUES('A1');
SQL>-- Regular order [Numeric Off]
SQL>SELECT * FROM alphanum ORDER BY codes;
CODES
==========
A1
A123
A23
A234
A3
SQL>-- Numeric order [Numeric On]
SQL>CREATE COLLATION numeric FROM eor USING '[Numeric On]';
SQL>SELECT * FROM alphanum ORDER BY codes COLLATE numeric;
CODES
==========
A1
A3
A23
A123
A234
Two column sorting
Here is an example on how to sort two fields properly; in this case 'last name', 'first name':
SQL>create table name(last varchar(32),first varchar(32));
SQL>insert into name values('van Diesel','Peter');
SQL>insert into name values('van Diesel','Thomas');
SQL>insert into name values('vanDiesel','Peter');
SQL>insert into name values('vanDiesel','Thomas');
SQL>insert into name values('Van Diesel','Peter');
SQL>insert into name values('Van Diesel','Thomas');
SQL>insert into name values('Van','Stephan');
SQL>insert into name values('Van','Buster');
SQL>create collation names from EOR_1
SQL&using '[level 4][alternate shifted]&9<","'; -- ',' before 'A'
SQL>select last || ', ' || first as fullname
SQL&from name order by fullname collate names;
FULLNAME
========
Van, Buster
Van, Stephan
van Diesel, Peter
vanDiesel, Peter
Van Diesel, Peter
van Diesel, Thomas
vanDiesel, Thomas
Van Diesel, Thomas
Name prefix handling
Example on how to treat different Mac prefixes as equal. Typical names are MacAlister, McAlister, McDonell, MacDougel and M'Dougel.
SQL>create collation mac_english_3 from english_3 using
SQL&'&MAC<<<mc<<<Mc<<<MC<<<m#27#<<<M#27#';
SQL>create collation mac_english_2 from mac_english_3 using '[level 2]';
SQL>create table macs (name varchar(32));
SQL>insert into macs values('M''Dougel');
SQL>insert into macs values('McDonell');
SQL>insert into macs values('MacAlister');
SQL>insert into macs values('McAlister');
SQL>insert into macs values('MacDougel');
SQL>select * from macs order by name collate english_3;
name
================================
M'Dougel
MacAlister
MacDougel
McAlister
McDonell
5 rows found
SQL>select * from macs order by name collate mac_english_3;
name
================================
MacAlister
McAlister
McDonell
MacDougel
M'Dougel
5 rows found
SQL>select * from macs where name = 'macalister' collate mac_english_2;
name
================================
MacAlister
McAlister
2 rows found
SQL>select * from macs where name = 'mcalister' collate mac_english_2;
name
================================
MacAlister
McAlister
2 rows found
SQL>select * from macs where name = 'm''alister' collate mac_english_2;
name
================================
MacAlister
McAlister
2 rows found
Collating Details
Expanding Characters
A single character can map to a sequence of collation elements. For instance, ß is equivalent to ss. In German Phonebook ä, ö and ü sort as though they were ae, oe and ue respectively.
Contracting Character Sequences
Many languages have digraphs, which actually counts as separate letters. In traditional Spanish, ch sorts between c and d, and ll sorts between l and m. Two characters are mapped into a single collation element that cause the combination to be ordered differently from either character individually.
Another example of contractions are lj and nj in Bosnian and Croatian, which sorts after l and n respectively.
Order without contraction |
Order with contraction “nj” sorting after “n” |
Na |
Na |
Ni |
Ni |
Nj |
Nk |
Nja |
Nz |
Njz |
Nj |
Nk |
Nja |
Nz |
Njz |
Oa |
Oa |
Some languages, particularly French, require words to be ordered on the secondary level by comparing backwards from right to left.
Example
English ordering |
French ordering |
Cote |
Cote |
Coté |
Côte |
Côte |
Coté |
Côté |
Côté |
Attribute: [indic]
Function: Method for traditional Indic collation
The traditional Indic sort order is as follows:
1Vowel
2Vowelless consonant
3Vowelless consonant + Vowel
4Vowelless consonant + Vowelless consonant
5Vowelless consonant + Vowelless consonant + Vowel
6... and so on
As the consonant letters in Indic scripts includes an inherent vowel /a/, the following transformations are applied before sorting:
1Consonant + Virama => Vowelless consonant
2Consonant + Vowel-sign => Vowelless consonant + Vowel
3Consonant => Vowelless consonant + A
Transformation examples:
The method for traditional Indic collation effectively works for the following scripts:
•Devanagari (Hindi, Konkani, Marathi, Nepali and Sanskrit)
•Bengali (Assamese, Bengali and Manipuri)
•Gujarati
•Oriya
•Telugu
•Kannada
•Malayalam
The famous authoritative Monier-Williams: Sanskrit-English Dictionary is a good reference:
https://www.ibiblio.org/sripedia/ebooks/mw/
https://www.ibiblio.org/sripedia/ebooks/mw/0000/mw__0033.html
The [indic] attribute also works for Tamil, but with different rules as used in the authoritative University of Madras: Tamil Lexicon http://dsal.uchicago.edu/dictionaries/tamil-lex/
Punjabi does not need any tailoring, the default order follows the rules in the Punjabi University: Punjabi-English Dictionary ISBN:8173800960.
Without the [indic] attribute, a very large tailoring is needed for traditional collation. See the following example for Devanagari.
·#<#0915##094D#
Γ#<#0916##094D#
Δ#<#0917##094D#
Ε#<#0918##094D#
Ζ#<#0919##094D#
Η#<#091A##094D#
[A#<#091B##094D#
[B#<#091C##094D#
[C#<#091D##094D#
[D#<#091E##094D#
[E#<#091F##094D#
[F#<#0920##094D#
Θ#<#0921##094D#
Ι#<#0922##094D#
Κ#<#0923##094D#
Λ#<#0924##094D#
Μ#<#0925##094D#
Ν#<#0926##094D#
Ξ#<#0927##094D#
Ο#<#0928##094D#
Π#<#092A##094D#
\A#<#092B##094D#
\B#<#092C##094D#
\C#<#092D##094D#
\D#<#092E##094D#
\E#<#092F##094D#
\F#<#0930##094D#
΢#<#0932##094D#
Τ#<#0933##094D#
Υ#<#0935##094D#
Χ#<#0936##094D#
Ψ#<#0937##094D#
Ω#<#0938##094D#
Ϊ#<#0939##094D#
Γ##094D##0905#=#0915#
Γ##094D##0906#=#0915##093E#
Γ##094D##0907#=#0915##093F#
Γ##094D##0908#=#0915##0940#
Γ##094D##0909#=#0915##0941#
Γ##094D##090A#=#0915##0942#
Γ##094D##090B#=#0915##0943#
Γ##094D##0960#=#0915##0944#
Γ##094D##090C#=#0915##0962#
Γ##094D##0961#=#0915##0963#
Γ##094D##090D#=#0915##0945#
Γ##094D##090E#=#0915##0946#
Γ##094D##090F#=#0915##0947#
Γ##094D##0910#=#0915##0948#
Γ##094D##0911#=#0915##0949#
Γ##094D##0912#=#0915##094A#
Γ##094D##0913#=#0915##094B#
Γ##094D##0914#=#0915##094C#
...
same pattern for #0916#..#0938# (32)
...
Ϋ##094D##0905#=#0939#
Ϋ##094D##0906#=#0939##093E#
Ϋ##094D##0907#=#0939##093F#
Ϋ##094D##0908#=#0939##0940#
Ϋ##094D##0909#=#0939##0941#
Ϋ##094D##090A#=#0939##0942#
Ϋ##094D##090B#=#0939##0943#
Ϋ##094D##0960#=#0939##0944#
Ϋ##094D##090C#=#0939##0962#
Ϋ##094D##0961#=#0939##0963#
Ϋ##094D##090D#=#0939##0945#
Ϋ##094D##090E#=#0939##0946#
Ϋ##094D##090F#=#0939##0947#
Ϋ##094D##0910#=#0939##0948#
Ϋ##094D##0911#=#0939##0949#
Ϋ##094D##0912#=#0939##094A#
Ϋ##094D##0913#=#0939##094B#
Ϋ##094D##0914#=#0939##094C#
Attribute: [CJK Kanji]
Function: JIS X 4061-1996 rules for SOUND/ITERATION MARKS
This attribute is an implementation of JIS X 4061-1996 and the collation rules are based on that standard.
The following criteria are considered in order until the collation order is determined. By default, Levels 1 to 4 are applied and Level 5 is ignored (as JIS does).
The character classes are sorted in the following order:
Space characters, Symbols and Punctuations, Digits,
Latin Letters, Greek Letters, Cyrillic Letters,
Hiragana/Katakana letters, Kanji ideographs.
In the class, alphabets are collated alphabetically; Kana letters are AIUEO-betically (in the Gozyuon order).
For Kanji, see Kanji Classes.
Other characters are collated as defined.
Characters not defined as a collation element are ignored and skipped on collation.
In the Latin vowels, the order is as shown the following list.
One without diacritical mark, then with diacritical mark.
In Kana, the order is as shown the following list.
A voiceless kana, the voiced, then the semi-voiced
(if exists). (eg. Ka before Ga; Ha before Ba before Pa)
A small Latin character is less than the corresponding capital character.
In Kana, the order is as shown in the following list:
replaced PROLONGED SOUND MARK(U+30FC);
Small Kana;
replaced ITERATION MARK (U+309D, U+309E, U+30FD or U+30FE);
then normal kana
For example, Katakana A + PROLONGED SOUND MARK, Katakana A + Small Katakana A, Katakana A + ITERATION MARK, Katakana A + Katakana A.
Hiragana is lesser than Katakana.
A character that belongs to the block Halfwidth and Fullwidth Forms is greater than the corresponding normal character.
Note:According to the JIS standard, the level 5 should be ignored.
There are three Kanji classes:
1The 'saisho' (minimum) Kanji class
It comprises five Kanji-like characters, i.e. U+3003, U+3005, U+4EDD, U+3006, U+3007. Any Kanji except U+4EDD are ignored on collation.
2The 'kihon' (basic) Kanji class
It comprises JIS levels 1 and 2 kanji in addition to the minimum Kanji class. Sorted in the JIS order. Any Kanji excepting those defined by JIS X 0208 are ignored on collation.
3The 'kakucho' (extended) Kanji class
All the CJK Unified Ideographs in addition to the minimum Kanji class. Sorted in the Unicode order.
Note:This is the implemented class.
Attribute: [CJK Hanja]
Function: Special sort table access
Hanja characters are sorted with secondary difference from the corresponding Hangul character.
Attribute: [CJK ChuNom]
Function: Syllable by syllable processing
In lexical ordering, differences in letters are treated as primary, differences in tone markings as secondary, and differences in case as tertiary differences. Ordering according to primary and secondary differences proceeds syllable by syllable. According to this principle, a dictionary lists “ban mai” before “bàn cát” because the secondary difference in the first syllable takes precedence over the primary difference in the second.