The Curious Case of the Greek Final Sigma
by admin
For utf8rewind 1.2.0, one of my goals was to add Unicode case mapping to the library. For the most part I’ve succeeded; it’s now possible to convert any kind of UTF-8 encoded text to upper-, lower- and titlecase using functions in the library. But there were a couple of exceptions to the case mapping algorithm that I glossed over.
Implementing case mapping
When you’re looking to implement Unicode-compliant case mapping, the first resources you should grab are the ones from the Unicode Consortium itself. These are freely available from their FTP server. The most important resource is UnicodeData.txt, which is the database for every code point encoded by this version of the Unicode standard. The database is a pure ASCII-encoded text file where every line is a record and each record is split into fields with the ‘;’ character (U+003B SEMICOLON, to be exact). A record in the database looks like this:
1 | 0051;LATIN CAPITAL LETTER Q;Lu;0;L;;;;;N;;;;0071; |
Each record has fifteen fields, but for case mapping we’re only interested in a couple of them:
- The first field, which denotes the code point’s value (in this case U+0051).
- The second field, which is the official name for the code point (LATIN CAPITAL LETTER Q)
- The thirteenth field, which is the uppercase mapping (because it’s empty, the value is the same as the code point, so U+0051)
- The fourteenth field, which is the lowercase mapping (U+0071 LATIN SMALL LETTER Q)
- The fifteenth field, which is the titlecase mapping (U+0051 again)
Because Unicode is backwards-compatible with ASCII, the first 127 code points are exactly the same in both standards. This is how Unicode can get away with describing the first 127 characters of its standard, which it needs to describe to standard itself!
Case mapping has not posed much of a problem so far. Each code point may or may not have case mapping and the converted code point is always the same length as the source code point. Or is it?
The Eszett problem
When we go beyond Basic Latin and venture into the land of Latin-1, we encounter this peculiar fellow:
1 | 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;; |
You might think that this means that LATIN SMALL LETTER SHARP S does not have an uppercase version, but you’d be mistaken. You would end up disappointing a lot of German, Austrian, Danish and even Belgian users!
One of the limitations of the Unicode database is that it must be backwards-compatible with itself. An older versions of Unicode specified that this code point must be converted to U+1E9E LATIN CAPITAL LETTER SHARP when uppercasing. This was to ensure that each code point would only be converted to a single other code point when doing case conversion. Unfortunately, this is wrong according to German grammar rules. They state that the uppercase version of Eszett, as they call it, should be written as SS when uppercased. In effect, there is no uppercase version of the Eszett character, but Unicode did provide one! This discrepancy was corrected in a newer version of Unicode, but it meant the addition of a special text file, called SpecialCasing.txt.
Here, we find an updated record for the LATIN SMALL LETTER SHARP S code point:
1 2 3 4 | # The German es-zed is special--the normal mapping is to SS. # Note: the titlecase should never occur in practice. It is equal to titlecase(uppercase()) 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S |
Combining this data with the record in the Unicode database gives us the complete story:
- The value for the code point is U+00DF
- Its official name is LATIN SMALL LETTER SHARP S
- The lowercase mapping is U+00DF
- The titlecase mapping is U+0053 U+0073 (Ss)
- The uppercase mapping is U+0053 U+0053 (SS)
Good, we got that sorted. Most code points map to a single code point, but some don’t. These exceptions can be found in SpecialCasing.txt.
However, when we scroll down, we get to a section ominously titled “Conditional Mappings”.
Conditional mappings
While the code points specified by the Unicode standard are language- and locale-independent, some of their behavior, unfortunately, isn’t. This is where the conditional mappings come in. When I implemented case mapping for utf8rewind 1.2.0, I glossed over this section entirely. The language used was a bit too vague for my tastes and I would rather ship imperfect case mapping now than perfect case mapping later. There wasn’t an immediate backlash, but I did receive a pull request to fix my Turkish uppercase strings. This made me decide to take a closer look at these conditional mappings. Luckily, by now I had received my copy of Unicode Demystified, which explains the case mapping algorithm in greater detail.
The Sigma Exception
Which brings us to GREEK CAPITAL LETTER SIGMA. Here’s what the book has to say about this code point:
Here’s a classic example of the use of the special conditions:
1 2 03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA # 03A3; 03C3; 03A3; 03A3; # GREEK CAPITAL LETTER SIGMAThese two entries are for the Greek capital letter sigma (Σ, U+03A3). This letter has two lowercase forms: one (ς, U+03C3) that appears only at the ends of words and another (σ, U+03C3) that appears at the beginning or in the middle of a word. The FINAL_SIGMA at the end of the first entry indicates that this mapping applies only at the end of a word (and when the letter isn’t the only letter in the word). The second entry, which maps to the initial/medial form of the lowercase sigma, is commented out (begins with a #) because this mapping is in the UnicodeData.txt file.
(Source: Richard Gillam, “Character Properties and the Unicode Character Database” in “Unicode Demystified: A Practical Programmer’s Guide to the Encoding Standard”, pp. 169.)
Now the mapping makes sense! Although the exception is relatively easy to program against, it is independent of the language used for the text. This means that if a text is in Dutch (and the system locale is set to Dutch), but it includes a capital letter sigma at the end of a word, it should still be lowercased to ς instead of σ! So now my lowercase implementation looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | while (state.src_size > 0) { size_t converted; if (!casemapping_readcodepoint(&state)) { UTF8_SET_ERROR(INVALID_DATA); return bytes_written; } if (state.last_code_point == 0x03A3) { uint8_t should_convert; state.last_code_point_size = codepoint_read( state.src, state.src_size, &state.last_code_point); should_convert = state.last_code_point_size == 0; if (!should_convert) { state.last_general_category = database_queryproperty( state.last_code_point, UnicodeProperty_GeneralCategory); should_convert = (state.last_general_category & GeneralCategory_Letter) == 0; } converted = codepoint_write( should_convert ? 0x03C2 : 0x03C3, &state.dst, &state.dst_size); } else { converted = casemapping_execute2(&state); } if (!converted) { UTF8_SET_ERROR(NOT_ENOUGH_SPACE); return bytes_written; } bytes_written += converted; } |
And branch prediction algorithms in processors everywhere wept.
These changes are currently part of the local-independent-case-mapping branch of utf8rewind and won’t appear in a stable release until utf8rewind 1.3.0, the next minor version.
> branch prediction algorithms in processors everywhere wept
You can use compiler intrinsics like __builtin_expect() (like unlikely() in the kernel) to partly alleviate that.
[…] language is named specifically in the Unicode Consortium’s SpecialCasing.txt, which I have blogged about before. The file specifies exceptions for specific languages when case mapping, including Turkish or Azeri […]