Abstract:
A data structure for storing the properties of Unicode characters.

Created by Peter Kankowski
Last changed
Filed under Algorithms

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Two-stage tables for storing Unicode character properties

When dealing with Unicode strings, you often need to get character properties: character category (letter, digit, mark, etc.), its lowercase and uppercase counterparts, shifts for Boyer-Moore algorithm, and so on. There are about one million characters in Unicode, so an array indexed by the character value would take too much space.

The property values often repeat themselves, for example, all Chinese characters have the same value of script and category. Using the specifically designed algorithm, we can compress the repetitions without sacrificing performance. However, many programmers failed to find a good solution to this task.

Suboptimal solutions

  • Chapter 5 in Beautiful code demonstrates using a long switch or series of if statements to check whether the Unicode character is a letter or a digit. Something like that:
  • switch(ch) {
        case 'a': return true;
        case 'b': return true;
        case 'c': return true;
        case 'd': return true;
        // ...
        case 'z': return true;
        case '{': return false;
        case '|': return false;
        case '}': return false;
        case '~': return false;
        // another 65 409 cases...
    }
    

    (As Ace witty noted, this function should belong to The Daily WTF, not to Beautiful code.) The author's final solution is a 64K table with character properties, which is bloated and just wrong, because Unicode has more than 65536 characters.

  • PCRE 7.7, the regular expression library used in Apache and PHP, locates the character property in a table using binary search; spans of similar characters are stored as one array item. This is a better solution, but still a suboptimal one — O(log N).

Multi-stage table

The optimal solution is O(1); it requires just 6 instructions on x86, without any branches or slow instructions. It's called a two-stage table.

Multistage table structure

Assume there is an array of character properties (32, 0, 32, 0, 0, 0, ..., -16). To compress them, you can separate the array into blocks. For performance reasons, the block size should be a power of 2 (in this example, it is four).

When creating the table, the program eliminates the repeating blocks and place only the unique blocks in the second stage table. The first stage table contains pointers to these unique blocks. Often, the first stage items point to the same block, so the data is effectively compressed. For example, all blocks with values (32, 0, 32, 0) are replaced by pointers to the block 0 containing these values.

In practice, you should store the number of a unique block, not the pointer, in the first stage table, because the former takes less space. The second stage table is implemented as an array divided into fixed-size blocks.

The code to look up the table takes just three lines:

unsigned get_property(unsigned ch)
{
    const unsigned BLOCK_SIZE = 256;
    unsigned block_offset = stage1[ch / BLOCK_SIZE] * BLOCK_SIZE;
    return stage2[block_offset + ch % BLOCK_SIZE];
}

A script to create the table

The Python script below creates the table. It avoids building an intermediate table with all characters; only the current block is kept in memory. The smallest possible data type for the table items is automatically chosen. The script builds the table for Unicode category, but it can be easily adapted to another property (just change __decode method and def_val constant in UnicodeDataExtractor).

Download the script for building multistage tables

Starting from version 7.8, PCRE includes a multi-stage table implementation based on this script.

A similar script in used in Python source code. Their script combines several character properties (category, combining class, and bidirectional info) into one table and automatically finds the value of block size yielding the smallest table. Python interpreter uses two-stage tables for the functions in unicodedata module and string.upper() / lower() functions.

Conclusion

Unicode standard explicitly recommends using two-stage tables (see chapter 5.1). So, the moral is to read the standard and follow it instead of trying to reinvent the wheel, as some programmers mentioned above did.

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP. He can be reached at kankowski@gmail.com.

Created by Peter Kankowski
Last changed

27 comments

Ten recent comments are shown below. Show all comments

CodeVisio,

Hi Peter,

I followed your explanation but probably I missed something because I didn't understand your multi-stage table.

The indexes of the array of characters properties you're talking about at the beginning of you article are the unicode code points?

Peter Kankowski,

Yes, and the array is compressed by replacing the repeating blocks with pointers to unique blocks. Please ask if anything is unclear.

CodeVisio,

Thanks for the answer!

I think I got it, let's resume without "optimization" tricks:

You build an array of character properties based on the .txt(s) from unicode org.

In a first instance the array length is 0x10FFFF. In your graph is called stage1.

After, you proceed subdividing the stage1 in fixed block size, 256 bytes, and since there are many

code point ranges with the same properties those blocks "perform" some sort of optimization.

Those blocks of properties go inside a second array called stage2.

Time by time you store a new block of properties inside stage2 you update stage1 with the relative pointers, better with the block number or you update stage1 with an existing block already computed.

After that you will have ranges of elements of stage1 repeated because the repetition properties of code points, and then you can compress the stage1 array and so reducing spaces.

All of this at "building" time, that is when you're going to build the two stage tables. Once finished you can simply store as normal array, stage1 and stage2, inside your code.

is it so?

If yes and if you have already done this what is the percentage of compression?

I meant if you're going to store the entire array of char properties for all code points and each element of the array is 4 bytes you will get about 4MB (0x10FFFF x 4) of spaces occupied while with your method?

Thanks 
Peter Kankowski,

Yes, you've got it :) There is a Python script attached to this article (see the link above). If you run it for Unicode 6.0 character properties, the tables will take 31 KB. Each element of the array is 1 byte, so it would be around 1 MB without the compression.

Igor Lubashev,

Ace, the branch in question is a test for the magic bit -- "is it a pointer to the next block, or is it the data itself?" This branch will usually resolve the same and be very predictable.

What you are talking, I believe, is the expense of an extra level. Branch prediction has little to offer here, of course. However, it is important to see how many level will be triversed in the common case. It is possible (I did not check this) if the common case would not have to triverse all levels and, instead, would be optimized into higher-level blocks.

Axel Rietschin,

The article is crystal-clear but the comments are very confusing: where is the "if" statement and tests for "magic bit" in

unsigned get_property(unsigned ch)
{
    return stage2[stage1[ch / BLOCK_SIZE] * BLOCK_SIZE + ch % BLOCK_SIZE];
}

???

Peter Kankowski,

Axel, see this comment. In short: Igor has a different idea, which involves the "if" statement.

Axel Rietschin,

Oh OK I did not realize that your site did not display all comments by default. Reading from the middle on did not make any sense. Anyway I like your simpler 2-stage method better provided one can live with a 27K table.

Anonymous,

The use of Unicode in general is a problem. Don't use it.

Peter Kankowski,

Why do you think that Unicode usage is a problem?

Your name:


Comment:


Please ignore this field: