Abstract:
A data structure for storing the properties of Unicode characters.

Created by Peter Kankowski
Last changed
Filed under Algorithms

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Two-stage tables for storing Unicode character properties

When dealing with Unicode strings, you often need to get character properties: character category (letter, digit, mark, etc.), its lowercase and uppercase counterparts, shifts for Boyer-Moore algorithm, and so on. There are about one million characters in Unicode, so an array indexed by the character value would take too much space.

The property values often repeat themselves, for example, all Chinese characters have the same value of script and category. Using the specifically designed algorithm, we can compress the repetitions without sacrificing performance. However, many programmers failed to find a good solution to this task.

Suboptimal solutions

  • Chapter 5 in Beautiful code demonstrates using a long switch or series of if statements to check whether the Unicode character is a letter or a digit. Something like that:
  • switch(ch) {
        case 'a': return true;
        case 'b': return true;
        case 'c': return true;
        case 'd': return true;
        // ...
        case 'z': return true;
        case '{': return false;
        case '|': return false;
        case '}': return false;
        case '~': return false;
        // another 65 409 cases...
    }
    

    (As Ace witty noted, this function should belong to The Daily WTF, not to Beautiful code.) The author's final solution is a 64K table with character properties, which is bloated and just wrong, because Unicode has more than 65536 characters.

  • PCRE 7.7, the regular expression library used in Apache and PHP, locates the character property in a table using binary search; spans of similar characters are stored as one array item. This is a better solution, but still a suboptimal one — O(log N).

Multi-stage table

The optimal solution is O(1); it requires just 6 instructions on x86, without any branches or slow instructions. It's called a two-stage table.

Multistage table structure

Assume there is an array of character properties (32, 0, 32, 0, 0, 0, ..., -16). To compress them, you can separate the array into blocks. For performance reasons, the block size should be a power of 2 (in this example, it is four).

When creating the table, the program eliminates the repeating blocks and place only the unique blocks in the second stage table. The first stage table contains pointers to these unique blocks. Often, the first stage items point to the same block, so the data is effectively compressed. For example, all blocks with values (32, 0, 32, 0) are replaced by pointers to the block 0 containing these values.

In practice, you should store the number of a unique block, not the pointer, in the first stage table, because the former takes less space. The second stage table is implemented as an array divided into fixed-size blocks.

The code to look up the table takes just three lines:

unsigned get_property(unsigned ch)
{
    const unsigned BLOCK_SIZE = 256;
    unsigned block_offset = stage1[ch / BLOCK_SIZE] * BLOCK_SIZE;
    return stage2[block_offset + ch % BLOCK_SIZE];
}

A script to create the table

The Python script below creates the table. It avoids building an intermediate table with all characters; only the current block is kept in memory. The smallest possible data type for the table items is automatically chosen. The script builds the table for Unicode category, but it can be easily adapted to another property (just change __decode method and def_val constant in UnicodeDataExtractor).

Download the script for building multistage tables

Starting from version 7.8, PCRE includes a multi-stage table implementation based on this script.

A similar script in used in Python source code. Their script combines several character properties (category, combining class, and bidirectional info) into one table and automatically finds the value of block size yielding the smallest table. Python interpreter uses two-stage tables for the functions in unicodedata module and string.upper() / lower() functions.

Conclusion

Unicode standard explicitly recommends using two-stage tables (see chapter 5.1). So, the moral is to read the standard and follow it instead of trying to reinvent the wheel, as some programmers mentioned above did.

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Created by Peter Kankowski
Last changed

31 comments

Ten recent comments are shown below. Show all comments

Igor Lubashev,

Ace, the branch in question is a test for the magic bit -- "is it a pointer to the next block, or is it the data itself?" This branch will usually resolve the same and be very predictable.

What you are talking, I believe, is the expense of an extra level. Branch prediction has little to offer here, of course. However, it is important to see how many level will be triversed in the common case. It is possible (I did not check this) if the common case would not have to triverse all levels and, instead, would be optimized into higher-level blocks.

Axel Rietschin,

The article is crystal-clear but the comments are very confusing: where is the "if" statement and tests for "magic bit" in

unsigned get_property(unsigned ch)
{
    return stage2[stage1[ch / BLOCK_SIZE] * BLOCK_SIZE + ch % BLOCK_SIZE];
}

???

Peter Kankowski,

Axel, see this comment. In short: Igor has a different idea, which involves the "if" statement.

Axel Rietschin,

Oh OK I did not realize that your site did not display all comments by default. Reading from the middle on did not make any sense. Anyway I like your simpler 2-stage method better provided one can live with a 27K table.

Frank Earnest,

Two-stage tables are pretty "sub-optimal" for space in many use cases. Not every program needs the full set of properties. Some just need to know e.g. all alphanumeric codepoints or all non-spacing marks, etc. With binary search that can be done with a 2-3 KiB interval table.

Making the distinction between O(log N) and O(1) here is almost completely irrelevant, since there's a constant, known upper bound for N. If your (relatively) massive table causes even a single, extra cache-line fill vs. the smaller table you've already lost, despite log(N) being 14 or so.

Peter Kankowski,

For binary search, you have to run a loop to look up in the interval table. For a multi-stage table, your lookup code is completely branchless (see the three lines above). It's a noticeable difference for modern processors with branch prediction.

PCRE and Boost authors decided to use my script. You can always run a benchmark to see which approach wins in your particular case.

Frank Earnest,

> It's a noticeable difference for modern processors with branch prediction.

Most of the branches in a binary search loop *are* predictable though. It's a backwards branch forming a loop that's only going to break once -- either when it's exhausted the search or found a match. Saying something to the effect of "branch prediction bad" is not particularly insightful.

Also, a cacheline fill is an order of magnitude more expensive than a mispredicted branch. Unpredictable branches are certainly bad for performance, but it's gotten to the point where saying "hurrr branches bad" is just an overused meme without any sane quantification or analysis of what you're replacing it with.

Frank Earnest,

That was supposed to be "branch [mis]prediction bad".

Peter Kankowski,

Frank, please feel free to do a benchmark

Frank Earnest,

I'm not interested in doing artificial benchmarks on a whim, I just wanted to point out the insanity of this statement:

"The table will be smaller, but you will have a branch (if statement) in your code, so it will be slower, because unpredictable branches are expensive on modern processors."

A branch misprediction on "modern processors" costs about 16 cycles. In some situations that's a big deal -- but treating branches like some kind of unknown quantity, to be avoided at all costs, is a very weird phenomenon.

Especially when you're comparing it to something that adds tons of extra cache pressure, which won't even register in synthetic benchmarks but could easily destroy performance when you actually throw it into a real application.

Your name:


Comment:


Please ignore this field: