Abstract:
A data structure for storing the properties of Unicode characters.

Created by Peter Kankowski
Last changed
Filed under Algorithms

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Two-stage tables for storing Unicode character properties

When dealing with Unicode strings, you often need to get character properties: character category (letter, digit, mark, etc.), its lowercase and uppercase counterparts, shifts for Boyer-Moore algorithm, and so on. There are about one million characters in Unicode, so an array indexed by the character value would take too much space.

The property values often repeat themselves, for example, all Chinese characters have the same value of script and category. Using the specifically designed algorithm, we can compress the repetitions without sacrificing performance. However, many programmers failed to find a good solution to this task.

Suboptimal solutions

  • Chapter 5 in Beautiful code demonstrates using a long switch or series of if statements to check whether the Unicode character is a letter or a digit. Something like that:
  • switch(ch) {
        case 'a': return true;
        case 'b': return true;
        case 'c': return true;
        case 'd': return true;
        // ...
        case 'z': return true;
        case '{': return false;
        case '|': return false;
        case '}': return false;
        case '~': return false;
        // another 65 409 cases...
    }
    

    (As Ace witty noted, this function should belong to The Daily WTF, not to Beautiful code.) The author's final solution is a 64K table with character properties, which is bloated and just wrong, because Unicode has more than 65536 characters.

  • PCRE 7.7, the regular expression library used in Apache and PHP, locates the character property in a table using binary search; spans of similar characters are stored as one array item. This is a better solution, but still a suboptimal one — O(log N).

Multi-stage table

The optimal solution is O(1); it requires just 6 instructions on x86, without any branches or slow instructions. It's called a two-stage table.

Multistage table structure

Assume there is an array of character properties (32, 0, 32, 0, 0, 0, ..., -16). To compress them, you can separate the array into blocks. For performance reasons, the block size should be a power of 2 (in this example, it is four).

When creating the table, the program eliminates the repeating blocks and place only the unique blocks in the second stage table. The first stage table contains pointers to these unique blocks. Often, the first stage items point to the same block, so the data is effectively compressed. For example, all blocks with values (32, 0, 32, 0) are replaced by pointers to the block 0 containing these values.

In practice, you should store the number of a unique block, not the pointer, in the first stage table, because the former takes less space. The second stage table is implemented as an array divided into fixed-size blocks.

The code to look up the table takes just three lines:

unsigned get_property(unsigned ch)
{
    const unsigned BLOCK_SIZE = 256;
    unsigned block_offset = stage1[ch / BLOCK_SIZE] * BLOCK_SIZE;
    return stage2[block_offset + ch % BLOCK_SIZE];
}

A script to create the table

The Python script below creates the table. It avoids building an intermediate table with all characters; only the current block is kept in memory. The smallest possible data type for the table items is automatically chosen. The script builds the table for Unicode category, but it can be easily adapted to another property (just change __decode method and def_val constant in UnicodeDataExtractor).

Download the script for building multistage tables

Starting from version 7.8, PCRE includes a multi-stage table implementation based on this script.

A similar script in used in Python source code. Their script combines several character properties (category, combining class, and bidirectional info) into one table and automatically finds the value of block size yielding the smallest table. Python interpreter uses two-stage tables for the functions in unicodedata module and string.upper() / lower() functions.

Conclusion

Unicode standard explicitly recommends using two-stage tables (see chapter 5.1). So, the moral is to read the standard and follow it instead of trying to reinvent the wheel, as some programmers mentioned above did.

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Created by Peter Kankowski
Last changed

25 comments

ace,
Excellent article!

I have an idea for which I don't expect that it can give better results in this case, but I've had an idea: imagine using a little smaller blocks, and such that the start of the block can be anywhere in the bigger table and not only at equally spaced points. To build the resulting tables, I believe that for searching for the subblock anywhere inside of the already formed bigger table some of the algorithms used in compression algorithms can be used. It would be interesting to try.

P.S. Congratulations -- your article is at the moment the most relevant on the web on the subject of two stage tables. I've tried to search for "two-stage table" and the biggest search site returns your article as the first match (I ho. Other don't know about it but give in matches or as sponsored links "Shop for a Table for two" and "Buy a Table for two at ..."
Won,
The problem with having variable-sized blocks is that you lose the O(1) lookup. On the other hand, you can potentially make things much more compact by overlapping blocks. Without thinking too deeply about it, you might be able to find all the overlapping substrings in a string by using a variant of Knuth-Morris-Pratt string matching (oddly enough, this came up in a conversation with a coworker recently).

In the cases where the codepoint attribute is a bool, somethine like a BDD (binary decision diagram) or a perfect hash might work without a table.
Peter Kankowski,
Thank you, I did not know about BDD.

Ace probably means not variable-sized blocks, but the blocks that can start at offsets not divisible by block size. In this case, you will have to store the full offset instead of the block number, so the first stage table will require a larger data type (e.g., unsigned short instead of unsigned char).

It's still an interesting idea. Thank you, Ace!

If you need to store several properties (e.g., script and category), you can save space by combining them into one table. Python creates the third table ("database record") storing all unique combinations of the properties; a two-stage table contains the offset in this table. Such table will be more compact, because all properties are more or less correlated.

I recently tried a three-stage table; it's much smaller, too.
ace,
> the blocks that can start at offsets not divisible by block size.

Right.

> so the first stage table will require a larger data type (e.g., unsigned short instead of unsigned char).

Yes, that's why I expressed doubt (albeit the whole message was confusing as the result of careless editing) if it would work, since increasing the size of the first table pays off only if there are significantly less new tables introduced compared to the simplest solution.

However it's still an interesting theme for experiments. In such a setup, in the cases where some existing subtable doesn't already overlap at the start with another it would allow adding prefix to it (inserting something inside of the resulting table) in order to minimize the resulting table.
Harold,
Thanks for another great article!

Speaking of Unicode, I was wondering what's the proper way of converting the hash functions you've been discussing to work with Unicode strings.
Won,
The simple answer is: nothing. Just treat the Unicode string as a bunch of bits. In that, it doesn't even really matter what the encoding format.

The subtlety is related to a problem you might have for comparing Unicode strings for equality, since Unicode has the notion of equivalent (and compatible) character sequences and normalization.

http://en.wikipedia.org/wiki/Unicode_normalization

There are lots of tricky things about Unicode and multi-byte encodings...
Peter Kankowski,
Thank you for the good question.

A multiplicative hash function will give you a better result if you process the string by characters (wchar_t), not by bytes. Small multipliers should still work nicely, because most users will use only the letters of their language, so the range of unique characters will be small. (Notable exceptions are ideographic and syllabic scripts, e.g., Chinese, Japanese, or Korean, which have a lot of unique characters.)

A more complex hash function such as Murmur or Paul Hsieh's function can be used without modification, as Won suggested.
Harold,
Thanks - good to know I wasn't doing something obviously stupid. :)
Peter Kankowski,
I've finished a patch to PCRE 7.7 that enables it to use a two-stage table instead of the binary search algorithm.

                     speed         size for Unicode 5.0
old algorithm    152 ticks per char    24 KB
2-stage            8 ticks per char    46 KB
3-stage           13 ticks per char    22 KB


Philip Hazel, the author of PCRE, chose the 2-stage table, which provides better performance at cost of twice larger memory usage. The patch will be included in the next version of PCRE.
ace,
> The patch will be included in the next version of PCRE.

Congratulations!
Igor Lubashev,

You can also employ the following very effective trick. There are many level-0 blocks (blocks containing the actual data) that contain all the same value (like Block 1 and Block 2 in your example). You do not have to store this entire block just so that you can later stuff pointers/indexes to it. Instead, you can store that repeated value in the higher-level stage(s) directly (if you can spare 1 bit to encode this pointer as "non-pointer").

This optimization can be used even more effectively with a higher-order staged tables, since the special "repeated" value can propogate even higher (for example, if level-1 stage block is full of the same "non-pointer" values, the block does not need to be created and, instead, the same "non-pointer" value can be stored in the stage above). The tables may detect "same-pointer" blocks in higher-order stages (not just leaf blocks), but this would require keeping more "extra" bits to distinguish "non-pointer" leaf data from "non-pointer" "skip-stage" pointers. So I did not do this in my attempts below.

My experience with a particular unicode subset ( U+000000 - U+10FFFF):

"original" 2-stage table. Size: 27648 bytes.

  • Stage 0. Block size: 256 bytes.
  • Stage 1. Block size: inf bytes.

"optimized" 2-stage table. Size: 26112 bytes.

  • Stage 0. Block size: 256 bytes.
  • Stage 1. Block size: inf bytes.

"optimized" 3-stage table. Size: 12112 bytes.

  • Stage 0. Block size: 16 bytes.
  • Stage 1. Block size: 64 shorts.
  • Stage 2. Block size: inf bytes.

"optimized" 4-stage table. Size: 9632 bytes.

  • Stage 0. Block size: 8 bytes.
  • Stage 1. Block size: 16 shorts.
  • Stage 2. Block size: 16 shorts.
  • Stage 3. Block size: inf bytes.

"optimized" 5-stage table. Size: 8208 bytes.

  • Stage 0. Block size: 4 bytes.
  • Stage 1. Block size: 4 shorts.
  • Stage 2. Block size: 16 shorts.
  • Stage 3. Block size: 16 bytes.
  • Stage 4. Block size: inf bytes.
Igor Lubashev,

Sorry for the botched wiki formatting in the previous comment. Different wikis have different formatting rules. If someone can clean this up, it would be great.

Peter Kankowski,

Igor, thank you very much; it's an interesting idea. The table will be smaller, but you will have a branch (if statement) in your code, so it will be slower, because unpredictable branches are expensive on modern processors.

So, it's a time-space compromise (smaller table or faster code). If you want a smaller table, you could also explore data compression algorithms, especially RLE. The table can be uncompressed once (when your program starts up), then queried many times.

Igor,
Actually, the branch should be very predictable, if the code is mostly dealing with the ASCII range. Also, smaller tables are more cache-friendly, so here is a potential performance boost. I suspect that only a true performance benchmark in a typical application would be able to tell for sure what dominates here -- an extra (mostly predicted) branch or a slightly larger cache footprint.
ace,
The current CPU's can't magically predict where the switch goes, they just do a good job for the equivalents of "ifs", and the prediction as it is implemented is actually just a way to remember the default if choices inside of the loop. When you have to jump somewhere else for each character, no CPU can cope with that.

On another side, I fully agree that no normal application is not doing just one small task and then the cache pressure often becomes more important than the speed of the segment that was benchmarked with the full caches.
Austin Hastings,

In Igor's case, the branch is very predictable. The jumps are short, forward jumps. Most of the time the branch target will be inside the current or next cache line. (Recall that the OP mentions 6 ops for a 2-stage table - this is *small* code.)

 

ace,

> the branch is very predictable

I'm very surprised to hear something like this. Can you please explain how would you predict what the next character to appear, not in hardware, but just as an algorithm or some statistical analysis? I'm still not able to imagine any predictor which would predict random jumps.

> Most of the time the branch target will be inside (...) cache line

That doesn't help if you have to wait the whole pipeline length because the target was not predicted. You'd lose 'length of the pipeline' T states. However I also think that if some code is used infrequently enough (so that the relevant cache entries are empty on the start) the "lighter on caches" algorithm has more chances to win.

CodeVisio,

Hi Peter,

I followed your explanation but probably I missed something because I didn't understand your multi-stage table.

The indexes of the array of characters properties you're talking about at the beginning of you article are the unicode code points?

Peter Kankowski,
Yes, and the array is compressed by replacing the repeating blocks with pointers to unique blocks. Please ask if anything is unclear.
CodeVisio,

Thanks for the answer!

I think I got it, let's resume without "optimization" tricks:

You build an array of character properties based on the .txt(s) from unicode org.

In a first instance the array length is 0x10FFFF. In your graph is called stage1.

After, you proceed subdividing the stage1 in fixed block size, 256 bytes, and since there are many

code point ranges with the same properties those blocks "perform" some sort of optimization.

Those blocks of properties go inside a second array called stage2.

Time by time you store a new block of properties inside stage2 you update stage1 with the relative pointers, better with the block number or you update stage1 with an existing block already computed.

After that you will have ranges of elements of stage1 repeated because the repetition properties of code points, and then you can compress the stage1 array and so reducing spaces.

All of this at "building" time, that is when you're going to build the two stage tables. Once finished you can simply store as normal array, stage1 and stage2, inside your code.

is it so?

If yes and if you have already done this what is the percentage of compression?

I meant if you're going to store the entire array of char properties for all code points and each element of the array is 4 bytes you will get about 4MB (0x10FFFF x 4) of spaces occupied while with your method?

Thanks 
Peter Kankowski,
Yes, you've got it :) There is a Python script attached to this article (see the link above). If you run it for Unicode 6.0 character properties, the tables will take 31 KB. Each element of the array is 1 byte, so it would be around 1 MB without the compression.
Igor Lubashev,

Ace, the branch in question is a test for the magic bit -- "is it a pointer to the next block, or is it the data itself?" This branch will usually resolve the same and be very predictable.

What you are talking, I believe, is the expense of an extra level. Branch prediction has little to offer here, of course. However, it is important to see how many level will be triversed in the common case. It is possible (I did not check this) if the common case would not have to triverse all levels and, instead, would be optimized into higher-level blocks.

Axel Rietschin,

The article is crystal-clear but the comments are very confusing: where is the "if" statement and tests for "magic bit" in

unsigned get_property(unsigned ch)
{
    return stage2[stage1[ch / BLOCK_SIZE] * BLOCK_SIZE + ch % BLOCK_SIZE];
}

???

Peter Kankowski,

Axel, see this comment. In short: Igor has a different idea, which involves the "if" statement.

Axel Rietschin,

Oh OK I did not realize that your site did not display all comments by default. Reading from the middle on did not make any sense. Anyway I like your simpler 2-stage method better provided one can live with a 27K table.

Your name:


Comment:


Please ignore this field: