Abstract:
A data structure for storing the properties of Unicode characters.

Created 2 years ago by Peter Kankowski
Last changed 2 years ago
Filed under Algorithms

Reddit Digg Delicious Buzz Facebook Twitter

Two-stage tables for storing Unicode character properties

When dealing with Unicode strings, you often need to get character properties: character category (letter, digit, mark, etc.), its lowercase and uppercase counterparts, shifts for Boyer-Moore algorithm, and so on. There are about one million characters in Unicode, so an array indexed by the character value would take too much space.

The property values often repeat themselves, for example, all Chinese characters have the same value of script and category. Using the specifically designed algorithm, we can compress the repetitions without sacrificing performance. However, many programmers failed to find a good solution to this task.

Suboptimal solutions

Multi-stage table

The optimal solution is O(1); it requires just 6 instructions on x86, without any branches or slow instructions. It's called a two-stage table.

Multistage table structure

Assume there is an array of character properties (32, 0, 32, 0, 0, 0, ..., -16). To compress them, you can separate the array into blocks. For performance reasons, the block size should be a power of 2 (in this example, it is four).

When creating the table, the program eliminates the repeating blocks and place only the unique blocks in the second stage table. The first stage table contains pointers to these unique blocks. Often, the first stage items point to the same block, so the data is effectively compressed. For example, all blocks with values (32, 0, 32, 0) are replaced by pointers to the block 0 containing these values.

In practice, you should store the number of a unique block, not the pointer, in the first stage table, because the former takes less space. The second stage table is implemented as an array divided into fixed-size blocks.

The code to look up the table takes just three lines:

unsigned get_property(unsigned ch)
{
    const unsigned BLOCK_SIZE = 256;
    unsigned block_offset = stage1[ch / BLOCK_SIZE] * BLOCK_SIZE;
    return stage2[block_offset + ch % BLOCK_SIZE];
}

A script to create the table

The Python script below creates the table. It avoids building an intermediate table with all characters; only the current block is kept in memory. The smallest possible data type for the table items is automatically chosen. The script builds the table for Unicode category, but it can be easily adapted to another property (just change __decode method and def_val constant in UnicodeDataExtractor).

Download the script for building multistage tables

Starting from version 7.8, PCRE includes a multi-stage table implementation based on this script.

A similar script in used in Python source code. Their script combines several character properties (category, combining class, and bidirectional info) into one table and automatically finds the value of block size yielding the smallest table. Python interpreter uses two-stage tables for the functions in unicodedata module and string.upper() / lower() functions.

Conclusion

Unicode standard explicitly recommends using two-stage tables (see chapter 5.1). So, the moral is to read the standard and follow it instead of trying to reinvent the wheel, as some programmers mentioned above did.

13 comments

Ten recent comments are shown below. Show all comments

ace, 2 years ago
> the blocks that can start at offsets not divisible by block size.

Right.

> so the first stage table will require a larger data type (e.g., unsigned short instead of unsigned char).

Yes, that's why I expressed doubt (albeit the whole message was confusing as the result of careless editing) if it would work, since increasing the size of the first table pays off only if there are significantly less new tables introduced compared to the simplest solution.

However it's still an interesting theme for experiments. In such a setup, in the cases where some existing subtable doesn't already overlap at the start with another it would allow adding prefix to it (inserting something inside of the resulting table) in order to minimize the resulting table.
Harold, 2 years ago
Thanks for another great article!

Speaking of Unicode, I was wondering what's the proper way of converting the hash functions you've been discussing to work with Unicode strings.
Won, 2 years ago
The simple answer is: nothing. Just treat the Unicode string as a bunch of bits. In that, it doesn't even really matter what the encoding format.

The subtlety is related to a problem you might have for comparing Unicode strings for equality, since Unicode has the notion of equivalent (and compatible) character sequences and normalization.

http://en.wikipedia.org/wiki/Unicode_normalization

There are lots of tricky things about Unicode and multi-byte encodings...
Peter Kankowski, 2 years ago
Thank you for the good question.

A multiplicative hash function will give you a better result if you process the string by characters (wchar_t), not by bytes. Small multipliers should still work nicely, because most users will use only the letters of their language, so the range of unique characters will be small. (Notable exceptions are ideographic and syllabic scripts, e.g., Chinese, Japanese, or Korean, which have a lot of unique characters.)

A more complex hash function such as Murmur or Paul Hsieh's function can be used without modification, as Won suggested.
Harold, 2 years ago
Thanks - good to know I wasn't doing something obviously stupid. :)
Peter Kankowski, 2 years ago
I've finished a patch to PCRE 7.7 that enables it to use a two-stage table instead of the binary search algorithm.

                     speed         size for Unicode 5.0
old algorithm    152 ticks per char    24 KB
2-stage            8 ticks per char    46 KB
3-stage           13 ticks per char    22 KB


Philip Hazel, the author of PCRE, chose the 2-stage table, which provides better performance at cost of twice larger memory usage. The patch will be included in the next version of PCRE.
ace, 2 years ago
> The patch will be included in the next version of PCRE.

Congratulations!
Igor Lubashev, 1 year ago

You can also employ the following very effective trick. There are many level-0 blocks (blocks containing the actual data) that contain all the same value (like Block 1 and Block 2 in your example). You do not have to store this entire block just so that you can later stuff pointers/indexes to it. Instead, you can store that repeated value in the higher-level stage(s) directly (if you can spare 1 bit to encode this pointer as "non-pointer").

This optimization can be used even more effectively with a higher-order staged tables, since the special "repeated" value can propogate even higher (for example, if level-1 stage block is full of the same "non-pointer" values, the block does not need to be created and, instead, the same "non-pointer" value can be stored in the stage above). The tables may detect "same-pointer" blocks in higher-order stages (not just leaf blocks), but this would require keeping more "extra" bits to distinguish "non-pointer" leaf data from "non-pointer" "skip-stage" pointers. So I did not do this in my attempts below.

My experience with a particular unicode subset ( U+000000 - U+10FFFF):

"original" 2-stage table. Size: 27648 bytes.

  • Stage 0. Block size: 256 bytes.
  • Stage 1. Block size: inf bytes.

"optimized" 2-stage table. Size: 26112 bytes.

  • Stage 0. Block size: 256 bytes.
  • Stage 1. Block size: inf bytes.

"optimized" 3-stage table. Size: 12112 bytes.

  • Stage 0. Block size: 16 bytes.
  • Stage 1. Block size: 64 shorts.
  • Stage 2. Block size: inf bytes.

"optimized" 4-stage table. Size: 9632 bytes.

  • Stage 0. Block size: 8 bytes.
  • Stage 1. Block size: 16 shorts.
  • Stage 2. Block size: 16 shorts.
  • Stage 3. Block size: inf bytes.

"optimized" 5-stage table. Size: 8208 bytes.

  • Stage 0. Block size: 4 bytes.
  • Stage 1. Block size: 4 shorts.
  • Stage 2. Block size: 16 shorts.
  • Stage 3. Block size: 16 bytes.
  • Stage 4. Block size: inf bytes.
Igor Lubashev, 1 year ago

Sorry for the botched wiki formatting in the previous comment. Different wikis have different formatting rules. If someone can clean this up, it would be great.

Peter Kankowski, 1 year ago

Igor, thank you very much; it's an interesting idea. The table will be smaller, but you will have a branch (if statement) in your code, so it will be slower, because unpredictable branches are expensive on modern processors.

So, it's a time-space compromise (smaller table or faster code). If you want a smaller table, you could also explore data compression algorithms, especially RLE. The table can be uncompressed once (when your program starts up), then queried many times.

Your name:
Comment: