Plural forms

Introduction

Messages like "%d file(s) found" are notoriously hard to localize. In English language, there are only 2 forms: 1 file (singular) and 2 or more files (plural), but other languages use up to 4 plural forms. For example, there are 3 forms in Polish:

    0 plików
    1 plik
  2-4 pliki
 5-21 plików
22-24 pliki
25-31 plików
      etc.

Other languages (French, Russian, Czech, etc.) also use rules different from English and from each other.

The gettext library extracts a rule for plural form selection from the localization file. The rule is a C language expression, which is evaluated for each message. It's a universal solution, but an expression evaluator is probably an overkill for this task.

Simpler solution

Here are some observations about the languages mentioned on gettext page:

All additional plural forms are used for some range of numbers, e.g., from 2 to 4 in Slovak and Czech.
The pattern is often repeated for each 10 or 100 items. In Russian, it sounds like "twenty-one file", not "twenty-one files", because the noun is agreed with the last figure, "one". The same pattern repeats for 30, 40, etc.
The numbers from 10 to 19 are often an exception to the rules. Just like 16 is spelled differently from 26, 36, 46, etc. in English: "sixteen" vs. "twenty-six", "thirty-six", and "forty-six".
Zero is treated differently in some languages, e.g. Romanian.

So, the rule for each plural form will consist of these components:

range_start  range_end  modulo_for_repetition  skip_teens_flag

Here are some examples:

English
singular:  range_start = 1, range_end = 1
plural:    all other numbers

Polish
singular:  range_start = 1, range_end = 1
plural1:   range_start = 2, range_end = 4, modulo = 10, skip_teens = true
plural2:   all other numbers

Irish
singular:  range_start = 1, range_end = 1
plural1:   range_start = 2, range_end = 2
plural2:   all other numbers

Lithuanian
singular:  range_start = 1, range_end = 1, modulo = 10, skip_teens = true
plural1:   range_start = 2, range_end = 9, modulo = 10, skip_teens = true
plural2:   all other numbers (from 10 to 19)

The rules for each language could be written to a short string, which is stored in the language file (e.g., for Lithuanian, the string is "1 1 10 t; 2 9 10 t").

Using the Code

Include plurals.h and plurals.c in your project. The interface consists of two functions. First, you call PluralsReadCfg to read rules from the string. Next, you pass a number to PluralsGetForm. It returns the index of correct plural form for this number, which you use to read the string from your language file:

PLURAL_INFO plurals;
PluralsReadCfg(&plurals, ReadFromLngFile("PluralRules"));

char lng_str_name[16], message[128];
sprintf(lng_str_name, "FilesFound%d", PluralsGetForm(&plurals, number));
sprintf(message, ReadFromLngFile(lng_str_name), number);

In the language file, you have strings for each plural form:

PluralRules = "1"
FilesFound0 = "%d file found"
FilesFound1 = "%d files found"

ReadFromLngFile is your own function. You could wrap two sprintfs in a higher-level function (and, of course, use a secure function instead of sprintf to protect your program from buffer overflow).

Even better solution is implementing a custom formatting function, so you could write something like "%d %(file|files) found" in the language file. Scott Rippey devised this technique and implemented it in VB .NET.

Conclusion

Two functions, PluralsReadCfg and PluralsGetForm, take 500 bytes in your executable file when compiled with MSVC++. A small price to pay for spelling your messages correctly in any language.

Download the source code (25 KB, MSVC++)

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Created 16 years ago by Peter Kankowski
Last changed 15 years ago

12 comments

Ten recent comments are shown below. Show all comments

Peter Kankowski, 13 years ago

Thank you, storing plural forms in a table (represented as string) is a very good idea! I was able to find the formulas for all languages when using your method. Here is the C code:

static const char * formulas[] = {
    "", // jp
    "101", // en
    "001", // fr
    "1022222222222222222220222222222222222222", // lv
    "2012", // gd
    "10111111111111111111211111111111111111112", // ro
  // 0123456789 123456789 123456789 123456789
    "2011111111222222222220111111112222222222", // lt
    "2011122222222222222220111222222222222222", // ru
    "2011122222", // cs
    "2011122222222222222222111222222222222222", // pl
    "30122333333333333333301223333333", // sl
};

unsigned DecodePluralForm(int num, const char * formula) {
    size_t len = strlen(formula);
    if (len == 0)
        return 0;
    num = abs(num);
    int index = num;
    if (num > 19 && len > 20) {
        index = num % 100 + 20;
        if (index >= len && len % 10 == 0)
            index = index % 10 + 20;
    }
    if (index >= len)
        index = len - 1;
    return formula[index] - '0';
}

DRON, 13 years ago

>I was able to find the formulas for all languages

I forgot to say about automatic translation from gettext expression to Formula (Delphi sources and EXE):

http://rghost.ru/3884964

Peter Kankowski, 13 years ago

Wow, RPN! :) But for this ad hoc evaluator, a programming language with eval function and syntax close to C would be a reasonable choice. For example, in JavaScript:

var s = 'n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2';
eval('var f = function (n) { return ' + s + '}');
var a = [];
for(var n = 0; n < 100000; n++)
	a[n] = f(n);

It's very fast (2 milliseconds in Firefox on my machine), because the expression is evaluated only once.

I will try to explain your algorithm as I understood it (if something is wrong, please correct me).

A) If the formula length is 20 figures or less, it's a simple lookup table for each number, e.g. for Irish it's "2012", which means "the 2nd form for zero, the 0th form for one, the 1st form for two, and the 2nd one for three or more". The last digit is used for all numbers greater than the formula length. These rules cover Western-European languages.

B) Romanian and Slovenian have a special pattern for the numbers from 0 to 99. Another pattern is used for numbers from 100 to 199, from 200 to 299, and so on (modulo 100). You encode the first pattern (exactly 20 characters), then the second pattern (its length must not be divisible by 10). The last digit is used for larger numbers, again.

C) Russian, Polish, and Baltic languages are similar to B), but the second pattern is also repeated from 20 to 29, from 30 to 39, from 40 to 49, etc. (modulo 10). The function applies this rule if the formula length is divisible by 10. It's easy to pad it to 40 characters with the repeating last digit.

Generally, it's a very clever hack. Thank you so much for sharing!

DRON, 13 years ago

>But for this ad hoc evaluator, a programming language with eval function and syntax close to C would be a reasonable choice.

You are right, but initial development reason was a simple localization framework for Delphi (not JavaScript). I start from classic expression parser and finished with Formula.

>It's very fast (2 milliseconds in Firefox on my machine)

Interesting that Firefox don't use this "dynamic evaluation" method in its plugins system (JavaScript-based), but a lot of predefined functions for every language:

https://developer.mozilla.org/En/Localization_and_Plurals

DRON, 13 years ago

>I will try to explain your algorithm as I understood it

You are correct, but I'm not a linguist and in my mind was not any languages, plural forms and so on. I started thinking about 3 arrays (20 numbers from first hundred, 10 for decades loop, and exclusions for *10..*19) and flags (loop every 10, every 100, etc), then I combine all arrays in one string, use tail reduction and embed flags in string size.

By the way your method is similar (using condition to test each plural form) to Plural rules standart:

http://cldr.unicode.org/index/cldr-spec/plural-rules

Peter Kankowski, 13 years ago

Thank you for the links. Note that Unicode and Mozilla have different rules for some languages. I asked them which ones are correct.

It's impressive how you started from the classic expression evaluator (similar to gettext) and came up with the smartly reduced lookup tables. Your formulas reminds me of the spellchecker by Doug McIlroy, who compressed 75 000 English words into 65 536 bytes of memory using a sparse hash table (described in Programming Pearls).

What license do you have for this code? May other people use it? May be, you would like to provide your full name to be credited in documentation / about box.

DRON, 13 years ago

>Thank you for the links. Note that Unicode and Mozilla have different rules for some languages.

When testing formulas I found same weird things: expression for some languages presented in many different forms (even if you use reputable sources like W3C). Winner is Arabic: six different expressions and not only algorithms are differ, but even a count of plural forms!

>What license do you have for this code?

I didn't think that the ten lines of code may have any license ;-) But if you want some legal crap...

ANYONE IS FREE TO COPY, MODIFY, PUBLISH, USE, COMPILE, SELL, OR DISTRIBUTE THE ORIGINAL CODE, EITHER IN SOURCE CODE FORM OR AS A COMPILED BINARY, FOR ANY PURPOSE, COMMERCIAL OR NON-COMMERCIAL, AND BY ANY MEANS.

NRO, 13 years ago

Peter, you mentioned McIlroy's spellchecker. If you're interested, his original source code is here:

http://code.google.com/p/unix-spell/

(both methods mentioned in the paper - the bloom filter and the differential huffman hash)

Peter Kankowski, 13 years ago

NRO, thank you very much!

Plural forms

Introduction

Simpler solution

Using the Code

Conclusion

About the author

12 comments

Featured pages

Recent comments