Introduction
Messages like "%d file(s) found" are notoriously hard to localize. In English language, there are only 2 forms: 1 file (singular) and 2 or more files (plural), but other languages use up to 4 plural forms. For example, there are 3 forms in Polish:
0 plików 1 plik 2-4 pliki 5-21 plików 22-24 pliki 25-31 plików etc.
Other languages (French, Russian, Czech, etc.) also use rules different from English and from each other.
The gettext library extracts a rule for plural form selection from the localization file. The rule is a C language expression, which is evaluated for each message. It's a universal solution, but an expression evaluator is probably an overkill for this task.
Simpler solution
Here are some observations about the languages mentioned on gettext page:
- All additional plural forms are used for some range of numbers, e.g., from 2 to 4 in Slovak and Czech.
- The pattern is often repeated for each 10 or 100 items. In Russian, it sounds like "twenty-one file", not "twenty-one files", because the noun is agreed with the last figure, "one". The same pattern repeats for 30, 40, etc.
- The numbers from 10 to 19 are often an exception to the rules. Just like 16 is spelled differently from 26, 36, 46, etc. in English: "sixteen" vs. "twenty-six", "thirty-six", and "forty-six".
- Zero is treated differently in some languages, e.g. Romanian.
So, the rule for each plural form will consist of these components:
range_start range_end modulo_for_repetition skip_teens_flag
Here are some examples:
English singular: range_start = 1, range_end = 1 plural: all other numbers Polish singular: range_start = 1, range_end = 1 plural1: range_start = 2, range_end = 4, modulo = 10, skip_teens = true plural2: all other numbers Irish singular: range_start = 1, range_end = 1 plural1: range_start = 2, range_end = 2 plural2: all other numbers Lithuanian singular: range_start = 1, range_end = 1, modulo = 10, skip_teens = true plural1: range_start = 2, range_end = 9, modulo = 10, skip_teens = true plural2: all other numbers (from 10 to 19)
The rules for each language could be written to a short string, which is stored in the language file (e.g., for Lithuanian, the string is "1 1 10 t; 2 9 10 t").
Using the Code
Include plurals.h and plurals.c in your project. The interface consists of two functions. First, you call PluralsReadCfg to read rules from the string. Next, you pass a number to PluralsGetForm. It returns the index of correct plural form for this number, which you use to read the string from your language file:
PLURAL_INFO plurals;
PluralsReadCfg(&plurals, ReadFromLngFile("PluralRules"));
char lng_str_name[16], message[128];
sprintf(lng_str_name, "FilesFound%d", PluralsGetForm(&plurals, number));
sprintf(message, ReadFromLngFile(lng_str_name), number);
In the language file, you have strings for each plural form:
PluralRules = "1" FilesFound0 = "%d file found" FilesFound1 = "%d files found"
ReadFromLngFile is your own function. You could wrap two sprintfs in a higher-level function (and, of course, use a secure function instead of sprintf to protect your program from buffer overflow).
Even better solution is implementing a custom formatting function, so you could write something like "%d %(file|files) found" in the language file. Scott Rippey devised this technique and implemented it in VB .NET.
Conclusion
Two functions, PluralsReadCfg and PluralsGetForm, take 500 bytes in your executable file when compiled with MSVC++. A small price to pay for spelling your messages correctly in any language.
12 comments
Same idea, but simpler (no parsing at all) and faster (no loops):
Sorry for Pascal.
Formula for
English: 101
Polish: 2011122222222222222222111222222222222222
Lithuanian: 2011111111222222222220111111112222222222
Thank you, storing plural forms in a table (represented as string) is a very good idea! I was able to find the formulas for all languages when using your method. Here is the C code:
>I was able to find the formulas for all languages
I forgot to say about automatic translation from gettext expression to Formula (Delphi sources and EXE):
http://rghost.ru/3884964
Wow, RPN! :) But for this ad hoc evaluator, a programming language with eval function and syntax close to C would be a reasonable choice. For example, in JavaScript:
It's very fast (2 milliseconds in Firefox on my machine), because the expression is evaluated only once.
I will try to explain your algorithm as I understood it (if something is wrong, please correct me).
A) If the formula length is 20 figures or less, it's a simple lookup table for each number, e.g. for Irish it's "2012", which means "the 2nd form for zero, the 0th form for one, the 1st form for two, and the 2nd one for three or more". The last digit is used for all numbers greater than the formula length. These rules cover Western-European languages.
B) Romanian and Slovenian have a special pattern for the numbers from 0 to 99. Another pattern is used for numbers from 100 to 199, from 200 to 299, and so on (modulo 100). You encode the first pattern (exactly 20 characters), then the second pattern (its length must not be divisible by 10). The last digit is used for larger numbers, again.
C) Russian, Polish, and Baltic languages are similar to B), but the second pattern is also repeated from 20 to 29, from 30 to 39, from 40 to 49, etc. (modulo 10). The function applies this rule if the formula length is divisible by 10. It's easy to pad it to 40 characters with the repeating last digit.
Generally, it's a very clever hack. Thank you so much for sharing!
>But for this ad hoc evaluator, a programming language with eval function and syntax close to C would be a reasonable choice.
You are right, but initial development reason was a simple localization framework for Delphi (not JavaScript). I start from classic expression parser and finished with Formula.
>It's very fast (2 milliseconds in Firefox on my machine)
Interesting that Firefox don't use this "dynamic evaluation" method in its plugins system (JavaScript-based), but a lot of predefined functions for every language:
https://developer.mozilla.org/En/Localization_and_Plurals
>I will try to explain your algorithm as I understood it
You are correct, but I'm not a linguist and in my mind was not any languages, plural forms and so on. I started thinking about 3 arrays (20 numbers from first hundred, 10 for decades loop, and exclusions for *10..*19) and flags (loop every 10, every 100, etc), then I combine all arrays in one string, use tail reduction and embed flags in string size.
By the way your method is similar (using condition to test each plural form) to Plural rules standart:
http://cldr.unicode.org/index/cldr-spec/plural-rules
Thank you for the links. Note that Unicode and Mozilla have different rules for some languages. I asked them which ones are correct.
It's impressive how you started from the classic expression evaluator (similar to gettext) and came up with the smartly reduced lookup tables. Your formulas reminds me of the spellchecker by Doug McIlroy, who compressed 75 000 English words into 65 536 bytes of memory using a sparse hash table (described in Programming Pearls).
What license do you have for this code? May other people use it? May be, you would like to provide your full name to be credited in documentation / about box.
>Thank you for the links. Note that Unicode and Mozilla have different rules for some languages.
When testing formulas I found same weird things: expression for some languages presented in many different forms (even if you use reputable sources like W3C). Winner is Arabic: six different expressions and not only algorithms are differ, but even a count of plural forms!
>What license do you have for this code?
I didn't think that the ten lines of code may have any license ;-) But if you want some legal crap...
ANYONE IS FREE TO COPY, MODIFY, PUBLISH, USE, COMPILE, SELL, OR DISTRIBUTE THE ORIGINAL CODE, EITHER IN SOURCE CODE FORM OR AS A COMPILED BINARY, FOR ANY PURPOSE, COMMERCIAL OR NON-COMMERCIAL, AND BY ANY MEANS.
Peter, you mentioned McIlroy's spellchecker. If you're interested, his original source code is here:
http://code.google.com/p/unix-spell/
(both methods mentioned in the paper - the bloom filter and the differential huffman hash)