Abstract:
How to make you code shorter and easier to maintain by using arrays.

Created by Peter Kankowski
Last changed
Filed under Low-level code optimization

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Table-driven approach

You can often avoid repeating code by placing distinct parts of it into an array:

// A text adventure game
if(strcmpi(command, "north") == 0) {
    if(cur_location->north)
        GoToLocation(cur_location->north);
    else
        Print("Cannot go there");
}
else if(strcmpi(command, "east") == 0) {
    if(cur_location->east)
        GoToLocation(cur_location->east);
    else
        Print("Cannot go there");
}
else if(strcmpi(command, "south") == 0) {
    if(cur_location->south)
        GoToLocation(cur_location->south);
    else
        Print("Cannot go there");
}
else if(strcmpi(command, "west") == 0) {
    if(cur_location->west)
        GoToLocation(cur_location->west);
    else
        Print("Cannot go there");
}

A shorter equivalent:

enum SIDE {SIDE_NORTH = 0, SIDE_EAST, SIDE_SOUTH, SIDE_WEST};
struct COMMAND {
   const char * name;
   SIDE side;
};
static const COMMAND commands[] = {
   {"north", SIDE_NORTH},
   {"east", SIDE_EAST},
   {"south", SIDE_SOUTH},
   {"west", SIDE_WEST},
};
for(int i = 0; i < NUM_OF(commands); i++)
    if(strcmpi(commands[i].name, command) == 0) {
        SIDE d = commands[i].side;
        if(cur_location->sides[d])
            GoToLocation(cur_location->sides[d]);
        else
            Print("Cannot go there");
    }

The second version is not only smaller, but also more abstract and hence easier to maintain. To implement short command names, you have to modify the code in just one place. If more commands will be added, you can easily replace the array with a hash table.

Such arrays are indispensable in many programs: from compilers and disassemblers to tax calculators and web applications. This article highlights some examples of their usage.

Price calculation

In Refactoring, Martin Fowler discusses the following code:

// calculating the price for renting a movie
double result = 0;
switch(movieType) {
   case Movie.REGULAR:
     result += 2;
     if(daysRented > 2)
        result += (daysRented - 2) * 1.5;
     break;
 
   case Movie.NEW_RELEASE:
     result += daysRented * 3;
     break;
 
   case Movie.CHILDRENS:
     result += 1.5;
     if(daysRented > 3)
        result += (daysRented - 3) * 1.5;
     break;
}

In order to improve it, he replaces the switch statement with inheritance (one class for calculating the regular price, one for new releases, and another one for children's movies).

A simpler solution uses arrays:

enum MovieType {Regular = 0, NewRelease = 1, Childrens = 2};
 
                             // Regular   NewRelease   Childrens
const double initialCharge[] = {2,             0,        1.5};
const double initialDays[] =   {2,             0,          3};
const double multiplier[] =    {1.5,           3,        1.5};
 
double price = initialCharge[movie_type];
if(daysRented > initialDays[movie_type])
    price += (daysRented - initialDays[movie_type]) * multiplier[movie_type];

It's evident that the same formula is used in all three cases, so you don't have to make a complicated class hierarchy for them. The table-driven code is faster, much shorter, and easier to maintain.

Generally, when you think in terms of the only paradigm (in this case, OOP), you will often miss a better solution. By learning more different approaches (including table-driven approach), you can extend your mental toolbox and solve more programming problems with less lines of code.

Shipping rates and progressive taxation

Another situation when the tables are useful is a simple or a piecewise linear function.

Imagine you are developing a web shop script. The total price shown to user should include a shipping rate, which depends on the weight of items and the shipping distance. For example, the rates of Russian postal service are:

DistancePrice for each 0.5 kg
less than 601 km27.60 rubles
601..2000 km35.00 rubles
2001..5000 km39.60 rubles
5001..8000 km46.35 rubles
more than 8000 km50.75 rubles

The dependence between distance and price cannot be represented with a formula, so the program has to search in an array to find the applicable rate:

def shipping_fee(weight, distance, value):
# weight in grams, distance in kilometers, declared value in rubles
	rates = [
		# max distance, price
		( 600, 27.60),
		(2000, 35.00),
		(5000, 39.60),
		(8000, 46.35)
	]
	# Find the applicable rate for this distance
	for r in rates:
		if distance <= r[0]:
			rate = r[1]
			break
	else:
		rate = 50.75
	
	# For each 500 grams of the parcel
	fee = rate * math.ceil(weight / 500)
	
	# For each ruble of the declared value
	fee += 0.03 * math.ceil(value)
	
	# Apply VAT (18%)
	return round(fee * 1.18, 2)

The array is usually small, so linear scanning works faster than binary search.

For example, sending The practice of programming from Moscow to Khabarovsk (300 g, more than 8000 km, 214 rubles) should cost 67.46 rubles (around $2.20).

Progressive tax calculation is a similar, but slightly more complex example:

def get_tax(income):
    if income < 0:
        return income
    MAX_INCOME = 0; RATE = 1 # array indexes
    tax_brackets = [
        (  8350,  0.10),
        ( 33950,  0.15),
        ( 82250,  0.25),
        (171550,  0.28),
        (372950,  0.33)
    ]
    tax = 0
    prev_bracket = 0
    for bracket in tax_brackets:
        tax += ( min(bracket[MAX_INCOME], income) - prev_bracket ) * bracket[RATE]
        if income <= bracket[0]:
            break
        prev_bracket = bracket[MAX_INCOME]
    else:
        tax += (income - prev_bracket) * 0.35
    return tax

Disclaimer: these functions are only meant to illustrate the table-driven technique; the author doesn't claim they calculate the shipping rate in Russia and the income tax in USA correctly. For the latter, please use the calculator from IRS site, which takes into account various deductions.

Assemblers, disassemblers, and code generators

Writing assembler or disassembler is a good exercise in table-driven design. Typically, you have arrays for one-byte and two-byte instructions. Mnemonic and operand types of each instruction are stored in these arrays. Gary Shute provides a small example of RISC assembler that uses this technique.

x86 code is harder to assembly because of its irregular and complicated structure. The best assemblers/disassemblers use additional tables to deal with opcode extensions (see, for example, CADT by Ms-Rem). The Table Driven Assembler by Jan Nikitěnko shows that it's possible to write a generalized assembler, which can be retargeted for any ISA.

Character properties

The ctype.h header file in C includes macros for character classification (isdigit, isupper, isalnum, etc.) Their comparison-based implementation is clumsy and ineffective:

int isalnum(int ch) {
    return 'a' <= ch && ch <= 'z' ||
           'A' <= ch && ch <= 'Z' ||
           '0' <= ch && ch <= '9';
}

If there are several character properties, you can assign a bit to each of them, and then make a 256-element array storing the properties for each character. Most C run-time libraries use this method for ctype.h macros:

static const unsigned char properties[] = {
      0,  0,  0,  0,  0,  0,  0,  0,  0, 16, 16, 16, 16, 16,  0,  0,
      0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
     16,160,160,160,160,160,160,160,160,160,160,160,160,160,160,160,
    204,204,204,204,204,204,204,204,204,204,160,160,160,160,160,160,
    160,202,202,202,202,202,202,138,138,138,138,138,138,138,138,138,
    138,138,138,138,138,138,138,138,138,138,138,160,160,160,160,160,
    160,201,201,201,201,201,201,137,137,137,137,137,137,137,137,137,
    137,137,137,137,137,137,137,137,137,137,137,160,160,160,160,  0,
      0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
      0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
      0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
      0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
      0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
      0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
      0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
      0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
};
#define islower(ch)  (properties[ch] & 0x01)
#define isupper(ch)  (properties[ch] & 0x02)
#define isdigit(ch)  (properties[ch] & 0x04)
#define isalnum(ch)  (properties[ch] & 0x08)
#define isspace(ch)  (properties[ch] & 0x10)
#define ispunct(ch)  (properties[ch] & 0x20)
#define isxdigit(ch) (properties[ch] & 0x40)
#define isgraph(ch)  (properties[ch] & 0x80)

If you have only one property to store, use a bit array, which is smaller (256 bits = 32 bytes), but needs extra operations to retrieve the value:

inline int isalnum(int ch) {
    static const unsigned int alnum[] = {
        0x0, 0x3ff0000, 0x7fffffe, 0x7fffffe, 0x0, 0x0, 0x0, 0x0,
    };
    return (alnum[ch >> 5]) & (1 << (ch & 31));
}

The alnum array has ones in the positions corresponding to alphanumeric characters, and zeros in other positions.

The range of matching characters is often limited. If you have a range of 32 or less characters, you can store the bits in an integer constant:

// True for a, e, i, o, u, and y
inline int isvowel(int ch) {
    int x = ch - 'a';
    return (0x1104111 >> x) & ((x & 0xffffffe0) == 0);
}

When x is negative or greater than 31, the result of shift by x is undefined in C language. An x86 processor uses the lower 5 bits of x; some other processors always return 0 for the big shift amounts. For portability, we have to do an extra check: (x & 0xffffffe0) == 0. Both MSVC++ and MinGW generate a branchless x86 code for this condition (using SBB or SETE instruction). The method is not as fast as the table-driven approach, but the code is smaller.

The arrays for these methods can be generated with a script (see below). For Unicode character properties, use a comparison should be more effective than a table lookup.

Download a Python script for generating character properties tables, 2 KB

Decision tables

A complex condition can often be replaced with a short decision table. Here is an example. A program counts the number of digraphs to determine the encoding and language of document (e.g., Windows Latin-1, German language). All characters are divided into 3 classes:

  • letters of the language (e.g., a to z, ä, ö, ü, and ß for German);
  • letters not used by the language (e.g., è or ô for German);
  • separators (space, period, comma, etc.)

If two adjacent separators are found, they should be ignored, because somebody may use spaces for indentation. A non-used character should not appear near a letter, so the expected frequency of this digraph is zero (e.g., if we found êt digraph, the text is probably in French, not German). A word consisting of non-used characters should be ignored: this will improve statistics for multilingual texts (e.g., English trademarks mentioned in an Arabic text).

A series of if statements for these rules would be long and hard to read (try to write it!). Such conditions are better to be represented with a decision table. Here is a fragment of code that builds the expected_frequencies matrix:

enum LETTER_CLASS { ST_NONUSED = 0, ST_LETTER, ST_SEPARATOR };
enum DECISION { DT_SETDONTCARE, DT_SETZERO, DT_LEAVEASIS };
const DECISION decision_table[3][3] = {
      // Non-used        Letter        Separator
	{DT_SETDONTCARE, DT_SETZERO,   DT_SETDONTCARE},  // Non-used
	{DT_SETZERO,     DT_LEAVEASIS, DT_LEAVEASIS},    // Letter
	{DT_SETDONTCARE, DT_LEAVEASIS, DT_SETDONTCARE}}; // Separator

// Read a sample text and fill the expected_frequencies matrix with digraph frequencies
...

// Correct the expected_frequency value depending on the letter class
switch (decision_table[letter_class1][letter_class2]) {
    case DT_SETDONTCARE:
        expected_frequencies[letter1][letter2] = DONTCARE;
        break;

    case DT_SETZERO:
        if (expected_frequencies[letter1][letter2] != 0)
            OutputMessage("Illegal letter combination %c%c\n",
                letter1, letter2);
        expected_frequencies[letter1][letter2] = 0;
        break;

    case DT_LEAVEASIS:
        // No action, use previously calculated expected_frequencies value
        break;
}

Further reading

Books
  • Code Complete by Steve McConnell, chapter 18 (Table-driven methods). Covers access by index and linear search in array. McConnell also criticizes the mechanical application of OOP.
  • Programming Pearls by Jon Bentley, chapter 3.1, 3.3 (A Survey Program, An Array of Examples). The chapter shows how to make a statistical program shorter by using arrays and provides more examples of where table-driven methods can be useful.
Web pages
Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Created by Peter Kankowski
Last changed

10 comments

ace,

My take:

Programming can be seen as creating new state machines. The programming languages are optimized representations of some intentionally limited kinds of state machines. E.g. the whole structured programming movement can be expressed as introducing explicit limits compared to all representable state machines when goto everywhere is allowed. Good programming courses should introduce and work with http://en.wikipedia.org/wiki/Automata_theory before they introduce specific languages, and then they should demonstrate how the automaton can be programmed with any of the languages.

It's extremely important approach and if you studied computer science and don't know enough about it you should ask for your money back (figuratively speaking). If you are a hobby programmer it's what you missed for not seriously learning the subjects.

Most of the languages are explicitly optimized for automata that have very few edges (paths, i.e. possible gotos). There are a lot of real life solutions that are more densely represented by tables simply because they don't adhere to the intentional limitations of the "flow" of the given programming language constructs.

Beginners and amateurs learn complex language aspects and then try to fit everything to the most complex thing they've learnt – they expect that when they use something that required from them a lot of effort to learn it must give them the most of benefits, even if they are not able to evaluate that. Bad book authors on their side give unrealistic examples to illustrate "advanced" techniques (where "advanced" ofen really means "better avoided" and not "it's always better to use it"). That's why we see "object oriented code" over more pages which can be replaced with a few lines without objects, or more pages of code in some languages which can be replaced with a few small tables and a few lines of code.

That's why we have a lot of "language fundamentalists" – people who spend a lot of time studying the texts of "language lawyers" and believe that simply because they know the special cases when e.g. "templated operator overloading for implicit conversion of a multiple inherited class" have to be illogically written they are good programmers. They are still amateurs.


(Style issues follow, all disclaimers apply)

You maybe know that I "look" at the code without actually reading it, in the first pass, as I do that much faster. I got stuck at lines:

        SIDE side = commands[i].side;
        if(cur_location->side[side])
            GoToLocation(cur_location->side[side]);
        else
            Print("Cannot go there");

Almost every occurrence of the same word "side" has another semantic in the first two lines! You know what you're doing when you write it, but the code should be readable later, and be readable fast. I'd suggest:

        SIDE d = commands[i].side;
        if ( !GoToLocation( curr_location->sides[ d ] )
            Print( "Cannot go there" );

Consider always using plurals to name arrays unless it's enough to use even shorter name as "a". Consider using one letter names for all variables to be used only in a few lines. Note also that your use of enum values as array indexes is hidden in the code. When I know at the start that the enum list is not a list of "some integers" but "the list integers where the first really must be 0 and the rest of the code depends of it" even if it results in the same generated code I write it as

enum SIDE { SIDE_NORTH=0, SIDE_EAST, SIDE_SOUTH, SIDE_WEST };

That helps me to not accidentally add something later at the start of the list because "it was just a list of values" and break the rest of the code without knowing.

Peter Kankowski,

Thank you, I applied your suggestions on coding style. Fully agreed about "language fundamentalists" and OOP overuse.

ace,

Here's one book which covers material I referred to. I haven't read it but it looks good.

Introduction to the Theory of Computation, Second Edition by Michael Sipser

Peter Kankowski,

Automata and Computability by Dexter Kozen also looks nice: it covers important topics and is not too hard to read.

ace,

I agree with you, Kozen's book has more directly useful examples, Sipser's is more "dry math."

Joe,

The decision tables are very similar to jump tables which are frequently used in parsers.

Command cmds[] = {
  [COMMAND_FOR] = &&cmdFor,
  [COMMAND_IF]  = &&cmdIf
};

COMMAND_FOR and COMMAND_IF can be implemented via an enum.

And use it like this:

  goto *(cmds[idx]);

cf. http://gcc.gnu.org/onlinedocs/gcc-3.4.1/gcc/Labels-as-Values.html

Peter Kankowski,
In most cases, it's better to use a switch instead of the jump table, because the switch is portable and shorter, and it will be compiled to a jump table anyway. (The GCC docs mention this.)
ace,

But yes, "designated initializers" could be a neat feature from C99 when you have to maintain the code with tables (insuring always proper initial placement of elements). Sadly it's not present in C++ compilers:

http://groups.google.com/group/comp.std.c++/browse_thread/thread/8b7331b0879045ad

G.Nitz,

In my opinion the word computer is always best translated with "a table machine". I create tables as often as possible and achive the best results in my programs. In speed and in size as well.

Sadly it's not widely known and used. Most young programmers don't ever hear of it.

Really a nice article to read!

olivia michael,

is there a disignated initializer of oops

Your name:


Comment: