A simple data structure for indexing text files.

Created by Peter Kankowski
Last changed
Filed under Algorithms

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Suffix array for full-text search

A reader asked by email if it's effective to store a full-text search index in a DAWG and how does it compare with using a trie.

DAWG is useful when you need a fuzzy search, e.g. for ignoring possible typos or spelling correction. If you don't need this, you can use a simpler data structure called suffix array.

The data structure

Let's break the text into words and place a pointer to the beginning of each word into an array. Then, sort the array alphabetically. If any two words are equal, compare the next words ("I am" will come before "I say").

When I say that I am in the habit of going to sea

I saygoing
thatI am
I amI say

The array contains not the words themselves, but only pointers to them, so it takes a little space.

When searching, use binary search to locate the matching word. The array will provide a pointer to the word in the original text, so you can show the whole sentence just like Google does. The words that start from the given characters will be returned by default. It's possible to search for multiple words, to match only the whole words, or to ignore case.

Refinements to the algorithm

I found a description of suffix array in Jon Bentley's book Programming Pearls. In the same book, there is a binary search algorithm that always finds the first of repeating elements in an array. I used this algorithm to locate the first matching word, then scanned the array starting from this word and recorded all matching words.

Now we have the list of matches, but it's ordered alphabetically, not by their location in the original text. So I copied the matches into a temporary array and sorted them by pointer value, which is their position in the text.

For allocating the memory for the index, I used the tricks described in my previous article on dynamic arrays.

The demo program builds the index in RAM. In a real program, you may wish to store the index in a file and rebuild it each time the text file is modified. In this case, you need offsets from the beginning of the text file instead of pointers. You can save some space if the text file is small by allocating less bits for the offset. For example, instead of using 32-bit integers, you can store the offset in 24 bits (3 bytes), which will be enough for text files smaller than 16 MB.

If you index several documents, you need to store both the number of the file and the offset inside it. If the file size does not change, it's possible to represent all files as one large document, so you can have only the offset in the index and determine the file name from the offset.

Demo program

Download the demo program in C (2 KB) or try an online JavaScript demo below (the search is case-sensitive).

Search for:

Call me Ishmael. Some years ago — never mind how long precisely — having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off — then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP. He can be reached at kankowski@gmail.com.

Created by Peter Kankowski
Last changed

Leave your comment

Your name:


Please ignore this field: