Full Text Search (Part 1)

I want to add full text search (FTS) to my iKnowPeople app. (Have I mentioned that you can buy it on iTunes?). FTS is a pretty old idea, and most DBs support it (albeit usually through some sort of add-on). Sqlite (the DB underlying Core Data on the iPhone) supports it, but, unfortunately, Core Data doesn’t let you access this feature.

This means I’m going to have to roll my own implementation. Which is what I’m going to be talking about for most of this week. (We’re still doing con games tomorrow, though.) Today I’m going to outline the basic approach and write a prototype in Python. Let’s get started.

Conceptual Design

FTS systems allow you (to a first approximation) to quickly search for any word in any ‘document’ in the system. They can become quite sophisticated (you could argue that Google is just FTS for the WWW); this Wikipedia article offers some idea of where things can go. In my particular case, the ‘documents’ I want to search are the various Core Data objects that users have created to record what they know about who they know, and my ambition is merely to provide a good and useful feature, not a theoretically cutting-edge one.

An FTS system can be thought of as a keyword index: the index lists, for each keyword that appears in any document, all the documents in which it appears. The relative sophistication of an FTS system depends largely on how this index is populated – i.e. which keywords are bound to which documents – and somewhat on how it is queried.

You could, for example, generate a simple set of keywords for a document by splitting it on whitespace, and treating the resulting array as a set of tokens. This would (sort of) work, but it would have several drawbacks:

Many useless words (“and”, “or”, “but”, etc.) would be included in the index, consuming space and time
Words which differ only by capitalization (“Fish” vs. “fish”) or diacritics (“précis” vs. “precis”) would have separate entries in the index
Punctuation would be intermixed with the “proper” keywords (the string “fee, fie, foe, fum!” would produce the tokens “fee,” “fie,” “foe,” and “fum!” – not at all what you want)

A slightly more sophisticated approach is to first simplify the document (i.e. standardize on one case and strip diacritics) and then split on any sequence of characters not from a small “legal” set. Of the resulting tokens, any that are found in a set of “stop words” (such as the aforementioned “and”, “or”, “but”, etc.) are thrown away. The remaining tokens are used to populate the index. This is the approach I’m going to take in my app. (At least for now. I’m writing this as I’m developing my solution, so I reserve the right to be surprised.)

Test Corpus

The first thing I need is some test text. I’ve decided to start with Benedict XVI’s encyclical on “Charity in Truth”, because of the sophistication of its language and the variety of its notation, its length, its use of diacritics, and the fact that it’s conveniently sliced up into 79 sections. (Jumping ahead a bit, its plaintext is about 175K, and its index contains ~10000 references to ~3600 tokens.)

The encyclical is available on the web; there might be cleaner versions available, but I started with the HTML document. Here’s some Python code to grab it:

def get_text():
    # Grab sourcecode of a page displaying the encyclical (as unicode string)
    fp = urllib2.urlopen('http://www.vatican.va/holy_father/benedict_xvi/encyclicals/documents/hf_ben-xvi_enc_20090629_caritas-in-veritate_en.html')
    s = unicode(fp.read(), 'iso-8859-1')
    fp.close()

    # Split it into sections
    a = re.split('<p align="left"><a name="\d+\.">', s)

    # Clip the last section
    a[-1] = a[-1][:a[-1].find('<hr>')]

    # Return all but the first section
    return a[1:]

Okay, a few points here:

I assume that the urllib2 and re modules are loaded
Note that I convert the data returned by fp.read into a unicode object with the iso-8859-1 encoding. This conversion is important for later cleanup; the encoding was determined by examining the page’s source code.
The page’s source code also reveals that each section begins with the same snippet of HTML: “<p align=”left”><a name=”XX.”>”, where “XX” is some number. I use a regex based on this to split the encyclical into its component parts.
The end of the document isn’t relevant to this exercise; there’s an <hr> tag immediately after Benedict XVI’s signature that serves as a good place to cut the last section.
The part of the document preceding the first occurrence of the regex isn’t relevant either, so I don’t return it.

At this point I’ve got 79 ‘documents’ to work with, but they’re full of HTML entities and markup. I need to clean them up:

def clean_text(a):
    # LUT to replace (referenced) entities w/ Unicode characters
    codes   = set(re.findall('&#\d{4};', ''.join(a)))
    lut     = dict([(k, unichr(int(k[2:2+4]))) for k in codes])

    # Fxn to process each paragraph
    def clean(p):
        # Strip tags
        p = re.sub('<.*?>', '', p)

        # Replace entities
        for k in lut.keys(): p = p.replace(k, lut[k])

        # Return cleaned string
        return p

    # Process all paragraphs
    return [clean(p) for p in a]

This code is pretty straightforward:

It scans the documents for “&#XXXX;” style HTML entities, and replaces them with unicode literals (calculated from the entity’s numeric code)
It strips all HTML tags

This code does assume that no “<” or “>” symbols appear outside of tags, and that no HTML entities other than those of the form “&#XXXX;” are used, but these assumptions seem valid.

Stop Words

The other thing I need is a list of stop words. I found this list of 544 english stop words but, again, they’re wrapped up in HTML. We’ll have to write some more code.

Before we get to the retrieval code, however, a word about internationalization (i18n). This FTS implementation is tailored towards english, and (as we’ll see) essentially unusable for non-Romanized languages. At a minimum, different stop word lists would have to be used for other languages, but this falls into the category of Things I’ll Deal With Later.

Okay, stop word retrieval. Here it is:

def get_mysql_stopwords():
    # Grab a page with a table of stopwords
    fp = urllib2.urlopen('http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html')
    s = fp.read()
    fp.close()

    # Snip out the relevant table
    b = s.find('default list of full-text')
    e = s.find('</table>', b)

    # Pull the stopwords
    return re.findall("<td>([a-z']+)</td>", s[b:e])

Again, this is pretty simple:

The stop words are stored in a table, one to a cell. This table is the first one that appears after the text (fragment): “default list of full-text”.
All the interesting cells are found between this “marker” text and the table end tag. A regex extracts them.

Tokenization

Now we have everything we need to make some tokens. Here’s the code to transform a block of text into a set of tokens:

def tokenize(s):
    # Diacritic removal:
    # * Code from: http://snipplr.com/view/7355/remove-diacritics/
    # * Diacritics are pulled out into their own ("combining") codepoints, and then discarded
    s = re.sub(u'[\u0300-\u036f\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]', '', unicodedata.normalize('NFD', s))

    # Convert to lower case
    s = s.lower()

    # Split
    tokens = set(re.split("[^a-z0-9']+", s))
    tokens.discard(u'')

    # Return
    return tokens

A few more notes:

This code uses the unicodedata module
The diacritic removal stuff is cute, and totally copied from the Internet. I suspect it’s not exhaustive, but will work for most European languages.
The split statement is the most egregious hack in this whole undertaking; it means that any character that (after massaging) isn’t converted to a lowercase letter between “a” and “z”, a number, or a single-quote/apostrophe (and, err, ::shuffle-shuffle::, don’t ask me too many questions about that distinction right now) cannot appear in a keyword. That said, doing it this way seems to solve a number of immediate problems.

Index

With all that done, a little more code gets us to our FTS index:

class FTS_Index:
    def __init__(self, stopwords):
        self.stopwords  = set(stopwords)
        self.tokenToIds = {}
        self.idToTokens = {}

    def update(self, id, text):
        # Find (non-stopword) tokens in the text
        tokens = tokenize(text) - self.stopwords

        # Add new xrefs, as required, between (new) tokens and this id
        for t in tokens - self.idToTokens.get(id, set()):
            self.tokenToIds.setdefault(t, set()).add(id)

        # Remove old xrefs, as required
        for t in self.idToTokens.get(id, set()) - tokens:
            self.tokenToIds[t].remove(id)

        # Store the tokens associated with this id
        self.idToTokens[id] = tokens

This class works with document IDs, so you’ll need to have your own convention set up for that. Aside from that consideration, you just call update() to modify the index, and then perform lookups on tokenToIds. You can download the source code, which includes a Testbed class to search the encyclical.

The next step is porting this over to Objective-C, Core Data, and the iPhone. Check back on Wednesday, and we’ll see how things are going.

Share and Enjoy:

Conceptual Design

Test Corpus

Stop Words

Tokenization

Index

Services

Find Stuff

Pages

Buy My Apps

Other Stuff I’ve Built

Book Club

Archives

Categories

Blogroll