Tokenization is a core concept of full text search. It covers the question how texts (which, in the end, are nothing more than byte arrays) can be broken down into individually searchable parts, so-called tokens. It is the token that is saved in the full text index and thus is the smallest basic unit to search for. This article will cover some strategies to cope with cases your search engine will not be able to handle.
Most search engines such as Sphinx, Elastic Search, or Solr(or Lucene for that matter) provide means to tokenize texts efficiently. More precisely, token creation is often split into several steps such as character filters (filtering out unwanted content such as HTML tags or folding all non-ASCII characters to standard ASCII) and actual separation of text units (defining separation criteria and fixed words such as C++, filtering out-stop words like “the” or “a”, etc.). To simplify things, I will call the whole process “tokenization”.
There are many good introductions to filtering and tokenization: I definitely recommend Doug Turnbull’s and John Berryman’s great book “Relevant Search”. A good short introduction can be found in Daniel Tunkelang’s article on tokenization. Daniel writes a grat many cool things on searching and it is easy to get lost in the many articles he has published. More practically minded people might want to have a look into actual tokenizers provided by Elastic Search for example. The documentation also covers creating custom filters and tokenizers.
While all these tools are really great and provide very robust means to achieve many use cases, I have repeatedly experienced their limits in projects during the last ten years. Virtually everything of it had to do with user expectations:
- Users will expect exact complex phrases to be found and ranked to the top. This is both true for domain specific phrases (like archival identifiers such as AV.2001.50005.pm_01) and multiple words (like “New York”).
- At the same time, users will expect partial phrases to render equally good results (like searching for “AV 50005” in the archival identifiers).
- Likewise, users will be confused by seemingly out-of-place search results. I experience this a lot when I use tokenizers that create base forms of words while indexing, often resulting in “weird” results during search.
- People will also query exploratively to find data they did not expect or know off in the first place (more on this see Search: The Whole Story by Daniel Tunkelang).
A lot of these struggles are akin with the good old precision vs. recall trade-off, but quite some of it was, because search engines acted either too smart (such as the base forms) or too dumb (such as taking apart archival identifiers into serval tokens). Creating custom tokenizers in most search engines is not quite trivial and often it is hard to actually fine-grain tokens to a level you sometimes need.
If you had similar problems, here some strategies that helped me tremendously.
BYOT: Bring Your Own Tokens
To grapple the problems mentioned above, I have started to create my own token fields. This is most useful when you have fields that are uniquely structured (archival identifiers again…) or contain data that is hard to tokenize by traditional means. I use this a lot in multilingual databases where fields contain English, German, Russian and Hebrew texts, often at the same time. Likewise it gives you a very finely controlled means to create tokens without the need to struggle with domain specific tokenization definitions in XML files. I am a programmer, so maybe I am biased…
To get into media res, the first step will be to actually tokenize yourself. This depends on your specific use case. Moreover, it only makes sense if traditional tokenizers struggle or fail to create relevant tokens.
Fixed String Tokens
A classic. How often have I had this? I work for archives a lot, such as the Memorial Archives. My Solr instance was set to save archival record fields, file name fields etc. as text. Now, such fields contain strings like AGFl_K.22.0002.1.jpg. Both the string and the parts have a deeper meaning to the archive and often link a series of documents to digital copies and vice versa. What did the standard Solr text field do? Naturally, it split up the string “AGFl_K.22.0002.1.jpg” into the tokens “agfl”, “k”, “22”, “0002”, “1”, “jpg”. This made patial searches really easy, because querying for something like “agfl 0002” rendered all the documents containing these tokens. But the users were very unhappy, because searching for “AGFl_K.22.0002.1.jpg” suddenly rendered weired results. The reason was that the precise query string was tokenized, too. And since tokenization filtered out “short” tokens such as “k” and “1”, the search becase one to find “agfl”, “22”, “0002”, and “jpg”. In a database containing millions of digital copies, there were quite a lot of results that matched these criteria. Naturally, the users were unhappy.
Converting the field to a string token type solved the second issue, but made partial searches impossible. Creating both a string and a text field proved more viable, but still generated many edge cases. For the memorial archives this was good enough, but in similar, smaller cases I have started to create custom tokens akin to ngrams (because users might also search for “agfl_k 1.jpg”, right?).
So our custom tokenizer would split up the term “AGFl_K.22.0002.1.jpg” into the following tokens:
- “AGFl_K.22.0002.1.jpg” (exact term)
- “agfl”, “k”, “22”, “0002”, “1”, “jpg” (term parts, completely exploded)
- “agfl_k”, “k.22”, “22.0002”. , “0002.1”, “1.jpg” (2-grams)
Naturally, one might create higher levelngrams, depending on your use case.
Domain Specific Tokens
Another common case, I encounter is domain specific expectations in regards to search behavior. One example: Quite a few years ago, I had to create an international postal code database. Users could enter either a location name (such as “New York”) or a postal code (such as “89919”). The application would hint show possible results in an autocomplete field and also fill out the rest of a form once the right location or postal code was entered or selected. Since it was international, users could also enter the postal code as “DE 89919” or “US 89919” (this was explained in the form) and English speaking users (based on browser preferences) would be offered U.S. zip codes before others while Polish users would see Polish postal codes sorted to the top. It was also expected that customers might enter partial place names or use other language terms for place names (such as Munich für München). So a typical entry in the data set had the following fields:
We decided to ease search as much as possible. Synonyms entered could be in any language, so an Italian might enter “Munich”, “Monaco”, or “München” and still find the relevant postcode. Consequently, the name and all synonyms were folded into search terms on an ASCII basis:
- München → “munchen” and “muenchen” (because people might search for the umlaut ue, too!)
- Munich → “munich”
- Múnich” → “munich” (a double, so we removed that from the list)
- Monaco di Baviera → “monaco”, “monaco di”, “monaco di baviera” (to facilitate typeahead)
- Мюнхен → “miunkhen” (ASCII folding/transliteration)
- Monachium → “monachium”
Likewise, the post code was also compiled into two tokens:
- “80333” (literal post code)
- “de 80333” (searching using country string)
So the final search token entries for the data set contained the terms: “80333”, “de 80333”, “munchen”, “muenchen”, “munich”, “monaco”, “monaco di”, “monaco di baviera”, “miunkhen”, “monachium”. In tests, the search proved to be pretty precise.
Creating Word Edged Ngrams
Sometimes, people struggle with a common full text search limitation: Per default, many search engines find tokens not word parts. This becomes apparent when using typeahead functions: You enter “n”, nothing, you enter “ne”, nothing, finally you enter “new” and get “New York”. This is how full text searches work internally (essentially being full word lookup libraries), although search engines like Solr and Elastic Search will enable you to enter partial search, you will have to handle the logics yourself (translating “n” to “n*” internally, handling with user input if he or she enters “n*” and so on… this can become quite interesting if you want to enable or disable certain search logics in Solr and Elastic).
So, sometimes it might be useful to create edged ngrams of your search words. Essentially this means that search term like “80333” will split into “8”, “80”, “803”, “8033”, and “80333”. For longer texts this can become quite longish, but there are certain advantages:
- No need to handle partial searches in the application.
- If you use a key-value store (either in-in memory or in an external service), you can actually predefine all possible search results and their sort order (the term being the search key, the results the data yielded by a key query).
Saving edged ngrams in-memory will be faster than any search engine, although at the cost of RAM, naturally. So this is an option for relatively small data sets (such as the postal code database detailled above) that really need speed.
Persisting Data and Searching
In the Search Engine
Now we have created our tokens, what to do with them? The easiest option is to just hand them over to an existing search engine. This can be an extra indexed field (non-saved) that is just used for full text search. Your searches are more likely to create relevant results without changing too much in your application. This is especially useful if you want to not change your architecture fundamentally and/or searching is combined with facets and filters.
There is one drawback using this method: You will loose the ability to highlight text fragments in your result set, at least those provied by the search engine. You can always create your own fragment generator, but with that you need to save the token positions in relation to the original text, retrieve these positions when a hit is generated and do some string magic to create highlighted fragments. It is less work than it sound, but depending on your use case, this might be a no go.
Searching is also easy: Once a user entered a search term, you just process the search string with the similar means to create search tokens and hand those over to you search engine. Endged ngrams are an exception: You will always want to search the exact term to get results (or split two or more terms up and let the search engine handle the details).
In a Database
Once you have created your own tokens, the question might be: Why use a search engine?
If you do not plan to sort by relevance, this might be a feasible option. You can pass on setting up an external search engine (cluster), synchronizing your data into it, keeping track of deleted records, etc. Sounds nice, right?
In fact, I have started using this quite often in microservices: Why not store the search terms with the data itself? If you use NoSQL databases like MongoDB, this is actually quite easy:
Just create an index on the field “terms” and your full-text search will be as fast as any search engine’s. You can also aggregate multiple fields into this field array.
Naturally, you will loose a few cool features the are available in full text search:
- There is no sorting by relevance (although you could simulate this in MongoDB by using the aggregation framework and sorting results somehow).
- No highlights — could be implemented by yourself, although it is a bit of work.
For smaller sets, you could store all your data in memory and search it there. This is mostly a matter of size and creating an index yourself. The biggest challenge is creating an index that can work with partial queries. This can be done using a key/value store, or some algorithm to keep linked phrase trees (there are several implementations of these). Still, programmatically, this will be the hardest to implement.
As you can see, sometimes it is useful to create your own search tokens to make your search results even more relavant to users. Given enough interest in the matter I will be happy to provide code examples in future articles.