Attribution: Wikimedia Commons

Tokenization is a core concept of full text search. It covers the question how texts (which, in the end, are nothing more than byte arrays) can be broken down into individually searchable parts, so-called tokens. It is the token that is saved in the full text index and thus is the smallest basic unit to search for. This article will cover some strategies to cope with cases your search engine will not be able to handle.

Most search engines such as Sphinx, Elastic Search, or Solr(or Lucene for that matter) provide means to tokenize texts efficiently. More precisely, token creation…


A solution to get rid of orphaned file uploads once and for all

Note: Fully working example is available on Gibhub: https://github.com/mkalus/self-cleaning-blob-storage-example

When writing web software, one of the more annoying things is file uploads. I have always been hating them, even in my early ASP, Perl and PHP years around 2000. Back then, you had the file form tag which allowed you send binary data to the web server. In most of the cases the web server would receive the data, save binary data into a temporary file for further processing. This was helpful but awkward, especially when form evaluation would not pass for some reason: Send the user back to the…

Max Kalus

Software Engineer, Economic Historian and Open Source Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store