Tokenization is a core concept of full text search. It covers the question how texts (which, in the end, are nothing more than byte arrays) can be broken down into individually searchable parts, so-called tokens. It is the token that is saved in the full text index and thus is the smallest basic unit to search for. This article will cover some strategies to cope with cases your search engine will not be able to handle.
Note: Fully working example is available on Gibhub: https://github.com/mkalus/self-cleaning-blob-storage-example
When writing web software, one of the more annoying things is file uploads. I have always been hating them, even in my early ASP, Perl and PHP years around 2000. Back then, you had the file form tag which allowed you send binary data to the web server. In most of the cases the web server would receive the data, save binary data into a temporary file for further processing. This was helpful but awkward, especially when form evaluation would not pass for some reason: Send the user back to the…
Software Engineer, Economic Historian and Open Source Enthusiast