History FullText Search Integrated, How Does It Work?
Since version 1.1.0, two things have really been on my mind. First off initially I thought WebKit could handle it if I fed your browser history to it in chunks, but it turns out it reads the whole page in one go then renders it. Which I can understand leads to a better experience on the fast networks of today (but I swear browsers used to render partial pages). So I had to rush out a patch, before your browsers started freezing, to add pagination to the history viewer, which should get to you anytime now.
Meanwhile before and after that I looked into integrating fulltext search for your browser history, which I’ve now got working very well. Except it doesn’t play well with the pagination yet, but I’ll fix that tomorrow. So look out for another blogpost then shortly followed by Odysseus 1.2.01.
The fulltext search is powered by the FTS5 SQLite extension, which despite being part of the SQLite codebase needed to be packaged with Odysseus. And it’s got lots of power-features, which I’ll document once 1.2.0 is out in the AppCenter.
How does SQLite FTS5 work?
There’s two (or three) aspects to how fulltext search works in SQLite. There’s the datastructures it stores on disk and the algorithms it uses to transform this datastructure into answers to your searches.
Data Representation
As always the cleverness of FTS5 is in how it represents data and less in how it processes it (even if CPUs only really understand the latter). Specifically FTS5 stores a list of all words used in a database table (ie your browser history) with some simple prefix compression. And for each of those it stores which records contain that word, which attributes in that record (those two could be swapped, but this optimizes the non-power features minusculy over the power features), and in turn at which text offset it occurs.
Meanwhile SQLite stores all it’s data in a “forest” of “BTrees”, which are in turn used to represent tables and their indices (indices are datastructures developers can add to give the databases more opportunities to speed things up). So some effort is required to compactly store FTS5’s data alongside those, which is done by encoding it as binary data in the rows of a single table/btree registered with SQLite.
Data Processing
Then when it comes to running query (and after figuring out what you mean), the first step is to look up all your terms in that index. That’s quite easy (a simple Binary Search given the terms are sorted), until you get to quoted sequences of tokens in a consecutive “phrase”2. For those the offset data is used to speed up the scanning of the candidate result records to determine if it really matches that phrase.
After that FTS5 goes searching for records which match all the rows rather than just some3. Again sorting the record identifiers helps a lot here, as it allows FTS5’s postprocessor to peek at which record ID comes next for each term and know that none of the records with smaller IDs satisfy all the terms. So for the others it scans the subsequent results until it finds that record ID or passes it. The process repeats until a record is found that matches all the requirements, at which point the result gets rendered by the embedding application (ie Odysseus).
“Tokenization” AKA How is text split into “terms”?
At the simplest the letters, numbers, etc (collectively referred to by programmers as “characters”) are split into two groups: those that can be part of a term and those that seperate them. This is very easy to implement, but can be quite fussy about which terms match each document.
So in Odysseus FTS5 is configured to use the Porter “stemming” algorithm. Essentially this strips, or occasionally replaces, suffixes off words in multiple predefined passes whilst ignoring, depending on the suffix, words that are too short by some measure. The tokens it outputs may not be proper words and, due simply to how non-formulaic human language is, it may not always be perfect. However does a an awefully good job at making search less fussy..
Final processing
It may be useful to run the tokenizers over each resulting record in order to format the results to indicate matches, or to rank those results. However neither of these are currently being used in Odysseus and I’m not missing it.
How is it used in Odysseus?
In Odysseus 1.2.0 the main thing you will notice is a new searchbox on odysseus:history. That is powered by SQLite FTS5, and once I got everything setup it was very easy to integrate into that page. I simply put a form on the page which sent your search query to SQLite and it’s FTS5 extension via the templating language I already defined, and took the result back from SQLite and rendered it into HTML via the same template. The main trick here was handling the case that no search was specified, which thanks to limited templating around SQL (purely for the performance and reliability of compiling the SQL alongside the template) had to be encoded via the OR operator.
As for setting up FTS5, I had a few false-starts there. First it appears that the FTS5 extension doesn’t ship with elementary OS/Ubuntu/Debian, so I needed to package it with the Odysseus install. But SQLite build system made it easy to extract FTS5 as a single C file with a corresponding header file which, with a little work, fits right in with my build system.
Then I tried loading the extension and initializing the fulltext search within the odysseus:history template. But because all the statements were parsed before any executed, that meant that I couldn’t use the fulltext search table from within the same template. So nevermind, I’ll just add it to the database schema4. And while I’m at it, I configured to share storage of the raw text with the normal history table and not bother with speeding up rankings since I’m instead sorting by date.
But this gets to the real drawback of SQLite’s full-text search as compared to databases like PostgreSQL. In SQLite you have normal tables and special “virtual” tables, including fulltext search tables, that behave differently. You can’t combine full-text search and normal database queries in a single table, whereas PostgreSQL allows you to attach a fulltext-search index to enhance a normal table.
- What, you don’t think having fancy fulltext search warrants a minor number increment?
- Or the NEAR poweruser feature.
- The poweruser features are actually simpler to implement in this case.
- And that’s the real reason I’m incrementing the minor version, it indicates a database upgrade.