Sunday, September 21, 2014

SQLite Full Text Search on Mobile devices (FTS3)

When we talk about Full Text Search, the libraries that come to mind are Solr, ElasticSearch etc. However, all of these are back end libraries and require the indexes of the files to be created first.

When we talk about searching inside the mobile apps (without hitting the server), either these libraries do not have the client side implementations or are too heavy to be included in an app.

This is where Full Text Search engine of SQLite (FTS3) comes to rescue. FTS3 provides a lot of methods to perform full text search on the database using SQL statements. Although, both iOS and Android support SQLite out of the box, however, even using the same with libraries like SQLCipher is quite easy and doesn't require embedding huge libraries into the app either. I am not going to cover the differences between FTS3 and FTS4 as all that information is available on the link I share below.

All the methods provided by FTS3 with examples are listed here http://www.sqlite.org/fts3.html
. I recommend that you do keep referring to the link if anything is not clear as I will not cover the definition in detail but only the basics of FTS3 and share a code snippet for Android.

The FTS3 and FTS4 extension modules allows users to create special tables with a built-in full-text index, "FTS tables". The full-text index allows the user to efficiently query the database for all rows that contain one or more words also known as "tokens", even if the table contains many large documents.

To create a virtual table, the following statement can be used.
    CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);
This makes the table eligible for Full text query

When the WHERE clause of the SELECT statement contains a sub-clause of the form " MATCH ?", FTS is able to use the built-in full-text index to restrict the search to those documents that match the full-text query string specified as the right-hand operand of the MATCH clause.

The fast full text query looks as follows:     SELECT * FROM mail WHERE subject match 'database';

FTS3 and FTS4 provides three special auxiliary functions that are very useful to the developers:"snippet", "offsets" and "matchinfo". As the SQLite portal states: "The purpose of the "snippet" and "offsets" functions is to allow the user to identify the location of queried terms in the returned documents. The "matchinfo" function provides the user with metrics that may be useful for filtering or sorting query results according to relevance."

In this post, I am going to talk about the Offsets method only as that is the one I found to be most effective if you have to perform a full text search and fetch an excerpt. Although, the snippets function seems to be the one to fetch excerpts, it actually doesn't work as per its name.
The offsets() function returns a text value containing a series of space-separated integers. For each term in each phrase match of the current row, there are four integers in the returned list. Each set of four integers is interpreted as follows:
Integer Interpretation
0 - The column number that the term instance occurs in (0 for the leftmost column of the FTS table, 1 for the next leftmost, etc.).
1 - The term number of the matching term within the full-text query expression. Terms within a query expression are numbered starting from 0 in the order that they occur.
2 - The *byte offset* of the matching term within the column.
3 - The *size* of the matching term in bytes.
Important thing to note here is that the offset is a byte offset and not a character offset.
More details on the same can be read here: http://www.sqlite.org/fts3.html#section_4_1

Let us see an example of Offsets function being used in Android.

First we will create a new virtual table:
database.execSQL("CREATE VIRTUAL TABLE mail USING fts3(subject, body);");
ContentValues contentValues = new ContentValues();
contentValues.put("subject", "Subject");
contentValues.put("body", textContent);
database.insert("mail", null, contentValues);

Now that the table is ready and has one record, we will write the FTS query as follows:
Cursor myCursor = database.rawQuery("SELECT offsets(mail), body FROM mail WHERE mail MATCH 'Republic';", null);

Here along with the offsets, I am also fetching the full text so that I can extract excerpts from the same.
myCursor.moveToFirst(); // Move cursor to first location
String[] offsets = myCursor.getString(0).split(" "); // Split to " " to read integers
String text = myCursor.getString(1); //Store complete body in a variable
byte[] textBytes = text.getBytes(); // Convert text to bytes
ByteArrayInputStream ba = new ByteArrayInputStream(textBytes);
int i = 0;
ArrayList results = new ArrayList();
int textLength = textBytes.length;
ExcerptFinder excerptFinder = new ExcerptFinder(ba); // Provide the stream containing text bytes to ExcerptFinder
while (i < offsets.length)
{
  //Term and column index are ignored because we've searched for a single term only.
  int startOffset = Integer.parseInt(offsets[i + 2]);// Find the start index of searched term
  int endOffset = startOffset + Integer.parseInt(offsets[i + 3]);// Find the end index of searched term

if(startOffset < 0)
{
    startOffset = 0;
}

if(endOffset >= textLength - 1)
{
     endOffset = textLength - 1;
}
  String excerpt = excerptFinder.readFullWords(startOffset, endOffset);
  results.add(excerpt);
  i += 4;
}

So, here we extract excerpts for the searched term by first identifying its start index and end index in bytes. The same process continues for all the results for the row. In this example, we are working with a single row of data, otherwise there would be one more loop for the rows. Since we need to run forwards and backwards in the ByteArray Stream, the ExcerptFinder class I created extends the RandomAccessStream class provided at: RandomAccessStream.java

The relevant code in the Excerpt Finder class goes as follows:
public final String readFullWords(int start, int end) throws IOException
{
  int startOffset = start;
   int endOffset = end;
  byte[] singleByte = new byte[1];
  int currentCharacter = 0;
  int i = 0;
  int spaceCount = 0;
  while(currentCharacter != RETURN_DELIMITER && spaceCount < NUMBER_OF_WORDS)
  {
    startOffset = start - i;
    seek(startOffset);
    if(startOffset <= 0)
    {
      startOffset = -1;
      break;
    }
    read(singleByte, 0, 1);
    currentCharacter = singleByte[0];
    if(currentCharacter == 32)
    {
      spaceCount++;
    }
    i++;
  }
  currentCharacter = 0;
  i = 0;
  spaceCount = 0;
  int readBytes = 0;
  while(currentCharacter != RETURN_DELIMITER && spaceCount < NUMBER_OF_WORDS)
  {
    endOffset = end + i;
    seek(endOffset);
    readBytes = read(singleByte, 0, 1);
    if(readBytes < 0)
    { //
      endOffset = endOffset - 1;
      break;
    }
  currentCharacter = singleByte[0];
  if(currentCharacter == 32)
  {
    spaceCount++;
  }
  i++;
    }
  seek(startOffset + 1);
  byte[] result = new byte[endOffset - startOffset - 1];
  readFully(result, result.length);
  return new String(result);
}

In the code above, the rule for an excerpt is defined by a RETURN delimiter or NUMBER_OF_WORDS before and after the term. The logic can be tweaked to anything that you want. The key here is to play with the bytes offsets returned by the FTS3 query. Although, we are performing operations at a byte level, the code executes much faster then a RegEx performed on plain text. The search can be made faster by simply fetching a predefined number of bytes before and after the search term without worrying about the number of words.

FTS can be very handy when it comes to providing offline search to mobile apps. You can create the DB at the backend and simply download it on the app to run the FTS query on it. This concludes the post and I look forward to the comments.

If you need the source for the Android app to help you get started, feel free to email me.

No comments:

Post a Comment