Vector search vs. Keyword search — Data structures and algorithms

Photo by Mateusz Wacławek

This is part 2 of a 4-part blog post series on Vector vs. keyword search:

  1. War of the worlds vs. we come in peace

Vector vs. keyword search 2: Data structures and algorithms

Now lets’s take a closer look at the inner workings of both keyword search and vector search. Only if we take the time to really understand the core principles of both approaches we will be able to identify where exactly differences and similarities lie and how we could utilize synergies and combine technologies in a way that they are complimentary.

Keyword search

Linear search

Linear search is the most straightforward and most inefficient way to search. All documents are sequentially scanned for the occurrence of the query keyword. That means that for each single query we have to evaluate each single byte of all the indexed documents. Query time doesn’t scale for a large number of documents, especially if the index size exceeds the RAM limit and we need to index the documents on HDD or SSD.

The benefits of linear search are the simple implementation of both indexing and search, a high sequential writing speed, as well as a minimal index storage size.

Inverted index

The inverted index is a set of posting lists — one for each keyword. A posting list is an ordered list of document id, that identify all documents where that specific keyword occurs. sometimes it also contains positional information, where in the document that keyword occurs to allow for phrase and proximity search.

The benefit of an inverted index is the high lookup speed, where a query is reduced to a single random access. That comes at the cost of increased storage size, and in the naive implementation with a slow index speed.

Log-structured merge tree

The Log-structured merge tree (LSM tree) has the function to increase the indexing speed. When a document is indexed and parsed into single terms, those terms arrive in a quasi-random order. But the inverted index requires that those terms and the associated document id are stored into the inverted index, specifically into the posting lists that are ordered by keywords. In the naive approach that would require for each term a random access to the posting list. As long as the index is small and fits into RAM that is no problem. But if we want that the index size can scale beyond RAM and we need to index to HDD or SSD than that is a challenge. Not only that writing speed into SSD is slower than into RAM, also the random access writing speed is by orders of magnitude slower than the sequential writing speed.

Here comes the LSM data structure into play. It allows to sequentially index the stream of keyword/docid pairs, but later transforms over several intermediate steps those data into random-access oriented posting list. That allows to index with sequential speed, but to search with random access.

And random access at query time is exactly what we want, as we don’t want to scan the whole index of several GB or TB for each query. And the slower random access speed doesn’t matter for a single access to the index, compared to billions of sequential accesses.

The benefit of an inverted index with LSM tree is the increased indexing speed, that allows to utilize slow HDD or SDD store to scale for large document numbers. It comes at the cost of a much more complex implementation, of an increased index storage size, and of write amplification that could reduce the lifetime of a SSD with its limited number of allowed writes.

Compaction

Function:

  • part of the transformation process to turn sequential writes into random-access oriented posting list

Compaction types

  • Size-tiered compaction

Index optimization

Vector search

In vector search both documents and queries are transformed into vectors. Then the documents vectors are stored into Nearest Neighbor Indexes to allow for Similarity Search. There are different architectures of Nearest Neighbor Indexes. All of them try with K Nearest Neighbors search to return a number of top-k documents where the document vector is closest to the query vector according to a specific similarity metric:

There is a very nice overview of vector search solution by Dmitry Kan

Turning documents into vectors

Although today the terms vector search and similarity search are used synonymously, we should remember that the embedding step and the resulting semantic similarity search has not been part of vector search from the beginning. At first only vectors of word occurrences have been compared using cosine similarity. That already enabled a kind of similarity search, where not all query keywords were required to appear within the result documents. Only with the advent of machine learning an extra step was added, that turned term vectors into vectors of meanings (dimensions/features), and enabled a whole new perspective in search: searching for semantic similarity.

Today artificial intelligence (AI) methods like machine learning (ML), specifically deep learning (DL) are used to first capture the meaning of words by performing statistical co-occurrence analysis of huge document repositories. This is done by tokenizing the document text, and then calculating for each unique token a word embedding, which is a vector of a certain number (e.g. 768) of dimensions/features. Then, based on the precalculated word embedding data we can during indexing turn our documents of words into vectors of meanings. The whole process is usually an very expensive in terms of time and processing power, often with quadratic scaling properties in regards to the number of documents.

It is done in two steps:

  1. capture (learn) the meaning of words by statistical co-occurrence analysis of huge document repositories

Luckily we can save the first step, as the precalculated word embedding data is released as open-source on the web, e.g. for Google BERT.

Tokenizing text with WordPiece Embeddings

  • Word2Vec

Index architectures

Let’s have a look at the three most frequently used vector index methods:

  • Linear search

All three options are availabe in Facebooks Faiss library for efficient similarity search and clustering of dense vectors. And there are some configuration tips here and here.

Linear search

Sometimes also called flat index, linear search is the most straightforward and most inefficient way to search. All document vectors are sequentially compared to the query vector.

The benefit of linear search is the straightforward implementation, a high sequential indexing speed and its perfect recall. The drawback is the limited scaling for large document numbers and slow query speed.

Examples of flat indices with linear search are the FAISS.IndexFlatIP or FAISS.IndexFlatL2.

Hierarchical Navigable Small Worlds (HNSW)

We trade higher search speed for approximate search results with limited recall.

Examples of indices based on Hierarchical Navigable Small World Graphs are the FAISS.IndexHNSWFlat, HNSW(nmslib) or N2 Index partitioning/clustering

Index partitioning/clustering

The index is partitioned into clusters of similar vectors. The lookup is divided into two steps. First we find the top-c clusters most similar to the query vector, and secondly we search only within those top-c clusters for the top-k documents most similar to the query vector.

The partitioning into clusters can be achieved by different methods

With clustering, we trade higher search speed for approximate search results with limited recall.

An example of index clustering with Voronoi diagrams (Dirichlet tessellation) is the FAISS.IndexIVFFlat (IVF = Inverted File Index).

An example of index clustering with Locality Sensitive Hashing is the FAISS.IndexLSH Other vector search approaches Index optimization

Other vector search approaches

Index optimization

  • document compression with zstandard

Speed vs. Accuracy

In IVF based algorithms there is a trade-of between recall vs. indextime vs. query time:

  • for the same recall, we can trade-off between indextime vs query time

Originally published at https://seekstorm.com on November 5, 2021.

--

--

Founder SeekStorm (Search-as-a-Service), FAROO (P2P Search) https://seekstorm.com https://github.com/wolfgarbe https://www.quora.com/profile/Wolf-Garbe

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store