SeekStorm’s LSMT-IVF for Billion-Scale Approximate Nearest Neighbor Search

Photo by Mateusz Wacławek

This is part 3 of a 4-part blog post series on Vector vs. keyword search:

  1. War of the worlds vs. we come in peace

Vector vs. keyword search 3: LSMT-IVF for Billion-Scale Approximate Nearest Neighbor Search

While in the first part we concentrated on the more philosophical aspects of evolving technology and choosing an appropriate architecture, and in the second part we had a closer look at the fundamentals of both vector search and keyword search, in the third part we want to talk about how Seekstorm as a deep tech start-up is committed to put its money where its mouth is and to deliver on our ideas. As always, uncompromised performance and scaling are paramount for us.

I am convinced that one will never fully understand the inner working of a complex system, its challenges and bottlenecks, and the compromises to be made if you don’t implement it yourself from scratch. And only then you will be able to improve and not be limited by the glass ceiling set by legacy solutions.

When I started working on approximate string search in large dictionaries I have always been pointed to Peter Norvig’s python code of spelling correction. It was a very concise code together with a very educational post, but performance-wise I have been somewhat disappointed. That was the inspiration for me to start SymSpell, an approximate string search and spelling correction library, that is 1000.000 times faster than Norvig’s code for edit distance 3.

With the Pruning Radix Trie we have developed a novel data structure, derived from a radix trie — but 3 orders of magnitude faster.

So, while I really do appreciate and respect the work that went into popular solutions, I have learned there is always room for improvement and I’m convinced that the future of search has just begun.

How about vector search? I have been following the advent of vector search, deep learning, and embeddings for quite some time. And I am really excited about the opportunities it enables. But in contrast to others, I’m not ready to dispose of the proven concept of an inverted index altogether. I see benefits in both approaches. Inverted index search offers precise search with perfect recall and impressive performance and scaling properties, while vector search enables the very useful concept of semantic similarity to search. But currently, vector search is not yet competitive in terms of performance and scaling for billion-scale search and beyond.

That spurred my interest. Could we do better, and how? What are the challenges, and how we can address them? As always, I took the best state-of-the-art-technology as a benchmark baseline for our own research and prototype implementation. Facebooks FAISS.IndexFlatIP, FAISS.IndexHNSWFlat, FAISS.IndexLSH are a good starting point.

Challenges in vector search

What are the challenges in a vector search architecture?

  • scaling for large document numbers: either the index size or the indexing time required for “construction” or “learning” grow quadratically

Apart from the part where the embeddings (e.g BERT) are generated by deep learning — there is nothing very specific about vector search. It’s all about the usual trade-offs one has to make when designing a high-performance search engine architecture (Indexing performance vs. search performance, storage space vs. storage cost vs storage speed) and all the usual suspects to deal with it: partitioning, hierarchies, level, segments, LSM-trees and tries, concurrency. locking-(free), compression, hardware acceleration, RAM/SSD hybrid strategies, persistence, crash safety, recovery strategies, etc. We are right at home here. Being in full control of every bit of our architecture allowed us to integrate vector search as a first-class citizen on par with keyword search. SeekStorm adds vector search to its toolbelt, not as an afterthought to keyword search, but equally genuine, with full performance and no restrictions.

SeekStorm’s neural search combines AI, machine learning, specifically deep learning, embeddings, and high-performance, and highly scalable vector search into semantic search capability that understands meanings, concepts, similarity, and synonyms.

So let us introduce SeekStorm’s LSMT-IVF — a new data structure for Billion-Scale Approximate Nearest Neighbor Search.

SeekStorms new vector search algorithm LSMT-IVF is a combination of log-structured merge tree (LSMT) and IVF (inverted index file).

In Facebooks FAISS.IVF method the vectors are partitioned into a nlist number of clusters. At search time we first search the nprobe number of closest clusters, and then we search only within the nprobe clusters for the topk closest vectors. By limiting the search to the nprobe closest clusters we turn to a non-exhaustive, approximate nearest neighbor search with recall<1.

Trade-offs in IVF based algorithms:

  • for the same recall, we can trade-off between indexing speed vs. query speed

The log structures merge tree (LSMT) data structure allows indexing to relatively slow memory (HDD, SSD) with high insert volume. It utilizes that both HDD and SSD have a much higher sequential write rate than random access write rate. The algorithm rewrites the data sequentially over several levels until they are segmented in a way that allows random access at query time. SeekStorm uses a sampling-based k-medoid clustering instead of Lloyd’s k-means centroid clustering (Voronoi iteration or relaxation) used in FAISS.IVF.

Combining both LSMT and IVF allows for

  • Incremental training/indexing,

Benchmarking SeekStorm’s new vector search algorithm LSMT-IVF against FAISS.IVF for 1 million docs with the Sift1M dataset:

37x faster learning/index speed and 3x higher search speed (nlist=4096) OR
3x faster learning/index speed and 4x higher search speed (nlist=1024)
for same recall (97%), same hardware (CPU, multithreaded)

FAISS is written in C++ and compiled ahead-of-time to machine code, while LSMT-IVF is written in C# and compiled just-in-time to .NET CLR.

Properties of SeekStorm’s LSMT-IVF:

  • Hybrid RAM/SSD storage

SeekStorm new vector index is currently experimental 🧪 and not yet in production.

Originally published at https://seekstorm.com on November 14, 2021.

--

--

Founder SeekStorm (Search-as-a-Service), FAROO (P2P Search) https://seekstorm.com https://github.com/wolfgarbe https://www.quora.com/profile/Wolf-Garbe

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store