Vector search vs. Keyword search — Data structures and algorithms

  1. War of the worlds vs. we come in peace
  2. Data structures and algorithms
  3. LSMT-IVF for Billion-Scale Approximate Nearest Neighbor Search
  4. Benchmarking

Vector vs. keyword search 2: Data structures and algorithms

Keyword search

  • part of the transformation process to turn sequential writes into random-access oriented posting list
  • part of a periodically routine, that restores index consistency. In an LSM deletes and updates are not directly performed in the posting list as this would require slow random access. instead for each new delete and update a new entry is sequentially writen into the LSM, and during search only the latest entry is taken into account. This causes that updated and deleted documents occur multiple times within the index, clutter the index, take up space and prolonge the query latency. Compaction has the function to regularly consolidate the index, to remove all tombstones and to integrate all occurrences of the same document into one.
  • Size-tiered compaction
  • Leveled compaction
  • SeekStorms sharded compaction

Vector search

  1. capture (learn) the meaning of words by statistical co-occurrence analysis of huge document repositories
  2. turn our documents of words into vectors of meanings, based on the previously precalculated word embedding data
  • Word2Vec
  • GloVe
  • fastText
  • BERT
  • Word level embeddings from BERT
  • Sentence level embeddings from BERT
  • Linear search
  • Hierarchical Navigable Small Worlds (HNSW)
  • Index partitioning/clustering
  • document compression with zstandard
  • vector quantization (VQ)
  • Principal component analysis (PCA) is used for Dimensionality reduction to reduce vectors with large dimensions to smaller dimensions.
  • Product Quantization (PQ) is used for compressing and storing vectors of large dimensions
  • Optimized Product Quantization (OPQ)
  • CPU or GPU hardware acceleration of vector arithmetics

Speed vs. Accuracy

  • for the same recall, we can trade-off between indextime vs query time
  • additionally we can trade recall for both indexing speed and query speed
  • nlist, nprobe and different clustering methods are parameters to balance this

--

--

Founder SeekStorm (Search-as-a-Service), FAROO (P2P Search) https://seekstorm.com https://github.com/wolfgarbe https://www.quora.com/profile/Wolf-Garbe

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store