Fallacies of index compression in search

Photo by Sergio Jimenez

Posting list compression is an important building block of every search engine. It reduces index size and query latency, as it reduces the amount of data to be loaded from disk or ram (memory bandwidth). Compression algorithms like roaring bitmaps compress the document ID and are effective on large posting lists with many docid.

But two things are often overlooked:

  1. Word frequencies in a text or corpus are distributed according to Zipf’s law. That means they are long-tail with low-frequency terms are dominating the corpus. And those low-frequency terms with short posting lists with a single or only are few docid are barely compressible with traditional methods. Then suddenly the size of posting list metadata such as term, docid count, pointers, etc. becomes dominating, while being negligible for long posting lists. As short posting lists are dominating the index, efficient data structures for meta data are paramount and are heavily influencing the overall compression rate.
  2. The size of position data, required for phrase intersection is usually at least as large, as the docid list. That’s intuitive because for every docid we need to store at least one position, often more. Position lists are relatively short, as the number of positions is limited by the number of words within a document, while docid posting lists are limited by the number of documents in the corpus. Position lists being short result in their low compressibility. Posting list consists of both docids and positions per docid, where only docids are reasonably compressible, but position data being at least of the same size but much less compressible. Therefore the impressive compression rates for docid lists published in academic papers don’t apply to the posting list as a whole.

When looking at compression efficacy as opposed to compression rate we have to distinguish:

  1. The index size and its compressibility are dominated by the long tail distribution of all terms in the index.
  2. For the query latency only the distribution of terms within the query is decisive, with the most frequent term with the longest posting list in the query dominating the query latency

Originally published at https://seekstorm.com on March 15, 2022.

--

--

--

Founder SeekStorm (Search-as-a-Service), FAROO (P2P Search) https://seekstorm.com https://github.com/wolfgarbe https://www.quora.com/profile/Wolf-Garbe

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Wolf Garbe

Wolf Garbe

Founder SeekStorm (Search-as-a-Service), FAROO (P2P Search) https://seekstorm.com https://github.com/wolfgarbe https://www.quora.com/profile/Wolf-Garbe

More from Medium

Protestware, Politics, and Open Source Software

Databases to machine learning: the stage is set for mathematicians

Algorithms of Oppression: How Search Engines reinforce racism?

Clojure, specs and data correctness