Fallacies of index compression in search

Photo by Sergio Jimenez

Posting list compression is an important building block of every search engine. It reduces index size and query latency, as it reduces the amount of data to be loaded from disk or ram (memory bandwidth). Compression algorithms like roaring bitmaps compress the document ID and are effective on large posting lists with many docid.

But two things are often overlooked:

  1. Word frequencies in a text or corpus are distributed according to Zipf’s law. That means they are long-tail with low-frequency terms are dominating the corpus. And those low-frequency terms with short posting lists with a single or only are few docid are barely compressible with traditional methods. Then suddenly the size of posting list metadata such as term, docid count, pointers, etc. becomes dominating, while being negligible for long posting lists. As short posting lists are dominating the index, efficient data structures for meta data are paramount and are heavily influencing the overall compression rate.

When looking at compression efficacy as opposed to compression rate we have to distinguish:

  1. The index size and its compressibility are dominated by the long tail distribution of all terms in the index.

Originally published at https://seekstorm.com on March 15, 2022.

--

--

Founder SeekStorm (Search-as-a-Service), FAROO (P2P Search) https://seekstorm.com https://github.com/wolfgarbe https://www.quora.com/profile/Wolf-Garbe

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store