Fallacies of index compression in search

  1. Word frequencies in a text or corpus are distributed according to Zipf’s law. That means they are long-tail with low-frequency terms are dominating the corpus. And those low-frequency terms with short posting lists with a single or only are few docid are barely compressible with traditional methods. Then suddenly the size of posting list metadata such as term, docid count, pointers, etc. becomes dominating, while being negligible for long posting lists. As short posting lists are dominating the index, efficient data structures for meta data are paramount and are heavily influencing the overall compression rate.
  2. The size of position data, required for phrase intersection is usually at least as large, as the docid list. That’s intuitive because for every docid we need to store at least one position, often more. Position lists are relatively short, as the number of positions is limited by the number of words within a document, while docid posting lists are limited by the number of documents in the corpus. Position lists being short result in their low compressibility. Posting list consists of both docids and positions per docid, where only docids are reasonably compressible, but position data being at least of the same size but much less compressible. Therefore the impressive compression rates for docid lists published in academic papers don’t apply to the posting list as a whole.
  1. The index size and its compressibility are dominated by the long tail distribution of all terms in the index.
  2. For the query latency only the distribution of terms within the query is decisive, with the most frequent term with the longest posting list in the query dominating the query latency



Founder SeekStorm (Search-as-a-Service), FAROO (P2P Search) https://seekstorm.com https://github.com/wolfgarbe https://www.quora.com/profile/Wolf-Garbe

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store