Tail latencies and percentiles — what are they and why do they matter?
Photo by Steven Lelham
How fast is fast enough?
In the early days of the web, 1 second was thought to be the limit for a responsive website. But more recent research shows that any latency above 13 ms has an increasingly negative impact on human performance and that even 100 milliseconds of latency can cause a significant decrease in sales, drop in traffic, and user satisfaction.
Average vs. outliers
Often performance or latency is stated as average. While this gives a first impression, it doesn’t show the full picture. An average latency of 300 milliseconds may be sufficient, but outliers with 3 or 30 seconds are not. Users don’t notice average performance, but exceedingly large tail latencies they always do, even if they infrequently occur. Users often perform hundreds of actions during a single session, especially if instant search or auto-completion are involved. If only 1% of the queries are delayed, it means that on average every single user on every single visit to your site is affected by delays. What grave impact delays may have on consumer satisfaction, sales, click-through rate, cancel rate, and revisit probability is outlined below:
Importance of tail latencies
- “At Amazon, every 100 milliseconds of latency causes a 1% decrease in sales. And at Bing, a two-second slowdown was found to reduce revenue per user by 4.3%. At Shopzilla, reducing latency from seven seconds to two seconds increased page views by 25%, and revenue by 7%.” source
- For every 100 ms of latency, Google estimated a drop in search traffic by ~0.20% source
- “Keeping tail latencies under control doesn’t just** make your users happy, but also significantly **improves your service’s resilience while reducing operational costs.” source.
- “A 99th percentile latency of 30ms means that every 1 in 100 requests experiences 30ms of delay. For a high-traffic website like LinkedIn, this could mean that for a page with 1 million page views per day, 10,000 of those page views experience (noticeable) delay.” source
- The fastest rate at which humans can process incoming visual information is about 13 ms. Increasing latency above 13 ms has an increasingly negative impact on human performance for a given task source.
- Data from Akamai shows that a 100-millisecond delay in website load time can hurt conversion rates by 7 percent, a two-second delay in web page load time increases bounce rates by 103 percent, and within ~3 seconds, more than half (53%) will lose patience and leave the page. source
- At Walmart.com and Staples.com, every 1 second of load time improvement equals a 2% increase and a 10% increase in conversion rates [source (https://medium.com/@vikigreen/impact-of-slow-page-load-time-on-website-performance-40d5c9ce568a) source
Now, that we know why tail latencies are important, let’s have a look at how they are caused and how we can influence them.
Root causes of tail latencies
- Long tail queries are difficult to cache. The memory consumption would be either prohibitive or the hit rate negligible.
- Queries with frequent terms (e.g. “The Who”), require long posting lists to be loaded from the disk and intersected.
- High-load peaks with many parallel queries may cause a bottleneck in processor load, IO, and memory bandwidth.
- Parallel indexing for real-time search may cause a bottleneck in processor load, IO, and memory bandwidth.
- Garbage collection may cause occasional delays.
- Commits and compaction may cause occasional delays.
Percentiles, the way tail latencies are measured
- Arithmetic mean latency: The sum of all latency measurements divided by the number of latency measurements.
- 50th percentile latency = Median latency: The maximum latency, for the fastest 50% of all requests. For example, if the 50th percentile latency is 0.5 seconds, then 50% of requests are processed in less than 0.5 seconds. This metric is sometimes called the median latency. The median is also explained as the value separating the higher half from the lower half of a data sample.
- 75th percentile latency: The maximum latency, for the fastest 75% of requests.
- 95th percentile latency: The maximum latency, for the fastest 95% of requests.
- 99th percentile latency: The maximum latency, for the fastest 99% of requests.
- 99.9th percentile latency: The maximum latency, for the fastest 99.9% of requests.
A single user will usually send requests only sequentially, starting a new request only once the previous has finished, while multiple users send requests in parallel.
With N=1 the efficiency of processing a single query is tested, with a medium N the ability to fully utilize all processors and cores, and the efficiency of locking or lock-free mechanisms is tested, while with high N the behavior under high load and stress is tested.
Multi-threading, the utilization of multiple processor cores for processing queries can both increase the throughput of queries (QPS) and reduce the latencies under high load. But if supported by the search engine, also the latency of a single query can be improved by breaking the processing into parallel tasks (e.g. intersection of long posting list).
Most real-live scenarios are multi-user, multi-thread: Sytems that are perfectly fine with a single user, become laggy or completely frozen and unresponsive as soon as many users in parallel are using it.
That’s why using concurrency (all cores) to make all queries faster, and saturate CPU/IO even at low query loads is so important. That’s not only a matter of just throwing more cores at the task, but the search architecture has to be designed to fully utilize them:
- buffer reuse
- full processor utilization by saturating all cores
- efficient locking or lock-free architecture decides whether 100% usage of cores can be achieved
- scalability of memory consumption: if not effective then crash under load
- memory bandwidth: if not effective then bottleneck under load
- stability, throughput, tail latencies under load
- maximum load
- multiple indexes and multi-tenant
We have benchmarked the tail latencies of SeekStorm against Lucene, the open-source search engine library powering Solr, Elasticsearch and OpenSearch. The benchmark is based on the English Wikipedia corpus (5,032,105 documents, 8,28 GB decompressed) and queries derived from the AOL query dataset.
Taking into account the research cited above, reducing the tail latency at the 90th percentile from 85 to 5 milliseconds means removing any visible delay and increasing conversion rate by 7% and sales by 1% for that 10% of users previously affected by tail latencies.