Understanding Probalistic counting and HyperLogLog

September 30, 2022 Mohamad Tarbin

big-data counting google-big-query hyperloglog probabilistic-data-structure probability redis

When you hear the term Hyperloglog, if you don't know it and you are like me, you are likely to think that it's an open-source project that has to do something with logs collection. as you know how it's with naming those cool projects. you are always out of names.

but of course, it wasn't the case, when you look at Wikipedia's the definition of Hyperloglog you see :

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset.^[1] Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality

Count vs Cardinality

It's almost the same thing, counting is the process of determining how many things are in a group, while cardinality is the idea that the final number in a counting sequence represents the total number of objects counted.

Before understanding Hyperloglog it would be helpful to understand the idea of probabilistic counting, it's sacrificing some of the accuracy in typical counting in exchange for less storage required to get the cardinality

Probabilistic techniques shine when dealing with massive datasets or situations where extreme precision isn't necessary.

Probalistic Counting

Imagine you are a fan of some band standing near an entry door for their concert and want to count or estimate the distinct number of people coming to this large concert and you have very limited resources. no phone or paper or pen to use. you have under your eyesight the screen where you are seeing the ticket number that getting scanned by the person responsible for checking the tickets. of course, the system scans the tickets entered and can give an exact count based on that. but if you have limited resources you are seeing the ticket number and don't even have a pen and paper handy. even if you have it would work for 10s or 100s, but for larger you need a notebook or even more!

given only the ticket number feels impossible to do anything with it. for simplicity let's say the ticket number is 10 digits. we can look at the leading 0 in tickets based on the number of longest sequences we saw we can estimate the number of distinct tickets we have we can if we see xx-xx-xx-xx-x0 we can say we must have seen tens of tickets to see one match this pattern. and took us a 100s to see xx-xx-xx-xx-00

What if the tickets have a more complex format

While looking for leading zeros works with our simple ticket numbers, real-world data is messier. Ticket numbers might be way longer, or have letters mixed in. That's where hash functions save the day. These are like mathematical blenders that take complex data and spit out a shorter, seemingly random number. Importantly, the same ticket will always produce the same hashed value.

Let's pretend our hashing recipe just takes the last few digits of the ticket number. Now, instead of leading zeros, we might track how many times we see a hash ending in '00', '000', and so on. The same logic applies – the more attendees, the rarer it is to see a long string of trailing zeros in the hash values.

HyperLogLog: Beyond Zero-Counting

this comes to take everything a step further adding looking at leading zeros or using the hashing might let us have a collusion ( in the hashing function) and could mess up our estimation a little bit. Hyperloglog tries to make the whole thing more smarter.

HyperLogLog creates a bunch of buckets. When a ticket number gets hashed, it doesn't just look at the trailing zeros it figures out which bucket the hash value lands in.

Each bucket is a little memory to keep track of the highest count of trailing zeros it's seen so far. So, if bucket #3 has seen a hash ending in '0000', that's its current max

HyperLogLog doesn't just focus on one bucket, though. It takes an average across all the buckets' maximum counts of trailing zeros. Then, with a bit of fancy math, it turns that average into a more accurate estimate of the distinct ticket count.

The Tradeoff :

as mentioned this is giving up some accuracy in exchange for storage optimization. if you are not worried about storing the actual value you need to count Hyperloglog can provide huge savings in storage while giving very high accuracy

if we have a huge dataset with billions of elements, Hyperloglog can estimate the number of distinct elements with an accuracy of around 98% while using an incredibly small amount of memory (only 1.5 kB)

The HyperLogLog algorithm is able to estimate cardinalities of > 10⁹ with a typical accuracy (standard error) of 2%, using 1.5 kB of memory
Source: HyperLogLog the analysis of a near-optimal cardinality estimation algorithm https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

things use Hyperloglog I came to know when researching the topic are :

Google BigQuery: Their APPROX_COUNT_DISTINCT function is powered by HyperLogLog, letting analysts work with enormous datasets(link).
Redis: uses HyperLogLog to efficiently estimate the cardinality of sets.(link).
Unique IP Counter: Network Monitoring, Website Visitor Counter, etc..

Hyperloglog is great when you have a huge dataset, but keep in mind that if you can easily store the exact data, traditional counting methods are simpler.if you really care about the accuracy then also it's clearly you can't use it there as well.

If you're curious about how HyperLogLog is used in real-world, large-scale systems and want a deeper dive i highly recommend checking out this article by Facebook Engineering: https://engineering.fb.com/2018/12/13/data-infrastructure/hyperloglog/. It discusses how they've applied HyperLogLog at a massive scale within their infrastructure and explains the underlying math of the algorithm