Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Rivian was saved by software in 2025

    A surprise God of War prequel is out on the PS5 right now

    Ring cancels its partnership with Flock Safety after surveillance backlash

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

      February 12, 2026

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025
    • Crypto

      Berachain Jumps 150% as Strategic Pivot Lifts BERA

      February 12, 2026

      Tom Lee’s BitMine (BMNR) Stock Faces Cost-Basis Risk — Price Breakdown at 10%?

      February 12, 2026

      Why the US Jobs Data Makes a Worrying Case for Bitcoin

      February 12, 2026

      MYX Falls Below $5 as Short Sellers Take Control — 42% Decline Risk Emerges

      February 12, 2026

      Solana Pins Its $75 Support on Short-Term Buyers — Can Price Survive This Risky Setup?

      February 12, 2026
    • Technology

      A surprise God of War prequel is out on the PS5 right now

      February 13, 2026

      Ring cancels its partnership with Flock Safety after surveillance backlash

      February 13, 2026

      PlayStation State of Play February 2026: all the news and trailers

      February 13, 2026

      In one swoop, Trump kills US greenhouse gas regulations

      February 13, 2026

      Two powerful, OLED-equipped gaming laptops are hundreds off

      February 13, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Bloom Filters
    Technology

    Bloom Filters

    TechAiVerseBy TechAiVerseMay 2, 2025No Comments10 Mins Read0 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Bloom Filters
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Bloom Filters

    The original motivation for the creation of Bloom filters is efficient set
    membership, using a probabilistic approach to significantly reduce the time and
    space required to reject items that are not members in a certain set.

    The data structure was proposed by Burton Bloom in a 1970 paper titled “Space/Time
    Trade-offs in Hash Coding with Allowable Errors”. It’s a good paper that’s
    worth reading.

    Suppose that we store some information on disk and want to check if a certain
    file contains a certain entry. Reading from disk is time consuming, so we want
    to minimize it as much as possible. A Bloom filter is a data structure that
    implements a cache with probabilistic properties:

    1. If the cache says the key is not present in a specific file, then it’s
      100% certain we should not be reading the file.
    2. If the cache says the key is present in the file, there’s a small chance
      this is a false positive (and in fact the key isn’t there). In this case
      we just read the file as usual.

    In a scenario where the majority of queries “is this key in that file?” have a
    negative answer, a Bloom filter can significantly speed up the system [1].
    Moreover, the probabilistic nature (the existence of false positives) allows
    Bloom filters to be extremely fast and occupy very little space. Here’s a quote
    from the Bloom paper:

    The new hash-coding methods to be introduced are
    suggested for applications in which the great majority of
    messages to be tested will not belong to the given set. For
    these applications, it is appropriate to consider as a unit of
    time (called reject time) the average time required to
    classify a test message as a nonmember of the given set.

    How a Bloom filter works

    A Bloom filter is a special kind of a hash table with open addressing.
    It’s an array of bits (the length is typically denoted m), and some fixed
    number (k) of hash functions. We’ll assume each hash function can take an
    arbitrary sequence of bytes and hash it into an integer in the inclusive range
    [0, m-1]. A Bloom filter supports two operations:

    Insert an item: the item is hashed using each of the k hash functions, and the
    appropriate bits in the underlying array are set to 1.

    Test if an item is a member: the item is hashed using each of the k hash
    functions. If any of the bits indicated by their results is 0, we return “false”
    with certainty. If all the bits are 1, we return “true” – and there’s a small
    chance of false positives.

    Here’s how the Bloom paper describes it:

    The hash area is considered as N individual addressable bits, with addresses 0
    through N – 1. It is assumed that all bits in the hash area are first set to 0.

    Next, each message in the set to be stored is hash coded into a number of
    distinct bit addresses, say a1, a2, . . . , ad. Finally, all d bits addressed by
    a1 through ad are set to 1.

    To test a new message a sequence of d bit addresses,
    say a’1, a’2, … a’d, is generated in the same manner as for storing a message.
    If all d bits are 1, the new message is accepted. If any of these bits is zero,
    the message is rejected.

    Hopefully it’s clear why this data structure is probabilistic in nature: it’s
    possible that different items hash to the same number, and therefore when
    we test some X, all its hashes point to bits turned on by the hashing of other
    data. Read the Math appendix for the math behind Bloom filters and how to
    calculate (and design for a specific) the false positive rate.

    Here’s an example:

    1. We start with an empty bloom filter with m=16 and k=3. All bits
      are initialized to 0.
    2. Insertion of “x”. The three hashes return indices 1, 6, 15, so these bits
      in the array are set to 1.
    3. Insertion of “y”. Hashing returns indices 6, 9 and 13, so these
      bits in the array are set to 1. Note that bit 6 is set for both “x” and “y”,
      and that’s fine.

    Next, let’s look at some membership tests:

    1. Test “x”. Hashing returns 1, 6, 15; all these bits are 1 in the
      array, so the answer is “true”. This is a true positive.
    2. Test “w”. Hashing returns 3, 9, 13. Since the bit at position 3 is 0, the
      answer is “false”.
    3. Test “v”. Hashing returns 9, 13, 15; all these bits are 1 in the array,
      so the answer is “true”. This is a false positive.

    Note that it’s trivial to prove (by the law of contraposition) that all “false”
    answers from a Bloom filter’s test operation are true negatives.

    Implementation

    Here’s a simple implementation of a Bloom filter in Go:

    // New creates a new BloomFilter with capacity m, using k hash functions.
    // You can calculate m and k from the number of elements you expect the
    // filter to hold and the desired error rate using CalculateParams.
    func New(m uint64, k uint64) *BloomFilter {
      return &BloomFilter{
        m:      m,
        k:      k,
        bitset: newBitset(m),
        seed1:  maphash.MakeSeed(),
        seed2:  maphash.MakeSeed(),
      }
    }
    
    type BloomFilter struct {
      m      uint64
      k      uint64
      bitset []uint64
    
      // seeds for the double hashing scheme.
      seed1, seed2 maphash.Seed
    }
    
    // Insert a data item into the bloom filter.
    func (bf *BloomFilter) Insert(data []byte) {
      h1 := maphash.Bytes(bf.seed1, data)
      h2 := maphash.Bytes(bf.seed2, data)
      for i := range bf.k {
        loc := (h1 + i*h2) % bf.m
        bitsetSet(bf.bitset, loc)
      }
    }
    
    // Test if the given data item is in the bloom filter. If Test returns false,
    // it's guaranteed that data was never added to the filter. If it returns true,
    // there's an eps probability of this being a false positive. eps depends on
    // the parameters the filter was created with (see CalculateParams).
    func (bf *BloomFilter) Test(data []byte) bool {
      h1 := maphash.Bytes(bf.seed1, data)
      h2 := maphash.Bytes(bf.seed2, data)
      for i := range bf.k {
        loc := (h1 + i*h2) % bf.m
        if !bitsetTest(bf.bitset, loc) {
          return false
        }
      }
      return true
    }
    

    The bitsetSet and bitsetTest functions can be seen in the
    full code repository.

    This implementation uses double hashing to
    generate k different hash functions from just two hashes.

    The code also mentions the CalculateParams function:

    // CalculateParams calculates optimal parameters for a Bloom filter that's
    // intended to contain n elements with error (false positive) rate eps.
    func CalculateParams(n uint64, eps float64) (m uint64, k uint64) {
      // The formulae we derived are:
      // (m/n) = -ln(eps)/(ln(2)*ln(2))
      // k = (m/n)ln(2)
    
      ln2 := math.Log(2)
      mdivn := -math.Log(eps) / (ln2 * ln2)
      m = uint64(math.Ceil(float64(n) * mdivn))
      k = uint64(math.Ceil(mdivn * ln2))
      return
    }
    

    You’ll have to read the Math appendix to understand how it works.

    Practicalities

    Let’s look at a practical example of a realistic Bloom filter and how it
    performs. Suppose we want to store about 1 billion items, and have a false
    positive rate of 1% (meaning that if the filter returns “true” for a test,
    there’s a 99% chance that the item was previously added to the filter).
    Using these requirements, we can invoke CalculateParams to get the Bloom
    filter parameters:

    CalculateParams(1000000000, 0.01) ===> (9585058378 7)
    

    This means m is about 9.6 billion (bits) and k is 7. In other words, our
    Bloom filter requires about 1.2 GB of space to cache the membership test of
    a billion items (that could be of arbitrary size). Moreover, the lookup is
    very fast – it’s just 7 applications of the hash function. The constant lookup
    cost is particularly attractive, as it doesn’t depend on the number of items
    actually inserted, or on any particular pattern in the data (i.e. there are
    no worst case scenarios with asymptotically higher cost).

    On my machine, benchmarking with the Go implementation shown above I get ~30
    nanoseconds per lookup. Mind you, this is the simplest Go implementation I
    could think of – nothing here is optimized; I’m sure this can be improved at
    least 2x by using a more speed-optimized hash implementation, for example.

    Now imagine how long it would take to ascertain if data is present in a file
    with 1 billion entries, even if the file contains proper indexing for fast
    lookups. Just asking the OS to read the file’s first few KiBs to get at the
    index would take orders of magnitude longer than 30 ns.

    Recall that Bloom filters are best suited for cases “in which the great majority
    of messages to be tested will not belong to the given set”. Moreover, even if
    the data exists in the file, false positives only happen 1% of the time.
    Therefore, the number of times we’ll have to go to the disk just to find the
    data is not there is a very small fraction of total accesses.

    Code

    The full code for this post, with tests, is available
    on GitHub.

    Appendix: the Math behind Bloom filters

    A reminder on notation:

    • m: size (in bits) of the set
    • n: how many keys were inserted into the filter
    • k: number of hash functions used to insert/test each key

    For a specific bit in the set, assuming our hash functions distribute
    the keys randomly, the probability of it not being set by a specific
    hash function is:

    And the probability it’s not set by either of our k hash functions is:

    The last formula is constructed to use an approximation of
    for a large enough m (see the appendix in this post) to write:

    After inserting n elements, the probability that it’s 0 is:

    Meaning that the probability of it being 1 is:

    Recap: this is the probability of any given bit being 1 after n bits
    were inserted into a set of size m with k different hash functions.

    Assuming independence between our hash functions (this is not super
    rigorous, but a reasonable assumption in practice), let’s calculate the
    false positive rate. Suppose we have a new key that’s not in the set,
    and we’re trying to check its membership by hashing it with our k hash
    functions. The false positive rate is the probability that all hashes
    land on a bit that’s already set to 1:

    This is also called the error rate of our filter, or
    . To get an optimal (minimal) false positive rate,
    let’s minimize . Since the logarithm function is
    monotonically increasing, it will be more convenient to minimize
    :

    We’ll calculate the derivative w.r.t. k and set it to 0:

    Substituting a variable and using some more
    calculus and algebra, we can find that:

    A numerical example: if we have a set with 1 million bits, and we expect
    to insert about 100,000 elements, the optimal number of hash functions
    is:

    However, it’s more useful to aim for a certain error rate, and set the
    filter parameters accordingly. Let’s assume we’ll be using this optimal
    value of k. Substituting into the
    equation for from above:

    If we use the numerical example from before with ,
    the error rate with an optimal k will be
    .

    What often happens is that we have an error rate in mind and we want to
    calculate how many bits per element we want to dedicate in our set.
    Let’s take the previous equation and try to isolate
    from it using a logarithm:

    Then:

    Final numerical example: suppose we want an error (false positive) rate
    of 1%. This means our set should have:

    … bits per element. So if we expect about 100,000 elements, the bit
    set used for our filter should have at least 958,000 bits. And, as
    calculated earlier, we should be using hash functions to
    achieve this optimal error rate.


    [1] For this reason, Bloom filters are very common in data storage systems.
    Here’s a discussion about Cassandra,
    but there are many others.
    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleReflecting on a Year of Gamedev in Zig
    Next Article Zoho halts $700M semiconductor plan
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    A surprise God of War prequel is out on the PS5 right now

    February 13, 2026

    Ring cancels its partnership with Flock Safety after surveillance backlash

    February 13, 2026

    PlayStation State of Play February 2026: all the news and trailers

    February 13, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025668 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025256 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025153 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025111 Views
    Don't Miss
    Software and Apps February 13, 2026

    Rivian was saved by software in 2025

    Rivian was saved by software in 2025 Rivian is, by every measure, a maker and…

    A surprise God of War prequel is out on the PS5 right now

    Ring cancels its partnership with Flock Safety after surveillance backlash

    PlayStation State of Play February 2026: all the news and trailers

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Rivian was saved by software in 2025

    February 13, 20260 Views

    A surprise God of War prequel is out on the PS5 right now

    February 13, 20260 Views

    Ring cancels its partnership with Flock Safety after surveillance backlash

    February 13, 20260 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.