Yet another compressed bitset with union, intersection and difference implementations

In the last I introduced a data structure and algorithm to implement a concurrent memory efficient long set and long-to-long map. The long set can be regarded as some kind of compressed bitset. This post focuses on adding the classic bitset operations for union (or), intersection (and) and difference (and-not).

  1. chevron left iconYet another compressed bitset with union, intersection and difference implementations
walter-bauer.jpg
Walter BauerDecember 6, 2012
  • Technology

In the last post I introduced a data structure and algorithm to implement a concurrent memory efficient long set and long-to-long map. The long set can be regarded as some kind of compressed bitset. This post focuses on adding the classic bitset operations for union (or), intersection (and) and difference (and-not).

Two approaches are implemented:

  • OpNode (see class IDTrieOps): An expression tree like approach with iterators that work like a merge sort (think zip fastener) and return the next value for the result of an operator. So with (2, 5, 10) and (4, 5, 7, 10) the implementation of the and operator will walk through both lists until it finds matching pairs (5, 10). As a performance optimization it will skip values in one list, if the current value in the other list is ahead (and, and-not). The result is calculated on the fly without creating any temporary results.
  • SetOps (see class IDTrieSetOps): A more classic bitset approach where the leaf values (64 bits are stored in one long) are combined using bitwise and, or and and-not. Inner nodes are handled in a quite similar way. Each operation creates an effective result: a new IDSetTrie.

For details of the algorithms look at the source code that you can find here.

Now for the important part: performance. Of course without actual data the following measurements won't reflect a real use case, but at least an artificial test reveals some common patterns.

The test evaluates the expression "(a and b) or c" where each bitset has up to 10,000,000 entries using increments of 1 (dense population), 100 (sparse population) and 1000 (even more sparse population). All tests were performed on a MacBook Pro with a 2.3 GHz Intel i7 (4 cores / 8 threads). BitSet is the standard java.util.BitSet.

Java BitSet is the best, but SetOps follows closely, OpNode falls far behind because iterating through the node values takes its time.

If the set is sparsely populated, OpNode performance isn't that bad.

If even more sparsely populated, BitSet even falls behind OpNode and SetOps shines.

So it's not that easy to choose the best approach. And it even gets worse taking into account multiple parallel threads. Then it's the combination of CPU performance, available CPU-Cache (L1/L2/L3) and especially memory bandwidth that influence the results:

Ouch! Without memory efficient storage BitSet saturates the memory bandwidth and doesn't scale. So you definitely don't want to use BitSet on your expensive eight-core server class machine.

Over all the best solution could be to combine OpNode and SetOps. A simple but well working solution is to use OpNode as default but replace it with SetOps if the left and right arguments are of type IDSetTrie and the expected result sizes are large.

But your mileage may vary...

Read part I here: Efficient concurrent long set and map

walter-bauer.jpg
Walter Bauer
Walter Bauer is one of the censhare AG founders and is a passionate software designer and engineer with a bee in his bonnet about performance optimization and refactoring.

Want to learn more?