In the last post I introduced a data structure and algorithm to implement a concurrent memory efficient long set and long-to-long map. The long set can be regarded as some kind of compressed bitset. This post focuses on adding the classic bitset operations for union (or), intersection (and) and difference (and-not).

Two approaches are implemented:

OpNode (see class IDTrieOps): An expression tree like approach with iterators that work like a merge sort (think zip fastener) and return the next value for the result of an operator. So with (2, 5, 10) and (4, 5, 7, 10) the implementation of the and operator will walk through both lists until it finds matching pairs (5, 10). As a performance optimization it will skip values in one list, if the current value in the other list is ahead (and, and-not). The result is calculated on the fly without creating any temporary results.

SetOps (see class IDTrieSetOps): A more classic bitset approach where the leaf values (64 bits are stored in one long) are combined using bitwise and, or and and-not. Inner nodes are handled in a quite similar way. Each operation creates an effective result: a new IDSetTrie.

For details of the algorithms look at the source code that you can find here.

Now for the important part: performance. Of course without actual data the following measurements won't reflect a real use case, but at least an artificial test reveals some common patterns.

The test evaluates the expression "(a and b) or c" where each bitset has up to 10,000,000 entries using increments of 1 (dense population), 100 (sparse population) and 1000 (even more sparse population). All tests were performed on a MacBook Pro with a 2.3 GHz Intel i7 (4 cores / 8 threads). BitSet is the standard java.util.BitSet.

test with size: 10000000 (a and b) or c, increments: 1, 1, 1 OpNode: 301422us, count: 10000000, checksum: 50000005000000 SetOps: 51874us, count: 10000000, checksum: 50000005000000 BitSet: 50682us, count: 10000000, checksum: 50000005000000

Java BitSet is the best, but SetOps follows closely, OpNode falls far behind because iterating through the node values takes its time.

test with size: 10000000 (a and b) or c, increments: 100, 100, 100 OpNode: 5054us, count: 100000, checksum: 499995100000 SetOps: 2738us, count: 100000, checksum: 499995100000 BitSet: 1252us, count: 100000, checksum: 499995100000

If the set is sparsely populated, OpNode performance isn't that bad.

test with size: 10000000 (a and b) or c, increments: 1000, 1000, 1000 OpNode: 589us, count: 10000, checksum: 49995010000 SetOps: 390us, count: 10000, checksum: 49995010000 BitSet: 986us, count: 10000, checksum: 49995010000

If even more sparsely populated, BitSet even falls behind OpNode and SetOps shines.

So it's not that easy to choose the best approach. And it even gets worse taking into account multiple parallel threads. Then it's the combination of CPU performance, available CPU-Cache (L1/L2/L3) and especially memory bandwidth that influence the results:

test with size: 10000000 (a and b) or c, increments: 100, 100, 100 test with 4 threads OpNode: 5912us, count: 100000, checksum: 499995100000 SetOps: 8224us, count: 100000, checksum: 499995100000 BitSet: 3399us, count: 100000, checksum: 499995100000

test with size: 10000000 (a and b) or c, increments: 100, 100, 100 test with 8 threads OpNode: 11841us, count: 100000, checksum: 499995100000 SetOps: 18833us, count: 100000, checksum: 499995100000 BitSet: 8389us, count: 100000, checksum: 499995100000

test with size: 10000000 (a and b) or c, increments: 1000, 1000, 1000 test with 4 threads OpNode: 911us, count: 10000, checksum: 49995010000 SetOps: 712us, count: 10000, checksum: 49995010000 BitSet: 2945us, count: 10000, checksum: 49995010000

test with size: 10000000 (a and b) or c, increments: 1000, 1000, 1000 test with 8 threads OpNode: 740us, count: 10000, checksum: 49995010000 SetOps: 949us, count: 10000, checksum: 49995010000 BitSet: 6495us, count: 10000, checksum: 49995010000

Ouch! Without memory efficient storage BitSet saturates the memory bandwidth and doesn't scale. So you definitely don't want to use BitSet on your expensive eight-core server class machine.

Over all the best solution could be to combine OpNode and SetOps. A simple but well working solution is to use OpNode as default but replace it with SetOps if the left and right arguments are of type IDSetTrie and the expected result sizes are large.

But your mileage may vary...

Read part I here: Efficient concurrent long set and map