+++ Description = “Performance optimization of bit matrix counting” date = “2021-06-29” draft = false menu = “main” title = “Bit matrix count perf” toc = true type = “page” +++

Introduction

Consider a set of examples. Each example consists of a pair of bit strings. We have a lot of examples, ~825 billion.

We want to use each example to mutate all words of an ~4MB accumulator.

In this post, we observe that the naive implementation has poor performance and poor scaling. We then perform cache locality optimization, SIMD, and more complex cache locality optimization. By the end, we achieve single core performance improvement of a factor of ~11, and permitting ~linear scaling with core count and input size.

Problem

As previously mention, each example consists of a pair of bit strings. For reasons, the target string is 256 bits, while the input bit string is somewhere from 512 bits to 2048. For the time being, let us assume 1024 bits.

We wish to know, for each pair of the 256x1024 pairs, in how many examples were the two bits in each of the four states?

We need four numbers for each entry in the matrix:

input bit target bit
1 1
1 0
0 1
0 0

However, given two additional items:

we can compute the lower two rows from the upper two.

Even better, given only (1, 1) and (1, _), and (_, 1), we can compute all four.

bitpair status
xx known
1x known
x1 known
11 known
10 1x - 11
01 x1 - 11
00 xx - (1x + x1)

Therefore, our matrix is of the shape [[[T; 1024]; 2]; 256].

What is T?

Since we have over 2^32 examples, we must use a u64 to count. 825 billion is somewhat less then 40 bits, and so, we are in no danger of overflowing a u64.

Out accumulator will therefore be 256 * 2 * 1024 * 8 = 4MB.

Naive implementation

We map over the bits of the target shape, use the sign to select the set of counters which we want, iterate over bits, and increment the counters accordingly.

Consider the scaling with input size.

This implementation scales very badly. Why is this?

Multi core

We have 16 cores. Let use use them to make things run faster.

Fold reduce operations are easy to parallelism. Performance should scale linearly with number of cores. We can make it run faster by using more cores.

On a 16 core machine, consider the scaling with number of threads:

This scaling is quite bad. There must be some manner of resource contention.

Chunking

Perhaps the work unit is sufficiently small that rayon overhead is significant? Let us make the work unit larger my using par_chunks().

Now it scales somewhat less badly with number of threads, but is still super linear.

Let us try harder.

Simple cache locality

Consider cache sizes over the past 20 years. 20 years ago, L1 ranged from 32KB to 64KB while L2 ranges from 256KB to 2048KB. Now, L1 ranges from 32KB to 128KB and L2 ranges from 512KB to 4096KB. This is not much of an improvement, and many otherwise cheap and high performance cores have only 32KB / 512KB.

This is physics, we cannot wait for vendors to give us larger core local caches.

Consider, as long as our chunks are less then 256 in size, our counters need only be 8 bit. This is 8 times smaller then before.

Now it runs faster and scales better with thread and input size.

Let us try harder.

Complex cache locality

The accumulator is 2MB. This will not fit in L1, it will not even fit in L2.

As is traditional for high performance matrix multiplications, let us reshape the order so as to improve cache locality.

Currently: input word < input shape < output bits < data Improved: input word < output bits < data < input shape

Input bit expansion caching

Expanding bits is non trivially complex. We can cache each word of the input and amortize over 256 target bits.

Target bit expansion caching

Likewise, it is worth pre expanding the target bits.

SIMD

Currently, the compiler is generating 32 repetitions, one for each bit. But it is a great shame to use a powerful CPU with 64 bit wide registers to perform 8 bit addition. An AVX2 CPU has 256 bit wide registers, it can do the whole 32 bits in one operation with the awesome power of _mm256_add_epi8. ARM NEON is only 128 bits wide, but still a very nice improvement.

Also, instead of unpacking the word of bits one bit at a time, we can unpack bits in parallel using PSHUFB.

Different parallelism

Intermission

We have improved single threaded performance by a factor of x, and allowed flatter scaling.

It takes a whole x.y cycles to perform but a single _mm256_add_epi8. The CPU can do better. Supposedly, Skylake can do 3 per cycle.

Admittedly, the CPU is having to do some wonderfully complex things, loading and storing 256 bit wide vectors, computing indices, etc. But still, it is is disappointing slow.

Is it cache bound? Is it other compute overhead? Will we ever know?

Going exponential

Let us try some exponential scaling. That will surely solve our performance problems.

Consider the basic interview question: “What is the time complexity of indexing into an array of Sized items?” Not a linked list, but a Vec of, let us say, 8 bit items.

The correct answer is that indexing into such an array is constant time.

Now consider, if we have a large set of n b bit strings, and we wish to know, for each of the b bits, how many of the strings have that bit set.

There are two stages.

Our time complexity is n+(2^b)*b

Consider the more traditional way. Here, we increment a counter for each of the n*b bits.

Which is lesser? n+(2^b)*b or n*b?

If n is many times larger then b, n+(2^b)*b is far better.

Consider, for example, b==16, n==2^32.

Clearly, 36 is larger then ~32. We have reduced the number of additions by almost a factor of 4. We can continue to scale n and get yet greater improvements.

Let us revisit the interview question, “What is the time complexity of indexing into an array of Sized items?”.

If the interviewee answered “Constant” with no further qualifications, while this the correct answer, I would mark them down for it, for their answer betrays a lack of hardware sympathy.

In practice, repeatedly randomly indexing into a large array is not as fast as repeatedly indexing into a small array. If our array exceeds ~32KB, it gets evicted out of L1, if it exceeds ~512KB, it gets evicted out of L2, if it exceeds ~32MB, it gets evicted out of L3, to main memory, and after, perhaps a few hundred GB on a large machine, it will get swapped out to disk, where there is weeping and gnashing of teeth.

If b==16, we need 65536 counters, if each counter is to be 32 bit (to avoid being overflowed by our 2^32 strings), our accumulator will be 256KB. This may fit in L2, but will certainty not fit in L1.

If b==32, we had better have a whole lot more then 2^32 items, and so, we had better be using 64 bit counters. Our accumulator will therefore be 16GB.

It so happens that we have 825 billon examples. Let us round up to 2^30.

Seeing as our counters must not be overflowed by our examples, required counter size grows with n.

Space complexity therefore is (2^b)*n

Contrast this with the space complexity of our previous approach: b*n.

It is pointless for n to be less then many times larger then 2^b. Let us assume that log2(n) is twice b.

n bits space
4 16B
8 512B
16 256KB
32 32GB
64 very big

Consider further that our input strings are actually on the order of 1024 bits. We cannot allocate that much memory. Even if we could, we have not the n to amortize the fixed costs.

However, we can break it into smaller words, each with its own accumulator.

The smaller we break it, the smaller the total accumulator size. If we break it into single bit segments, we are back to where we started with a fixed two counters per bit. We must pick a good compromise in the space compute space.

Compute is average number of additions per 1024 bit input string.

n b number of chunks compute acc space
2^8 4 256 320 4KB
2^16 8 128 132 64KB
2^32 16 64 64.0156 16MB

For large n, the compute cost approaches the number of words.

Clearly, larger b and larger n, is better performance. In theory. In practice, often the only point of doing theoretical performance calculations is so that we can laugh at them when they fail.

Let us empirically observe performance.

Future work

Long ago, people road over the plains in their ox carts and, if the axle hole of the cart wheel was well formed, they road in comfort, and if the axle hole of the wheel was badly formed, they experienced dukkha.

Now, I sit in my room, and, as the fans of my computer spin in their exquisitely machined barrings to move the air to dissipate the heat generated as my Ryzen turns upon the wheel of computation ~4 billion times per second, I experience dukkha because my computer, with all its 16 Zen cores and 128 GB of RAM, takes on average a whole 478 ns to perform 2^18 8 bit integer additions in a very particular pattern.

I could practice mindfulness. Perhaps if someone said to me “Goat carts are outside the gate where you can play with them.” I would go outside and play with the goats, and feed them tasty leaves, and scratch their heads, and I and the goats would be free of the wheel. But, TFW no goat cart. :(

Seeing as I have no goat cart to help me, let us consider Johann Tetzel instead.

As the coin in the coffer rings

So the soul from purgatory springs

Perhaps if I bought a more powerful computer I could achieve happiness.

But what sort of computer? Should we buy a flock of RPi4s, or a Threadripper 3990X?

To return to wheels, is it better to use the Greater Vehicle, the Lesser Vehicle, or the Diamond Vehicle? Is it not said that computers are bicycles for the mind? Did not Albert Hofmann ride his bicycle home before he first achieved enlightenment? Surely, vehicles are critical to attaining success, but what sort of vehicle? Is it better to use a complex instruction set machine such as x86_64 or a reduced instruction machine such as ARM or RISC-V? (Or perhaps even a diamond lattice FPGA?) Which sort of vehicle will better help us to achieve enlightenment fastest, and, more importantly, cheapest?

In the summer, my electricity is free, but for others, it is not. If we are to perform a fold reduce over all our 825GB of data in a reasonable time, we should use lots of compute in parallel. Our compute demands are spiky. Rather then buy, perhaps we should rent. Who’s core GHz hours are best price/performance?

This is an easy question to answer with some manual spot market scraping.

type spot price vCPUs ram
t3.nano $0.0016 2 0.5
t3.micro $0.0031 2 1.0
t3.small $0.0069 2 2.0
t3a.nano $0.0023 2 0.5
t3a.micro $0.0030 2 1.0
t3a.small $0.0063 2 2.0
t2.nano ? 1 0.5
t2.micro $0.0035 1 1.0
t2.small $0.0069 2 1
a1.medium $0.0084 1 2
m6g.medium $0.0179 1 2
c5.large $0.0335 2 4
c4.large $0.0312 2 3.75
GCP n1-standard-1 $0.01 1 3.75GB
GCP n1-highcpu-2 $0.014 2 1.80GB
IBM classic transient $0.019 2 4
Azure A1 $0.012 1 1.75
Azure A1 v2 $0.009 1 2
Azure D2a v4 $0.022 2 8
Azure D1 $0.015 1 3.5
Azure F1 $0.012 1 2
Azure A0 $0.018 1 0.75

Some spot instances are cheaper then others per core. But also, they very in terms of performance. What merit do we gain by paying half the price per hour if it must run three times longer? The x86 cores are but hyper-threaded. Will our co-tenants share L1/L2 cache nicely? The t* types are cheaper, but they are also shared core. a1 and m6g are ARM based. Do we care?

Consider a developer who needed to provision a VM, who said: “I will not use the VM until I know whether it was designed by an Israeli, Korean or an American' He would say, ‘I won’t use the CPU until I know whether it was layed out by a human or a NN… until I now what feature size, and how many layers… until I know whether the motherboard was made by Dell, SuperMicro, HP, or Leneovo… until I know whether it is using KVM or Xen.” The developer would die and those things would still remain unknown to them.

Next: Benchmarking public cloud spot.