+++ Description = “We cannot descent a mod3 space in a single pass” date = “2021-08-17” draft = false menu = “main” title = “Single pass descent of mod3 space” toc = true type = “page” +++

Introduction

Consider a large bag of n bit strings, where n is, let us say, 1024. Each bit string has a single bit target. We wish to use the input bit string to predict the target bit.

Our bag is very large, ~825 billeon examples. We wish to pass over it only once, accumulating a summary into an accumulator.

This could be achieved using PCA. (Is this actually true? Can we solve a no hidden layer problem closed form?)

However, we wish obtain the set of n trits, which, when multiplied by the input string, summed, and thresholded at 0, will be positive when the target is unset set and negative when it is set.

If the bits of the string were full independent of each other, perhaps we could achieve it. However the input bits often have significant covariance.

I assert that in practice, it is not possible to obtain even a good approximation when the dataset is ECC expanded ngrams of text.

I hope to be shown to be mistaken in this assertion.

Problem

For any given n trit string, there is a count of examples which are correct. Given this n dimensional mod3 space, with a count at each corner, find the location with the largest count.

This could be trivially solved by having an accumulator of size 3^n.

However, great as our bag is, 3^1024 is far larger. Had our strings been smaller, such that 3^n < 825 billon this approach would work well.

Solutions

Baseline

Consider a simple lookup table mapping the n input bytes to the distribution of next byte. This is the information theoretic perfect predictor. (Ignoring generalization issues.) What accuracy can this achieve on datasets ranging from 16KB to 32MB, using differently sized context windows?

Random bytes:

(Note that we were forced to skip some of the larger sizes because the HashMap OOMed on a 128GB machine.)

N \ log2 bytes 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0 0.6% 0.6% 0.5% 0.5% 0.5% 0.4% 0.4% 0.4% 0.4% 0.4% 0.4% 0.4% 0.4% 0.4%
1 8.6% 6.0% 3.8% 2.6% 1.8% 1.4% 1.0% 0.8% 0.7% 0.6% 0.5% 0.5% 0.5% 0.4%
2 97.0% 93.7% 88.6% 78.8% 63.4% 43.6% 25.3% 14.0% 8.7% 6.0% 3.8% 2.6% 1.9% 1.4%
3 100% 100% 100% 99.9% 99.8% 99.6% 99.2% 98.4% 96.9% 94.1% 88.5% 78.8%
4 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 99.9%
5 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
6 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
7 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
8 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%

Quite quickly, for even fairly small context windows, it begins to perfectly memorize the training set.

This, we would like to note, will very much not generalize to non training data, and as soon as the dataset exhausts the capacity of the key space, accuracy drops to ~random.

Complete Works of Charles Dickens

N \ log2 bytes 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0 12.6% 14.8% 15.7% 15.8% 15.9% 16.2% 16.4% 16.4% 16.5% 16.5% 16.6% 17.0% 17.2% 17.3%
1 30.0% 28.5% 28.9% 29.0% 29.1% 29.4% 29.8% 29.9% 30.0% 29.9% 29.6% 29.6% 29.4% 29.3%
2 51.2% 46.5% 45.1% 44.0% 42.9% 42.7% 42.4% 42.5% 42.6% 42.5% 41.9% 41.6% 41.2% 40.8%
3 78.2% 70.3% 64.9% 61.1% 58.2% 56.9% 55.4% 54.8% 54.4% 54.1% 53.4% 52.7% 52.1% 51.3%
4 89.7% 84.9% 79.1% 74.3% 70.2% 67.1% 64.8% 63.5% 62.7% 62.0% 61.1% 60.2% 59.5% 58.6%
5 93.1% 90.9% 86.3% 82.4% 78.3% 74.7% 71.6% 69.6% 68.2% 67.0% 65.7% 64.4% 63.5% 62.6%
6 95.2% 94.2% 90.7% 87.7% 84.3% 80.7% 77.5% 75.0% 72.9% 71.2% 69.4% 67.7% 66.5% 65.5%
7 96.7% 96.3% 94.3% 91.8% 89.0% 85.6% 82.6% 79.9% 77.5% 75.4% 73.2% 71.0% 69.5% 68.2%
8 97.8% 97.5% 96.2% 94.4% 92.5% 89.8% 87.0% 84.4% 81.9% 79.6% 77.1% 74.7% 72.8% 71.3%
9 98.5% 98.4% 97.5% 96.2% 94.8% 92.7% 90.5% 88.2% 85.8% 83.5% 81.0% 78.5% 76.4% 74.7%
10 98.8% 98.9% 98.3% 97.3% 96.4% 94.8% 93.1% 91.2% 89.1% 86.9% 84.5% 82.2% 79.9% 78.2%
11 99.0% 99.1% 98.9% 98.1% 97.6% 96.3% 95.1% 93.6% 91.7% 89.8% 87.7% 85.6% 83.4% 81.7%
12 99.1% 99.4% 99.2% 98.6% 98.2% 97.3% 96.4% 95.3% 93.8% 92.2% 90.5% 88.6% 86.7% 85.0%

While this is not measuring generalization ability, merely ability to memorize the training set, for a sufficiently large training set, the two are approximately equivalent.

We see that with an input of 0 bytes, the best we can do is predict the character ' ' and achieve ~17% accuracy. This is significantly better then random because the distribution of bytes in english text is highly uneven. With an input of size 1, we quickly plateau at 29%. An input of size 2 allows us to achieve ~40%. Much larger inputs however, and we begin to over fit to even 4 million characters. By the time we use input of size 10, we almost perfectly memorize 1024 characters, and achieve accuracy of 78% on 32 million. However based on the accuracy trend, the accuracy on non training set data, or a larger training set would be somewhat lower.

Given a 4 character input, the best accuracy we can hope to achieve is probably ~57%.

ECC

Hadamard Matrix

N\B 64 128 256 512 1024
2 64 128 256 512 1024
4 41 84 169 340 681
8 32 68 137 276 557
16 30 62 126 259 522
32 26 58 122 252 510
64 24 55 118 245 501
128 22 53 114 240 494
256 21 50 112 236 488
512 19 49 109 232 481
1024 18 47 106 228 476

We devised a scheme to embed 8 bits in maximally distant corners of a 256 bit space. It was sublimely elegant and provided us with a minimum of 70 bits of distance.

fn(a: u8, y: u8) -> bool {
  (y ^ x).count_ones() > 4
}

ECC grid

(This is one type of Munching square)

A random bit matrix provides us with a minimum of ~97 bits of distance.

Simple solution

Let us assume for the time being that the input bits are noisy but independent information sources. For some, they are more likely to be set when the target bit is set, for others, they are more likely to be unset, but they all provide the same amount of information about the target bit.

Both these assumptions are false for our dataset, but let us ignore that for the time being.

Given these assumptions (which, as previously mentioned, are false for our dataset), it is a simple matter to count bits to obtain the signs. Given 64 bit counters, our accumulator is of size 1024_256_8 bytes.

Problems

Why do we fail?

For multiple reasons.

Nonlinearity

We do not care how far to one side an activation is, just that is on the correct side.

The gradient of any single trit with respect to the activation is dependent on all other bits.

This information cannot be stored in a sub exponential accumulator.

Unequal information

Different bits have different degrees of corelation with the target.

If a bit has no corelation, it can be ignored, but if it has some corelation, should we use it?

Covariance

Input bits are not independent. If we could PCA them, we could make them independent, but we cannot use PCA. To achieve optimal results, we must not treat covariant bits as independent, perhaps we could exploit covariance to skillfully cancel out noise?

Attempted solutions

Count the number of

Optimal solution

There is a solution. We can achieve theoretically optimal results, and we can do it in a single pass over the data. Simple test every combination of mutations, and select the best. This is very nice, the only downside is that it requires memory exponential with n. Given then n is 1024 bits, this is prohibitive.

Multi pass approximation

Consider if instead we selected a single mutation, tested its goodness, and, if it is an improvement, applied it. This would require only a single integer accumulator.

However it would not necessarily reach the global optima.

Also, it would require many passes.

IO is expensive. It is often worth burning somewhat more compute to save on IO.

Activation distribution

Consider a vector of [trit; N], and a large set of ([bit; N], bit).

This can be transformed to a [[u64; N+1]; 2] representing the distribution of hamming distances.

Now we have a pair of distributions, one for the set target bit, one for the unset target bit. All bins of both distributions will sum to the number of example in our training set.

If our trit vec is good, the distribution of the set bit will be higher then the distribution of the unset bit, it will be higher and overlap but little. If it is not good, the two distributions will overlap significantly.

Given such a distribution, it is a simple thing to select the threshold which will best minimize the number of incorrectly classified examples.

We can think of this as an API which takes, for each target bit, a set of trit vectors, and returns, for each of the provided trit vectors, the distributions.

Given that it is high overhead to inquire of the whole dataset, the coordinator is incentivized to query activation distribution in batches.

We wish to minimize number of batches. This is easily achieved, we simple request distributions for all combinations of mutations. This is 2^1024. This is a very big number. We should also optimize for fewer total distributions.

Consider therefore, the following scheme:

More sophisticated schemes could no doubt be devised, but the worker cluster need to be aware.

Efficiently searching the parameter space

The SearchManger trait provides two methods.

fn mutation_candidates(&self, n: usize) -> Vec<W>;
fn update(&mut self, mutations: &Vec<(W, u64)>);

mutation_candidates provides a list of cantidate points to mesure.

update takes a list of points in parameter space and mutates internal state.

Given a list of single dimention accurecy deltas,

Older values lag behind.

As the model drifts, they will tend to caus regression to the means.

At the same time, hte current stat has improved, and hence the cached value will likely be too small.

These two are oposite, however we do not know which is larger so we are pretending that they cancel.

Future directions

Can we design some hybrid that produces good results, uses a small accumulator, and requires few passes?

Consider the following:

  1. Count corelation with target, threshold wherever we wish to generate initial trits.
  2. Observe gradients for all single trit mutations.
  3. Select top k (were k ~= 8) mutations, test all 2^k sets of mutations, apply top set.
  4. Repeat 2 and 3 a few times.

In this fashion, we can apply multiple mutations per pair of passes, while ensuing that all mutations are strict improvements, and performing some small degree of multi mutation hopping.

Our accumulator is of size 2^k where k can be set as we please, and, while we require multiple passes, the number need to approach the global optima should still be low.

Omake

Daemonic Bean machine

Consider Galton’s bean machine.

In each layer, the falling bean is randomly boped to one side or the other. Since this is random, the beans will form a normal distribution.

However consider if, at each layer, a daemon is watching the beans and choosing to which side to bop it. If the daemon has a preference for one side or the other, the distribution will be skewed.

Now consider if some of our beans are green while others are blue. We want the green beans to be on the left and the blue beans to be on the right.

Different daemons have different failings:

The daemons which always bop beans to one particular side, though useless, are harmless.

The blind daemons are contributing noise. To them we say we say, “You should retire and let beans fall through your layer unmoved.”

To the chromatically confused daemons, we say “Bop the other way.”

Now the green beans will tend to pile on the left, and the blue beans pile on the right. We can select a threshold and separate green from blue.

However, many of the daemons, while better then random, are still quite noisy. If a daemon makes correct decisions for 51% of beans, do we permit to it bop beans, or have it retire? If we could tell daemons to bop less strongly, we could permit the 51% accurate daemon to help, but unfortunately, our daemons are atomic, they cannot bop a beans less strongly. If daemons were independent, we could choose some threshold and reject all daemons below it, but the daemons are not independent. Often, particular sets of daemons tend to make the same sort of mistake.

Note the order of daemons is irrelevant.

How are we to select which sets of daemons to use, and by what sign to have them invert their actions?

As we add or remove daemons, or instruct them to flip their direction, the distribution of green beans and distribution of blue beans shift.

Note that we do not care how far to one side of the center a bean lands.

We win if a green bean lands on left of center, and loos if it lands on the right. Win/loss is boolean, our score is the number of beans which land on the correct side, not how far on the correct side they landed.