+++ Description = “We cannot descent a mod3 space in a single pass” date = “2021-08-17” draft = false menu = “main” title = “Single pass descent of mod3 space” toc = true type = “page” +++
Introduction
Consider a large bag of n bit strings, where n is, let us say, 1024. Each bit string has a single bit target. We wish to use the input bit string to predict the target bit.
Our bag is very large, ~825 billeon examples. We wish to pass over it only once, accumulating a summary into an accumulator.
This could be achieved using PCA. (Is this actually true? Can we solve a no hidden layer problem closed form?)
However, we wish obtain the set of n trits, which, when multiplied by the input string, summed, and thresholded at 0, will be positive when the target is unset set and negative when it is set.
If the bits of the string were full independent of each other, perhaps we could achieve it. However the input bits often have significant covariance.
I assert that in practice, it is not possible to obtain even a good approximation when the dataset is ECC expanded ngrams of text.
I hope to be shown to be mistaken in this assertion.
Problem
For any given n trit string, there is a count of examples which are correct. Given this n dimensional mod3 space, with a count at each corner, find the location with the largest count.
This could be trivially solved by having an accumulator of size 3^n.
However, great as our bag is, 3^1024 is far larger. Had our strings been smaller, such that 3^n < 825 billon this approach would work well.
Solutions
Baseline
Consider a simple lookup table mapping the n input bytes to the distribution of next byte. This is the information theoretic perfect predictor. (Ignoring generalization issues.) What accuracy can this achieve on datasets ranging from 16KB to 32MB, using differently sized context windows?
Random bytes:
(Note that we were forced to skip some of the larger sizes because the HashMap
OOMed on a 128GB machine.)
N \ log2 bytes | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.6% | 0.6% | 0.5% | 0.5% | 0.5% | 0.4% | 0.4% | 0.4% | 0.4% | 0.4% | 0.4% | 0.4% | 0.4% | 0.4% |
1 | 8.6% | 6.0% | 3.8% | 2.6% | 1.8% | 1.4% | 1.0% | 0.8% | 0.7% | 0.6% | 0.5% | 0.5% | 0.5% | 0.4% |
2 | 97.0% | 93.7% | 88.6% | 78.8% | 63.4% | 43.6% | 25.3% | 14.0% | 8.7% | 6.0% | 3.8% | 2.6% | 1.9% | 1.4% |
3 | 100% | 100% | 100% | 99.9% | 99.8% | 99.6% | 99.2% | 98.4% | 96.9% | 94.1% | 88.5% | 78.8% | ||
4 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99.9% | ||
5 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
6 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
7 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
8 | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
Quite quickly, for even fairly small context windows, it begins to perfectly memorize the training set.
This, we would like to note, will very much not generalize to non training data, and as soon as the dataset exhausts the capacity of the key space, accuracy drops to ~random.
Complete Works of Charles Dickens
N \ log2 bytes | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12.6% | 14.8% | 15.7% | 15.8% | 15.9% | 16.2% | 16.4% | 16.4% | 16.5% | 16.5% | 16.6% | 17.0% | 17.2% | 17.3% |
1 | 30.0% | 28.5% | 28.9% | 29.0% | 29.1% | 29.4% | 29.8% | 29.9% | 30.0% | 29.9% | 29.6% | 29.6% | 29.4% | 29.3% |
2 | 51.2% | 46.5% | 45.1% | 44.0% | 42.9% | 42.7% | 42.4% | 42.5% | 42.6% | 42.5% | 41.9% | 41.6% | 41.2% | 40.8% |
3 | 78.2% | 70.3% | 64.9% | 61.1% | 58.2% | 56.9% | 55.4% | 54.8% | 54.4% | 54.1% | 53.4% | 52.7% | 52.1% | 51.3% |
4 | 89.7% | 84.9% | 79.1% | 74.3% | 70.2% | 67.1% | 64.8% | 63.5% | 62.7% | 62.0% | 61.1% | 60.2% | 59.5% | 58.6% |
5 | 93.1% | 90.9% | 86.3% | 82.4% | 78.3% | 74.7% | 71.6% | 69.6% | 68.2% | 67.0% | 65.7% | 64.4% | 63.5% | 62.6% |
6 | 95.2% | 94.2% | 90.7% | 87.7% | 84.3% | 80.7% | 77.5% | 75.0% | 72.9% | 71.2% | 69.4% | 67.7% | 66.5% | 65.5% |
7 | 96.7% | 96.3% | 94.3% | 91.8% | 89.0% | 85.6% | 82.6% | 79.9% | 77.5% | 75.4% | 73.2% | 71.0% | 69.5% | 68.2% |
8 | 97.8% | 97.5% | 96.2% | 94.4% | 92.5% | 89.8% | 87.0% | 84.4% | 81.9% | 79.6% | 77.1% | 74.7% | 72.8% | 71.3% |
9 | 98.5% | 98.4% | 97.5% | 96.2% | 94.8% | 92.7% | 90.5% | 88.2% | 85.8% | 83.5% | 81.0% | 78.5% | 76.4% | 74.7% |
10 | 98.8% | 98.9% | 98.3% | 97.3% | 96.4% | 94.8% | 93.1% | 91.2% | 89.1% | 86.9% | 84.5% | 82.2% | 79.9% | 78.2% |
11 | 99.0% | 99.1% | 98.9% | 98.1% | 97.6% | 96.3% | 95.1% | 93.6% | 91.7% | 89.8% | 87.7% | 85.6% | 83.4% | 81.7% |
12 | 99.1% | 99.4% | 99.2% | 98.6% | 98.2% | 97.3% | 96.4% | 95.3% | 93.8% | 92.2% | 90.5% | 88.6% | 86.7% | 85.0% |
While this is not measuring generalization ability, merely ability to memorize the training set, for a sufficiently large training set, the two are approximately equivalent.
We see that with an input of 0 bytes, the best we can do is predict the character ' '
and achieve ~17% accuracy.
This is significantly better then random because the distribution of bytes in english text is highly uneven.
With an input of size 1, we quickly plateau at 29%.
An input of size 2 allows us to achieve ~40%.
Much larger inputs however, and we begin to over fit to even 4 million characters.
By the time we use input of size 10, we almost perfectly memorize 1024 characters, and achieve accuracy of 78% on 32 million.
However based on the accuracy trend, the accuracy on non training set data, or a larger training set would be somewhat lower.
Given a 4 character input, the best accuracy we can hope to achieve is probably ~57%.
ECC
N\B | 64 | 128 | 256 | 512 | 1024 |
---|---|---|---|---|---|
2 | 64 | 128 | 256 | 512 | 1024 |
4 | 41 | 84 | 169 | 340 | 681 |
8 | 32 | 68 | 137 | 276 | 557 |
16 | 30 | 62 | 126 | 259 | 522 |
32 | 26 | 58 | 122 | 252 | 510 |
64 | 24 | 55 | 118 | 245 | 501 |
128 | 22 | 53 | 114 | 240 | 494 |
256 | 21 | 50 | 112 | 236 | 488 |
512 | 19 | 49 | 109 | 232 | 481 |
1024 | 18 | 47 | 106 | 228 | 476 |
We devised a scheme to embed 8 bits in maximally distant corners of a 256 bit space. It was sublimely elegant and provided us with a minimum of 70 bits of distance.
fn(a: u8, y: u8) -> bool {
(y ^ x).count_ones() > 4
}
(This is one type of Munching square)
A random bit matrix provides us with a minimum of ~97 bits of distance.
Simple solution
Let us assume for the time being that the input bits are noisy but independent information sources. For some, they are more likely to be set when the target bit is set, for others, they are more likely to be unset, but they all provide the same amount of information about the target bit.
- Equal information: All bits are equally valuable to us.
- Independence: Given some set of existing bits, any other bit will provide us with the same amount of additional information. In other words, they have flat covariance.
Both these assumptions are false for our dataset, but let us ignore that for the time being.
Given these assumptions (which, as previously mentioned, are false for our dataset), it is a simple matter to count bits to obtain the signs. Given 64 bit counters, our accumulator is of size 1024_256_8 bytes.
Problems
Why do we fail?
For multiple reasons.
Nonlinearity
We do not care how far to one side an activation is, just that is on the correct side.
The gradient of any single trit with respect to the activation is dependent on all other bits.
This information cannot be stored in a sub exponential accumulator.
Unequal information
Different bits have different degrees of corelation with the target.
If a bit has no corelation, it can be ignored, but if it has some corelation, should we use it?
Covariance
Input bits are not independent. If we could PCA them, we could make them independent, but we cannot use PCA. To achieve optimal results, we must not treat covariant bits as independent, perhaps we could exploit covariance to skillfully cancel out noise?
Attempted solutions
Count the number of
Optimal solution
There is a solution. We can achieve theoretically optimal results, and we can do it in a single pass over the data. Simple test every combination of mutations, and select the best. This is very nice, the only downside is that it requires memory exponential with n. Given then n is 1024 bits, this is prohibitive.
Multi pass approximation
Consider if instead we selected a single mutation, tested its goodness, and, if it is an improvement, applied it. This would require only a single integer accumulator.
However it would not necessarily reach the global optima.
Also, it would require many passes.
IO is expensive. It is often worth burning somewhat more compute to save on IO.
Activation distribution
Consider a vector of [trit; N]
, and a large set of ([bit; N], bit)
.
This can be transformed to a [[u64; N+1]; 2]
representing the distribution of hamming distances.
Now we have a pair of distributions, one for the set target bit, one for the unset target bit. All bins of both distributions will sum to the number of example in our training set.
If our trit vec is good, the distribution of the set bit will be higher then the distribution of the unset bit, it will be higher and overlap but little. If it is not good, the two distributions will overlap significantly.
Given such a distribution, it is a simple thing to select the threshold which will best minimize the number of incorrectly classified examples.
We can think of this as an API which takes, for each target bit, a set of trit vectors, and returns, for each of the provided trit vectors, the distributions.
Given that it is high overhead to inquire of the whole dataset, the coordinator is incentivized to query activation distribution in batches.
We wish to minimize number of batches. This is easily achieved, we simple request distributions for all combinations of mutations. This is 2^1024. This is a very big number. We should also optimize for fewer total distributions.
Consider therefore, the following scheme:
More sophisticated schemes could no doubt be devised, but the worker cluster need to be aware.
Efficiently searching the parameter space
The SearchManger
trait provides two methods.
fn mutation_candidates(&self, n: usize) -> Vec<W>;
fn update(&mut self, mutations: &Vec<(W, u64)>);
mutation_candidates
provides a list of cantidate points to mesure.
update
takes a list of points in parameter space and mutates internal state.
Given a list of single dimention accurecy deltas,
Older values lag behind.
As the model drifts, they will tend to caus regression to the means.
At the same time, hte current stat has improved, and hence the cached value will likely be too small.
These two are oposite, however we do not know which is larger so we are pretending that they cancel.
Future directions
Can we design some hybrid that produces good results, uses a small accumulator, and requires few passes?
Consider the following:
- Count corelation with target, threshold wherever we wish to generate initial trits.
- Observe gradients for all single trit mutations.
- Select top k (were k ~= 8) mutations, test all 2^k sets of mutations, apply top set.
- Repeat 2 and 3 a few times.
In this fashion, we can apply multiple mutations per pair of passes, while ensuing that all mutations are strict improvements, and performing some small degree of multi mutation hopping.
Our accumulator is of size 2^k where k can be set as we please, and, while we require multiple passes, the number need to approach the global optima should still be low.
Omake
Daemonic Bean machine
Consider Galton’s bean machine.
In each layer, the falling bean is randomly boped to one side or the other. Since this is random, the beans will form a normal distribution.
However consider if, at each layer, a daemon is watching the beans and choosing to which side to bop it. If the daemon has a preference for one side or the other, the distribution will be skewed.
Now consider if some of our beans are green while others are blue. We want the green beans to be on the left and the blue beans to be on the right.
Different daemons have different failings:
- Some always bop beans to a particular fixed side
- Some are essentially blind and hence bop the beans at random
- Some are confused as to which way is right and which way left
The daemons which always bop beans to one particular side, though useless, are harmless.
The blind daemons are contributing noise. To them we say we say, “You should retire and let beans fall through your layer unmoved.”
To the chromatically confused daemons, we say “Bop the other way.”
Now the green beans will tend to pile on the left, and the blue beans pile on the right. We can select a threshold and separate green from blue.
However, many of the daemons, while better then random, are still quite noisy. If a daemon makes correct decisions for 51% of beans, do we permit to it bop beans, or have it retire? If we could tell daemons to bop less strongly, we could permit the 51% accurate daemon to help, but unfortunately, our daemons are atomic, they cannot bop a beans less strongly. If daemons were independent, we could choose some threshold and reject all daemons below it, but the daemons are not independent. Often, particular sets of daemons tend to make the same sort of mistake.
Note the order of daemons is irrelevant.
How are we to select which sets of daemons to use, and by what sign to have them invert their actions?
As we add or remove daemons, or instruct them to flip their direction, the distribution of green beans and distribution of blue beans shift.
Note that we do not care how far to one side of the center a bean lands.
We win if a green bean lands on left of center, and loos if it lands on the right. Win/loss is boolean, our score is the number of beans which land on the correct side, not how far on the correct side they landed.