Black box function optimization

Content note: I wrote this some time ago at a time when I was somewhat confused. While it contains some interesting ideas which may still be useful, please do not trust it.

Preface

In the past few years, various new model optimization techniques have been discovered, and old techniques thought obsolete have been revisited. While they may at first appear to be unrelated, many of them are in fact closely related. In this post, I present no new techniques. Rather, I present various explanations of existing techniques, and explain the similarities and important differences between them.

When optimizing the parameters of non trivial functions, the standard practice is to use gradient decent via backpropagation. This works well, however it requires that the function to be optimized be known and differentiable. Often, functions are not known. Even when they are known, they may not be easily differentiable. Much effort has gone into differentiating common functions, but this is sometimes difficult.

Backpropagation is not the only optimization technique however. There exist various black box optimization techniques which do not require that the function be known or differentiable. In addition to removing the need for function differentiation, certain of these black box optimization techniques have various advantages over back propagation in simplicity of implementation, and/or ease of parallelization.

In this post, I explain the use of Evolutionary Strategy (ES), empirical gradient decent (EGD), and hyper networks to optimize the parameters of continuous functions, argue that ES is but a crude form of gradient descent, and describe some minor alterations to gradient descent which allow us to exploit many of the advantages of ES while retaining some of the benefit of gradient descent. Lastly, I include various analogies which may or may not be helpful in understanding the techniques discussed.

I assume basic familiarity with ML and backpropagation, as I will not be explaining them in detail in this post. In particular, I highly recommend that you read How neural networks are trained before proceeding, as it is a far better explanation of gradient descent and many of its optimizations then this post will contain.

Although the concepts presented here may sound alien, they are only new presentations of idea which, as of this writing, have been well known to the ML community for many months. What small refinements I add would, I think, be obvious to anyone acquainted with the art.

After I posted this, Uber AI labs publish Measuring the Intrinsic Dimension of Objective Landscapes which covers much of the same content, and is far better done. You should probable read it.

For some of the original papers in which these idea were presented see:

Problem

Often we have a function which takes a list of parameters and returns a loss value drawn from some distribution. If we evaluate the function on the same parameters multiple times, we can gain more information about the distribution from which the values are drawn. The loss distribution may be altered by changing the parameters. We wish the average loss to be as low as possible. There exists a set of parameters which will produce the lowest average loss. We wish to find that set of parameters.

Loss Surfaces

Consider a function which takes a single parameter. In this case, we can visualize the loss of this function on a 2 dimension plot. We use the horizontal axis for the one parameter, and the vertical axis for the loss. As the loss is a continuous function of the parameter, it will produce a single continuous line. Since the function is not fully deterministic, the line will not be identical each time we evaluate it, but still, each time we calculate it, it is drawn from the same distribution.

Consider a more complex function, one which take two parameters. We must now use 3 dimensions to visualize the loss surface; the two horizontal dimensions for the two parameters, and the vertical dimension for the loss.

Consider a function which takes 3 parameters. Regrettable we do not have 4 dimensions at our disposal in which to visualize the 3 dimensional loss surfaces. One could perhaps visualize it as a cube of color gradients with x, y and z mapped to the three parameters, and loss mapped to color.

Of as many dimensions as our function has parameters, we have a loss surfaces. We can move about in this surface by altering the parameters. As we move in some direction, sometimes the loss decreases, sometimes it increases. We wish to move to the lowest point on this surfaces.

Finding Minima

While the function is relatively simple, it is often possible to solve for the points at which the surface is horizontal. This will give a finite number of candidate locations in parameter space which we can search by hand. However more complex functions are theoretically or practically insoluble. While we have but one or two parameters to optimize, it is perhaps possible to calculate the loss at a grid of points in parameter space, however if our function has many thousands of parameters or is computationally expensive, it becomes impracticable to compute such a grid with sufficient resolution as to be useful. Also, as mentioned before, the loss surface is actual a distribution of loss surfaces; to gain an accurate understanding of the average loss at any given point, we must evaluate the function many times. If we are to scale to functions of non trivial complexity and with significantly many parameters, we must search efficiently, measuring the loss as few times as possible, at as few points in parameter space is possible.

Gradient descent

To find the lowest point on a loss surface, one technique which comes to mind is to simply place a ball of the appropriate dimensionality somewhere on the surface and allow gravity to roll it down. The ball will come to rest at a low point on the surface.

A ball cannot observe the elevation of distant points on the surface, how does it know in which direction to travel? By observing the slope of the surface at its current location, and moving in the direction of the downward slope.

We can emulate the ball:

Start in the center of the loss surface.
Find the slope of the ground directly under us.
Move some distance in the direction of the downward slope.
Repeat many times.

This is gradient descent, and forms the foundation of much of modern machine learning.

There are however two major problems with gradient descent:

The loss surface often contains locations which are not the global minima, but are sounded by areas of greater loss. As we descent the loss gradient, we often fall into these local minima and are unable to climb out.
It is often complex or expensive to directly calculate gradients, if even possible.

The first of these has been well addressed elsewhere, and, as I will briefly explain, is often not as serious a problem as one might think. The second problem on the other hand has, I argue, not been sufficiently explored. It is this problem of simple and inexpensive gradient calculation which I will be discussing in main body of this post. But first I will briefly discuss the problem of local minima.

Local minima

Recall that a function of one parameter produces a loss surface of one dimension. If a function of one parameter is non trivially complex, its slope will change sign several times; it will have multiple local minima. Only one will be the global minima. If we allow a ball to roll down starting from a randomly chosen location it is likely to fall into one of these local minima and be unable to move to the global minima.

This is also true of function of more parameters.

We can take a vertical cross section of the loss surface of a 2 parameter function. This is equivalent to transforming a two parameter function into a one parameter function by replacing both its former parameters with linear functions of the new single parameter. Observe how the cross section of the 2 parameter loss surface has many more local minima.

Likewise we can take a vertical cross section of the loss surface of a 3 parameter function. This is equivalent to transforming a 3 parameter function into a 2 parameter function by replacing its three former parameters with linear functions of the two new parameters. Observe how this 3 dimensional space contains more local minimum then its higher dimension parent.

Removing a dimension from the loss surface tends to increase local minima. Conversely, adding a dimension tends to decrease the number of local minima.

This trend continues as we increase dimensionality.

Though dimensionality curses us with a vast search space, it blesses us with a reduction in local minima.

Thus, in functions of many thousands of parameters, there are rarely truly inescapable local minima. That said, there may still be valleys or saddle points which are at least slow to escape.

Also, we can use our noisy observations of the loss to our advantage. Each time we compute the function, we draw a different loss surface from the distribution. It is unlikely that any given local minima exists across the entire distribution of loss surfaces. It is harder to stay stuck in a hole when the ground is shifting.

For these two reasons, in most cases, local minima are not, ultimately, an insurmountable problem. https://arxiv.org/pdf/1412.0233v3.pdf

Gradient Calculation

A sphere rolling down a slope does not know the slope of the surface at all points. It only observes the slope at its current location before deciding the direction in which to roll next. So to, as we descend the loss gradient of our many dimensional parameter space, we need only compute the slope at our current location to know in which direction to move. If we are optimizing a function with n parameters, if we wish to descent the gradient of an n dimensional loss surface, we need to obtain a list of n numbers indicating, for each dimension of the loss surface, how much we should move. In other words, how much to adjust each of the n parameters of the function so as to move us to a location of lower loss.

The problem has now been transformed into one of calculating the slope of the loss surface at various points. The common technique for this is to differentiate the function. This works acceptably well for the class of functions which are differentiable, but is complex to implement and requires high precision calculation. If the function is not easily differentiable, or one wishes to use lower precision, it is desirable to use some other method of gradient calculation.

Given a one dimensional parameter space, one simple technique to obtain the slope at any given point is to simply measure the loss at two points near each other, and calculate the slope between the two losses. We can think of this as empirical gradient calculate. This same method applies to higher dimensions. To calculate the degree to which each of n dimensions contribute to the slope at the current location, one must calculate the loss at n + 1 points. It is generally considered that the backwards pass of back propagation is about twice as expensive as the foreword pass. Therefore, the cost of obtaining the n slopes of a point in n dimensional loss surface using back propagation is three times the cost of a foreword pass alone. In contrast, the cost of obtaining the n slopes of a an n dimensional loss surface using EGD, is n + 1 times the cost of a foreword pass. Only if our function has fewer then 2 parameters is pure EGD cheaper then back propagation. All else being equal, a naive implementation of EGD is not a viable alternative to back propagation.

Parallelism

The function is often noisy; each time we run it with a given set of parameters, we get some evidence about the true loss at that point, but to get an accurate measurement we must run it many times and average.

The function is often expensive; it may for example take 100 milliseconds to compute on a single CPU core.

In other words, each sample of the loss distribution at any given point in parameter space costs us 100 milliseconds of compute time, and we must make many observes to get accurate information.

Fortunately, we have many CPUs at our disposal. Let us split the problem among a large cluster of cheap slow cores. This requires that we transmit between each core the loss for each point in parameter space which we observe.

A single float32 loss requires only 4 bytes; it is relatively cheap to store and transmit.

A coordinate in parameter space is however expensive. Assuming 1000 parameters and 32 bits per parameter, it would take 4 kilobytes to represent the coordinates of single point in parameter space. This is expensive on bandwidth.

Let us instead noisily calculate the gradient, through back propagation or some other technique, update the parameters accordingly on each machine, and then, each generation, transmit only the new location in parameter space to a central server to be averaged together. This is an improvement, but still requires non trivial bandwidth and limits the number of nodes which can usefully participate.

There are various implementations of this and similar distributed gradient descent via back propagation optimization techniques such as Horvod, PaddlePaddle and Distributed TensorFlow, but they generally have difficulty scaling to many thousands of nodes on slow networks.

At this point, I would like to discuss a different topic. Please temporarily forget about loss surfaces and gradient descent for the duration of the following explanation.

Evolutionary Optimization

We have a function which takes a list of numbers, the genome, and returns its fitness. This fitness calculation function is structured in such a way that small changes to the input will usually produces relatively small change to fitness. We wish to obtain a genome which has a high fitness. Biological evolution is quite good at producing strings of DNA which have a high fitness in the current environment. We can emulate it.

Start with a genome of n random numbers.

Make s child copies.
Slightly perturb each child in different ways.
Measure the fitness of each perturbed child genome.
Keep the best child, and use it to replace the original.
Repeat many times.

This is Evolutionary Strategy (ES) optimization.

Note that the number of child genomes whose fitness is evaluated each generation is usually many orders of magnitude smaller then the length of the genome. For a genome of many hundreds of thousands of elements, it is common to evaluate perhaps 30 child genomes per generation. This will be important latter.

Seeds

When perturbing the genome, it is common practice to use a small amount of Gaussian noise. The noise does not need to be truly random, it can be pseudo-randomly generated from a seed. If we have 256 or fewer children per generation, we can use distinct 8 bit ints as the seeds for each child.

Therefore, any child at any generation is the sum of its parent and the noise generated from a particular seed. The parent in turn is the sum of the noise of a seed and its parent, and so on until we reach the original parent which is just random noise by itself. Therefore, to recreate a child at generation t, if we used one byte seeds each generation, we need only t bytes of information.

If we use seeds in this manner, we are not trying to find the child genome which is best, but rather the seed that, when used to pseudo-randomly generate noise, will most improve the parent genome.

We can think of this as a tree with one level for each generation, where each node has s branches. ES traverses this tree, each generation it must decide between s branches; it needs to obtain log2(s) bits of information.

For models optimized using ES with seeds, the amount of information needed to recreate a model scales linearly with the number of generations for which it was trained, and is completely independent of the size of the model.

Likewise, distributed training becomes trivial. Each generation, each of the thousands of training nodes need only send the seed which they found to be best. The central coordination server then need only pick the seed which the most training nodes considered to be best. We need transmit only log2(s) bits of information from each node per generation.

To summaries, ES searches through a tree of seeds, trying to find a series of seeds which will result in the genome with the greatest fitness.

ES as low dimensional, onehot, empirical gradient descent

Recall that in gradient descent we calculate the slope of the loss surface at our current location, and travel in the direction of the downward slope.

In ES we pick, at random, some small number of points around our current location, evaluate the loss at each one, and move to the point of the lowest loss. This can be though of as taking several one dimensional cross section of the n dimensional parameter space, evaluating the loss some random positive distance along each of these one dimensional cross sections, and then picking the cross section which happens to give the lowest loss. Viewed in these terms, it becomes more obvious that ES is a (stunningly inefficient) form of gradient decent. When compared to more traditional forms of gradient decent, the only redeeming aspects of ES are its amenability to parallelization across innumerable machines connected by slow networking, and the ease with which we can efficiently encoding models trained through it.

Can we perhaps alter ES to gain some of the advantages of gradient decent while retaining its native advantages?

One egregious inefficiency in ES which comes to mind is that while s points in the loss space are evaluated in full float32 accurate, only log2(s) bits of information are used. We can extract more information.

Rather then treat the s loss measurements as s one dimensional cross sections, let us instead treat it as one randomly selected s dimensional cross section of the parameter space. Along with the loss of the parent, we have enough information to precisely calculate the slope of this s dimensional cross section. We simply multiply each noise vector by the change in loss which it inflicted on the parent, and sum them to get the slope of the s dimensional cross section. Now we can descend this slope some distance, and repeat with a newly selected s dimensional cross section.

Each generation, we calculate, not the gradient of all dimensions of parameter space, but only the gradient for a randomly selected lower dimensional cross section of the parameter space. In practice, this is often sufficient.

We have now gained some of the performance of traditional gradient decent. Have we retained the parallelization benefits of ES?

We must now transmit s float32s from each training server per generation. While traditional ES requires log2(s) * generations bits to store a model, this altered ES requires s * 32 * generations bits to store a model. While far worse then pure ES, this is still better then gradient decent in many cases.

Hyper networks

It is relatively easy to construct a model capable, when given the correct parameters, of producing the output we want. However finding the correct parameters is hard. Wouldn’t it be nice if we could just train a neural net to produce good parameters for us? In a certain limited sense, this is what hyper networks do.

Take the example of a 1000 parameter primary model. Let’s use a very simple single layer NN for the hyper network. We start with 100 biases, matmul them with the [100, 1000], and get the 1000 parameters of the primary model. Now instead of needing to descend gradients in 1000 dimensional space, we need to descend gradients in 1000 * 100 + 100 dimensional space. This has not made our task easer. The bulk of the parameters of the hyper network are in the weights. Let’s just freeze them to random values. For space efficiency, since they are just random values, let’s pseudo-randomly generate the weights from seeds. We take 100 seeds, use each seed to generate 1000 random values, weight the values from each seeds with the bias, and then sum them. If we agree to use seeds 0 through 99, to transmit the model to someone else, we just need to give them our biases, they can reconstruct the weight of the hyper network, and use it to construct the parameters of the main network.

Conclusions

To summarize:

Gradient decent requires gradients.
- Gradient are complex, when not impossible, to compute via back propagation.
- Gradient are expensive to compute empirically.
- Standard gradient decent distributes badly.
ES is a very inefficient form of gradient decent.
- It distributes very well.
- But it is wasteful.
We can make ES more efficient by treating it as empirical gradient decent over a series of random lower dimensional cross sections of the parameter space.
- But we loose some of the cheap distributability we got with ES.
This low dimensionality empirically gradient decent is actually just a form of gradient decent on a lower dimensional cross section of the parameter space.
- It has the nice side effect of producing what is, in effect, a sparse net which saves on storage and bandwidth, although not, regrettable, on compute.
This is actually just a form of simplified hyperNEAT which we are exploiting to get an efficient encoding and perhaps somewhat faster training.

Or to put it differently:

I’m to lazy to different my model. (And my ML library doesn’t do it for me.)
Let’s just pick a bunch of seeds, turn them into noise, add them to the parameters and pick the best seed.
If we do this enough times we get a fairly good model!
But it takes a long time. :(
But I can use lots of cheap CPUs with slow networks.
Instead of picking the single best seed let’s use all the seeds, (just to different extents).
Even better, let’s forget about generation, and optimize a flat list of seeds all at once.
We are now back to searching a loss surface, (but now it’s lower dimensional).
We just invented a degenerate form of hyper network.

Future directions

In future posts I plan to present a simple Go library to demonstrate various types of ES and EGD. Some other questions perhaps worth pursuing:

How far can we reduce the dimensionality of a parameter space via a hyper network?
For how long is it possible to descend each dimension of the parameter space independently?

UPDATE:

For the time being I have abandoned floating point neural nets as being pointlessly high precision and unnecessarily expensive on CPU and RAM. The Go library for continues function optimization can be found here, however I have no plans to develop it further for the foreseeable future.

Omake

Somewhat less formal explanations of various model optimization problems. If an explanation does not make sense to you, skip it; not all analogous are useful to every person.

Descending hills

You find yourself on a hill in a complex landscape of many mountains and valleys. There is thick fog all around; you cannot see around yourself, There exists a genie who can tell you the altitudes at any list of positions in the landscape. However, you must pay one dollar for each point whose altitude you request. This landscape is actually a distribution of landscapes, each time you request a list of altitudes, the genie draws a single landscape from the distribution. If you want an accurate value, you must ask, and pay, the genie many times. The landscape does not have just two dimension, but rather, many thousands. You must get to the lowest point on the landscape while spending as few dollar as possible.

Finding green leaves on a tree

Imagine a tree which, each year, grows 256 new branches from the end of each of its previous branches, each tipped with a leaf grown from the genome of that branch. When the tree is n years old, it will have n^256 leaves. Each year, cosmic rays impact the cells of the growing tips of the tree and slightly corrupt the genomes thereof. For reasons which shall not be discussed at this time, the same indexed branch from different parents will always be corrupted in the same way. The branches growing from them will inherit the corrupted genome and, next year, further corrupt it. Leaves are generated from the genome of the growing tip. If the genome is a good one, the leaf will be green and beautiful. If it is a bad genome, the leaf will be brown and ugly.

If you look at a leaf, you can observe its color. But leaves appear to be different in different light. The longer you observe a leaf, the more accurate your beliefs about its color becomes.

You can climb up and down this tree. As you climb up, at each node, you must chosen among the 256 branches deriving from that node. Each additional year, the tree will have 256 times more leaves. You wish to find the path to the greenest leaf on the tree. This will take a long time.

Fortunately, you can hire a few thousand workers to help you search. Tell each worker to spend a short amount of time looking at each of the 256 leafs and to then report back to you the branch which they think has the most green leaf. You then pick the branch which the most workers considered to be greenest, and then tell all of them to climb that branch and look at the leaves there. This will be much faster.

Growing snake seeking a sunny place

Imagine an n dimensional space. Some parts of the space have more of some attribute, others less. For want of a better word let’s call this attribute “sunniness'. You are a snake. You want your head to be in an a very sunny place, but your tail is stuck at the center of the graph. Each day you grow a new segment of your body just behind your head, extending your head out in a straight line. The angle of this new segment is random and fixed, but you can chose how long to grow it. You can even grow it to a negative length if you want.

As an added refinement, imagine that you could change the length of any segment of your body any time you wanted. The angles are still fixed however.

What if you started with a fixed number of segments, 100 for example, at random fixed angles, but you could adjust the lengths as many times as you please? How would you find the most sunny spot in which to place your head in this many dimension space?

Blogroll