=explanation =programming =machine learning
Convolutional neural networks start with convolution: sliding
a small "kernel" over your image. Here
are some examples. This produces a greyscale filtered image of how well
that kernel matches each point. You do this for several kernels, which each
detect simple things like lines or corners.
This is done for many features, and you can think of the output for each feature being like a color of the output image, except there are more than 3 such "colors".
The next step is usually max-pooling. You take shrink the filtered image, and each pixel becomes the max value of that area. Each pixel now indicates something like "there is probably a line somewhere in this area". Shrinking images like this not only reduces the amount of data to process at the next step, it also gives some looseness to the next stage of feature detection: you can now shift a feature by a pixel and the next stage will get basically the same input.
The same process is repeated, except now instead of detecting patterns of input pixels, you're detecting patterns of features. For example, maybe the next stage ends up detecting large approximate circles, by checking for circular patterns of line or corner features. After each stage, the image is again shrunk with max-pooling, providing a hierarchical looseness to detection of hierarchical features.
A common optimization in modern systems is "1x1 convolution" where at each pixel, the vector of features is multiplied by a matrix to reduce it to a smaller number of features. This works because there's often enough correlation between features that they can be combined into a smaller set of features.
You keep shrinking the image and increasing the number of features, and eventually you've shrunk things to 1 pixel. At this point, you're using a non-convolutional neural network, often called a "DNN" for "deep neural network" even if it's not very deep.
With a DNN, each set of features is a vector without any spatial meaning, and it's multiplied by a matrix to get the next features set. But you could just combine those matrices into a single matrix and multiply by that, so it doesn't matter how many layers you have, you're still just doing a single matrix multiplication. To avoid this, people run each feature through a nonlinear function between the matrix multiplications. Traditionally this has been a sigmoidal function, but more recently people have mainly used "rectification" which is just setting the value to 0 if it's negative.
If you're trying to tell what's in an image, this sort of system works relatively well, but you shouldn't get overconfident about its capabilities. You're detecting certain patterns of patterns, but if you're detecting something like "rough and yellow in the middle, red on top" for a certain type of bird, the system could be confused by a tomato on top of a pineapple, where a human wouldn't be. If you put, say, stickers on things with features the neural network strongly recognizes, it can wreck the recognition.
I remember talking to a professor a few years back, telling me that neural networks had surpassed human performance for detecting a type of cancer in pictures. I was skeptical, but I didn't change his mind because there was a paper published in a journal he trusted. It turned out the paper in question had a neural network trained on human-labeled images with inaccurate labels, so when they asked doctors to look at the images, the doctors didn't match the (inaccurate) labels as well as a neural network trained on those labels.
One rule of neural network performance is that they're better relative to humans when things are smaller. For example, if you're detecting birds in a photo of a beach, where the birds are just a few pixels in the sky, then neural networks can do about as well as humans.
Similarly, neural networks are adequate for board games like Chess or Go because the number of squares and possible moves is relatively small. (But even then, while AlphaZero is impressive work, it loses to specialized Chess programs when processing power is limited and equal.) For larger state spaces, neural networks become less effective.
back to index