Machine learning and deep learning have in the course of only recent years tackled difficult problems such as object recognition, sentiment analysis, translation and many more. It has become something so common to us that we often forget to ask, “What’s going on under the covers?”
A friend of mine Gavin Quinn posted a blog documenting his foray into the world of SAP’s Leonardo Machine Learning. It is a series of rest state APIs that return some impressive results in the area of object recognition, forecasting, classification and translation. I thought it would be a good idea to pull back the covers of this particular API and guess what is happening.
Gavin forayed first into the realm of Object recognition. He uploaded an image of his Golden Doodle Chloey and while it did not get terribly impressive results on the breed, it recognized easily that the image was a dog.
Developers unfamiliar with machine and deep learning and individuals who do not program may ask, “How does it do that?” How in the world would a program know there is a dog in the photo and how did it know not to say this is a picture of a monitor? After all, there are distinctive lines and the illuminated screen of a monitor quite clearly visible in the photo.
When it comes to Deep Learning and image recognition, it is important to understand how you yourself see the world. Ask yourself, how do you know that a dog is the main component of this photo? You may harken back to grade school when you were taught that the image you see is reflected on the back of your eye and then presented to the brain. This is the fundamental mechanics in a much simplified manner. However, the big question is…how do you know that the image sent to your brain is a dog?
First of all, our eyes do not actually work in that way. We have a very small focal area called foveal vision. It is the only part of our vision that allows for 100% visual precision. When you take in an object then, your foveal vision actually darts all over the object until you recognize it. Your brain takes in small parts of an image, pieces it together and then predicts what it is seeing. In rare documented cases of Visual Agnosia (read The Man Who Mistook His Wife for a Hat), the brain looses it’s ability to identify objects. It performs the functions necessary, but doesn’t put the pieces together correctly. If your brain took in a whole image and then ‘recognized’ it, there would be no such thing as Visual Agnosia. Take a look at this Ted Talk if you are interested in more detail about how we perceive and identify objects.
What is amazingly fascinating is that this is quite similar to the way that a convolutional neural network operates. A series of images and labels are fed into a network and trained. The network, through a series of layers, breaks down the image, scans it, and then identifies features within the image. The network then predicts with a certain confidence what the image represents. Like a child, in the beginning, the network does poorly. A child who has just learned what a dog is may see a cat and say , “look a dog!” Then the parent corrects, “No dear, that is a cat.” This is equivalent to the neural network making an incorrect prediction, being corrected and then tuning itself to become more precise the next time. After many iterations over images and labels, it begins to ‘learn’ what is in an image. Then when you feed the network an image it has never seen, like Chloey, it knows that this is the image of a dog.
Obviously, this is a super simplified explanation of a Convolutional Neural Network (ConvNet), but it illustrates the power of neural networks for image recognition and how it closely resembles what we, as humans, actually do.
As a side note, why did SAP’s Leonardo Image Recognition not correctly identify the image was of a Golden Doodle; it thought most likely Chloey was a Bouvier des Flandres? Now we are more nuanced. Give it pause and think about it a bit before you answer…why would it not get the breed right? I think there are two potential reasons. One, I personally don’t know the difference between the two breeds. The person(s) labelling the data my also not have known, so the labeled data may not be accurate to this level. In short, the network didn’t learn breeds properly. Or…Golden Doodle is not an AKC recognized breed and therefore was not in the training set at all.
If there is interest, I would be happy to go on a much deeper technical dive of image recognition using ConvNets and Python and show you just how intuitive and easy it is to create image recognition programs.
As a curiosity I ran Google’s pre-trained Deep Neural Network named Inception to see if it classified Chloey any better. You be the judge. The results are below.
48.66% : Irish water spaniel 16.15% : standard poodle 13.47% : miniature poodle 10.22% : Bouvier des Flandres 1.52% : toy poodle 1.18% : Kerry blue terrier 0.72% : briard 0.36% : giant schnauzer 0.30% : otterhound 0.18% : curly-coated retriever