Can you tell the difference
Developing a mobile-based computer vision prototype for seed classification by Jarred Wynan and Martyn Upton, consultants in our Canberra office.
We were recently asked to work on a short project to develop a proof of concept using Microsoft’s cognitive services to identify species of seeds using a mobile phone and its built-in camera capabilities.
As daunting as this sounds, within a few short weeks we have not only been able to meet that lofty goal, but unlock ways to take it further. We introduced a live video feed to assist with image focus and processed matching results in real-time, making it incredibly easy to use and practical for most real-world scenarios.
In addition, we were able to overcome some of the challenges of working with sub-millimetre subjects. Unless you’re a botanist, you probably wouldn’t recognise (or tell apart) the species of some seeds so it’s an extremely specialised field of study.
Fig.1 Chilli and Eggplant – can you tell which is which?
Over the next few blog articles we’d like to demonstrate the power of this technology and walk you through how we addressed some of the challenges. First of all, let’s talk about classification in machine learning so it’s clear what we’re trying to achieve.
Classification and Vision
Classification is a form of pattern recognition and aims to identify a subject by assigning some sort of output when provided with an input.
In this case, we’re providing an image of something as the input, and using a machine learning model to identify the correct species, family or genus of a seed as the output.
For you and me, classification automatically happens all the time: we’re constantly seeing, recognizing and recalling the things we look at. Our vision systems are hard-wired for pattern recognition and we’ll often see things that aren’t there (faces on the features of Mars, animals in a cloud formation, human figures in a patterned wallpaper).
For a computer, this form of visual pattern recognition is a foreign concept. A photo or video is simply a file containing individual bits of data (0’s and 1’s) which hold a representation of data captured by photoreceptors on a CCD or CMOS array in a camera. Each block of data essentially holds brightness, hue and saturation information about a pixel.
It actually isn’t that different from the photoreceptors packed into the retina. Red, green and blue “flavoured” cones are distributed in the eye and detect different frequencies of light, allowing our visual systems to blend the results and sense the beauty of the colour spectrum. Rods are activated in low light and are why you have a little grey scale vision at night-time.
The similarities don’t stop there, the concept of a maximum resolution of the human eye has been studied. It turns out that the resolving power of the human eye approximates 1 arc minute (about 0.02°) which corresponds to about 1/3rd of a meter viewed at a distance of 1km.
Camera and display technology is improving all the time with 4K becoming mainstream and 8K on the horizon. Better cameras in mobile devices means we are able to capture more detail and that’s something we’ll need to maximise if we’re trying to identify a seed that is 1mm across.
However, in the case of computer vision, it is not the resolution, it’s the pattern recognition that’s the challenge – How do we get computers to “see” and recognise what they’re looking at in the real world.
Enter machine learning
Over the years, researchers have developed various statistical models, algorithms and Convolutional Neural Networks (CNNs) as a way to address visual pattern recognition in Computer Vision. The concept is that by modelling and mimicking the way neurons work in the human brain, computers can learn.
CNN systems work by being given a set of images, each of which are tagged with a label. Given sufficient images, the computer can learn the visual characteristics associated with a particular label and discriminate it from something that doesn’t have those characteristics.
A CNN has an input layer, multiple hidden layers and an output layer, each of which are made up of convolutional neurons (essentially mathematical cross-correlations).
The input layer applies convolutions to the image to identify and extract key features and then passes these as inputs to the next layer, and so on.
Each convolutional neuron is associated with a specific receptive field and passes data to different network of convolutional neurons in the next layer. Ultimately these processes connect to an output layer which produces the classification result.
This layering and passing of data through a network is similar to the way individual neurons work in the human brain as a result of visual stimuli.
A typical example used with machine learning is to train the model on photos of cats and dogs, of which there are plentiful sources on the Internet.
One might choose to gather 50 or more random photos of cats, upload them into the model and tag them as “CAT”, then do the same for “DOG”. As the training progresses, the model can be tested by giving it an image it hasn’t seen before and asking it to classify the subject.
As more images are added and tagged, the model can be retrained and tuned to produce more accurate results over time. Equally, incorrectly tagging images can reduce the effectiveness of the model.
Selection of the training images can have an effect on the results. If the model is trained with all-white cats and all-black dogs, then there’s a chance it may misclassify a black cat as a dog, particularly if there are other visual characteristics are in common.
Fig.2 Visual characteristics and similarities – can you see why it might be classified as a dog?
It is now quite easy to train a model to reliably recognise breeds of cat or dog by separating out the images and classifying them by breed. Obviously, those breeds that look visually similar to other breeds (or are a bitsa) are likely to be misclassified, but even humans may have a hard time getting it right every time.
When we’re dealing with seeds, we’re frequently talking about objects smaller than a pea, and in many cases only a millimetre or two across. This raises a new set of challenges for us to solve in both the image acquisition and classification stages.
A recent model iPhone camera can take a picture of a seed without any magnifier attachment, but it is quite challenging to get the subject up close and in focus. Most of the time, the photos will appear as a background with a speck in the middle. Not exactly perfect for image classification!
Some of these seeds are tiny and we want to extract as many definitive features as possible so that the machine learning algorithm has a chance of getting it right.
Here’s where we needed to invest in a little magnification.
As luck would have it, magnifying lenses for mobile phones are cheap and can actually produce surprisingly good results. We’ll talk more about the specific devices and techniques we used in a later blog article.
Fig.3 Visual characteristics of Eggplant and Tomato seeds (~2mm diameter)
In the examples above, the Eggplant and Tomato seeds are roughly the same size (approx. 2-3mm in diameter). Up close there are striking differences and by magnifying these, we have a better chance of training our model to classify the seeds correctly.
However, we may also need to magnify the images when we are trying to classify a seed because the same characteristics need to be visible for a good match.
In some cases, scale can become a bit of problem.
Take the Chives and Australian Grass Tree examples below. The chives are approximately 1mm in length, whereas the Australian Grass Tree seeds are about 8mm in length. However, when you magnify the seeds to maximise the level of detail, you still need to frame the subject and by doing so, lose any sense of scale. They are visually quite similar and therefore can be misclassified.
Fig.4 Visual similarities of Chives (~1mm) and Australian Grass Tree (~8mm)
Going smaller, some of the seeds we have started to look at are ½mm across and it’s tricky to pick these up without a) damaging them, and b) breathing them in when you get too close during photography.
These seeds have characteristics synonymous with wind dispersal and may have a pappus attached. These feathery parachutes break off and so it is possible that our collection of training images may have some still attached and some missing. This is important because the visual characteristics are different with and without, leading to another potential cause of misclassification.
In the next article, we’ll talk about image acquisition and how we trained a Microsoft Custom Vision model to tackle the microscopic world.
About the Tools
Microsoft have a suite of cognitive services that include computer vision, speech analysis, semantics, language, handwriting and text recognition. In particular, we are using Custom Vision for Image Classification.
These tools are available to try for free and we were able to utilise the trial version to build our prototype.
Contact us to find out more about this application or how we could help you develop a Microsoft Custom Vision application for your organisation.