Smart algorithm lets investigators search through thousands of photographs faster

While investigating serious crimes, the police often confiscate data carriers such as mobile telephones, computers and hard disks in order to hunt for incriminating evidence. The quantities of data are often so vast that it is impossible to search through every data carrier by hand. To solve this problem, young NFI data scientists have developed a self-learning algorithm that makes it possible to quickly retrieve specific images from volumes of data. For example, the algorithm can recognise weapons or drugs in photographs, but also text, such as number plates or account numbers on stolen bank cards.

‘Trawling through thousands of images by hand and writing down the account number in every photograph with a bank card is impossible.’

Almost all of us will have received a phishing message claiming that our bank card is about to expire. If you click on the link in the message, you are redirected to a fake website that solicits your details. Criminals can use those details to apply for a new bank card in your name. ‘Criminals often take photographs of these “stolen” cards. Such photographs of bank cards sometimes number well into the thousands,’ one of the data scientists explains. Both the police and the banks are keen to find out which customers fall victim to this type of fraud. ‘However, trawling through thousands of images by hand and physically writing down the account number and name in every photograph with a bank card is impossible.’

To solve this problem, data scientists have developed a self-learning algorithm to search through photographic material that may be of interest to forensic investigators. The idea arose a few years ago, when police investigating a drugs case seized numerous data carriers with large amounts of images. Investigators had to go through the photographs one by one in the hope of finding images of shipping containers. This was because the criminals had taken photos of the containers to ‘prove’ to each other where the drugs had been hidden. Such cases can involve up to half a million images – which entailed sheer drudgery for the investigators involved.

Logo Hansken
afbeeding van opgestapelde zeecontainers
Investigators no longer need to scan thousands of photographs for shipping containers by hand – FIRE speeds up the job considerably.

Needle in a haystack

There must be a more efficient way of doing this, thought the data scientists at the Netherlands Forensic Institute (NFI), and so they developed the Forensic Image Recognition Engine (FIRE) software library. Working on the basis of existing models, the experts created a machine learning algorithm that, after ‘training’, is able to find the proverbial needle in the haystack. Using artificial intelligence is not just interesting in drugs cases involving shipping containers but can also be valuable for many other police investigations. Above all, automating the search process saves massive amounts of time and frees up police officers for other tasks.

In recent years, the data scientists have focused mainly on gathering vast amounts of training data to teach the algorithm to recognise other images of interest to forensic investigators as well. Aside from shipping containers and bank cards, these include firearms, marijuana and hard drugs such as lines of coke. As mentioned, it is now also possible to recognise text in images, such as account numbers and personal details on driving licences. The system learns to recognise such images by being shown a large number of examples. The algorithm looks for specific external features. In the case of marijuana, for example, these are the distinctive green, furry pellets.

wiet in plastic tasje
'A person recognises marijuana for what it is, while a computer learns to identify it by being shown a large number of examples.'

Guacamole

‘There are a couple of things you do need to take into account’, says one of the data scientists. ‘One good example is that, if the algorithm has never seen guacamole, it will always tell you that it’s marijuana. If you look through half-closed eyes, you can see that the green colour and the structure really are comparable.’ After the experts had trained the system to recognise shipping containers, the large data volumes also produced images of fences above water. The system had wrongly recognised the striped pattern as a sea container.

The difference between a computer and a person is that, when a person sees a photo of marijuana, for instance, they will immediately understand the context. In contrast, a computer does not learn to recognise marijuana by specifically understanding what it is but by being given a large number of examples. ‘If you provide poor examples, such as lots of photos of marijuana on a white table, the algorithm will regard the white table as part of the object’, the expert explains. A large amount of training data is the key to success. The more data, the better the system is able to recognise the differences. The experts obtain the data they use for training the algorithm from real criminal cases and from collections of images.

Hansken search engine

FIRE, the software library for images, has now been incorporated into the Hansken forensic search engine. Before this happened, the police could only search Hansken for text. Now, when investigators tick the box for firearms, they are shown a list of images that the programme thinks have the highest chance of showing a firearm. Police officers can conduct their own searches for specific image types in the search engine. The NFI data scientists just develop the technology, train the models and ensure that the various image categories are available in Hansken. They stress that all the methods and systems they develop are nothing more than useful tools and will never be able to replace human eyes entirely. ‘Human checks will remain key.’

Reverse image searches

The NFI experts spend the greatest part of their time collecting data to teach the algorithm new things. With a sufficient amount of input data, the system can then be trained in the space of one day. However, the process often takes longer. ‘One thing you don’t want is to keep the police waiting for days before they can carry on with their investigation,’ the data scientist says. For this reason, the scientists are now working on a reverse image search feature.

‘For instance, this might involve uploading an image of a hypodermic needle to Hansken, after which FIRE searches the vast quantity of data for similar images.’ The reverse search feature is less effective than a search for hypodermic needles by a system specifically trained for that purpose. ‘On the plus side, the police can get to work straight away. In the meantime, we train the system to search for that particular type of image – hypodermic needles, for example.’ It is worth mentioning that Google and the Russian search engine Yandex also possess this technology. ‘However, it wouldn’t do to upload a photograph of a live investigation to their servers. This is a feature that we want to keep in-house at the NFI.’

At the request of the staff, they have not been named in this article.