The handwritten chemical scheme recognition system
Basic information
01/
Customer
Objective
Chemrar is a Russian pharmaceutical company.
The main idea is to develop a programme for the handwritten chemical scheme recognition system.
The staff of the Himrar company has a lot of chemists:
regularly develop new molecules
re-paint molecules from the scientific articles
Request from the scientists:
They wish to quickly digitise their handwriting sketches.
Outcome
The chat-bot returning the digital representation of molecules.
The client was satisfied with the result of the model.
Way work
02/
The chemist draws the molecule
Takes a picture of the drawing
Sends in the telegram-bot
Gets the digital representation of molecules
Format
03/
SMILES-presentation is generally accepted specification of the composition and structure of the chemical substance.
SMILES-presentation allows easily and clearly to restore the molecule in the editor.
Our program works on arbitrary molecules (exist and newly invented). Without data-driven.
About numbers
04/
Year of development
Duration of development
Model training dataset
months
molecules
With the declared quality of 70%, we got 80% (having a small amount of initial data).
The feature of skill
The quality criterion
05/
The completely correctly recognised molecule.
According to results of development:
The correctly assembled molecules
Almost right
Wrong
The development phases
06/
segmentation
assembly of molecule
extraction of atoms
extraction of links
checking the connection between atoms
construction
The challenges we faced with
07\
Sometimes there is a very short edge between the atoms, so two neighboring atoms can merge into one on the segmentation mask.
The benzene rings contain alternating single and double links. If our model incorrectly determined at least one type of connection, we had to add additional post-processing, in order to restore the benzene ring.
Different markers – different sets of classes.
If the atoms or links are recognised incorrectly, the molecule isn’t built.
Some elements were too rare in the training sample, so they were poorly detected by tests.
We presented this project at the Big Data Conference.