The handwritten chemical scheme recognition system
Chemrar is a Russian pharmaceutical company.
The main idea is to develop a programme for the handwritten chemical scheme recognition system.
The staff of the Himrar company has a lot of chemists:
regularly develop new molecules
re-paint molecules from the scientific articles
Request from the scientists:
They wish to quickly digitise their handwriting sketches.
The chat-bot returning the digital representation of molecules.
The client was satisfied with the result of the model.
The chemist draws the molecule
Takes a picture of the drawing
Sends in the telegram-bot
Gets the digital representation of molecules
SMILES-presentation is generally accepted specification of the composition and structure of the chemical substance.
SMILES-presentation allows easily and clearly to restore the molecule in the editor.
Our program works on arbitrary molecules (exist and newly invented). Without data-driven.
Year of development
Duration of development
Model training dataset
With the declared quality of 70%, we got 80% (having a small amount of initial data).
The feature of skill
The quality criterion
The completely correctly recognised molecule.
According to results of development:
The correctly assembled molecules
The development phases
assembly of molecule
extraction of atoms
extraction of links
checking the connection between atoms
The challenges we faced with
Sometimes there is a very short edge between the atoms, so two neighboring atoms can merge into one on the segmentation mask.
The benzene rings contain alternating single and double links. If our model incorrectly determined at least one type of connection, we had to add additional post-processing, in order to restore the benzene ring.
Different markers – different sets of classes.
If the atoms or links are recognised incorrectly, the molecule isn’t built.
Some elements were too rare in the training sample, so they were poorly detected by tests.
We presented this project at the Big Data Conference.