The handwritten chemical scheme recognition system

Basic information

01/

Customer

Objective

Chemrar is a Russian pharmaceutical company.

The main idea is to develop a programme for the handwritten chemical scheme recognition system.

The staff of the Himrar company has a lot of chemists:

regularly develop new molecules

re-paint molecules from the scientific articles

Request from the scientists:

They wish to quickly digitise their handwriting sketches.

The chat-bot returning the digital representation of molecules.

Outcome

The client was satisfied with the result of the model.

Way work

02/

The chemist draws the molecule

Takes a picture of the drawing

Sends in the telegram-bot

Gets the digital representation of molecules

Format

03/

SMILES-presentation is generally accepted specification of the composition and structure of the chemical substance.

SMILES-presentation allows easily and clearly to restore the molecule in the editor.

Our program works on arbitrary molecules (exist and newly invented). Without data-driven.

About numbers

04/

Year of development

Duration of development

Model training dataset

months

molecules

With the declared quality of 70%, we got 80% (having a small amount of initial data).

The feature of skill

The quality criterion

05/

The completely correctly recognised molecule.

According to results of development:

The correctly assembled molecules

Almost right

Wrong

The development phases

06/

segmentation

assembly of molecule

extraction of atoms

extraction of links

checking the connection between atoms

construction

The challenges we faced with

07\

Sometimes there is a very short edge between the atoms, so two neighboring atoms can merge into one on the segmentation mask.

The benzene rings contain alternating single and double links. If our model incorrectly determined at least one type of connection, we had to add additional post-processing, in order to restore the benzene ring.

Some elements were too rare in the training sample, so they were poorly detected by tests.

Different markers – different sets of classes.

If the atoms or links are recognised incorrectly, the molecule isn’t built.

We presented this project at the Big Data Conference.

The interesting fact

Project team

07/

Data scientist

Data scientist

Data scientist

Emil Magerromov

Artem Kondyukov

Vlad Vinogradov

Where it may be used

09/

Drawing

Electrical circuitry

Area maps

Contact us