Chemical compounds are extracted from patent and non-patent full text literature, either from text by their IUPAC name, half trivial name, trade name, company code, INN (international non property name), CAS number (chemical abstracts service) and others. Similarly, we also extract compounds from published images and chemical structure files or public compound databases. All extracted compounds with a defined chemical structure are moved into a master database of unique compounds where the uniqueness is validated against the recognized InChI (International Chemical Identifier) and a unique structure ID for the compound is provided.
Thus, a database of 70 million unique structures has been made available while substances without defined chemical structures but with clear definitions are stored in a second master substance database. A third name database of about 450 million known chemical terms connects compound and substance identifiers with terms as they are found in documents.
A broad range of semantic natural language modules (cognitive processing) recognize different types of chemical terms, for example:
- a class/group/compound module analysis if a particular chemical term represents a compound, a group or a defined compound depending on the context of the term, e.g. “imidazole group” or “imidazole compounds”.
- Co-ordinated entities are typical for patents, e.g. the phrase “2-chloro-, 2-bromo- and 2-iodo-pyridine” is resolved to 3 specific compounds.
- Acronyms, abbreviations and labels are often used in scientific literature – we use a state-of-the-art module to resolve these chemistry term types.
- Anaphora expressions are typically underdetermined chemical class terms, e.g. “these phenols” represent an ad hoc chemical class defined by specific compounds in a document.
These modules together allow identifying chemical terms with the correct meaning, higher precision and recall than possible in other patent search engine.
It is important to notice that genes, DNA and RNA, proteins and peptides are organized in a separate knowledge domain which is annotated separately.