Archive for the ‘Cheminformatics’ Category

ChemSpider data cleanup

15 Dec 2023

In previous posts, we have discussed the automated workflow we use to check new incoming data for structure and synonym errors. These checks allow us to remove the most common types of errors before they are added to the site. However, these filters do not apply to data already in ChemSpider.

Manual curation is an important part of our work. We periodically review the data on our most accessed records, in addition to ad-hoc removal or correction of erroneous data that we or our users notice when using the site. However, there are far too many records and far too much data to clean up using manual curation alone.

Recently we have focused on bulk identification and removal of erroneous data. This work has covered mapping errors and other clearly incorrect values in our experimental property data, correction or removal of malformed synonyms, correction of incorrectly labelled synonyms, and resolution of structure/synonym clashes.

Experimental Properties

We retrieved all 6.3 million experimental properties, text properties, and associated annotations from the ChemSpider database. We then compared the original text of the property as it was written in the original file to how that text was parsed and mapped by our deposition system. This enabled us to identify and correct several types of errors affecting around 2% of the properties in our database:

35,774 experimental property values had been assigned the incorrect unit (e.g. g/L instead of g/mL, °C instead of °F)
2,591 boiling points measured under non-standard pressure did not have this pressure displayed
4,292 densities had their density and temperature values swapped
79,252 miscellaneous erroneous properties and associated annotations were deleted. For example, “white crystals” mapped as melting point, impossibly high melting points or densities, etc.

Synonyms

Synonyms, chemical names, and identifiers are the most abundant type of data on ChemSpider, with a total of more than 446 million synonyms. These synonyms have additional metadata including language labels and flags identifying what type of synonym they are (e.g. CAS number, UNII, INN, trade name).

Simple Checks

We ran a series of regular expression string searches to identify synonyms with incorrect metadata, as well as malformed or otherwise erroneous synonyms.

200,007 synonym type flags added, and 4,766 incorrect flags removed
9,170 synonyms with an incorrect language label identified.
631,697 erroneous synonyms identified, including scrambled characters, properties/units, molecular formulae as synonyms, purity information, or invalid CAS numbers or EC numbers (formerly called EINECS).
922,334 instances of these erroneous synonyms deleted from ChemSpider records.

Structure/Synonym comparison

After identifying and removing these synonym-level errors, we then cross-checked ChemSpider records and their synonyms to identify mismatches. This work included amino acids, nucleic acids, and pharmaceutically acceptable salts.

As a first pass, we compared synonyms to molecular formulae to identify records missing key elements. Examples include synonyms describing a sodium salt when the molecular formula does not contain sodium, or describing an amino acid when the molecular formula contains no nitrogen. A total of 28,194 of these synonym/formula clashes were identified and removed.

For records that passed this initial molecular formula check, we performed a SMARTS comparison to identify chemical structures missing key structural features described in the synonym. These SMARTS strings were written broadly, with common substitutions allowed to prevent unnecessary removal of valid synonyms from derivative compounds.

In the following examples, the mismatched part of the synonym is highlighted in bold.

Structure	Removed synonym
	Sulfate ion
	Zolpidem tartrate
	Sodium S-sulfocysteine hydrate

After identifying these clashes, we manually spot-checked the output to weed out false positives and iterate the SMARTS filters. 101,257 synonym/structure clashes were identified and removed.

These checks included the following categories:

Amino acids and their derivatives: 6 formula clashes, 56 structure clashes
Nucleic acids, nucleosides, nucleotides: 977 formula clashes, 1,870 structure clashes
Halogens: 13,437 formula clashes, 1,256 structure clashes
Alkali and alkaline earth metals, and aluminium: 3,586 formula clashes, 56 structure clashes
Carboxylic acids and their derivatives: 5,002 formula clashes, 88,501 structure clashes
Other pharmaceutically acceptable acids: 3,534 formula clashes, 1,529 structure clashes
Amides and amines: 190 formula clashes, 304 structure clashes
Deuterates, hydrates, methylbromides: 1,462 formula clashes, 7,685 structure clashes

Get involved

You are the expert in your area of chemistry, so if you see something that doesn’t look quite right please let us know. If the error is confined to a single ChemSpider record, click the “Comment On This Record” box at the top of the affected record and let us know what the problem is. All we need is a sentence describing the error, however the more information you can provide, the better.

For more systemic errors, or in cases where you want to attach supplementary information or corrected chemical structures, please get in touch via email (chemspider@rsc.org).

Comments Off on ChemSpider data cleanup

ChemSpider Pre-Deposition Filters

18 Sep 2018

Written by Mark Archibald.

In a previous post (Behind the Scenes at ChemSpider) we discussed some of the challenges in upholding data quality across one of the largest chemical databases in the world. We identified automated filtering as a key tool when dealing with far more records than a human could reasonably handle. In this post we’ll go into more detail about how that filtering works, what the challenges are, and the role played by human intervention.

To perform this filtering we use KNIME, an open-source data processing platform. The wide range of KNIME nodes developed by the active cheminformatics community allows us to ask chemistry-specific questions of the data we process. In simple terms, input chemical structures that match our criteria are passed on to the next node, while those that don’t are written out to an error file. After processing all structures, the result is a file of structures that have successfully passed through all the filters and several (usually smaller) files of structures rejected for various reasons.

It’s not possible to review all of the generated files in full, as this would eliminate the time-saving advantages of automated processing. However, output files of all types are spot checked for accuracy and to iteratively improve the filtering criteria. Certain output files have high potential for false positives and so we review them in full.

Formats and identifiers

Submitted files can be in one of several different formats. The most common is SDF (structure data file, a chemical structure format containing multiple structures with associated data fields). The advantage of this format is that it contains 2- or 3-dimensional structures, so we can immediately start processing the file without having to convert an identifier to a structure. This means that the final structure we deposit is more likely to exactly match the original. The disadvantage of the SDF format is that it is specialised – many users will be unfamiliar with it or won’t have software to create and display the files.

We also receive different spreadsheet formats (excel, csv, tsv) with structures encoded in text-based notation systems like SMILES or InChI. The advantage of this format is that it doesn’t require specialised software (provided the submitter has SMILES or InChIs for the compounds).The disadvantage is that the structures require conversion to SDF before processing and deposition to ChemSpider. Additionally, these formats contain information about atoms and their connectivity but lack layout information. This can introduce errors as different structure drawing packages can parse these structures slightly differently, resulting in alterations to the final deposited structure.

Filtering criteria

The criteria by which we judge chemical structures are a mixture of definitive chemical rules and less well-defined ‘rules of thumb’ based on our experience and chemical knowledge. Examples of both follow.

Empty structures, query atoms and incorrect valences

The first filter is the simplest – ChemSpider is a structure-centric database, so it’s not possible to deposit any input entries that lack a structure.

Similarly, each ChemSpider record requires a single defined chemical structure, so we exclude anything using a query atom to represent a variable atom or attachment point.

Another simple filter is to exclude structures in which atoms have invalid valences.

Charge imbalance

In general, entries in ChemSpider should represent a real-world, isolable compound. This means that we filter out structures with a non-zero overall charge. However, we make exceptions for certain examples where a counterion is generally unimportant and it’s useful to consider the charged species alone, such as choline (ChemSpider record).

Structures containing undefined stereocentres

Undefined stereocentres alone don’t represent a chemical error. However, structures like that shown below (cholesterol without any defined stereocentres) occur frequently and, although chemically valid, it’s extremely unlikely that they represent the intended structure.

Cholesterol skeleton with no defined stereochemistry

Cholesterol skeleton without stereochemistry

As a result we have a rule of thumb that excludes structures containing more than two undefined stereocentres. This is not a hard-and-fast rule, but rather an attempt to strike a balance between excluding structures like the one above and including structures where the undefined stereocentres are intentional and correct.

The count of undefined stereocentres (as determined by examining the InChI) sometimes includes cases where it is conventional to exclude stereochemical wedges. Examples include nucleic acids with no wedges on the phosphate and adamantyl groups without explicit stereochemistry – it’s unusual to draw these compounds with wedges, and users will rarely use wedges in their search. These potential false positives are filtered out and reviewed manually. A curator can then decide whether to include them in the deposition, improving the overall accuracy of the filter.

Structures containing many components

This is another rule of thumb – there’s no upper limit on how many separate components a correctly depicted chemical substance can have. However, from experience we find that excluding structures with more than four separate components removes most obviously nonsensical entries (e.g. attempts to depict alloys) while retaining the majority of correct entries.

When applying this rule, pharmaceutical molecules represent a major source of false positives because they are often multiple hydrates and/or salts with multiple counterions (e.g. Irinotecan hydrochloride trihydrate). Excluded structures that are hydrates or contain common pharmaceutical salts are flagged for human review.

Synonym filter

This filter compares the synonyms assigned to a given structure with its molecular formula and performs some ‘common sense’ checks. For example, a relatively frequent error is associating the name of a salt form (e.g., mozavaptan hydrochloride) with the structure of the free base (mozavaptan). In this case, the filter removes synonyms containing ‘hydrochloride’ because the molecular formula does not contain Cl.

SMARTS

SMARTS (Wikipedia page) is a way of describing general chemical structures. It’s based on SMILES, but has additional features allowing the specification of variable chain lengths, number of bonds, number of hydrogens, variable bond orders, or more than one potential element at a site.

We use SMARTS to identify common erroneous features in a structure. These include:

Azides and diazo groups depicted with a pentavalent nitrogen
A ‘floating’ alkane unconnected to the main structure (probably caused by an accidental click in a drawing program)
Metal carboxylates depicted as a protonated carboxylic acid with an elemental metal atom
Hexafluorophosphates (and similar species) depicted as phosphorous pentafluoride and a separate fluoride ion

SMIRKS

SMIRKS is a further extension of SMILES to depict reactions. We don’t use it to represent real reactions, but to define structural transformations – allowing us to fix simple structural errors that can be resolved by breaking and creating bonds.

One example is connecting charge-separated Grignard reagents to give a more accurate depiction:

Reconnecting Grignards

Organometallics

The difficulties of encoding organometallic structures in machine-readable formats are well documented (J. Chem. Inf. Model. 51, 12, 3149-3157). There is an ongoing IUPAC project to extend the InChI’s functionality, but for now, the challenges remain.

Every ChemSpider record is fundamentally based on an InChI, and so we are bound by the current limitations. This means that we can’t depict coordination bonds or bonds with non-integer order – any bond drawn is interpreted as a standard covalent bond with one electron contributed by each atom.

Although we generally can’t represent organometallic structures in the manner a human chemist would prefer, we still attempt to choose the ‘least wrong’ structure from various possible compromises.

Ferrocene is a classic example of this problem and illustrates several of the issues we have to consider. A few common ways to draw ferrocene are shown below (there are many more).

Common depictions of ferrocene lose bonding information when converted to mol files

Converting ferrocene structures to mol format can introduce errors in molecular formula, bond orders or valence

Most of the structures shown take advantage of extended features of chemical drawing packages in order to represent ferrocene’s bonding in a way that’s attractive and easily understandable to a human chemist. Unfortunately, once transferred to the simplified but universal mol format, some of those features are lost, resulting in nonsensical structures. Although structure D is unchanged, this representation has other problems: incorrect valence on Fe and no representation of the aromaticity of the cyclopentadienyl ligands.

We have a limited number of ways in which we can depict ferrocene and related structures in ChemSpider, none of which give an accurate representation of the bonding or a view that would satisfy an inorganic chemist. However, we can choose the ‘least bad’ of the possible compromises and allow machine readability:

Our compromise

Although this structure (ChemSpider record) doesn’t capture the hapticity of ferrocene and the charge localisation on a single carbon is inaccurate, it retains correct overall charges and valences and doesn’t show the ligands as sigma-bonded.

More generally, we apply some rules and transformations to standardise representations of organometallic structures. Many of these rules involve choosing whether to depict a metal–carbon (or metal–heteroatom) as covalent or ionic, depending on the nature of the metal and the ligand. Again, compromises are necessary when working within the limitations of machine-readable structures, but we attempt to classify ‘more ionic’ and ‘more covalent’ bonds. Some examples follow:

Disconnect oxygen from group 1 and 2 metals
Connect oxygen to all other metals
Disconnect carbon from sodium, potassium and calcium
Connect carbon to group 11 and 12 metals, p-block metals and some metalloids

As expected, general rules like these fail in certain cases. Therefore we have additional, more specific rules to cover exceptions, which we iteratively refine.

But these errors still appear in ChemSpider!

At present the filtering described only applies to new data coming into ChemSpider. The full ChemSpider database, built up over many years, certainly contains examples of every error described here. To fix these legacy errors, we intend to run the entire database through the same quality filters. This is a significant task with some specific challenges: the files requiring human review become orders of magnitude larger, the processing time and memory/CPU overhead is high, and the larger the data set the more likely we will run into false positives. In order to manage these challenges, we are taking the time to refine our processes on new depositions, and periodically checking our progress by running subsets of the full ChemSpider database through our filters. We know you need access to data you can trust, so we want to make sure we get this right. We’ll continue to update you as this project progresses, so stay tuned!

Comments Off on ChemSpider Pre-Deposition Filters

Adding RSC CIFS to ChemSpider

09 Dec 2013

Written by Aileen Day.

We are pleased to announce that we have just imported 1047 CIFs to ChemSpider of crystal structures that were previously reported in RSC papers (and are available as ESI for those) to ChemSpider for the relevant compounds, and linked those back to the original articles and to the CCDC’s webCSD, e.g. example compound with RSC article CIF (see the CIF infobox). Since each CIF that is uploaded into ChemSpider must be associated with a ChemSpider compound, the difficult part of this task was working out a 2D molecular structure (in .mol file format) for each 3D crystal structure (in .cif file format) – which is particularly difficult because CIFs only contain information about each atomic position and not how the atoms are bonded to each other in the crystal or whether they are charged or not.
Ultimately we would like this CIF to mol conversion (and the whole upload) to be performed programmatically without human intervention. However, there is no reliable way to do that currently – although programs such as OpenBabel can be used to extract mols from each CIF, the reliability of this conversion isn’t 100%.
So as one of our student intern projects at the University of Southampton this summer (in parallel with another student intern project at Southampton University to share thesis data in ChemSpider) we used OpenBabel (version 2.3.2, run from the command line with the options -i cif inputfilename.txt -o mol -m –unique -d –AddPolarH) to extract mols for all the CIFs in the RSC archive (over 43,000 files as of June 2013) and enlisted Julija Kezina (shown below) to review the results of these conversions to ensure that only good structure and CIF pairs would be deposited to ChemSpider, and to better understand the problems in the conversion process with a view to fixing them. One problem that became immediately apparent was that because the 2D structure obtained was just a projection of the 3D structure along the a cell axis, which is not always the orientation which shows the molecule most clearly, even if they did have the write chemical connections between the atoms, so all mol structures were run through OpenEye’s cleaning algorithm before being reviewed.

Julija Kezina – Southampton University intern who examined CIF to Mol conversion

Julija compared each structure in the output mol files with those in the original CIF files to judge whether the conversion was accurate or not. In addition, as an extra check, all of the output mol structures were submitted to ChemSpider validation and standardisation platform to filter out molecules with structural problems (e.g. stereochemistry, valence or congestion issues).
Overall, approximately 30% of the CIF to mol conversions that Julija checked were good, with the right connectivity of atoms and ions (although approximately 30% of these needed the atomic positions to be repositioned to clean or tidy up the structure, either manually or using ChemDraw’s cleaning functionality). The 1047 of these mols which contain only a single molecule (without solvent molecules or cocrystals etc.) are those which have been deposited into ChemSpider with their corresponding CIFs.
The journals which had the highest successful conversion percentage were Molecular BioSystems (57%), MedChemComm (51%), Organic and Biomolecular Chemistry (44%) and Green Chemistry (44%) – the journals which in general are about small organic molecules.
Julija was working in the National Crystallography Service’s office at the University of Southampton, under the co-supervision of Professor Simon Coles, and we are grateful to them for their help and advice about the finer points of the CIF file format.

Unsuccessful CIF to mol conversions

Running and evaluating OpenBabel on such a large and varied set of structures has given us a useful opportunity to identify and categorise the most common problems encountered. Here we share these and give examples that would enable the identification of some easy fixes in the pipeline that might benefit the whole community and be used as test cases when doing so. We will report these bugs to the OpenBabel forum and because OpenBabel is open source, hope to resolve at least some of these issues in the future through collaboration with its other developers.

The following OpenBabel bugs look like they might be most straightforward to fix:

Details	Example
Category: BAD_NITRO Frequency: 233 Description: there are different ways of representing nitro groups in structure drawers – OpenBabel currently does so by producing a mol with a pentavalent nitrogen. In ChemSpider we we choose to avoid this in favour of a format with a charge-separated nitro. Solution: Allow OpenBabel to have a different output option for nitro groups to output them as shown in corrected mol file.	CIF: CCDC 194360 ChemSpider: 10001804
Category: BAD_MULT Frequency: 434 Description: Duplicate (exactly identical, including stereochemistry) molecules are present in the resulting mol file despite running OpenBabel with the –unique option (which should filter out duplicate molecules based on their inchis) Solution: Fix OpenBabel when run with the –unique option so that it works.	CIF: CCDC 229590 ChemSpider: 3915
Category: BAD_MISSINGPARTOFMOLECULE Frequency: 724 Description: Part of the molecule is missing Cause: OpenBabel doesn’t understand crystal symmetry – only the atoms in the CIF that are explicitly listed with positions are included in the resulting mol file, and those that are inferred by symmetry are not. Solution: Make OpenBabel generate the full molecule from the symmetry in the CIF file, or recommend that a script/program that can process a CIF to generate another CIF with all atoms is run before OpenBabel.	CIF: CCDC 185091 ChemSpider: 11917
Category: BAD_PARTIALOCCUPANCY Frequency: 432 Description: partial occupancy of multiple sites for a particular atom in the CIF file Cause: In CIF files sometimes positions of multiple sites are specified with occupancy less than one – OpenBabel doesn’t recognise this and assumes that the occupancy of all sites is one effectively, so that there are duplicates of some atoms or fragments in the mol file. Solution: Where the _atom_site_occupancy is less than one, group together atoms into those which are alternatives of each other (by type, proximity, and those which add up to a total occupancy of 1) and choose only one of them to include in the final mol file (that with the highest site occupancy, or if two have equal occupancies of e.g. 0.5 then pick one at random). Note that there needs to be consistency, so that if for example a C is discarded, then all of the adjoining H’s with partial occupancy are also discarded but those bonded to the C that is included are included (as in the attached example).	CIF: CCDC 854369 ChemSpider: 68005704

Many of the problems were caused by idiosynchronies or errors in the input CIFs, but these on the whole weren’t handled well by OpenBabel (e.g. by writing an error message and terminating the program) but rather, in the majority of cases went into an infinite loop and the program hung. Because of this, and because the OpenBabel conversions were part of a longer script, all OpenBabel jobs had to be run with an arbitary timeout so that if still running after this timeout they were killed, which may have discarded some valid but long-running OpenBabel jobs. We will investigate whether there is a validation program that can be automatically performed on CIFs to filter out ones with these problems (similar to the CCDC’s EnCIFer but which can be run programmatically), but it would be relatively straightforward to make OpenBabel more reliable by being able to exit nicely when it encounters these problems so that pre-validation wasn’t necessary. These problems are listed in the table below:

Details	Example
Category: CIF_NOCOORDINATES Frequency: 378 Description: cif doesn’t contain any coordinates Cause: Some CIFs contain e.g. powder diffraction refinement data and don’t contain coordinates. Solution: OpenBabel already issues an error: “CIF Error: no atom found ! (in data block:XXX)” – simply abort the program if this is found (rather than trying to continue).
Category: CIF_MISSINGLOOP Frequency: 85 Description: cif misses a “loop_” line Solution: Do an initial check that there is at least one loop_ line in the expected place before attempting to do the conversion.	CIF: CCDC 753484
Category: CIF_COMMENTEDFIELD Frequency: 36 Description: if there is a CIF field name in a commented section of the CIF, OpenBabel doesn’t ignore it and goes into an infinte loop Solution: It would be trivial to make sure that OpenBabel ignores CIF field names which are commented out (between a pair of semicolons).	CIF: CCDC 840581

The following OpenBabel bugs were the most frequent in occurence, but will be difficult to fix. They arise from the problem that the CIF format does not record charges on atoms/ions or the types of bong between them so OpenBabel needs to work them out which is hard to do correctly.

Details	Example
Category: BAD_CHARGEMISSING Frequency: 830 Description: One or more ions in the molecule have the wrong charge on them in the resulting mol file	CIF: CCDC 879075 ChemSpider: 68005707
Category: BAD_WRONGCOORDINATION Frequency: 747 Description: One or more atoms or ions in the molecule have the wrong coordination – problem observed in metal ions, S, P, Se and B	CIF: CCDC 218529 ChemSpider: 26579734
Category: BAD_BONDMISSING Frequency: 587 Description: One or more of the bonds in the molecule are of the wrong order e.g. a single bond instead of a double bond.	CIF: CCDC 926530 ChemSpider: 34226187
Category: BAD_WRONGBOND Frequency: 452 Description: Wrong sequence of single/double bonds.	CIF: CCDC 203663 ChemSpider: 238575
Category: BAD_NOCOORDL Frequency: 52 Description: no coordination to a ligand.	CIF: CCDC 218360 ChemSpider: 68005705
Category: BAD_MISSINGH Frequency: 18 Description: missing hydrogen.	CIF: CCDC 220380 ChemSpider: 21188989

There were also some problem mol files produced which either won’t be able to be fixed by OpenBabel (since they resulted from either errors or limitations of the input CIF files which cannot be fixed retrospectively) or are too difficult to fix and/or too infrequently occuring to be worth the effort:

- There were 237 cases where there were solvent molecules in the CIF (many of which have missing hydrogens, partial occupancy of the molecule or part of the molecule etc.) which give rise to spurious oxygens, fragments of molecules and radicals in the resulting mol file (see CIF: CCDC 213787 and ChemSpider record: 68005706). 148 of these cases are just water solvent molecules either with missing or detached hydrogen atoms. The poor definition of the solvent molecules is a limitation of CIF files from diffraction so it is not possible for OpenBabel to better define them in the output mol that is derived from them. However, running OpenBabel with the -r option to remove all but the largest contiguous fragment was quite successful to remove these problem solvent molecules so no further action is required to deal with this problem and this option will be used by us in the future.
- There were 81 cases where there was at least one missing hydrogen in the original CIF (or in 3 cases, all hydrogens missing) – see CCDC 259871.
- Some CIFs contain crystal structures which correspond to continuous networks rather than small molecules (e.g. polymers, MOFs, zeolites, POMs) which cannot meaningfully be captured in mol format – see CCDC 206593.
- There were a few (24) cases where the stereochemistry in the mol file obtained is incorrectly defined. However, because on the stereochemistry was well interpreted by OpenBabel and these cases were relatively few, it probably isn’t worth disturbing the apple cart to investigate these further – see CCDC 238611 and ChemSpider 9419187.

Comments Off on Adding RSC CIFS to ChemSpider

More hexagons in the plane

25 Oct 2013

Written by Colin Batchelor.

Recently I heard someone who cycled the 1400 km from John O’Groats to Lands End, with a headwind all the way, because it looked on the map as if it was downhill and hence easier. (I am grateful to Neil Swainston of the University of Manchester for this anecdote.)

You might think that “down” on the page is unlikely to be “down” in 3D space, but there is an interesting exception to this, at least for certain interpretations of “down”.

Some time ago I gave a teaser of my Sheffield talk, which is now online here and here. The mathematical meat of the talk was about redrawing sugar rings in small molecules so that they can be properly indexed by cheminformatics systems. The teaser showed a classification of hexagons so we can tell which rules to apply.

It turns out that for the hexagons we see most in practice, which are chair hexagons and Haworth hexagons, at least if the hexagon itself has its long axis roughly horizontal on the page, then if a bond points “down” on the page, when we redraw the hexagon as viewed from “above”, then the bond will still be pointing down and needs to be redrawn with a dashed bond. The same applies, mutatis mutandis, for the bonds pointing “up”.

So far, so distressingly simple. Sometimes tasks really are easier than they look. There are two more things to address, though. One is simple and involves the well-known rules for how many stereobonds you draw in any given structure (I’ve mentioned this before). The other one is tidying the molecule so that the layout algorithm doesn’t undo all your good work. This is a bit trickier and I need to look a bit more at what tools are already out there for doing this.

Comments Off on More hexagons in the plane

Hexagons in the Plane

17 Apr 2013

Written by Colin Batchelor.

I’ll be talking at the 6th Joint Sheffield Conference on Cheminformatics in July on Validation and Standardization of Molecular Structures in General and Sugars in Particular. This is a taster.

Sugars in Particular

One of the big problems with chemical structure algorithms is that they can’t, in general, cope with the ways that chemists are accustomed to drawing sugar molecules. They will lose the stereochemistry around the sugar ring, collapsing D-glucose, say, on to L-glucose, not to mention allose, altrose, gulose and all the others.

(ChemDraw, I should note, can interpret chair stereo properly, but it is very much an exception.)

The first step in determining correct stereochemistry for a chair atom is recognizing a chair hexagon. That is the subject of this post.

Have you ever been in the same car as a satnav (US readers: this is the same as a GPS)? Whereas a human navigator will give general instructions like “go straight over all of the roundabouts till we reach the Red Lion”, a satnav only ever gives single-step, local instructions. “At the roundabout, take the third exit.” “In 100 metres, turn left.” Machine structure perception is rather like this. Instead of apprehending in an instant that the hexagon is a chair or a boat like you or I would, the algorithm needs to step around the structure atom by atom, bond by bond.

The trick to identifying what kind of hexagon we are dealing with is to see whether, at each atom, we turn left or right. If we keep turning in the same direction all the way round, then we have a regularish hexagon. If we turn once in one direction, then twice in the other, then once in the first, then twice in the other, then we have a chair. There are six other sorts of hexagon you can draw, and they’re all depicted below alongside the corresponding sequences of turns.

Some of them are familiar, like the boat, the twist boat, and the envelope. Others, less so.

What happens when we’ve identified the atoms in the chair? I’ll come to that in more detail soon, but in the mean time here’s the slides from the ACS Spring meeting in New Orleans:

Comments Off on Hexagons in the Plane

Wedges, hashes and a side order of Grice

09 Nov 2012

Written by Colin Batchelor.

No (This is not a post about carbohydrates, despite the title!)

Dodgy stereochemistry is a persistent problem. Even if someone knows all of the stereocentres in a particular molecule, they might not necessarily draw them in a way that a machine, or even a person, can interpret. There are rules about whether the pointy end or the blunt end of a bond indicates the stereocentre, and it’s surprising how often you see them done wrongly.

Today I’m going to talk about a particular IUPAC recommendation for drawing stereocentres that might at first glance seem surprising, the rule that you may only have one stereobond at a given stereocentre. If you have a wedged bond attached to an atom, you can’t have a hashed bond attached to the same atom. And vice versa.

Why is this?

You might think that as you’re supplying more information, you’re making the diagram easier to interpret. However, you’re running directly counter to the normal principles of communication. You’re being more informative than required, and this sets off alarm bells in the reader. What are you trying to say? If you ask a passerby the time and they say “Well, it’s half past six Greenwich Mean Time” you’re entitled to wonder why they’re quoting the timezone. Maybe they’re trying to be funny.

Paul Grice thought about this whole problem in the 1970s and came up with a set of four principles, summarized in maxims, that listeners (or readers) assume that speakers are following. These are they:

Be Truthful. Do not say what you believe to be false. Do not say that for which you lack adequate evidence.

Let us hope that this one is implicit in any chemical drawing!

Make your contribution as informative as is required. Do not make your contribution more informative than required.

If you have two methyl groups coming off an atom, do not make one wedgy and one hashy. You are adding no new information!

Do not mark carbons with the letter C unless your target audience is schoolchildren.

Be relevant:

On the grand scale: do not illustrate an article with any old molecule—make sure the molecule mentioned is actually relevant.

On the scale of the drawing itself, however: If you have three bonds about an ordinary p-block atom, for example, make sure they’re at 120 degrees to each other. If they aren’t, for example if two of them are at right angles, the reader will infer that something odd is going on.

Be clear:

Make sure all your double bonds actually look like double bonds rather than a single bond parallel to another single bond. I suspect a lot of the success of ChemDraw is down to the fact that it produces attractive, clear chemical drawings.

Do people ever flout the maxims on purpose?

Oh yes. People often flout the maxims when trying to be funny, or in a political interview. Similarly there are all kinds of Gricean violations in the chemical drawings you see in patents: bonds which do not quite extend all the way to atoms, R groups labelled as Y (particularly dangerous as Y is yttrium!) or Q or W (also tungsten) or some other unusual letter and so forth. Exactly why this happens so much more often in patents than in journal articles is left as an exercise for the reader.

Comments Off on Wedges, hashes and a side order of Grice

Putting sugar in perspective

16 Aug 2012

Written by Colin Batchelor.

You might not think so, but you’re very good at taking a two-dimensional drawing and converting it into a three-dimensional shape in your head. No, really, you are.

Fig. 1. Galactose in perspective

Take the drawing of galatose in Fig. 1. Even if you’re not a chemist, you can tell which bits of the ring are at the front and at the back, which bonds point up and which bonds point down. If you actually are a chemist, you’ve been trained to apply this geometrical intuition to work out what’s going on at each of the five stereocentres.

However, if you ask the InChI algorithm about the stereochemistry of this molecule, it’ll say that there is no stereochemistry in there and you’re looking at a stereoless description of which atom is attached to which. Since we use the InChI algorithm to say whether two records describe the same molecule, this puts us in a quandary, and there are thousands of entries in ChemSpider that come from just such a drawing and hence lack stereochemistry.

(more…)

Comments Off on Putting sugar in perspective

ChemSpider Blog

Archive for the ‘Cheminformatics’ Category

ChemSpider data cleanup

Experimental Properties

Synonyms

Simple Checks

Structure/Synonym comparison

Structure

Removed synonym

Get involved

ChemSpider Pre-Deposition Filters

Formats and identifiers

Filtering criteria

Empty structures, query atoms and incorrect valences

Charge imbalance

Structures containing undefined stereocentres

Structures containing many components

Synonym filter

SMARTS

SMIRKS

Organometallics

But these errors still appear in ChemSpider!

Adding RSC CIFS to ChemSpider

Unsuccessful CIF to mol conversions

More hexagons in the plane

Hexagons in the Plane

Wedges, hashes and a side order of Grice

Putting sugar in perspective

Categories

Archives

Meta