Adding RSC CIFS to ChemSpider

09 Dec 2013

Written by Aileen Day.

We are pleased to announce that we have just imported 1047 CIFs to ChemSpider of crystal structures that were previously reported in RSC papers (and are available as ESI for those) to ChemSpider for the relevant compounds, and linked those back to the original articles and to the CCDC’s webCSD, e.g. example compound with RSC article CIF (see the CIF infobox). Since each CIF that is uploaded into ChemSpider must be associated with a ChemSpider compound, the difficult part of this task was working out a 2D molecular structure (in .mol file format) for each 3D crystal structure (in .cif file format) – which is particularly difficult because CIFs only contain information about each atomic position and not how the atoms are bonded to each other in the crystal or whether they are charged or not.
Ultimately we would like this CIF to mol conversion (and the whole upload) to be performed programmatically without human intervention. However, there is no reliable way to do that currently – although programs such as OpenBabel can be used to extract mols from each CIF, the reliability of this conversion isn’t 100%.
So as one of our student intern projects at the University of Southampton this summer (in parallel with another student intern project at Southampton University to share thesis data in ChemSpider) we used OpenBabel (version 2.3.2, run from the command line with the options -i cif inputfilename.txt -o mol -m –unique -d –AddPolarH) to extract mols for all the CIFs in the RSC archive (over 43,000 files as of June 2013) and enlisted Julija Kezina (shown below) to review the results of these conversions to ensure that only good structure and CIF pairs would be deposited to ChemSpider, and to better understand the problems in the conversion process with a view to fixing them. One problem that became immediately apparent was that because the 2D structure obtained was just a projection of the 3D structure along the a cell axis, which is not always the orientation which shows the molecule most clearly, even if they did have the write chemical connections between the atoms, so all mol structures were run through OpenEye’s cleaning algorithm before being reviewed.

Julija Kezina – Southampton University intern who examined CIF to Mol conversion

Julija compared each structure in the output mol files with those in the original CIF files to judge whether the conversion was accurate or not. In addition, as an extra check, all of the output mol structures were submitted to ChemSpider validation and standardisation platform to filter out molecules with structural problems (e.g. stereochemistry, valence or congestion issues).
Overall, approximately 30% of the CIF to mol conversions that Julija checked were good, with the right connectivity of atoms and ions (although approximately 30% of these needed the atomic positions to be repositioned to clean or tidy up the structure, either manually or using ChemDraw’s cleaning functionality). The 1047 of these mols which contain only a single molecule (without solvent molecules or cocrystals etc.) are those which have been deposited into ChemSpider with their corresponding CIFs.
The journals which had the highest successful conversion percentage were Molecular BioSystems (57%), MedChemComm (51%), Organic and Biomolecular Chemistry (44%) and Green Chemistry (44%) – the journals which in general are about small organic molecules.
Julija was working in the National Crystallography Service’s office at the University of Southampton, under the co-supervision of Professor Simon Coles, and we are grateful to them for their help and advice about the finer points of the CIF file format.

Unsuccessful CIF to mol conversions

Running and evaluating OpenBabel on such a large and varied set of structures has given us a useful opportunity to identify and categorise the most common problems encountered. Here we share these and give examples that would enable the identification of some easy fixes in the pipeline that might benefit the whole community and be used as test cases when doing so. We will report these bugs to the OpenBabel forum and because OpenBabel is open source, hope to resolve at least some of these issues in the future through collaboration with its other developers.

The following OpenBabel bugs look like they might be most straightforward to fix:

Details	Example
Category: BAD_NITRO Frequency: 233 Description: there are different ways of representing nitro groups in structure drawers – OpenBabel currently does so by producing a mol with a pentavalent nitrogen. In ChemSpider we we choose to avoid this in favour of a format with a charge-separated nitro. Solution: Allow OpenBabel to have a different output option for nitro groups to output them as shown in corrected mol file.	CIF: CCDC 194360 ChemSpider: 10001804
Category: BAD_MULT Frequency: 434 Description: Duplicate (exactly identical, including stereochemistry) molecules are present in the resulting mol file despite running OpenBabel with the –unique option (which should filter out duplicate molecules based on their inchis) Solution: Fix OpenBabel when run with the –unique option so that it works.	CIF: CCDC 229590 ChemSpider: 3915
Category: BAD_MISSINGPARTOFMOLECULE Frequency: 724 Description: Part of the molecule is missing Cause: OpenBabel doesn’t understand crystal symmetry – only the atoms in the CIF that are explicitly listed with positions are included in the resulting mol file, and those that are inferred by symmetry are not. Solution: Make OpenBabel generate the full molecule from the symmetry in the CIF file, or recommend that a script/program that can process a CIF to generate another CIF with all atoms is run before OpenBabel.	CIF: CCDC 185091 ChemSpider: 11917
Category: BAD_PARTIALOCCUPANCY Frequency: 432 Description: partial occupancy of multiple sites for a particular atom in the CIF file Cause: In CIF files sometimes positions of multiple sites are specified with occupancy less than one – OpenBabel doesn’t recognise this and assumes that the occupancy of all sites is one effectively, so that there are duplicates of some atoms or fragments in the mol file. Solution: Where the _atom_site_occupancy is less than one, group together atoms into those which are alternatives of each other (by type, proximity, and those which add up to a total occupancy of 1) and choose only one of them to include in the final mol file (that with the highest site occupancy, or if two have equal occupancies of e.g. 0.5 then pick one at random). Note that there needs to be consistency, so that if for example a C is discarded, then all of the adjoining H’s with partial occupancy are also discarded but those bonded to the C that is included are included (as in the attached example).	CIF: CCDC 854369 ChemSpider: 68005704

Many of the problems were caused by idiosynchronies or errors in the input CIFs, but these on the whole weren’t handled well by OpenBabel (e.g. by writing an error message and terminating the program) but rather, in the majority of cases went into an infinite loop and the program hung. Because of this, and because the OpenBabel conversions were part of a longer script, all OpenBabel jobs had to be run with an arbitary timeout so that if still running after this timeout they were killed, which may have discarded some valid but long-running OpenBabel jobs. We will investigate whether there is a validation program that can be automatically performed on CIFs to filter out ones with these problems (similar to the CCDC’s EnCIFer but which can be run programmatically), but it would be relatively straightforward to make OpenBabel more reliable by being able to exit nicely when it encounters these problems so that pre-validation wasn’t necessary. These problems are listed in the table below:

Details	Example
Category: CIF_NOCOORDINATES Frequency: 378 Description: cif doesn’t contain any coordinates Cause: Some CIFs contain e.g. powder diffraction refinement data and don’t contain coordinates. Solution: OpenBabel already issues an error: “CIF Error: no atom found ! (in data block:XXX)” – simply abort the program if this is found (rather than trying to continue).
Category: CIF_MISSINGLOOP Frequency: 85 Description: cif misses a “loop_” line Solution: Do an initial check that there is at least one loop_ line in the expected place before attempting to do the conversion.	CIF: CCDC 753484
Category: CIF_COMMENTEDFIELD Frequency: 36 Description: if there is a CIF field name in a commented section of the CIF, OpenBabel doesn’t ignore it and goes into an infinte loop Solution: It would be trivial to make sure that OpenBabel ignores CIF field names which are commented out (between a pair of semicolons).	CIF: CCDC 840581

The following OpenBabel bugs were the most frequent in occurence, but will be difficult to fix. They arise from the problem that the CIF format does not record charges on atoms/ions or the types of bong between them so OpenBabel needs to work them out which is hard to do correctly.

Details	Example
Category: BAD_CHARGEMISSING Frequency: 830 Description: One or more ions in the molecule have the wrong charge on them in the resulting mol file	CIF: CCDC 879075 ChemSpider: 68005707
Category: BAD_WRONGCOORDINATION Frequency: 747 Description: One or more atoms or ions in the molecule have the wrong coordination – problem observed in metal ions, S, P, Se and B	CIF: CCDC 218529 ChemSpider: 26579734
Category: BAD_BONDMISSING Frequency: 587 Description: One or more of the bonds in the molecule are of the wrong order e.g. a single bond instead of a double bond.	CIF: CCDC 926530 ChemSpider: 34226187
Category: BAD_WRONGBOND Frequency: 452 Description: Wrong sequence of single/double bonds.	CIF: CCDC 203663 ChemSpider: 238575
Category: BAD_NOCOORDL Frequency: 52 Description: no coordination to a ligand.	CIF: CCDC 218360 ChemSpider: 68005705
Category: BAD_MISSINGH Frequency: 18 Description: missing hydrogen.	CIF: CCDC 220380 ChemSpider: 21188989

There were also some problem mol files produced which either won’t be able to be fixed by OpenBabel (since they resulted from either errors or limitations of the input CIF files which cannot be fixed retrospectively) or are too difficult to fix and/or too infrequently occuring to be worth the effort:

- There were 237 cases where there were solvent molecules in the CIF (many of which have missing hydrogens, partial occupancy of the molecule or part of the molecule etc.) which give rise to spurious oxygens, fragments of molecules and radicals in the resulting mol file (see CIF: CCDC 213787 and ChemSpider record: 68005706). 148 of these cases are just water solvent molecules either with missing or detached hydrogen atoms. The poor definition of the solvent molecules is a limitation of CIF files from diffraction so it is not possible for OpenBabel to better define them in the output mol that is derived from them. However, running OpenBabel with the -r option to remove all but the largest contiguous fragment was quite successful to remove these problem solvent molecules so no further action is required to deal with this problem and this option will be used by us in the future.
- There were 81 cases where there was at least one missing hydrogen in the original CIF (or in 3 cases, all hydrogens missing) – see CCDC 259871.
- Some CIFs contain crystal structures which correspond to continuous networks rather than small molecules (e.g. polymers, MOFs, zeolites, POMs) which cannot meaningfully be captured in mol format – see CCDC 206593.
- There were a few (24) cases where the stereochemistry in the mol file obtained is incorrectly defined. However, because on the stereochemistry was well interpreted by OpenBabel and these cases were relatively few, it probably isn’t worth disturbing the apple cart to investigate these further – see CCDC 238611 and ChemSpider 9419187.

ChemSpider Blog

Adding RSC CIFS to ChemSpider

Unsuccessful CIF to mol conversions

Categories

Archives

Meta