ChemSpider data cleanup

In previous posts, we have discussed the automated workflow we use to check new incoming data for structure and synonym errors. These checks allow us to remove the most common types of errors before they are added to the site. However, these filters do not apply to data already in ChemSpider.

Manual curation is an important part of our work. We periodically review the data on our most accessed records, in addition to ad-hoc removal or correction of erroneous data that we or our users notice when using the site. However, there are far too many records and far too much data to clean up using manual curation alone.

Recently we have focused on bulk identification and removal of erroneous data. This work has covered mapping errors and other clearly incorrect values in our experimental property data, correction or removal of malformed synonyms, correction of incorrectly labelled synonyms, and resolution of structure/synonym clashes.

Experimental Properties

We retrieved all 6.3 million experimental properties, text properties, and associated annotations from the ChemSpider database. We then compared the original text of the property as it was written in the original file to how that text was parsed and mapped by our deposition system. This enabled us to identify and correct several types of errors affecting around 2% of the properties in our database:

  • 35,774 experimental property values had been assigned the incorrect unit (e.g. g/L instead of g/mL, °C instead of °F)
  • 2,591 boiling points measured under non-standard pressure did not have this pressure displayed
  • 4,292 densities had their density and temperature values swapped
  •  79,252 miscellaneous erroneous properties and associated annotations were deleted. For example, “white crystals” mapped as melting point, impossibly high melting points or densities, etc.

Synonyms

Synonyms, chemical names, and identifiers are the most abundant type of data on ChemSpider, with a total of more than 446 million synonyms. These synonyms have additional metadata including language labels and flags identifying what type of synonym they are (e.g. CAS number, UNII, INN, trade name).

Simple Checks

We ran a series of regular expression string searches to identify synonyms with incorrect metadata, as well as malformed or otherwise erroneous synonyms.

  • 200,007 synonym type flags added, and 4,766 incorrect flags removed
  • 9,170 synonyms with an incorrect language label identified.
  • 631,697 erroneous synonyms identified, including scrambled characters, properties/units, molecular formulae as synonyms, purity information, or invalid CAS numbers or EC numbers (formerly called EINECS).
  • 922,334 instances of these erroneous synonyms deleted from ChemSpider records.

Structure/Synonym comparison

After identifying and removing these synonym-level errors, we then cross-checked ChemSpider records and their synonyms to identify mismatches. This work included amino acids, nucleic acids, and pharmaceutically acceptable salts.

As a first pass, we compared synonyms to molecular formulae to identify records missing key elements. Examples include synonyms describing a sodium salt when the molecular formula does not contain sodium, or describing an amino acid when the molecular formula contains no nitrogen. A total of 28,194 of these synonym/formula clashes were identified and removed.

For records that passed this initial molecular formula check, we performed a SMARTS comparison to identify chemical structures missing key structural features described in the synonym.  These SMARTS strings were written broadly, with common substitutions allowed to prevent unnecessary removal of valid synonyms from derivative compounds.

In the following examples, the mismatched part of the synonym is highlighted in bold.

Structure

Removed synonym

Chemical structure of sulfur dioxide Sulfate ion
Chemical structure of zolpidem Zolpidem tartrate
Chemical structure of Sodium S-sulfocysteine Sodium S-sulfocysteine hydrate

After identifying these clashes, we manually spot-checked the output to weed out false positives and iterate the SMARTS filters. 101,257 synonym/structure clashes were identified and removed.

These checks included the following categories:

  • Amino acids and their derivatives: 6 formula clashes, 56 structure clashes
  • Nucleic acids, nucleosides, nucleotides: 977 formula clashes, 1,870 structure clashes
  • Halogens: 13,437 formula clashes, 1,256 structure clashes
  • Alkali and alkaline earth metals, and aluminium: 3,586 formula clashes, 56 structure clashes
  • Carboxylic acids and their derivatives: 5,002 formula clashes, 88,501 structure clashes
  • Other pharmaceutically acceptable acids: 3,534 formula clashes, 1,529 structure clashes
  • Amides and amines: 190 formula clashes, 304 structure clashes
  • Deuterates, hydrates, methylbromides: 1,462 formula clashes, 7,685 structure clashes

Get involved

You are the expert in your area of chemistry, so if you see something that doesn’t look quite right please let us know. If the error is confined to a single ChemSpider record, click the “Comment On This Record” box at the top of the affected record and let us know what the problem is. All we need is a sentence describing the error, however the more information you can provide, the better.

For more systemic errors, or in cases where you want to attach supplementary information or corrected chemical structures, please get in touch via email (chemspider@rsc.org).

Webinar 3: Chemistry data: Challenges and opportunities. Watch the recording

We will explore ongoing and planned initiatives developing standards and tools, research infrastructures, and cultures to support FAIR chemistry data as well as its preparation, publication, and reuse.

Webinar 3: Challenges and opportunities

Webinar recorded on 7 December 2023 – watch the recording here 

Speakers

Sonja Herres Pawlis
“How to initiate the cultural change towards digital chemistry” SLIDES
Sonja Herres-Pawlis
Chair of Bioinorganic Chemistry, RWTH Aachen

Samantha Kanza
“How can we combat heterogeneous, unfair and disparate data in digital chemistry? ” SLIDES
Samantha Kanza
Senior Enterprise Fellow, University of Southampton
Pathfinder Lead, Physical Sciences Data Infrastructure (PSDI)

Guy Jones
“How data journals can support (chemistry) data sharing and discovery” SLIDES
Guy Jones
Chief Editor of Scientific Data, Springer Nature


Sponsored by Revvity

Revvity Signals Software, formerly PerkinElmer Informatics, has over three decades of experience providing support for scientific workflows.

logo of Revvity Signals

Our powerful informatics solutions are used in R&D across disciplines from drug discovery to materials development. Now under our Signals Research Suite, our end-to-end SaaS solution integrates workflows to accelerate innovation and help scientists collaborate. In addition, our solution powered by TIBCO® Spotfire® can transform clinical trials.

From our flagship ChemDraw® and E-Notebook applications, to our Signals Research Suite, to our TIBCO® Spotfire® partnership for data analytics, Revvity Signals offers a powerful suite of scientific solutions.

Supported by



 

 

About ChemSpider

Explore more than 128 million structures on the ChemSpider database. Including over 200 data sources, ChemSpider is a valuable source of information for chemical scientists working with data.

Freely accessible and comprehensive, this rich source of structure-based chemistry information is a fundamental resource for chemical scientists working with data everywhere.

Learn more about ChemSpider

Webinar 2: What does the future hold? Watch the recording

We will explore ongoing and planned initiatives developing standards and tools, research infrastructures, and cultures to support FAIR chemistry data as well as its preparation, publication, and reuse.

Webinar 2: What does the future hold?

Webinar recorded on 17 November 2023 – watch the recording here

Speakers

Lynn Kamerlin
“Data explosion in chemistry: what are we going to do with all the data, and what will it do to us?” SLIDES
Lynn Kamerlin
Professor and Georgia Research Alliance Vasser Woolley Chair in Molecular Design, Georgia Tech


“Will an AI win a chemistry Nobel Prize and replace us?” SLIDES
Simon Coles
Professor of Structural Chemistry, University of Southampton

Anna Rulka
“Data sharing at the RSC” SLIDES
May Copsey
Executive Editor, Chemical Science, RSC
Anna Rulka
Executive Editor, Digital Discovery, RSC


Sponsored by Revvity

Revvity Signals Software, formerly PerkinElmer Informatics, has over three decades of experience providing support for scientific workflows.

logo of Revvity Signals

Our powerful informatics solutions are used in R&D across disciplines from drug discovery to materials development. Now under our Signals Research Suite, our end-to-end SaaS solution integrates workflows to accelerate innovation and help scientists collaborate. In addition, our solution powered by TIBCO® Spotfire® can transform clinical trials.

From our flagship ChemDraw® and E-Notebook applications, to our Signals Research Suite, to our TIBCO® Spotfire® partnership for data analytics, Revvity Signals offers a powerful suite of scientific solutions.

Supported by



 

 

About ChemSpider

Explore more than 128 million structures on the ChemSpider database. Including over 200 data sources, ChemSpider is a valuable source of information for chemical scientists working with data.

Freely accessible and comprehensive, this rich source of structure-based chemistry information is a fundamental resource for chemical scientists working with data everywhere.

Learn more about ChemSpider

Webinar 1: Where are we with digital chemistry data? Watch the recording

 

 

These webinars will explore how digital chemistry data is enabling research – existing models, current challenges and exemplars, and what’s needed to evolve towards a better future using chemistry data.

We will focus on how data is enabling research – existing models, current challenges and exemplars, and what’s needed to evolve a better future using chemistry data.

Webinar 1: Where are we with digital chemistry data?

Webinar recorded on 17 October 2023 – watch the recording here.

Speakers

Leah McEwen
“WANTED: standard notation for reusable chemical data” SLIDES
Leah McEwen
Chemistry Librarian, Cornell University

Kevin Jablonka
Kevin Jablonka
Research Group Leader, University of Jena

Pierre Morieux

Pierre Morieux
Chemistry Product Marketing Manager, Revvity Signals

 


Sponsored by Revvity

Revvity Signals Software, formerly PerkinElmer Informatics, has over three decades of experience providing support for scientific workflows.

logo of Revvity Signals

Our powerful informatics solutions are used in R&D across disciplines from drug discovery to materials development. Now under our Signals Research Suite, our end-to-end SaaS solution integrates workflows to accelerate innovation and help scientists collaborate. In addition, our solution powered by TIBCO® Spotfire® can transform clinical trials.

From our flagship ChemDraw® and E-Notebook applications, to our Signals Research Suite, to our TIBCO® Spotfire® partnership for data analytics, Revvity Signals offers a powerful suite of scientific solutions.

Supported by


 

About ChemSpider

Explore more than 128 million structures on the ChemSpider database. Including over 200 data sources, ChemSpider is a valuable source of information for chemical scientists working with data.

Freely accessible and comprehensive, this rich source of structure-based chemistry information is a fundamental resource for chemical scientists working with data everywhere.

Learn more about ChemSpider

ChemSpider webinars – helping you embrace digital chemistry data with expert insights

How can you learn about chemistry data trends and best practices happening right now? Elevate your knowledge for future success with leading experts in our three-part webinar series.

The webinar series will focus on how data is enabling research – the current challenges and examples and how a better future can be created using chemistry data. It will showcase current and planned initiatives to develop standards and tools, research infrastructures, and developing cultures to support Findable Accessible Interoperable Reusable (FAIR) chemistry data preparation, publication and reuse.

Elevate your data practices  

Created as a free, three-part series for chemical scientists working with data, learn more about chemistry data today, what the future holds, and the current challenges and opportunities of digital chemistry data. Make the most of this opportunity to discover insights from the experts in the field – register for all three webinars.

 

Recordings of all webinars are available here

 


Webinar 1: Where are we with digital chemistry data?

Held on 17 October 2023.

Speakers

Leah McEwen
“Wanted – standard notation for reusable chemistry data” SLIDES
Leah McEwen
Chemistry Librarian, Cornell University

Kevin Jablonka
Kevin Jablonka
Research Group Leader, University of Jena

Pierre Morieux

Pierre Morieux
Chemistry Product Marketing Manager, Revvity Signals


Webinar 2: What does the future hold?

Held on 17 November 2023

Speakers

Lynn Kamerlin
“Data explosion in chemistry: what are we going to do with all the data, and what will it do to us?” SLIDES
Lynn Kamerlin
Professor and Georgia Research Alliance Vasser Woolley Chair in Molecular Design, Georgia Tech


“Will an AI win a chemistry Nobel Prize and replace us?” SLIDES
Simon Coles
Professor of Structural Chemistry, University of Southampton

Anna Rulka
“Data sharing at the RSC” SLIDES
May Copsey
Executive Editor, Chemical Science, RSC
Anna Rulka
Executive Editor, Digital Discovery, RSC


Webinar 3: Challenges and opportunities

7 December 2023

Speakers

Sonja Herres Pawlis
“How to initiate the cultural change towards digital chemistry” SLIDES
Sonja Herres-Pawlis
Chair of Bioinorganic Chemistry, RWTH Aachen

Samantha Kanza
“How can we combat heterogeneous, unfair and disparate data in digital chemistry?” SLIDES
Samantha Kanza
Senior Enterprise Fellow, University of Southampton
Pathfinder Lead, Physical Sciences Data Infrastructure (PSDI)

Guy Jones

“How data journals can support (chemistry) data sharing and discovery” SLIDES
Guy Jones
Chief Editor of Scientific Data, Springer Nature

 


Sponsored by Revvity

Revvity Signals Software, formerly PerkinElmer Informatics, has over three decades of experience providing support for scientific workflows.

logo of Revvity Signals

Our powerful informatics solutions are used in R&D across disciplines from drug discovery to materials development. Now under our Signals Research Suite, our end-to-end SaaS solution integrates workflows to accelerate innovation and help scientists collaborate. In addition, our solution powered by TIBCO® Spotfire® can transform clinical trials.

From our flagship ChemDraw® and E-Notebook applications, to our Signals Research Suite, to our TIBCO® Spotfire® partnership for data analytics, Revvity Signals offers a powerful suite of scientific solutions.

Supported by




About ChemSpider

Explore more than 128 million structures on the ChemSpider database. Including over 200 data sources, ChemSpider is a valuable source of information for chemical scientists working with data.

Freely accessible and comprehensive, this rich source of structure-based chemistry information is a fundamental resource for chemical scientists working with data everywhere.

Learn more about ChemSpider

Tips and tricks: generating machine-readable structural data from a structure

Interested in making your article more discoverable and usable? As a reader, you have probably spent a lot of time re-drawing structures from an image in a PDF, or have struggled to find all relevant articles because your compound of interest is called by different names in different articles (IUPAC name, trivial name, registry number, drug development ID, generic name, brand name, revised trivial name etc etc etc…).

If you’re already drawing a structure for an article you are preparing to submit, it only takes a few seconds to generate machine-readable mol files or structure identifiers like SMILES or InChI. Including these files or identifiers in your article or supplementary information helps make your article indexable and structure-searchable, and is a great way to make your article stand out.

Save as MOL fileSave as mol file

 

All major structure drawing packages can save structures as MOL files. They generally follow the same steps:

Choose File > Save As from the top menu OR press Ctrl+Shift+S.

Select “MDL Molfile”, “MDL SDFile”, or “.mol” or “.sdf” in the dropdown.

Please note: There may be more than one molfile format listed in the dropdown. If there is more than one option, please be aware that V2000 mol format is more common and is supported by all cheminformatics software packages. The V3000 mol file has some extra features, but is not universally supported, so it is advised that you use V2000 mol format to ensure maximum interoperability.


Copy as SMILES or InChI

Start by selecting the structure you would like to copy as SMILES or InChI.

Avogadro

Copy as - Avogadro

From the top menu, choose Edit > Copy As > SMILES or InChI

ChemDoodle

Copy as - chemdoodle

From the top menu, choose Edit > Copy As > Daylight SMILES or IUPAC InChI

OR

To copy as SMILES, press Ctrl+Alt+C

ChemDraw

Copy as InChI

From the top menu, choose Edit > Copy As > SMILES or InChI

OR

Right click, and choose Molecule > Copy As > SMILES or InChI

OR

To copy as SMILES, press Alt+Ctrl+C

ChemSketch

machine readable copy as - chemsketch

From the top menu, choose Tools > Generate > SMILES Notation or InChI for Structure

MarvinSketch

Copy as - Marvin

Press Ctrl+K, then select SMILES or InChI from the Copy As pop-up

OR

From the top menu, choose Edit > Copy As and select SMILES or InChI from the pop-up

OR

To copy as SMILES, press Ctrl+L

Finally, paste your SMILES or InChI into your document or spreadsheet.


The less time we have to spend re-drawing structures from pdfs, the more time we can devote to doing science. Luckily, it really couldn’t be quicker or easier to improve the discoverability and reusability of your article by including machine-readable structure files or identifiers. Let’s work together to make chemistry articles easier to find and use.

ChemSpider Mobile app

ChemSpider Mobile was an app developed by Molecular Materials Informatics Inc1 on behalf of the Royal Society of Chemistry to allow users to explore the benefits of ChemSpider on mobile devices. Since its launch we have made improvements to ChemSpider.com, including responsive design elements to allow it to work better on smart phones and tablets2 and upgrades to the ChemSpider web services3 that power it. As a result of these developments we felt it was timely to review the community’s need for the app and have taken the decision to discontinue support for the services that power the app from 31st October. We would like to thank everyone who used and provided feedback on the app to aid its development and encourage you to switch to using ChemSpider.com for future mobile use.

1. http://molmatinf.com/

2. http://blogs.rsc.org/chemspider/2015/05/21/introduction-to-the-new-chemspider-website/

3. https://developer.rsc.org/

Chemical Validation and Standardization Platform (CVSP)

The Chemical Validation and Standardization Platform (CVSP)1 was developed during the Open PHACTS IMI project2 to process chemical structure files through tested validation and standardization protocols. The aim was to provide the community with rigorous analysis of their chemical structure files to ensure that data released into the public domain via online databases was pre-validated. The online CVSP site provided a useful means to test the rulesets and allow users to validate their structure files, but the standalone website was taken offline in November 2018. As a legacy, the codebase and ruleset has been evolved and applied to the ChemSpider deposition system at deposit.chemspider.com3 and the community discussions around appropriate standardisation of chemical structure files continue. The original code is also available from GitHub.4

  1. The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets, J. Cheminf., 2015, 7:30, https://doi.org/10.1186/s13321-015-0072-8
  2. https://www.openphacts.org
  3. https://deposit.chemspider.com/
  4. https://github.com/openphacts/ops-crs/tree/master/CVSP

ChemSpider Pre-Deposition Filters

Written by Mark Archibald.

In a previous post (Behind the Scenes at ChemSpider) we discussed some of the challenges in upholding data quality across one of the largest chemical databases in the world. We identified automated filtering as a key tool when dealing with far more records than a human could reasonably handle. In this post we’ll go into more detail about how that filtering works, what the challenges are, and the role played by human intervention.

To perform this filtering we use KNIME, an open-source data processing platform. The wide range of KNIME nodes developed by the active cheminformatics community allows us to ask chemistry-specific questions of the data we process. In simple terms, input chemical structures that match our criteria are passed on to the next node, while those that don’t are written out to an error file. After processing all structures, the result is a file of structures that have successfully passed through all the filters and several (usually smaller) files of structures rejected for various reasons.

Structures are filtered. Flagged structures are reviewed, and passed structures are added to ChemSpider.

It’s not possible to review all of the generated files in full, as this would eliminate the time-saving advantages of automated processing. However, output files of all types are spot checked for accuracy and to iteratively improve the filtering criteria. Certain output files have high potential for false positives and so we review them in full.

Formats and identifiers

Submitted files can be in one of several different formats. The most common is SDF (structure data file, a chemical structure format containing multiple structures with associated data fields). The advantage of this format is that it contains 2- or 3-dimensional structures, so we can immediately start processing the file without having to convert an identifier to a structure. This means that the final structure we deposit is more likely to exactly match the original. The disadvantage of the SDF format is that it is specialised – many users will be unfamiliar with it or won’t have software to create and display the files.

We also receive different spreadsheet formats (excel, csv, tsv) with structures encoded in text-based notation systems like SMILES  or InChI. The advantage of this format is that it doesn’t require specialised software (provided the submitter has SMILES or InChIs for the compounds).The disadvantage is that the structures require conversion to SDF before processing and deposition to ChemSpider. Additionally, these formats contain information about atoms and their connectivity but lack layout information. This can introduce errors as different structure drawing packages can parse these structures slightly differently, resulting in alterations to the final deposited structure.

Filtering criteria

The criteria by which we judge chemical structures are a mixture of definitive chemical rules and less well-defined ‘rules of thumb’ based on our experience and chemical knowledge. Examples of both follow.

Empty structures, query atoms and incorrect valences

The first filter is the simplest – ChemSpider is a structure-centric database, so it’s not possible to deposit any input entries that lack a structure.

Similarly, each ChemSpider record requires a single defined chemical structure, so we exclude anything using a query atom to represent a variable atom or attachment point.

Another simple filter is to exclude structures in which atoms have invalid valences.

Charge imbalance

In general, entries in ChemSpider should represent a real-world, isolable compound. This means that we filter out structures with a non-zero overall charge. However, we make exceptions for certain examples where a counterion is generally unimportant and it’s useful to consider the charged species alone, such as choline (ChemSpider record).

Structures containing undefined stereocentres

Undefined stereocentres alone don’t represent a chemical error. However, structures like that shown below (cholesterol without any defined stereocentres) occur frequently and, although chemically valid, it’s extremely unlikely that they represent the intended structure.

Cholesterol skeleton with no defined stereochemistry

Cholesterol skeleton without stereochemistry

As a result we have a rule of thumb that excludes structures containing more than two undefined stereocentres. This is not a hard-and-fast rule, but rather an attempt to strike a balance between excluding structures like the one above and including structures where the undefined stereocentres are intentional and correct.

The count of undefined stereocentres (as determined by examining the InChI) sometimes includes cases where it is conventional to exclude stereochemical wedges. Examples include nucleic acids with no wedges on the phosphate and adamantyl groups without explicit stereochemistry – it’s unusual to draw these compounds with wedges, and users will rarely use wedges in their search. These potential false positives are filtered out and reviewed manually. A curator can then decide whether to include them in the deposition, improving the overall accuracy of the filter.

Structures containing many components

This is another rule of thumb – there’s no upper limit on how many separate components a correctly depicted chemical substance can have. However, from experience we find that excluding structures with more than four separate components removes most obviously nonsensical entries (e.g.  attempts to depict alloys) while retaining the majority of correct entries.

When applying this rule, pharmaceutical molecules represent a major source of false positives because they are often multiple hydrates and/or salts with multiple counterions (e.g. Irinotecan hydrochloride trihydrate). Excluded structures that are hydrates or contain common pharmaceutical salts are flagged for human review.

Synonym filter

This filter compares the synonyms assigned to a given structure with its molecular formula and performs some ‘common sense’ checks. For example, a relatively frequent error is associating the name of a salt form (e.g., mozavaptan hydrochloride) with the structure of the free base (mozavaptan). In this case, the filter removes synonyms containing ‘hydrochloride’ because the molecular formula does not contain Cl.

SMARTS

SMARTS (Wikipedia page) is a way of describing general chemical structures. It’s based on SMILES, but has additional features allowing the specification of variable chain lengths, number of bonds, number of hydrogens, variable bond orders, or more than one potential element at a site.

We use SMARTS to identify common erroneous features in a structure. These include:

  • Azides and diazo groups depicted with a pentavalent nitrogen
  • A ‘floating’ alkane unconnected to the main structure (probably caused by an accidental click in a drawing program)
  • Metal carboxylates depicted as a protonated carboxylic acid with an elemental metal atom
  • Hexafluorophosphates (and similar species) depicted as phosphorous pentafluoride and a separate fluoride ion

SMIRKS

SMIRKS is a further extension of SMILES to depict reactions. We don’t use it to represent real reactions, but to define structural transformations – allowing us to fix simple structural errors that can be resolved by breaking and creating bonds.

One example is connecting charge-separated Grignard reagents to give a more accurate depiction:

Reconnecting disconnected Grignard reagents

Reconnecting Grignards

Organometallics

The difficulties of encoding organometallic structures in machine-readable formats are well documented (J. Chem. Inf. Model. 51, 12, 3149-3157). There is an ongoing IUPAC project to extend the InChI’s functionality, but for now, the challenges remain.

Every ChemSpider record is fundamentally based on an InChI, and so we are bound by the current limitations. This means that we can’t depict coordination bonds or bonds with non-integer order – any bond drawn is interpreted as a standard covalent bond with one electron contributed by each atom.

Although we generally can’t represent organometallic structures in the manner a human chemist would prefer, we still attempt to choose the ‘least wrong’ structure from various possible compromises.

Ferrocene is a classic example of this problem and illustrates several of the issues we have to consider. A few common ways to draw ferrocene are shown below (there are many more).

Common depictions of ferrocene lose bonding information when converted to mol files

Converting ferrocene structures to mol format can introduce errors in molecular formula, bond orders or valence

 

Most of the structures shown take advantage of extended features of chemical drawing packages in order to represent ferrocene’s bonding in a way that’s attractive and easily understandable to a human chemist. Unfortunately, once transferred to the simplified but universal mol format, some of those features are lost, resulting in nonsensical structures. Although structure D is unchanged, this representation has other problems: incorrect valence on Fe and no representation of the aromaticity of the cyclopentadienyl ligands.

We have a limited number of ways in which we can depict ferrocene and related structures in ChemSpider, none of which give an accurate representation of the bonding or a view that would satisfy an inorganic chemist. However, we can choose the ‘least bad’ of the possible compromises and allow machine readability:

Fe2+ and (C5H5-)2

Our compromise

Although this structure (ChemSpider record) doesn’t capture the hapticity of ferrocene and the charge localisation on a single carbon is inaccurate, it retains correct overall charges and valences and doesn’t show the ligands as sigma-bonded.

More generally, we apply some rules and transformations to standardise representations of organometallic structures. Many of these rules involve choosing whether to depict a metal–carbon (or metal–heteroatom) as covalent or ionic, depending on the nature of the metal and the ligand. Again, compromises are necessary when working within the limitations of machine-readable structures, but we attempt to classify ‘more ionic’ and ‘more covalent’ bonds. Some examples follow:

  • Disconnect oxygen from group 1 and 2 metals
  • Connect oxygen to all other metals
  • Disconnect carbon from sodium, potassium and calcium
  • Connect carbon to group 11 and 12 metals, p-block metals and some metalloids

As expected, general rules like these fail in certain cases. Therefore we have additional, more specific rules to cover exceptions, which we iteratively refine.

But these errors still appear in ChemSpider!

At present the filtering described only applies to new data coming into ChemSpider. The full ChemSpider database, built up over many years, certainly contains examples of every error described here. To fix these legacy errors, we intend to run the entire database through the same quality filters. This is a significant task with some specific challenges: the files requiring human review become orders of magnitude larger, the processing time and memory/CPU overhead is high, and the larger the data set the more likely we will run into false positives. In order to manage these challenges, we are taking the time to refine our processes on new depositions, and periodically checking our progress by running subsets of the full ChemSpider database through our filters. We know you need access to data you can trust, so we want to make sure we get this right. We’ll continue to update you as this project progresses, so stay tuned!

Royal Society of Chemistry Renews Partnership with ACD/Labs to Continue Providing Industry-Leading Data to Worldwide Research Community

ACD/Labs algorithms will continue to equip ChemSpider with physicochemical property values and chemical nomenclature following ten year milestone.

Toronto, CANADA (July 26, 2018)ACD/Labs, an informatics company that develops and commercializes solutions in support of R&D, today announced the continued collaboration with ChemSpider, a leading chemical database owned by the Royal Society of Chemistry, to continue furnishing predicted physicochemical properties and chemical nomenclature to the ever-expanding platform. For over ten years, scientists have accessed this publically-available free resource to gather information on chemical compounds in preparation of research or experimentation.

As the industry standard for physicochemical prediction software, ACD/Labs was chosen to generate property information including logP, logD (at various pHs), Lipinski rule-of-5 values, and boiling point, and to provide Name-to-structure (and vice-versa) capabilities. The renewal of the partnership further reflects the success of the platform and its continued importance as one of the most robust online chemical structure databases for the scientific community. As the platform advances, ChemSpider will continue to use ACD/Labs algorithms to provide quality insights to researchers.

“We set out with the mission of empowering researchers with a comprehensive view of chemical data to inform R&D initiatives,” said Richard Kidd, Publisher, Royal Society of Chemistry. “By working with ACD/Labs and utilizing its property information, we’ve been able to meet our users’ need for knowledge, which is reflected in our rapid growth since the Royal Society of Chemistry acquired ChemSpider ten years ago. To-date, property information populated by ACD/Labs’ algorithms has been among the most accessed on ChemSpider, and remains a key driver in our service.”

While ChemSpider has doubled the size of its database, it has remained committed to maintaining high quality data from selective sources. As the platform continues to grow, ChemSpider will use ACD/Percepta prediction algorithms and ACD/Name tools in a batch-wise fashion to populate the database and enhance publicly available chemical intelligence.

“Enabling the dissemination of chemical knowledge and providing solutions to accelerate R&D are among our top priorities at ACD/Labs,” said Gabriela Cimpan, Senior Director Sales, Europe, ACD/Labs. “ChemSpider is empowering knowledge throughout the chemical community and we feel privileged to be able to support learning worldwide.”

For more information on ACD/Percepta, visit https://www.acdlabs.com/percepta

For more information on ACD/Labs Chemical Nomenclature tools, visit https://www.acdlabs.com/name

For more information on ChemSpider, visit http://www.chemspider.com

About Advanced Chemistry Development, Inc.

ACD/Labs is a leading provider of scientific informatics technologies to R&D organizations that rely on analytical data and molecular information for decision-making, problem-solving, and product lifecycle control. Our software automates and accelerates molecular characterization, product development, and knowledge management. We integrate with existing informatics systems and undertake custom projects including enterprise-level automation.

ACD/Labs solutions are used globally in a variety of industries including pharma/biotech, chemicals, consumer goods, agrochemicals, petrochemicals, and academic/government institutions. We provide worldwide sales and support, and more than 20 years of experience and success helping organizations accelerate R&D and leverage corporate intelligence. For more information, please visit www.acdlabs.com. Follow us on Twitter @ACDLabs.

About the Royal Society of Chemistry

The Royal Society of Chemistry is the world’s leading chemistry community, advancing excellence in the chemical sciences. With over 50,000 members and a knowledge business that spans the globe, we are the UK’s professional body for chemical scientists; a not-for-profit organisation with 175 years of history and an international vision for the future. We promote, support and celebrate chemistry. We work to shape the future of the chemical sciences – for the benefit of science and humanity.