Royal Society of Chemistry Renews Partnership with ACD/Labs to Continue Providing Industry-Leading Data to Worldwide Research Community

ACD/Labs algorithms will continue to equip ChemSpider with physicochemical property values and chemical nomenclature following ten year milestone.

Toronto, CANADA (July 26, 2018)ACD/Labs, an informatics company that develops and commercializes solutions in support of R&D, today announced the continued collaboration with ChemSpider, a leading chemical database owned by the Royal Society of Chemistry, to continue furnishing predicted physicochemical properties and chemical nomenclature to the ever-expanding platform. For over ten years, scientists have accessed this publically-available free resource to gather information on chemical compounds in preparation of research or experimentation.

As the industry standard for physicochemical prediction software, ACD/Labs was chosen to generate property information including logP, logD (at various pHs), Lipinski rule-of-5 values, and boiling point, and to provide Name-to-structure (and vice-versa) capabilities. The renewal of the partnership further reflects the success of the platform and its continued importance as one of the most robust online chemical structure databases for the scientific community. As the platform advances, ChemSpider will continue to use ACD/Labs algorithms to provide quality insights to researchers.

“We set out with the mission of empowering researchers with a comprehensive view of chemical data to inform R&D initiatives,” said Richard Kidd, Publisher, Royal Society of Chemistry. “By working with ACD/Labs and utilizing its property information, we’ve been able to meet our users’ need for knowledge, which is reflected in our rapid growth since the Royal Society of Chemistry acquired ChemSpider ten years ago. To-date, property information populated by ACD/Labs’ algorithms has been among the most accessed on ChemSpider, and remains a key driver in our service.”

While ChemSpider has doubled the size of its database, it has remained committed to maintaining high quality data from selective sources. As the platform continues to grow, ChemSpider will use ACD/Percepta prediction algorithms and ACD/Name tools in a batch-wise fashion to populate the database and enhance publicly available chemical intelligence.

“Enabling the dissemination of chemical knowledge and providing solutions to accelerate R&D are among our top priorities at ACD/Labs,” said Gabriela Cimpan, Senior Director Sales, Europe, ACD/Labs. “ChemSpider is empowering knowledge throughout the chemical community and we feel privileged to be able to support learning worldwide.”

For more information on ACD/Percepta, visit https://www.acdlabs.com/percepta

For more information on ACD/Labs Chemical Nomenclature tools, visit https://www.acdlabs.com/name

For more information on ChemSpider, visit http://www.chemspider.com

About Advanced Chemistry Development, Inc.

ACD/Labs is a leading provider of scientific informatics technologies to R&D organizations that rely on analytical data and molecular information for decision-making, problem-solving, and product lifecycle control. Our software automates and accelerates molecular characterization, product development, and knowledge management. We integrate with existing informatics systems and undertake custom projects including enterprise-level automation.

ACD/Labs solutions are used globally in a variety of industries including pharma/biotech, chemicals, consumer goods, agrochemicals, petrochemicals, and academic/government institutions. We provide worldwide sales and support, and more than 20 years of experience and success helping organizations accelerate R&D and leverage corporate intelligence. For more information, please visit www.acdlabs.com. Follow us on Twitter @ACDLabs.

About the Royal Society of Chemistry

The Royal Society of Chemistry is the world’s leading chemistry community, advancing excellence in the chemical sciences. With over 50,000 members and a knowledge business that spans the globe, we are the UK’s professional body for chemical scientists; a not-for-profit organisation with 175 years of history and an international vision for the future. We promote, support and celebrate chemistry. We work to shape the future of the chemical sciences – for the benefit of science and humanity.

Behind the Scenes at ChemSpider

A peek at who we are, how we run the site, and how we manage data quality.

What is ChemSpider and who runs the service?

ChemSpider is one of the largest chemical databases in the world, containing data on over 65 million chemical structures. This data is freely available to the public at ChemSpider.com, a website published by the Royal Society of Chemistry.

How does the Royal Society of Chemistry support ChemSpider?

ChemSpider.com is an independent service that does not rely on direct or research grant funding. The Royal Society of Chemistry supports the website using the surplus generated by our publishing activities, allowing us to provide a sustainable and reliable service. We also generate revenue from advertising and by providing paid for web services, such as our APIs, for non-academic users. These activities help keep ChemSpider financially sustainable and help support our server costs, staff hours and development.

These services enable us to make the site available free anyone in the world, and we reached over six million unique users in 2017. These users range from school students looking for help with their homework, to researchers working in academia and industry, to general users who want to keep their chemical knowledge up to date. They come from every continent except Antarctica, and just about every country on Earth.

What goes into ChemSpider?

ChemSpider data comes from the chemical sciences community itself – submitted by researchers, databases, publishers, chemical vendors and many more.

We have two main inclusion criteria for ChemSpider data:

  1. Machine readability – Depositors must provide structures in a machine-readable format, typically a .mol file that is interpretable by InChI – the open-source chemical structure representation algorithm.The .mol format describes how a compound is arranged, atom-by-atom and bond-by-bond. This means that it can only accurately depict small molecules with defined structures. For ChemSpider, “small” means structures up to 4000 daltons, including short peptides, oligonucleotides, and other structures. Large proteins, extended crystal lattices or long nucleotides are too big to describe sensibly in ChemSpider, but are available from other databases suited for larger molecules.

    We also only accept ‘defined structures’ – compounds with exact chain lengths, fully expressed functional groups, and integer bond orders – due to the requirement to describe every heavy atom in a molecule. This means we can only accept structures for which we can generate a valid InChI.

    Most ChemSpider structures are organic molecules. However, we do accept some inorganic and organometallic compounds, with specific methods for curating these.

  2. Real compounds – We do not accept virtual or prophetic compounds.

As far as possible, we only accept compounds that have been synthesised or isolated in physical form. This means we do not accept transition states, theoretically predicted compounds, virtual compounds from vendors or prophetic compounds from patents.

Who are our data sources?

We have received data from almost 250 unique data sources, including data from chemical vendors, specialist databases, individuals, research groups and publishers. These sources cross the breadth of the chemical sciences – including biochemistry, pharmacology and toxicology, natural products, spectroscopy and crystallography. Each ChemSpider record includes links to all of the data sources for the compound, enabling users to find and to check the provenance of the data.

Our data source list is continually changing, as we find new sources of data to add and remove outdated or low-quality data sources.

We no longer accept data from other data aggregators. We have taken this step to match our quality requirements with other databases and reduce the propagation of algorithmically generated errors that can arise from prophetic sources. One example of this is Chessboardane, which originated from an optical structure recognition program interpreting a data table contained within a patent as a chemical structure. The result was an 81-carbon grid structure, erroneously identified as a complex cyclic alkane, which was deposited in a public repository and shared between multiple aggregators.

Because of this, we only seek data directly from the original sources, where we have greater certainty about the data’s provenance and accuracy, and are working to curate legacy data still within ChemSpider.

Because of examples like Chessboardane, we are cautious about accepting data from text-and-data-mined sources that depositors have programmatically extracted from text or encoded images in patents or scientific literature. After review, we have added some of the highest quality data mined sources. We will continue to review potential new data-mined sources on a case-by-case basis to ensure that their data meet our quality standards.

Automated filters

A manual check of every one the 65 million records in ChemSpider would take an individual more than 600 years to complete working round the clock – even if we only invested five minutes of curation time per record.

Instead, we run each deposition through a series of automated filters to pick out unsuitable structures, such as those with incorrect valences, unbalanced charges, or missing stereochemistry. In addition to structure filters, we also apply basic name and synonym filtering and regularly review the processed files so that we can improve our filters.

We have provided a simplified overview of this process below, and will provide a more detailed description of our filters in a separate blog post:

Structures are run through filters in KNIME. Those that fail the filters are removed and reviewed. Passed structures are deposited to ChemSpider

Curation by ChemSpider staff

ChemSpider is run by a small team of full-time curators, who work to add new compounds, remove errors, and respond to user feedback. Our staff have extensive experience of both chemical data and practical chemistry, with backgrounds in fields such as organic synthesis and art conservation, and a wealth of experience working on other Royal Society of Chemistry databases, such as The Merck Index* Online and Analytical Abstracts.

Community curation

Because we cannot review every record ourselves, we really appreciate comments or corrections from our users.  The easiest way to help us improve ChemSpider is to leave feedback or email us when you spot an error. We try to act on user feedback within a few days – sooner for simpler queries. Please let us know if you find an error by leaving a comment on the relevant ChemSpider record, or by emailing us (chemspider@rsc.org).

Users wishing to get more involved can directly deposit structures and curate synonyms related to their research or work, without having to email the ChemSpider team.

We are extremely grateful for all the contributions our community curators have made over the years.

Keep using and contributing to ChemSpider

To access information on over 65 million chemical structures, go to ChemSpider.com, which is fully searchable by structure, name, or advanced query, from any device, anywhere, for free.

To deposit data, tell us about an error, become a curator, or for any other query, please do not hesitate to email us at chemspider@rsc.org

*The name THE MERCK INDEX is owned by Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc., Whitehouse Station, N.J., U.S.A., and is licensed to The Royal Society of Chemistry for use in the U.S.A. and Canada.

Introduction to the new ChemSpider website

Blog post written by David Sharpe.

The ChemSpider team at the Royal Society of Chemistry is proud to announce that our new look ChemSpider website has been launched. As discussed in our last post one of the key features of this new design is to make ChemSpider work on as many devices (from desktops to mobile phones).

ChemSpider home page

The ChemSpider homepage as it might appear on a desktop computer (left) and a mobile phone (right)

 

As the screenshots above illustrate, the difference in size, shape and the method of interacting with the page means the view of the website that you need is very different between devices. The nature of a responsive website design also means that some of the screenshots that we provide might be a little different from the view that you see when accessing the  site, however the variances should be clear. We hope this results in an experience where usability and readability are not sacrificed for functionality.

What has changed? … and what has stayed the same?

To start with the things that have stayed the same: ChemSpider is still based on the same quality-data and provides mechanisms for users to supply and curate data. We also haven’t changed how the search queries work, so searches that you ran previously should still return the same results.

The key changes

 

1. The new page header

We’ve moved all of the old menu items into a bar at the very top of every page (1), we also display a search bar just above the main page content (2). On smaller displays you will see icons for the Quick search box, Sign In and Help items, all other options can be found under the ‘hamburger’ symbol (3).

Comparison of the ChemSpider page header on large and small screens

2. Shorter record pages

One of the biggest challenges of making ChemSpider work on a mobile is how to display all of the information that we have on a much smaller screen. I think that our solution will actually make ChemSpider better for everyone – regardless of how they view the site.

Previously, a ChemSpider record was one big long page that had basic details about the chemical structure at the top of the page, followed by a number of infoboxes that could be opened or closed and also re-ordered. This worked fine in most cases but led to situations where you had to do lots of scrolling up and down, and might not be able to spot the infobox that you were looking for. Now, we still show some information about the chemical structure at the top of the page (1) but, below that there is a single pane (2) which contains tabs (3) that allow you to select the section of the record that you wish to display. This means that it is always easy to look at some information and see the structure to which it relates.

ChemSpider record layout

The new page layout consists of a Compound header (1) and a Pane (2) diplaying the contents of the infotabs (3)

3. No Java, No worries

Many browsers no longer support Java applets. Good Java-free versions of chemistry tools have really started to take off in the last 18 months and the time was right to start the switch over. This means that the site now incorporates JSmol – enabling 3D  structure view, CIF viewer and NMR/IR/MS spectra display as well as Ketcher and Elemental for structure input/editing.

 

4. Structure searches simplified

Previously, creating a structure search was a bit of a pain as you had to: open the structure editor in a pop-up, draw your structure, and then save it back into the searches pages – now our structure editors are embedded into the interface, cutting down the number of steps needed to get to your results and making it easier to tweak searches.

One particularly useful feature for anyone accessing the site on a tablet or mobile phone is the Convert Structure tab which can be used to load in a complex structure as a basis for a search, for instance using “dibenzylamine” in the structure conversion gives a structure that can be quickly elaborated to the Simpkins’ chiral base precursor amine shown in the the screenshot.

What’s next?

Hold on a moment there! We’ve only just got all of these great features into the site! I’m joking, but we will be spending time tweaking and perfecting the new design. We will then be able to focus on further development, if I were to speculate – I’d suggest that we will look at more (non-Java) tools that can be incorporated into the site to give a better experience, and new methods of improving the quality of data in our records.

In the meantime, please explore the site and do email us at chemspider-at-rsc.org to let us know what you think of the new site.

What’s new with ChemSpider?

Blog post written by David Sharpe.

Subscribers to this blog might have noticed that we’ve been a bit quiet of late. I want to assure that this doesn’t mean that we have been resting on our laurels. In fact we have been working on a whole host of improvements to ChemSpider – improving our infrastructure, developing ways to increase data quality and designing a new layout for our records.

We will discuss both the data quality work and the website redesign work in more detail in separate posts but ahead of the release of the new website design I want to provide some insight into what to expect when the changes go live.

Why are we changing the site now?

Well there are quite a few reasons:

  1. Primarily, we need to have a site that meets the standards of the modern internet. This means that the site needs to be usable not only on a desktop computer but also on a tablet or a mobile phone. This is often referred to as responsive web design
  2. ChemSpider has always had records that are full of lots of rich and varied types of information – which poses a challenge when it comes to presenting that information so that it is discoverable and easy to understand once found. We hope that the new layout will present data in an intuitive and clear way that will provide a better experience for everyone.
  3. We need to move away from technologies that are not supported by the widely used browsers. Java-based tools have been an issue for users on certain platforms for a while and this is only going to get worse. For a long time we have provided non-Java structure editors alongside the Java tools (the current version of the site incorporates Elemental and Ketcher for structure drawing. This release will see the adoption of JSmol to enable 3D structure view and Spectra display widgets for devices that don’t support Java . At this time we are providing both Java and non-Java solutions but expect to phase out Java applets in the near future.
  4. Improve the integration of ChemSpider with the wider Royal Society of Chemistry web family

 

Will there be any more changes to how the site works?

There will certainly be some changes to some aspects of the site due to: user feedback and bug fixes. We also what to look at how we can make more complex interfaces such as Advanced Search more usable, but we hope that there won’t be any major changes to the site.

Will all of the features that you use still be accessible?

In the main, the answer to this is; Yes! It might be that they now appear slightly differently or be accessible through a different interface. There are 2 caveats:

  1. When accessing the site on mobile devices

    The layout of a page on the smaller screens and tablets often needs to be different – wherever possible this is achieved by rearranging the elements of the page and adding new controls. But for some parts of the ChemSpider interface we realised that there wasn’t a good way to display all of the data and the only solution was not to show that part of the page on these smaller screens.

  2. Removed features

    There are a couple of features (such as the Print button) which we felt were no longer relevant in the new design or need to be redesigned to make them more usable.

When will the new site be launched?

We hope that the new site will be ready to release within the next week.

How will the changes affect you?

We hope that the transition will be smooth for everyone. Once the new design goes live you might need to refresh/clear your Browser Cache. The new design does require a modern browser with a good support of the HTML 5 specification. We will try to ensure that the site is usable on as wide a range of browsers and platforms as possible but expect that the site will not work well in older browsers such as IE7.

Will it still be possible to access the site using the old interface?

Unfortunately, the old interface will not be available alongside the new one.

How will you be able to provide feedback on the new design?

The best way to provide feedback will be to email us at chemspider-at-rsc.org

Keep an eye out for the new design – when it is made live we will write a blog post about the changes.

Adding RSC CIFS to ChemSpider

Written by Aileen Day.

We are pleased to announce that we have just imported 1047 CIFs to ChemSpider of crystal structures that were previously reported in RSC papers (and are available as ESI for those) to ChemSpider for the relevant compounds, and linked those back to the original articles and to the CCDC’s webCSD, e.g. example compound with RSC article CIF (see the CIF infobox). Since each CIF that is uploaded into ChemSpider must be associated with a ChemSpider compound, the difficult part of this task was working out a 2D molecular structure (in .mol file format) for each 3D crystal structure (in .cif file format) – which is particularly difficult because CIFs only contain information about each atomic position and not how the atoms are bonded to each other in the crystal or whether they are charged or not.
Ultimately we would like this CIF to mol conversion (and the whole upload) to be performed programmatically without human intervention. However, there is no reliable way to do that currently – although programs such as OpenBabel can be used to extract mols from each CIF, the reliability of this conversion isn’t 100%.
So as one of our student intern projects at the University of Southampton this summer (in parallel with another student intern project at Southampton University to share thesis data in ChemSpider) we used OpenBabel (version 2.3.2, run from the command line with the options -i cif inputfilename.txt -o mol -m –unique -d –AddPolarH) to extract mols for all the CIFs in the RSC archive (over 43,000 files as of June 2013) and enlisted Julija Kezina (shown below) to review the results of these conversions to ensure that only good structure and CIF pairs would be deposited to ChemSpider, and to better understand the problems in the conversion process with a view to fixing them. One problem that became immediately apparent was that because the 2D structure obtained was just a projection of the 3D structure along the a cell axis, which is not always the orientation which shows the molecule most clearly, even if they did have the write chemical connections between the atoms, so all mol structures were run through OpenEye’s cleaning algorithm before being reviewed.

Julija Kezina - Southampton University intern who examined CIF to Mol conversion

Julija Kezina – Southampton University intern who examined CIF to Mol conversion

Julija compared each structure in the output mol files with those in the original CIF files to judge whether the conversion was accurate or not. In addition, as an extra check, all of the output mol structures were submitted to ChemSpider validation and standardisation platform to filter out molecules with structural problems (e.g. stereochemistry, valence or congestion issues).
Overall, approximately 30% of the CIF to mol conversions that Julija checked were good, with the right connectivity of atoms and ions (although approximately 30% of these needed the atomic positions to be repositioned to clean or tidy up the structure, either manually or using ChemDraw’s cleaning functionality). The 1047 of these mols which contain only a single molecule (without solvent molecules or cocrystals etc.) are those which have been deposited into ChemSpider with their corresponding CIFs.
The journals which had the highest successful conversion percentage were Molecular BioSystems (57%), MedChemComm (51%), Organic and Biomolecular Chemistry (44%) and Green Chemistry (44%) – the journals which in general are about small organic molecules.
Julija was working in the National Crystallography Service’s office at the University of Southampton, under the co-supervision of Professor Simon Coles, and we are grateful to them for their help and advice about the finer points of the CIF file format.

Unsuccessful CIF to mol conversions

Running and evaluating OpenBabel on such a large and varied set of structures has given us a useful opportunity to identify and categorise the most common problems encountered. Here we share these and give examples that would enable the identification of some easy fixes in the pipeline that might benefit the whole community and be used as test cases when doing so. We will report these bugs to the OpenBabel forum and because OpenBabel is open source, hope to resolve at least some of these issues in the future through collaboration with its other developers.

The following OpenBabel bugs look like they might be most straightforward to fix:

Details Example
  • Category: BAD_NITRO
  • Frequency: 233
  • Description: there are different ways of representing nitro groups in structure drawers – OpenBabel currently does so by producing a mol with a pentavalent nitrogen. In ChemSpider we we choose to avoid this in favour of a format with a charge-separated nitro.
  • Solution: Allow OpenBabel to have a different output option for nitro groups to output them as shown in corrected mol file.

  • Category: BAD_MULT
  • Frequency: 434
  • Description: Duplicate (exactly identical, including stereochemistry) molecules are present in the resulting mol file despite running OpenBabel with the –unique option (which should filter out duplicate molecules based on their inchis)
  • Solution: Fix OpenBabel when run with the –unique option so that it works.

  • Category: BAD_MISSINGPARTOFMOLECULE
  • Frequency: 724
  • Description: Part of the molecule is missing
  • Cause: OpenBabel doesn’t understand crystal symmetry – only the atoms in the CIF that are explicitly listed with positions are included in the resulting mol file, and those that are inferred by symmetry are not.
  • Solution: Make OpenBabel generate the full molecule from the symmetry in the CIF file, or recommend that a script/program that can process a CIF to generate another CIF with all atoms is run before OpenBabel.

  • Category: BAD_PARTIALOCCUPANCY
  • Frequency: 432
  • Description: partial occupancy of multiple sites for a particular atom in the CIF file
  • Cause: In CIF files sometimes positions of multiple sites are specified with occupancy less than one – OpenBabel doesn’t recognise this and assumes that the occupancy of all sites is one effectively, so that there are duplicates of some atoms or fragments in the mol file.
  • Solution: Where the _atom_site_occupancy is less than one, group together atoms into those which are alternatives of each other (by type, proximity, and those which add up to a total occupancy of 1) and choose only one of them to include in the final mol file (that with the highest site occupancy, or if two have equal occupancies of e.g. 0.5 then pick one at random). Note that there needs to be consistency, so that if for example a C is discarded, then all of the adjoining H’s with partial occupancy are also discarded but those bonded to the C that is included are included (as in the attached example).

Many of the problems were caused by idiosynchronies or errors in the input CIFs, but these on the whole weren’t handled well by OpenBabel (e.g. by writing an error message and terminating the program) but rather, in the majority of cases went into an infinite loop and the program hung. Because of this, and because the OpenBabel conversions were part of a longer script, all OpenBabel jobs had to be run with an arbitary timeout so that if still running after this timeout they were killed, which may have discarded some valid but long-running OpenBabel jobs. We will investigate whether there is a validation program that can be automatically performed on CIFs to filter out ones with these problems (similar to the CCDC’s EnCIFer but which can be run programmatically), but it would be relatively straightforward to make OpenBabel more reliable by being able to exit nicely when it encounters these problems so that pre-validation wasn’t necessary. These problems are listed in the table below:

Details Example
  • Category: CIF_NOCOORDINATES
  • Frequency: 378
  • Description: cif doesn’t contain any coordinates
  • Cause: Some CIFs contain e.g. powder diffraction refinement data and don’t contain coordinates.
  • Solution: OpenBabel already issues an error: “CIF Error: no atom found ! (in data block:XXX)” – simply abort the program if this is found (rather than trying to continue).
  • Category: CIF_MISSINGLOOP
  • Frequency: 85
  • Description: cif misses a “loop_” line
  • Solution: Do an initial check that there is at least one loop_ line in the expected place before attempting to do the conversion.

  • Category: CIF_COMMENTEDFIELD
  • Frequency: 36
  • Description: if there is a CIF field name in a commented section of the CIF, OpenBabel doesn’t ignore it and goes into an infinte loop
  • Solution: It would be trivial to make sure that OpenBabel ignores CIF field names which are commented out (between a pair of semicolons).

The following OpenBabel bugs were the most frequent in occurence, but will be difficult to fix. They arise from the problem that the CIF format does not record charges on atoms/ions or the types of bong between them so OpenBabel needs to work them out which is hard to do correctly.

Details Example
  • Category: BAD_CHARGEMISSING
  • Frequency: 830
  • Description: One or more ions in the molecule have the wrong charge on them in the resulting mol file

  • Category: BAD_WRONGCOORDINATION
  • Frequency: 747
  • Description: One or more atoms or ions in the molecule have the wrong coordination – problem observed in metal ions, S, P, Se and B

  • Category: BAD_BONDMISSING
  • Frequency: 587
  • Description: One or more of the bonds in the molecule are of the wrong order e.g. a single bond instead of a double bond.

  • Category: BAD_WRONGBOND
  • Frequency: 452
  • Description: Wrong sequence of single/double bonds.

  • Category: BAD_NOCOORDL
  • Frequency: 52
  • Description: no coordination to a ligand.

  • Category: BAD_MISSINGH
  • Frequency: 18
  • Description: missing hydrogen.

There were also some problem mol files produced which either won’t be able to be fixed by OpenBabel (since they resulted from either errors or limitations of the input CIF files which cannot be fixed retrospectively) or are too difficult to fix and/or too infrequently occuring to be worth the effort:

    • There were 237 cases where there were solvent molecules in the CIF (many of which have missing hydrogens, partial occupancy of the molecule or part of the molecule etc.) which give rise to spurious oxygens, fragments of molecules and radicals in the resulting mol file (see CIF: CCDC 213787  and ChemSpider record: 68005706). 148 of these cases are just water solvent molecules either with missing or detached hydrogen atoms. The poor definition of the solvent molecules is a limitation of CIF files from diffraction so it is not possible for OpenBabel to better define them in the output mol that is derived from them. However, running OpenBabel with the -r option to remove all but the largest contiguous fragment was quite successful to remove these problem solvent molecules so no further action is required to deal with this problem and this option will be used by us in the future.
    • There were 81 cases where there was at least one missing hydrogen in the original CIF (or in 3 cases, all hydrogens missing) – see CCDC 259871.
    • Some CIFs contain crystal structures which correspond to continuous networks rather than small molecules (e.g. polymers, MOFs, zeolites, POMs) which cannot meaningfully be captured in mol format – see CCDC 206593.
    • There were a few (24) cases where the stereochemistry in the mol file obtained is incorrectly defined. However, because on the stereochemistry was well interpreted by OpenBabel and these cases were relatively few, it probably isn’t worth disturbing the apple cart to investigate these further – see CCDC 238611 and ChemSpider 9419187.

More hexagons in the plane

Written by Colin Batchelor.

Recently I heard someone who cycled the 1400 km from John O’Groats to Lands End, with a headwind all the way, because it looked on the map as if it was downhill and hence easier. (I am grateful to Neil Swainston of the University of Manchester for this anecdote.)

You might think that “down” on the page is unlikely to be “down” in 3D space, but there is an interesting exception to this, at least for certain interpretations of “down”.

Some time ago I gave a teaser of my Sheffield talk, which is now online here and here. The mathematical meat of the talk was about redrawing sugar rings in small molecules so that they can be properly indexed by cheminformatics systems. The teaser showed a classification of hexagons so we can tell which rules to apply.

It turns out that for the hexagons we see most in practice, which are chair hexagons and Haworth hexagons, at least if the hexagon itself has its long axis roughly horizontal on the page, then if a bond points “down” on the page, when we redraw the hexagon as viewed from “above”, then the bond will still be pointing down and needs to be redrawn with a dashed bond. The same applies, mutatis mutandis, for the bonds pointing “up”.

So far, so distressingly simple. Sometimes tasks really are easier than they look. There are two more things to address, though. One is simple and involves the well-known rules for how many stereobonds you draw in any given structure (I’ve mentioned this before). The other one is tidying the molecule so that the layout algorithm doesn’t undo all your good work. This is a bit trickier and I need to look a bit more at what tools are already out there for doing this.

Southampton University internships to transfer thesis data into LabTrove and ChemSpider

Written by Aileen Day.

This summer there have been a number of students from the University of Southampton doing internships on joint projects between the university and the Royal Society of Chemistry and ChemSpider. Three of these students have been sifting through theses from past members of Richard Whitby’s research group in order to extract the compound, spectra and reaction data in it (and linked lab note books, and archive spectra files) and share these in LabTrove, ChemSpider, and CSSP. The students – Alex Hartke, Yet Wai Lee and Josh Whittam (all 2nd year undergraduates) – are shown below together with the boxes of thesis data, lab notebooks and spectra print outs that they digitised.

Southampton University interns

Southampton University interns

Between them they digitised 7 theses, by A.Henderson, L. Sayer, D. Owen, D.Macfarlane, F. Giustiniano, G. Saluste, J. Stec, which resulted in 1035 LabTrove pages being published to the Whitby Group’s LabTrove blog.

The theses were a rich source of compound information – including compound structures, names, properties and spectra, all of which were also deposited into ChemSpider resulting in 208 new compound pages, and about 600 spectra.

For this project the students manually deposited the compound information into LabTrove and then deposited the compounds and spectra to ChemSpider. However, we are currently developing a range of ChemSpider jquery widgets which can be integrated into web-based ELNs such as LabTrove which will make it easier to enter compound information from ChemSpider into experiments, and also to publish compound and reaction data from the ELNs to ChemSpider, CSSP and ChemSpider Reactions. This will follow on from the initial proof of concept to retreive ChemSpider information and enter it into LabTrove pages.

With this long-term aim in view, the LabTrove pages that the interns stored the compound and reaction data were structured using LabTrove templates, and this structuring will make it easier for publishing widgets to understand the data and process it the correct way. In this way, the project was partly a test to ensure that the templates were suitable for storing compound data in LabTrove. As well as the ChemSpider compound and associated data template (with corresponding help page, templates were also written to store reaction data in a formatted way, since the theses were primarily focused on the synthesis of compounds. At their simplest, basic reaction data can be stored in LabTrove using the ChemSpider Reactions template (and corresponding help page, and eventually posts written in this format will be easily publishable to ChemSpider Reactions. More detailed reaction data can be stored using the ChemSpider SyntheticPages style reaction template (and corresponding help page. The initial aim was to deposit all of this reaction data into ChemSpider SyntheticPages but it became clear that it was difficult for anyone other than the researcher who conducted the reaction, or their superviser to supply the necessary level of detail for CSSP submissions, and in particular couldn’t easily be reached by retrospectively abstracting theses. As a result, only a handful of reactions were submitted to CSSP, and the majority (over 500) were stored in LabTrove for future submission to ChemSpider Reactions.

If reactions can be published easily from ELNs to ChemSpider Reactions and that is easily queryable by other researchers and their applications when performing new reactions this will be a major step towards the aims of the Dial-a-molecule (an EPSRC Grand Challenge network). An important part of the reaction data which needs to be captured is the stoichiometry table of substances used and produced in a reaction. However, these stoichiometry tables are too complicated to incorporate into a LabTrove template, so the LabTrove reaction templates will be used in conjunction with a new ChemSpider jquery widget which is currently in the process of being integrated with LabTrove (more details to follow on this blog shortly!) which will construct them. The widget performs ChemSpider lookups to retrieve compound information, and will calculate equivalents, thereby saving the researcher time when working out the amounts of reactants needed or yields of products obtained. An example of a reaction post which was initially created using the ChemSpider Reactions template and then supplemented by adding a stoichiometry table to it using the ChemSpider Edit Stoichiometry Table widget is shown here.

If you are a LabTrove user and wish to use the ChemSpider templates, their source is available via their links above, and instructions for using templates in Labtrove are documented here.

Recent Improvements to ChemSpider Search (part 3)

In part one of this series we talked about searching by molecular formula ranges, and combining substructure searches with other types of searches. Part two covered how to search by supplementary information like bioactivity, appearance or melting point. This time we will demonstrate how you can use a search combining these new features to help answer a question you might encounter in the lab.

After performing a bromination reaction on phenol you isolate a product with a melting point of 90-93°C. If you start a search with just three pieces of information – your product is a derivative of phenol, it should contain at least one bromine, and your melting point is 90-93°C – you can construct a search on the Advanced Search page to help you get started in identifying your product.

Since you can now combine substructure searches with other searches, you start by looking for a compound containing phenol (Search by SubStructure). To restrict your results to brominated phenols, you add a molecular formula range search for C6H(1-5)O1Br(1-5) (Search by Properties). Lastly, you search for compounds with a melting point of 90-93°C (Search by Supplementary Information).

Your search turns up one result – 2,4,6-Tribromophenol. Although you need more information to conclusively confirm the identification, this gives you a lead in your analysis/elucidation.

Taking a look at the record, you may notice it has an interactive IR spectrum from NIST. If you check the Data Sources section, you will find that there are a lot data sources for the record.

To make it simpler to identify useful information you can browse the tabs to look for specific types of information: for instance the “Spectral Data” tab provides links to data in the MassBank and NMRShiftDB databases, which will hopefully aid you confirming/determining whether the product is 2,4,6-Tribromophenol.

This is just one example of how you can combine different searches on the Advanced Search page. Advanced searches are a great way to narrow down your results to help you find exactly what you are looking for, and there are many options we haven’t covered here, so have a look around and see what combinations might work for you.

Recent Improvements to ChemSpider Search (part 2)

Last time we told you about a number of improvements we have added to ChemSpider in the recent site updates, including combined substructure and properties search and searching by molecular formula ranges. As promised, this time we will cover how to search by properties like melting point or appearance.

Searching by Supplementary Information

Until now, although you could view properties when you were already on a record, there was no way to search by melting point, refractive index, appearance or bioactivity. This update has implemented a new search interface which allows you to search this data. You can now find compounds that are reported as being isolated from yeast, or compounds with a melting point of 32-35 °C.

There are 2 main parts to our Supplementary search interface.

Text Properties Search

Text properties include appearance, chemical class, drug status, or safety data. You can search any of these properties by using key words. When you start typing, a number of suggested search terms will appear, which can help you narrow down what search term to use.

You can also use wild cards by entering *, which can give you a little more flexibility in your search term – so if your unknown is a blue, crystalline material a search for “Blue crystal*” will turn up all records which mention the word “blue”, as well as any word beginning with “crystal” (such as crystals or crystalline).

 

Numeric Properties Search

Numeric properties include physical properties like experimental or predicted boiling point, optical rotation, or LogP. Since we draw data from a wide range of data sources, not all of this information is sent to us in the same format or with the units depicted the same way. In order to make it possible for you to search across all the properties in our database no matter how it was supplied to us, we have done a lot of background work on tidying up and standardizing this data.

All numeric properties can be searched using min/max or with a +/- range and the search term can be entered in a variety of units – eg. Fahrenheit or Celsius for temperature, or psi or mmHg for pressure. Because the boiling point of a material is dependent at the pressure at which the measurement is made and not all boiling points are measured at atmospheric pressure we have created a feature that attempts to compensate for this. It uses the Clausius-Clapeyron equation to create estimated (standardised) boiling points for searching, please remember this when looking at your results.

 

As you can see, you are able to search on a wide variety of experimental properties, including boiling point, LogP, melting point, specific gravity and solubility. Please note that although many of the more common compounds have some properties, these properties are only available on a subset of our records – so if you do not get a result on a property search, it might be that we haven’t added that information yet.

Hopefully this gives you a good idea of the improvements we’ve made to ChemSpider search, and how these new features make it easier than ever to find what you are looking for. See the following post for a case study that showcases several of the new features covered in these posts.

Recent Improvements to ChemSpider Search (part 1)

We recently published an update to the ChemSpider website which, in addition to fixing a number of bugs, has added some useful new features. Three of these features are highlighted in this post – one which you might have noticed already, and two which you may not have discovered yet.

Auto-Complete

We have reinstated the auto-complete feature on the ChemSpider homepage. Now, when you begin typing in the search box, ChemSpider makes suggestions based on what you have typed. This makes it easier than ever to find what you are looking for – even if you aren’t quite sure how to spell it.

Autocomplete on the ChemSpider homepage

 

Combined Structure/Property Searches

People frequently ask if there is a way to search substructure and other properties like molecular weight or molecular formula at the same time. This update now makes it possible to perform this kind of combined search from our improved Advanced Search page.

E.g. If you are interested in finding compounds which are structurally similar to Valium, you can enter a benzodiazepinone substructure and restrict it to compounds with a molecular weight of 275-325.


This search then returns Valium along with other similar drugs like clonazepam, nitrazepam and lorazepam.

There are many other search options that can be combined with a substructure/similarity search so look at the Advanced Search page and have a play.

Molecular Formula Range Searching

You can also search a range of molecular formulae at once. To specify the range for a given element, put the range in parentheses after the element. E.g. C7H(10-12)O(0-1) would return all compounds containing exactly 7 carbons and between 10 to 12 hydrogens and which may or may not contain an oxygen. This type of search can be performed from the Simple Search page, as part of an Advanced Search or from the ChemSpider homepage.

Best of all, this can be combined with any of the other search parameters on the Advanced Search page including the substructure search. For example, if you wanted to find polychlorinated biphenyls containing at least three Chlorines you could perform a substructure search for a biphenyl with a molecular formula of C12H(0-7)Cl(3-10).


In our next post, we will cover some new ways you can search by properties that are stored in our records such as melting point, density, etc.