Author Archive

Digital Discovery Webinar: Artificial Intelligence and Data in Drug Discovery and Development

Digital Discovery invites you to this webinar on opportunities, challenges and techniques in the use of AI and data in drug discovery and development.

A banner with the webinar title and pictures of the speakers

Featuring Maximilian Jakobs (DeepMirror), Andreas Bender (University of Cambridge) and Nessa Carson (AstraZeneca), this 90-minute seminar will explore key ideas and case studies, challenges in achieving tangible process improvements, and approaches to interfacing AI, data and robotic systems with pharmaceutical R&D.

Register to join us live on Wednesday, 30 October 2024 at 1400 GMT, or receive the on-demand version.

Register now!

Program

1400 GMT – Welcome
1405 GMT – Introduction to Digital Discovery, Anna Rulka (Executive Editor, Digital Discovery)
1410 GMT – What is AI, and Why Does It Matter?, Maximilian Jakobs (DeepMirror)
1435 GMT – Aspects of Life Science Data and Translation, Andreas Bender (Cambridge University)
1500 GMT – AI and data in the process development space, Nessa Carson (AstraZeneca)
1525 GMT – Final questions and close

This webinar is free to attend wherever you are, and can be watched either live or on-demand at a time that’s convenient to you. We hope you can join us!

Guest post: The evolving roles of data and citations in journal articles

The evolving roles of data and citations in journal articles

Henry S. Rzepaa

aEmeritus Professor of Computational Chemistry, Department of Chemistry, Imperial College London.

A portrait photo of Henry Rzepa

Background

The last thirty years have seen enormous changes in the so-called scientific journal model, first introduced some 350 years ago as a paper based medium. The typical journal article in say the chemical sciences has evolved during this period to contain a traditional narrative structure such as an introduction or background to the topic, the presentation of results and data, conclusions drawn from the data, experimental procedures to enable replication and a bibliographic section where relationships to other work can be cited. Such a serial narrative format has itself come under scrutiny, as for example a recent publishing experiment involving its dissection into eight smaller units of publication, potentially with their own structures and authorship and each of which could stand on their own merits, but which can also be assembled to reconstitute an overarching synoptic journal article.1 The electronic journal era of the last 30 years has also brought with it experiments in how the various constituents of the traditional journal article might be digitally exploited. An example2 dating from the start of the e-journal period showed how selected articles in the journal Chemical Communications could be enhanced with “pop-up” interactive molecular models based on 3D coordinate data provided by the authors, thus augmenting the static views provided by conventional figures.

In the present commentary, the focus will be on two other ways of digitally exploiting the medium of the journal, both driven by the extraordinary recent attention given to artificial intelligence or machine learning and questions such as whether the current publishing models need to be prepared for this new era. These are how the availability, discovery and the properties of data associated with journal articles is being improved and secondly of citation enhancement, both being facets of the publication processes and which turn out to be closely inter-related.

Journals and Data

For much of the history of publishing in e.g. chemistry, the data behind a research article has been integrated into the article in the form of tables of numerical results and/or figures derived from these data, along with graphical schemes illustrating other aspects such as molecular structures and associated reactions and mechanisms. Isolated numerical data could often be simply integrated into the text-based narrative. This became impractical when the tables of numerical data swelled in size – an example being e.g. crystallographic information from the 1950s onwards. Procedures for printing this information and then depositing the print copy in a national library or other central resource were introduced and this became more common for a short period during the 1970s.3 In order to re-use such data, an interested reader would have to re-type the numerical information in order to absorb it into say a computer for analysis, and then spend a fair bit of time trying to ensure no errors had been introduced by this process. From the mid 1990s, this paper-based form thankfully started being replaced by “electronic printing” into the PDF format, when it became known as ESI or electronic supporting information – a mechanism that still dominates to this day. Over the last decade however, it has been increasingly recognised4 that ESI is not an optimal medium for use in areas such as e.g. artificial intelligence and machine learning (abbreviated AI/ML here), for which specifically structured and semantically rich information is essential or at least greatly helpful.5

Journals and Citations

It is appropriate at this point to interleave citations into the discussion. These have their own fascinating history! In the 19th and early 20th century, citations in an article were often sparse and cryptic, with journal references heavily abbreviated, possibly to save type-setting effort. I cannot resist citing6 this article by Niels Bohr dating from 1922 as an extreme example. Probably one of the most influential articles of that century – leading to a Nobel prize no less – it contains no citations either as footnotes or endnotes and instead, individuals contributing to the area are acknowledged throughout the text. Nonetheless, by the second half of the 20th century, most research articles had fully separated citations into a discrete list at the end of the article. Arguably, these lists were often mis-used by inclusion of text-based footnotes extending the discussion of the main body of the article. Individual numbered citations could themselves contain sub-lists of journal references associated by an inferred common theme and of hoped-for relevance to the discussion. Such lists started suffering from the same issues as ESI, in other words an apparently lack of the formal structures and declared semantics so helpful for AI/ML; These will be referred to as unstructured citations for reasons that will shortly become apparent.

Journals and Metadata

It is time to introduce the unifying concept of metadata, this being structured and controlled descriptions of a body of data or of a narrative and including simple components such as authorship, article titles, abstracts, affiliations and provenance and publication dates. These formal structures now allow metadata to be more easily processed and analysed using AI/ML methods and provide infrastructures for obtaining for example metrics relating to research impacts. Whereas the commercial models that many publishers used in the past in the era before open-access would result in access to the digital journal article itself being paywall-protected in some manner, the metadata associated with that article was not so protected and was made readily available for use by anyone. In 2000, the Crossref organisation7 was set up by a consortium of publishers, libraries, research institutions and funders to accept, store, curate and disseminate this metadata, and Crossref issued what is known as a persistent identifier (the DOI is a specific example of such a PID) to identify the metadata records.

Initially, Crossref metadata did not include the citations from an article, but from 20048 these were added as a discrete component in the form of structured citations. Initial uptake by publishers was slow, but nowadays it is almost universal.9 These structured citations of books and journal articles included conventional information such as the author and journal name and the volume and page numbers, but in time these evolved to also include the article DOI, which allows facile and programmatic access to the metadata record for each citation. At this stage a record is introduced for one specific article10 and its access point in the form suitable for AI/ML applications:

https://api.crossref.org/works/10.1039/D3DD00246B/transform/application/vnd.crossref.unixsd+xml

An example of a structured citation from this record (as of mid 2024) is shown below:

<citation key=”D3DD00246B/cit25/1″>

<journal_title>J. Chem. Phys.</journal_title>

<author>Scalmani</author>

<cYear>2010</cYear>

<first_page>114110</first_page>

<doi>10.1063/1.3359469</doi>

</citation>

If you explore the metadata further, you will soon encounter a slightly different form, which is designated an unstructured citation, arising by virtue of inclusion of a component containing free-text comments. This is how all those citation footnotes, comments and other annotations so beloved by some authors are currently included. In this example, the article DOI itself is also noted, thus rendering the unstructured component somewhat redundant, but this is not always the case!

<citation key=”D3DD00246B/cit10/1″>

<volume_title>ChemRxiv</volume_title>

<author>Braddock</author>

<cYear>2024</cYear>

<doi>10.26434/chemrxiv-2023-vcmcl</doi>

<unstructured_citation>For a preprint, see, D. C.Braddock, S.Lee and H. S.Rzepa, SWERN Oxidation.

transition structure Theory is OK, ChemRxiv, 2023, preprint, 10.26434/chemrxiv-2023-vcmcl

</unstructured_citation>

</citation>

A third variation in the citation format can also be identified.

<citation key=”D3DD00246B/cit19/1″>

<volume_title>Imperial College Research Data Repository</volume_title>

<author>Braddock</author>

<cYear>2023</cYear>

<doi>10.14469/hpc/13108</doi>

<unstructured_citation>

  1. C.Braddock, H. S.Rzepa and S.Lee, Imperial College Research Data Repository, 2023,

10.14469/hpc/13108</unstructured_citation>

</citation>

Here one might infer from the volume title that this is now about data. This is a suitable entry point for the discussion here to rejoin the theme introduced above regarding data and ESI. However, instead of referring to data inside an ancillary PDF file associated with the article, a data DOI is now cited instead. As implied above for article DOIs, this form also has an associated metadata record, being stored, curated and disseminated by DataCite,11 an organisation set up some ten years after Crossref but acting in parallel to allow the citation of data. Unlike data contained in relatively unstructured – or parochially structured ESI documents, this form of data has associated formal descriptors in the metadata record describing the properties of the data. DataCite also allow access to this record, albeit using a slightly different form to that used by Crossref:

https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/13108

The properties as described by such a metadata record constitute information about how Findable, Accessible, Interoperable and Re-usable the data is – properties that became known by the acronym FAIR12 around 2016 and are important for the application of AI/ML. Note however that again in the citation example shown above, an unstructured component is also included containing the free-text assertion that the data is held in an institutional research data repository. Formally therefore, data is only implied by this form of citation, but at least the metadata record associated with the provided DOI can be used to confirm this. At this stage it is worth noting that around half of all the citations associated with this specific article10 are of this type, an unusually high proportion. When an assertion is made in the narrative of this article, it can now be supported with a data citation as appropriate. Such multiple and in-context data citation can be contrasted with the conventional data availability statement nowadays found in most journal articles, introduced around 2017 and which often simply points to the single and largely context-free supporting information document listed on the article landing page.

Very shortly the expectation is13 that Crossref will modify the unstructured aspect of data citation by a small extension to their schema in the form shown below and hence adding the ability to formalise the citation of data in an article.

<citation type=”dataset” key=”D3DD00246B/cit19/1″>

<volume_title>Imperial College Research Data Repository</volume_title>

<author>Braddock</author>

<cYear>2023</cYear>

<doi>10.14469/hpc/13108</doi>

</citation>

Formalisation is also proposed by Crossref of the data availability statement alluded to above. In most current articles in this and other journals it appears in the generic form of a Data availability section, where the authors can list how their data can be obtained in the form of e.g. URLs or DOIs. However, this information does NOT currently appear in the Crossref metadata record unless the authors have also included it as an unstructured citation. The proposal is to add it to the metadata record in the form of

<statement type=”data availability”>Data Availability Statement … … </statement>

The content of this statement is still unstructured free-text, but at least it is available for parsing and analysis in ways that might be useful.

At this stage, the assertion above that the two facets of data and citations are in fact closely associated can be summarised as:

  • Key information about a journal article is now made freely available via its metadata record, a structured and semantically rich format that allows AI/ML processing.
  • The relationships the article has with other articles is now also present in the form of structured citations in the Crossref metadata record.
  • Such structured citations should include persistent identifiers such as DOIs with an indication of the type of the citation, such as to a dataset.
  • The inclusion of persistent identifiers in turn allows AI/ML access to metadata records describing data referred to in the article.

Primary vs processed data

This section contains discussion of two forms of expressions of data in an article, firstly the conventional Tables/Figures/Schemes as contained in the body of the article and secondly the presence of citations allowing specific access to more complete or at least less lossy primary data. The broad distinction is here made that the former representations might constitute processed and interpreted data, whereas ideally the latter types would constitute the more complete data from which the former are derived, such as that obtained from an instrument or output by a computational procedure. Specific examples illustrate the difference between the two.

  1. A form of processed data could be an NMR or frequency domain spectrum presented in association with a chemical structure representation. The combination of the two can be used to confirm the identity of g. the product of a chemical synthesis.
  2. The corresponding primary or raw form would be the time-domain data as produced directly from an NMR instrument, to be converted by g. a Fourier Transform operation to a frequency domain presentation that is more readily analysed. The process of converting the primary data to the processed form is of course lossy; some information at least is lost by this conversion.

A second example derives from computational modelling.

  1. A form of processed data could be a two-dimensional representation or figure corresponding to the highest occupied molecular orbital or the HOMO of a molecule of interest.
  2. The corresponding primary data would be a file containing the full wavefunction calculated for the molecule using a specific model for solution of the Schrödinger equation and presented as loss-free data in the form of a formatted checkpoint or rawbinaryarray file[14] resulting from g. a Gaussian calculation. These forms would allow not only an alternative three-dimensional representation of the HOMO to be generated, but indeed that of any other desired orbital or other property computable from the wavefunction.

The final example is found in the article cited above10 and relates to the calculation of kinetic isotope effects.

  1. The processed data derives from application of the Bigeleisen model to kinetic isotope effects for deuterium substitution at a specified temperature and for specified atoms, using computer code specified again by a suitable DOI-based citation. It can be presented as numerical values in a table.
  2. The primary data derives from the final calculation checkpoint files, which as well as containing the wavefunction also contain the second derivative force constant matrix, allowing other isotopic substitutions to be made at any location in the molecule and which can be evaluated at any required temperature.

The purpose of including these examples of forms of data is to show that both can be useful! Processed data, in the form of visualisable figures and tables are particularly helpful for the type of perception of complex concepts that humans traditionally excel at. Primary data are useful for access to alternative forms of visualisation, for re-use in a context different to that presented in the body of the article or by application of alternative models to those presented by the original authors, such as might be derived by ML/AI methods. The journal experiment noted above2 combined these by accessing the primary data (molecular coordinates) and converting this on the fly to a pop-up visual representation for humans (an interactive 3D model). Even at the simplest level, access to primary data might allow replication of the results quoted in the original article. In the article cited10 such replication was not always possible because of lack of such primary data associated with the original report.15

Data Discovery

The examples above illustrate how the various components of a scientific article can be prepared for AI/ML analysis by adding predictable structures to both the citations and the data implicit in the article. There is another important benefit of data citation which is next illustrated, that of data discovery. Finding something in a conventional ESI document is largely limited to searching the free text for appropriate string patterns. The scope of such a pattern search does not extend beyond that document. However, metadata records associated with a dataset are automatically aggregated by the metadata registration agency, being either Crossref or Datacite. Both offer rich structured and federated searches of the metadata across all registered entries, not just of a single ESI document. To illustrate this aspect, the data availability statement in the article discussed above10 has been modified to include both data availability and discovery. An extended version of the example cited there is shown below:16

https://commons.datacite.org/?query=(media.media_type:chemical/x-gaussian-log+OR+media.media_type:chemical/x-gaussian-checkpoint)+AND+media.media_type:text/plain+AND+(titles.title:*Endo*+OR+descriptions.description:*Endo*+OR+titles.title:*Exo*+OR+descriptions.description:*Exo*)

If this syntax looks rather long and unwieldy, it is because it is what is called an API (application programming interface) such as used by AI/ML applications (the specific API form of the above is https://api.datacite.org/dois/?query= ). It reveals all datasets derived from using the Gaussian quantum chemical application as restricted by the presence of an additional file containing further information (here the kinetic isotope effects) and by specified title or description keywords, the search being within the global corpus of registered metadata. This extends the scope of the discovery well beyond that of a single ESI document. A way of constraining the search to a particular specified property, namely kinetic isotope effects, would require future community agreement18 on the vocabulary term and/or scheme to be used for that property. Here a possible such term is invoked by appending +AND+subjects.subjectScheme:*KIE*+AND+subjects.subject:1H/2H to the above search, which constrains the property to KIE and its value to 1H/2H (a hydrogen-deuterium isotope effect).17 The searches themselves can even be assigned16,17 a persistent identifier to facilitate discovery by e.g. AI/ML software. The community is here challenged to enable enrichment of the descriptive and relational publication metadata by agreeing wider vocabularies or search terms, thus enabling data discovery to be made ever more specific and accurate.18

The future

The examples used to illustrate the concepts described above show how a journal article10 can be very usefully adapted to ensure it is more AI/ML-friendly, with relatively little extra effort required by its authors. Many more innovations associated with both data and citations can be anticipated and that the 350+ year evolution of scientific publishing will continue apace!

Note added after publication

Sara El-Gebali from Datacite has also published a blog post on 20th August 2024 entitled “Connecting the Dots with DataCite DOI Metadata”, which usefully expands upon the discussion in this commentary. This gives a wider range of metadata types that can be used for discovery. See DOI: 10.5438/k81t-zq43

A citable version of this blog post is available on ChemRxiv, at DOI: 10.26434/chemrxiv-2024-dz2dv

References:

1 The Octopus publishing project, https://www.octopus.ac/about

2 D. James, B. J. Whitaker, C. Hildyard, H. S. Rzepa, O. Casher, J. M. Goodman, D. Riddick and P. Murray-Rust, The case for content integrity in electronic chemistry journals: The CLIC project, New Review of Information Networking, 1995, 1, 61–69, DOI: 10.1080/13614579509516846

3 H. S. Rzepa, The Long and Winding Road towards FAIR Data as an Integral Component of the Computational Modelling and Dissemination of Chemistry, Isr. J. Chem. 2022, 62, e202100034, DOI: 10.1002/ijch.202100034

4 J. Downing, P. Murray-Rust, A. P. Tonge, P. Morgan, H. S. Rzepa, F. Cotterill, N. Day and M. J. Harvey, SPECTRa : The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories, J. Chem. Inf. Model., 2008, 48, 1571–1581, DOI: 10.1021/ci7004737

5 P. Murray-Rust and H. S. Rzepa, Chemical markup Language and XML Part I. Basic principles, J. Chem. Inf. Comput. Sci., 1999, 39, 928, DOI: 10.1021/ci990052b

6 N. Bohr, Der Bau der Atome und die physikalischen und chemischen Eigenschaften der Elemente. Zeitschrift für Physik, 1922, 9, 1–67, DOI: 10.1007/BF01326955

7 The Formation of Crossref: A Short History, https://www.crossref.org/pdfs/CrossRef10Years.pdf

8 See Crossref Schema 2.0.5, 2004, https:// b.archive.org/web/20040202113642/http://www.crossref.org/02publishers/forward_linking_howto.html

9 D. Shotton, Publishing: Open citations. Nature, 2013, 502, 295–297, DOI: 10.1038/502295a

10 D. C. Braddock, S. Lee and H. S. Rzepa, Modelling kinetic isotope effects for Swern oxidation using DFT-based transition state theory, Digital Discovery, 2024, 3, 1496–1508, DOI: 10.1039/D3DD00246B

11 J. Neumann and J. Brase, DataCite and DOI names for research data, J. Comput.-Aided Mol. Des., 2014, 28, 1035–1041, DOI: 10.1007/s10822-014-9776-5

12 M. Wilkinson, M. Dumontier, I. Aalbersberg, et al., The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data., 2016, 3, 160018, DOI: 10.1038/sdata.2016.18

13 Crossref Metadata updates (for public comment) July 2024, https://docs.google.com/document/d/1VPXhTPMZzfvAPmTOlNp-bZf9cTLkw0dPZFTuDtDIPls/

14 H. S. Rzepa, Quantum chemistry interoperability (library): another step towards FAIR data, 2022, https://www.ch.imperial.ac.uk/rzepa/blog/?p=24543, DOI: 10.59350/mzs83-g6218

15 T. Giagou and M. P. Meyer, Mechanism of the Swern Oxidation: Significant Deviations from Transition State Theory, J. Org. Chem., 2010, 75, 8088–8099, DOI: 10.1021/jo101636w

16 H. S. Rzepa, Example of a discovery search procedure, 2024, DOI: 10.14469/hpc/14510

17 H. S. Rzepa, Example of a discovery search procedure using a subject-constrained search, 2024, DOI: 10.14469/hpc/14517

18 This is currently being done for e.g. NMR Spectroscopy; R. M. Hanson, D. Jeannerat, M. Archibald, I. Bruno, S. Chalk, A. N. Davies, R. J. Lancashire, J. Lang and H. S. Rzepa, IUPAC specification for the FAIR management of spectroscopic data in chemistry (IUPAC FAIRSpec) – guiding principles, Pure and Applied Chemistry, 2022, 94, 623–636, DOI: 10.1515/pac-2021-2009

New themed collection in collaboration with Accelerate Conference 2022

Portraits of the three Guest Editors

We’re pleased to announce that a new themed collection from Digital Discovery has now been published online.

Read the collection

This new themed collection represents a collaboration between the editors of Digital Discovery and the Acceleration Consortium, organisers of the Accelerate Conference. The goal of the conference was to explore the power of self-driving labs (SDLs), which combine AI, automation, and advanced computing to accelerate materials and molecular discovery.

This themed collection, Guest Edited by Prof. Keith A. Brown (Boston University, USA), Prof. Fedwa El Mellouhi (Hamad Bin Khalifa University, Qatar), and Prof. Claudiane Ouellet-Plamondon (École de technologie supérieure, Canada), features contributions that cover various aspects of this process, whether specifically presented at the conference or not.

Examples include, realization of new SDLs; fundamental studies of the operation of SDLs; sustainable, resilient, low carbon, materials and chemical discoveries made using SDLs.

A list of the articles has been provided below. All articles in Digital Discovery are open access and free to read.

We hope you enjoy this new themed collection from Digital Discovery.

A new collection to feature contributors to Accelerate Conference 2023 and Accelerate Conference 2024 is currently in preparation – watch this space for more information!

 

Editorial

Introduction to “Accelerate Conference 2022”
Keith A. Brown, Fedwa El Mellouhi and Claudiane Ouellet-Plamondon
Digital Discovery, 2024, 3, DOI: 10.1039/D4DD90036G

 

Perspectives

The laboratory of Babel: highlighting community needs for integrated materials data management
Brenden G. Pelkie and Lilo D. Pozzo
Digital Discovery, 2023, 2, 544–556, DOI: 10.1039/D3DD00022B

What is missing in autonomous discovery: open challenges for the community
Phillip M. Maffettone, Pascal Friederich, Sterling G. Baird, Ben Blaiszik, Keith A. Brown, Stuart I. Campbell, Orion A. Cohen, Rebecca L. Davis, Ian T. Foster, Navid Haghmoradi, Mark Hereld, Howie Joress, Nicole Jung, Ha-Kyung Kwon, Gabriella Pizzuto, Jacob Rintamaki, Casper Steinmann, Luca Torresi and Shijing Sun
Digital Discovery, 2023, 2, 1644–1659, DOI: 10.1039/D3DD00143A

Autonomous cementitious materials formulation platform for critical infrastructure repair
Howie Joress, Rachel Cook, Austin McDannald, Mark Kozdras, Jason Hattrick-Simpers, Aron Newman and Scott Jones
Digital Discovery, 2024, 3, 231–237, DOI: 10.1039/D3DD00211J

 

Papers

A fully automated platform for photoinitiated RAFT polymerization
Jules Lee, Prajakatta Mulay, Matthew J. Tamasi, Jonathan Yeow, Molly M. Stevens and Adam J. Gormley
Digital Discovery, 2023, 2, 219–233, DOI: 10.1039/D2DD00100D

A high-throughput workflow for the synthesis of CdSe nanocrystals using a sonochemical materials acceleration platform
Maria Politi, Fabio Baum, Kiran Vaddi, Edwin Antonio, Joshua Vasquez, Brittany P. Bishop, Nadya Peek, Vincent C. Holmberg and Lilo D. Pozzo
Digital Discovery, 2023, 2, 1042–1057, DOI: 10.1039/D3DD00033H

Neural networks trained on synthetically generated crystals can extract structural information from ICSD powder X-ray diffractograms
Henrik Schopmans, Patrick Reiser and Pascal Friederich
Digital Discovery, 2023, 2, 1414–1424, DOI: 10.1039/D3DD00071K

Driving school for self-driving labs
Kelsey L. Snapp and Keith A. Brown
Digital Discovery, 2023, 2, 1620–1629, DOI: 10.1039/D3DD00150D

Robotically automated 3D printing and testing of thermoplastic material specimens
Miguel Hernández-del-Valle, Christina Schenk, Lucía Echevarría-Pastrana, Burcu Ozdemir, Enrique Dios-Lázaro, Jorge Ilarraza-Zuazo, De-Yi Wang and Maciej Haranczyk
Digital Discovery, 2023, 2, 1969–1979, DOI: 10.1039/D3DD00141E

Towards a modular architecture for science factories
Rafael Vescovi, Tobias Ginsburg, Kyle Hippe, Doga Ozgulbas, Casey Stone, Abraham Stroka, Rory Butler, Ben Blaiszik, Tom Brettin, Kyle Chard, Mark Hereld, Arvind Ramanathan, Rick Stevens, Aikaterini Vriza, Jie Xu, Qingteng Zhang and Ian Foster
Digital Discovery, 2023, 2, 1980–1998, DOI: 10.1039/D3DD00142C

A human-in-the-loop approach for visual clustering of overlapping materials science data
Satyanarayana Bonakala, Michael Aupetit, Halima Bensmail and Fedwa El-Mellouhi
Digital Discovery, 2024, 3, 502–513, DOI: 10.1039/D3DD00179B

New themed collection with the NeurIPS AI4Mat 2023 workshop

The AI for Materials Design logo

We’re pleased to announce that a new themed collection from Digital Discovery has now been published online.

Read the collection

The AI for Accelerated Materials Design (AI4Mat) workshop at NeurIPS 2023 featured many of the ongoing major research themes in materials design, synthesis, and characterization by bringing together an international interdisciplinary community of researchers and enthusiasts. The AI4Mat 2023 organizing committee and the editors of Digital Discovery have curated a selection of research papers drawn from some of the most exciting and high-quality paper submissions from the workshop. We are pleased to share these papers, and a perspective on the workshop as a whole, in this themed collection.

You can find the line-up of the collection below. All articles in Digital Discovery are open access and free to read.

Editorial

Perspective on AI for Accelerated Materials Design at the AI4Mat-2023 Workshop at NeurIPS 2023
Santiago Miret, N. M. Anoop Krishnan, Benjamin Sanchez-Lengeling, Marta Skreta, Vineeth Venugopal and Jennifer N. Wei
Digital Discovery, 2024, 3, DOI: 10.1039/D4DD90010C

Communications

Discovery of novel reticular materials for carbon dioxide capture using GFlowNets
Flaviu Cipcigan, Jonathan Booth, Rodrigo Neumann Barros Ferreira, Carine Ribeiro dos Santos and Mathias Steiner
Digital Discovery, 2024, 3, 449–455, DOI: 10.1039/D4DD00020J

A message passing neural network for predicting dipole moment dependent core electron excitation spectra
Kiyou Shibata and Teruyasu Mizoguchi
Digital Discovery, 2024, 3, 649–653, DOI: 10.1039/D4DD00021H

Papers

Connectivity optimized nested line graph networks for crystal structures
Robin Ruff, Patrick Reiser, Jan Stühmer and Pascal Friederich
Digital Discovery, 2024, 3, 594–601, DOI: 10.1039/D4DD00018H

Learning conditional policies for crystal design using offline reinforcement learning
Prashant Govindarajan, Santiago Miret, Jarrid Rector-Brooks, Mariano Phielipp, Janarthanan Rajendran and Sarath Chandar
Digital Discovery, 2024, 3, 769–785, DOI: 10.1039/D4DD00024B

EGraFFBench: evaluation of equivariant graph neural network force fields for atomistic simulations
Vaibhav Bihani, Sajid Mannan, Utkarsh Pratiush, Tao Du, Zhimin Chen, Santiago Miret, Matthieu Micoulaut, Morten M. Smedskjaer, Sayan Ranu and N. M. Anoop Krishnan
Digital Discovery, 2024, 3, 759–768, DOI: 10.1039/D4DD00027G

Gotta be SAFE: a new framework for molecular design
Emmanuel Noutahi, Cristian Gabellini, Michael Craig, Jonathan S. C. Lim and Prudencio Tossou
Digital Discovery, 2024, 3, 796–704, DOI: 10.1039/D4DD00019F

Reconstructing the materials tetrahedron: challenges in materials information extraction
Kausik Hira, Mohd Zaki, Dhruvil Sheth, Mausam and N. M. Anoop Krishnan
Digital Discovery, 2024, 3, 1021–1037, DOI: 10.1039/D4DD00032C

Towards equilibrium molecular conformation generation with GFlowNets
Alexandra Volokhova, Michał Koziarski, Alex Hernández-García, Cheng-Hao Liu, Santiago Miret, Pablo Lemos, Luca Thiede, Zichao Yan, Alán Aspuru-Guzik and Yoshua Bengio
Digital Discovery, 2024, 3, 1038–1047, DOI: 10.1039/D4DD00023D

CoDBench: a critical evaluation of data-driven models for continuous dynamical systems
Priyanshu Burark, Karn Tiwari, Meer Mehran Rashid, Prathosh A. P. and N. M. Anoop Krishnan
Digital Discovery, 2024, 3, DOI: 10.1039/D4DD00028E

We hope you enjoy this new themed collection from Digital Discovery.

Research infographic – Robotically automated 3D printing and testing of thermoplastic material specimens

We’re pleased to share this new infographic on research from Haranczyk et al.

An infographic summarising the linked article.

Read the article here:

Robotically automated 3D printing and testing of thermoplastic material specimens

“Formalizing chemical physics using the Lean theorem prover” featured on Breaking Math

Josephson et al.‘s paper “Formalizing chemical physics using the Lean theorem prover” is featured on a new episode of the Breaking Math podcast! Find it at the links below or in your favourite podcatcher.

Apple Podcasts

Spotify podcasts

Read the open access article here.

Research infographic – Digitisation of a modular plug and play 3D printed continuous flow system for chemical synthesis

Our new infographic highlights work from Hilton et al. on a 3D-printed, modular system for classical and photochemical synthesis:

An infographic summarising the linked article.

Read their paper below to find out more:

Digitisation of a modular plug and play 3D printed continuous flow system for chemical synthesis

Mireia Benito Montaner, Matthew R. Penny and Stephen T. Hilton, Digital Discovery, 2023, 2, 1797–1805

Research infographic – Evaluating the roughness of structure–property relationships using pretrained molecular representations

Work by Coley et al. features in the next Digital Discovery infographic, which introduces  a reformulation of the roughness index (ROGI) to help understand the roughtness of QSPR surfaces created by new models.

An infographic summarising the linked article.

Get the whole story in their article, available open access:

Evaluating the roughness of structure–property relationships using pretrained molecular representations

David E. Graff, Edward O. Pyzer-Knapp, Kirk E. Jordan, Eugene I. Shakhnovich and Connor W. Coley, Digital Discovery, 2023, 2, 1452–1460

Research infographic – Driving school for self-driving labs

Our latest research infographic shares Snapp and Brown’s heuristic framework for defining the operaiton of self-driving labs.

An infographic summarising the linked article

Find out more in their open access article here:

Driving school for self-driving labs

Kelsey L. Snapp and Keith A. Brown, Digital Discovery, 2023, 2, 1620–1629, DOI: 10.1039/D3DD00150D

Research infographic – Feature selection in molecular graph neural networks based on quantum chemical approaches

Discover new research on feature selection for molecular systems in this new infographic:

An infographic summarising the linked article

Read the open access full article at the link below:

Feature selection in molecular graph neural networks based on quantum chemical approaches

Daisuke Yokogawa and Kayo Suda, Digital Discovery, 2023, 2, 1089–1097, DOI: 10.1039/D3DD00010A