The evolving roles of data and citations in journal articles
Henry S. Rzepaa
aEmeritus Professor of Computational Chemistry, Department of Chemistry, Imperial College London.
Background
The last thirty years have seen enormous changes in the so-called scientific journal model, first introduced some 350 years ago as a paper based medium. The typical journal article in say the chemical sciences has evolved during this period to contain a traditional narrative structure such as an introduction or background to the topic, the presentation of results and data, conclusions drawn from the data, experimental procedures to enable replication and a bibliographic section where relationships to other work can be cited. Such a serial narrative format has itself come under scrutiny, as for example a recent publishing experiment involving its dissection into eight smaller units of publication, potentially with their own structures and authorship and each of which could stand on their own merits, but which can also be assembled to reconstitute an overarching synoptic journal article.1 The electronic journal era of the last 30 years has also brought with it experiments in how the various constituents of the traditional journal article might be digitally exploited. An example2 dating from the start of the e-journal period showed how selected articles in the journal Chemical Communications could be enhanced with “pop-up” interactive molecular models based on 3D coordinate data provided by the authors, thus augmenting the static views provided by conventional figures.
In the present commentary, the focus will be on two other ways of digitally exploiting the medium of the journal, both driven by the extraordinary recent attention given to artificial intelligence or machine learning and questions such as whether the current publishing models need to be prepared for this new era. These are how the availability, discovery and the properties of data associated with journal articles is being improved and secondly of citation enhancement, both being facets of the publication processes and which turn out to be closely inter-related.
Journals and Data
For much of the history of publishing in e.g. chemistry, the data behind a research article has been integrated into the article in the form of tables of numerical results and/or figures derived from these data, along with graphical schemes illustrating other aspects such as molecular structures and associated reactions and mechanisms. Isolated numerical data could often be simply integrated into the text-based narrative. This became impractical when the tables of numerical data swelled in size – an example being e.g. crystallographic information from the 1950s onwards. Procedures for printing this information and then depositing the print copy in a national library or other central resource were introduced and this became more common for a short period during the 1970s.3 In order to re-use such data, an interested reader would have to re-type the numerical information in order to absorb it into say a computer for analysis, and then spend a fair bit of time trying to ensure no errors had been introduced by this process. From the mid 1990s, this paper-based form thankfully started being replaced by “electronic printing” into the PDF format, when it became known as ESI or electronic supporting information – a mechanism that still dominates to this day. Over the last decade however, it has been increasingly recognised4 that ESI is not an optimal medium for use in areas such as e.g. artificial intelligence and machine learning (abbreviated AI/ML here), for which specifically structured and semantically rich information is essential or at least greatly helpful.5
Journals and Citations
It is appropriate at this point to interleave citations into the discussion. These have their own fascinating history! In the 19th and early 20th century, citations in an article were often sparse and cryptic, with journal references heavily abbreviated, possibly to save type-setting effort. I cannot resist citing6 this article by Niels Bohr dating from 1922 as an extreme example. Probably one of the most influential articles of that century – leading to a Nobel prize no less – it contains no citations either as footnotes or endnotes and instead, individuals contributing to the area are acknowledged throughout the text. Nonetheless, by the second half of the 20th century, most research articles had fully separated citations into a discrete list at the end of the article. Arguably, these lists were often mis-used by inclusion of text-based footnotes extending the discussion of the main body of the article. Individual numbered citations could themselves contain sub-lists of journal references associated by an inferred common theme and of hoped-for relevance to the discussion. Such lists started suffering from the same issues as ESI, in other words an apparently lack of the formal structures and declared semantics so helpful for AI/ML; These will be referred to as unstructured citations for reasons that will shortly become apparent.
Journals and Metadata
It is time to introduce the unifying concept of metadata, this being structured and controlled descriptions of a body of data or of a narrative and including simple components such as authorship, article titles, abstracts, affiliations and provenance and publication dates. These formal structures now allow metadata to be more easily processed and analysed using AI/ML methods and provide infrastructures for obtaining for example metrics relating to research impacts. Whereas the commercial models that many publishers used in the past in the era before open-access would result in access to the digital journal article itself being paywall-protected in some manner, the metadata associated with that article was not so protected and was made readily available for use by anyone. In 2000, the Crossref organisation7 was set up by a consortium of publishers, libraries, research institutions and funders to accept, store, curate and disseminate this metadata, and Crossref issued what is known as a persistent identifier (the DOI is a specific example of such a PID) to identify the metadata records.
Initially, Crossref metadata did not include the citations from an article, but from 20048 these were added as a discrete component in the form of structured citations. Initial uptake by publishers was slow, but nowadays it is almost universal.9 These structured citations of books and journal articles included conventional information such as the author and journal name and the volume and page numbers, but in time these evolved to also include the article DOI, which allows facile and programmatic access to the metadata record for each citation. At this stage a record is introduced for one specific article10 and its access point in the form suitable for AI/ML applications:
https://api.crossref.org/works/10.1039/D3DD00246B/transform/application/vnd.crossref.unixsd+xml
An example of a structured citation from this record (as of mid 2024) is shown below:
<citation key=”D3DD00246B/cit25/1″>
<journal_title>J. Chem. Phys.</journal_title>
<author>Scalmani</author>
<cYear>2010</cYear>
<first_page>114110</first_page>
<doi>10.1063/1.3359469</doi>
</citation>
If you explore the metadata further, you will soon encounter a slightly different form, which is designated an unstructured citation, arising by virtue of inclusion of a component containing free-text comments. This is how all those citation footnotes, comments and other annotations so beloved by some authors are currently included. In this example, the article DOI itself is also noted, thus rendering the unstructured component somewhat redundant, but this is not always the case!
<citation key=”D3DD00246B/cit10/1″>
<volume_title>ChemRxiv</volume_title>
<author>Braddock</author>
<cYear>2024</cYear>
<doi>10.26434/chemrxiv-2023-vcmcl</doi>
<unstructured_citation>For a preprint, see, D. C.Braddock, S.Lee and H. S.Rzepa, SWERN Oxidation.
transition structure Theory is OK, ChemRxiv, 2023, preprint, 10.26434/chemrxiv-2023-vcmcl
</unstructured_citation>
</citation>
A third variation in the citation format can also be identified.
<citation key=”D3DD00246B/cit19/1″>
<volume_title>Imperial College Research Data Repository</volume_title>
<author>Braddock</author>
<cYear>2023</cYear>
<doi>10.14469/hpc/13108</doi>
<unstructured_citation>
- C.Braddock, H. S.Rzepa and S.Lee, Imperial College Research Data Repository, 2023,
10.14469/hpc/13108</unstructured_citation>
</citation>
Here one might infer from the volume title that this is now about data. This is a suitable entry point for the discussion here to rejoin the theme introduced above regarding data and ESI. However, instead of referring to data inside an ancillary PDF file associated with the article, a data DOI is now cited instead. As implied above for article DOIs, this form also has an associated metadata record, being stored, curated and disseminated by DataCite,11 an organisation set up some ten years after Crossref but acting in parallel to allow the citation of data. Unlike data contained in relatively unstructured – or parochially structured ESI documents, this form of data has associated formal descriptors in the metadata record describing the properties of the data. DataCite also allow access to this record, albeit using a slightly different form to that used by Crossref:
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/13108
The properties as described by such a metadata record constitute information about how Findable, Accessible, Interoperable and Re-usable the data is – properties that became known by the acronym FAIR12 around 2016 and are important for the application of AI/ML. Note however that again in the citation example shown above, an unstructured component is also included containing the free-text assertion that the data is held in an institutional research data repository. Formally therefore, data is only implied by this form of citation, but at least the metadata record associated with the provided DOI can be used to confirm this. At this stage it is worth noting that around half of all the citations associated with this specific article10 are of this type, an unusually high proportion. When an assertion is made in the narrative of this article, it can now be supported with a data citation as appropriate. Such multiple and in-context data citation can be contrasted with the conventional data availability statement nowadays found in most journal articles, introduced around 2017 and which often simply points to the single and largely context-free supporting information document listed on the article landing page.
Very shortly the expectation is13 that Crossref will modify the unstructured aspect of data citation by a small extension to their schema in the form shown below and hence adding the ability to formalise the citation of data in an article.
<citation type=”dataset” key=”D3DD00246B/cit19/1″>
<volume_title>Imperial College Research Data Repository</volume_title>
<author>Braddock</author>
<cYear>2023</cYear>
<doi>10.14469/hpc/13108</doi>
</citation>
Formalisation is also proposed by Crossref of the data availability statement alluded to above. In most current articles in this and other journals it appears in the generic form of a Data availability section, where the authors can list how their data can be obtained in the form of e.g. URLs or DOIs. However, this information does NOT currently appear in the Crossref metadata record unless the authors have also included it as an unstructured citation. The proposal is to add it to the metadata record in the form of
<statement type=”data availability”>Data Availability Statement … … </statement>
The content of this statement is still unstructured free-text, but at least it is available for parsing and analysis in ways that might be useful.
At this stage, the assertion above that the two facets of data and citations are in fact closely associated can be summarised as:
- Key information about a journal article is now made freely available via its metadata record, a structured and semantically rich format that allows AI/ML processing.
- The relationships the article has with other articles is now also present in the form of structured citations in the Crossref metadata record.
- Such structured citations should include persistent identifiers such as DOIs with an indication of the type of the citation, such as to a dataset.
- The inclusion of persistent identifiers in turn allows AI/ML access to metadata records describing data referred to in the article.
Primary vs processed data
This section contains discussion of two forms of expressions of data in an article, firstly the conventional Tables/Figures/Schemes as contained in the body of the article and secondly the presence of citations allowing specific access to more complete or at least less lossy primary data. The broad distinction is here made that the former representations might constitute processed and interpreted data, whereas ideally the latter types would constitute the more complete data from which the former are derived, such as that obtained from an instrument or output by a computational procedure. Specific examples illustrate the difference between the two.
- A form of processed data could be an NMR or frequency domain spectrum presented in association with a chemical structure representation. The combination of the two can be used to confirm the identity of g. the product of a chemical synthesis.
- The corresponding primary or raw form would be the time-domain data as produced directly from an NMR instrument, to be converted by g. a Fourier Transform operation to a frequency domain presentation that is more readily analysed. The process of converting the primary data to the processed form is of course lossy; some information at least is lost by this conversion.
A second example derives from computational modelling.
- A form of processed data could be a two-dimensional representation or figure corresponding to the highest occupied molecular orbital or the HOMO of a molecule of interest.
- The corresponding primary data would be a file containing the full wavefunction calculated for the molecule using a specific model for solution of the Schrödinger equation and presented as loss-free data in the form of a formatted checkpoint or rawbinaryarray file[14] resulting from g. a Gaussian calculation. These forms would allow not only an alternative three-dimensional representation of the HOMO to be generated, but indeed that of any other desired orbital or other property computable from the wavefunction.
The final example is found in the article cited above10 and relates to the calculation of kinetic isotope effects.
- The processed data derives from application of the Bigeleisen model to kinetic isotope effects for deuterium substitution at a specified temperature and for specified atoms, using computer code specified again by a suitable DOI-based citation. It can be presented as numerical values in a table.
- The primary data derives from the final calculation checkpoint files, which as well as containing the wavefunction also contain the second derivative force constant matrix, allowing other isotopic substitutions to be made at any location in the molecule and which can be evaluated at any required temperature.
The purpose of including these examples of forms of data is to show that both can be useful! Processed data, in the form of visualisable figures and tables are particularly helpful for the type of perception of complex concepts that humans traditionally excel at. Primary data are useful for access to alternative forms of visualisation, for re-use in a context different to that presented in the body of the article or by application of alternative models to those presented by the original authors, such as might be derived by ML/AI methods. The journal experiment noted above2 combined these by accessing the primary data (molecular coordinates) and converting this on the fly to a pop-up visual representation for humans (an interactive 3D model). Even at the simplest level, access to primary data might allow replication of the results quoted in the original article. In the article cited10 such replication was not always possible because of lack of such primary data associated with the original report.15
Data Discovery
The examples above illustrate how the various components of a scientific article can be prepared for AI/ML analysis by adding predictable structures to both the citations and the data implicit in the article. There is another important benefit of data citation which is next illustrated, that of data discovery. Finding something in a conventional ESI document is largely limited to searching the free text for appropriate string patterns. The scope of such a pattern search does not extend beyond that document. However, metadata records associated with a dataset are automatically aggregated by the metadata registration agency, being either Crossref or Datacite. Both offer rich structured and federated searches of the metadata across all registered entries, not just of a single ESI document. To illustrate this aspect, the data availability statement in the article discussed above10 has been modified to include both data availability and discovery. An extended version of the example cited there is shown below:16
https://commons.datacite.org/?query=(media.media_type:chemical/x-gaussian-log+OR+media.media_type:chemical/x-gaussian-checkpoint)+AND+media.media_type:text/plain+AND+(titles.title:*Endo*+OR+descriptions.description:*Endo*+OR+titles.title:*Exo*+OR+descriptions.description:*Exo*)
If this syntax looks rather long and unwieldy, it is because it is what is called an API (application programming interface) such as used by AI/ML applications (the specific API form of the above is https://api.datacite.org/dois/?query= ). It reveals all datasets derived from using the Gaussian quantum chemical application as restricted by the presence of an additional file containing further information (here the kinetic isotope effects) and by specified title or description keywords, the search being within the global corpus of registered metadata. This extends the scope of the discovery well beyond that of a single ESI document. A way of constraining the search to a particular specified property, namely kinetic isotope effects, would require future community agreement18 on the vocabulary term and/or scheme to be used for that property. Here a possible such term is invoked by appending +AND+subjects.subjectScheme:*KIE*+AND+subjects.subject:1H/2H to the above search, which constrains the property to KIE and its value to 1H/2H (a hydrogen-deuterium isotope effect).17 The searches themselves can even be assigned16,17 a persistent identifier to facilitate discovery by e.g. AI/ML software. The community is here challenged to enable enrichment of the descriptive and relational publication metadata by agreeing wider vocabularies or search terms, thus enabling data discovery to be made ever more specific and accurate.18
The future
The examples used to illustrate the concepts described above show how a journal article10 can be very usefully adapted to ensure it is more AI/ML-friendly, with relatively little extra effort required by its authors. Many more innovations associated with both data and citations can be anticipated and that the 350+ year evolution of scientific publishing will continue apace!
Note added after publication
Sara El-Gebali from Datacite has also published a blog post on 20th August 2024 entitled “Connecting the Dots with DataCite DOI Metadata”, which usefully expands upon the discussion in this commentary. This gives a wider range of metadata types that can be used for discovery. See DOI: 10.5438/k81t-zq43
A citable version of this blog post is available on ChemRxiv, at DOI: 10.26434/chemrxiv-2024-dz2dv
References:
1 The Octopus publishing project, https://www.octopus.ac/about
2 D. James, B. J. Whitaker, C. Hildyard, H. S. Rzepa, O. Casher, J. M. Goodman, D. Riddick and P. Murray-Rust, The case for content integrity in electronic chemistry journals: The CLIC project, New Review of Information Networking, 1995, 1, 61–69, DOI: 10.1080/13614579509516846
3 H. S. Rzepa, The Long and Winding Road towards FAIR Data as an Integral Component of the Computational Modelling and Dissemination of Chemistry, Isr. J. Chem. 2022, 62, e202100034, DOI: 10.1002/ijch.202100034
4 J. Downing, P. Murray-Rust, A. P. Tonge, P. Morgan, H. S. Rzepa, F. Cotterill, N. Day and M. J. Harvey, SPECTRa : The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories, J. Chem. Inf. Model., 2008, 48, 1571–1581, DOI: 10.1021/ci7004737
5 P. Murray-Rust and H. S. Rzepa, Chemical markup Language and XML Part I. Basic principles, J. Chem. Inf. Comput. Sci., 1999, 39, 928, DOI: 10.1021/ci990052b
6 N. Bohr, Der Bau der Atome und die physikalischen und chemischen Eigenschaften der Elemente. Zeitschrift für Physik, 1922, 9, 1–67, DOI: 10.1007/BF01326955
7 The Formation of Crossref: A Short History, https://www.crossref.org/pdfs/CrossRef10Years.pdf
8 See Crossref Schema 2.0.5, 2004, https:// b.archive.org/web/20040202113642/http://www.crossref.org/02publishers/forward_linking_howto.html
9 D. Shotton, Publishing: Open citations. Nature, 2013, 502, 295–297, DOI: 10.1038/502295a
10 D. C. Braddock, S. Lee and H. S. Rzepa, Modelling kinetic isotope effects for Swern oxidation using DFT-based transition state theory, Digital Discovery, 2024, 3, 1496–1508, DOI: 10.1039/D3DD00246B
11 J. Neumann and J. Brase, DataCite and DOI names for research data, J. Comput.-Aided Mol. Des., 2014, 28, 1035–1041, DOI: 10.1007/s10822-014-9776-5
12 M. Wilkinson, M. Dumontier, I. Aalbersberg, et al., The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data., 2016, 3, 160018, DOI: 10.1038/sdata.2016.18
13 Crossref Metadata updates (for public comment) July 2024, https://docs.google.com/document/d/1VPXhTPMZzfvAPmTOlNp-bZf9cTLkw0dPZFTuDtDIPls/
14 H. S. Rzepa, Quantum chemistry interoperability (library): another step towards FAIR data, 2022, https://www.ch.imperial.ac.uk/rzepa/blog/?p=24543, DOI: 10.59350/mzs83-g6218
15 T. Giagou and M. P. Meyer, Mechanism of the Swern Oxidation: Significant Deviations from Transition State Theory, J. Org. Chem., 2010, 75, 8088–8099, DOI: 10.1021/jo101636w
16 H. S. Rzepa, Example of a discovery search procedure, 2024, DOI: 10.14469/hpc/14510
17 H. S. Rzepa, Example of a discovery search procedure using a subject-constrained search, 2024, DOI: 10.14469/hpc/14517
18 This is currently being done for e.g. NMR Spectroscopy; R. M. Hanson, D. Jeannerat, M. Archibald, I. Bruno, S. Chalk, A. N. Davies, R. J. Lancashire, J. Lang and H. S. Rzepa, IUPAC specification for the FAIR management of spectroscopic data in chemistry (IUPAC FAIRSpec) – guiding principles, Pure and Applied Chemistry, 2022, 94, 623–636, DOI: 10.1515/pac-2021-2009