PDFs enhanced with XMP

Our readers still read most of our articles on the web as PDFs rather than HTML, so we thought we’d experiment with making some of our award-winning Prospect markup available through PDFs as well as through HTML.

Our first experiment is with XMP, a format which has hitherto mainly been used for metadata in photographs. We’re including compound data as InChIs, specifically pointers to the RSC InChI resolver, and incorporating other entities of interest with reference to OBO and RSC ontologies.

Examples, and instructions for how to see what we’ve included with an ordinary PDF viewer, available here: http://www.rsc.org/Publishing/Journals/ProjectProspect/Examples.asp

They’re not really intended to be directly read by human beings; we’d anticipate that these will be picked up and indexed by search engines or desktop search, and that people will use Adobe’s SDK to extract the data into a triplestore where it can be reasoned over.

We should also acknowledge that Omer Casher and Henry Rzepa at Imperial College London were experimenting with XMP back in 2006, and that NPG’s Tony Hammond has been blogging extensively on this subject on the CrossTech blog.

More experiments soon, but do let us know what you think in the comments below!

Digg This
Reddit This
Stumble Now!
Share on Facebook
Bookmark this on Delicious
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

8 Responses to “PDFs enhanced with XMP”

  1. I am happy to hear about this. It reminds me of the OTMI experiment held by Nature a few years back.

    Now, to me, the real issue is, accepting you cannot do everything right with 4096 characters, certainly not full ontological content descriptors, or inclusion of full InChIs, is that whether or not the RSC is going to state that the use of this data falls under fair use, and as such, will be placed in the public domain (e.g. with CCZero), independent of the paper itself being OA or not.

    (Report comment)

  2. Tony Hammond says:

    Hi Colin:

    Took a very quick peek at these and they look great. I have a couple comments:

    1. For reading the XMP data there only seems to be an “Additional Metadata” button in Adobe Acrobat Standard and not in Adobe Reader. Not that I could see anyway. I tried Reader on both Mac (9.3.3) and PC (9.0.0).

    2. The title property isn’t coming through. It seems to be using the filename. I compared with our XMP packets and only difference I can immediately see is the xml:lang attribute where you have “en” and we use “x-default”.

    3. There’s very little descriptive metadata, e.g. PRISM elements, but then you know that. :)

    4. The links to the InChI resolver etc. seem only to return an HTML page. I did try to send through a couple different “Accept” headers hoping to get some RDF/XML back but thhat didn’t seem to work. I guess the Linked Data trail stops here.

    5. I notice that both PDFs are not optimized. Was that a consequence of adding in the packets or is that how you normally serve PDFs?

    Cheers,

    Tony

    (Report comment)

  3. [...] This post was mentioned on Twitter by matthew llewellin, Tony Hammond. Tony Hammond said: More XMP. RSC experimenting with adding markup to PDFs for Prospect: http://bit.ly/91JtZM #xmp [...]

    (Report comment)

  4. Dave says:

    It’s been a long while since I added XMP stuff to our PDFs but I have vague recollections that the only way I could get it to show the descriptive metadata in Reader was to fashion the XMP like that output by acrobat.

    http://journals.iucr.org/a/issues/2003/01/00/ay0015/ay0015.pdf

    Dave

    (Report comment)

  5. Henry Rzepa says:

    Metadata is useful in bulk, when it can be aggregated, mined, and repurposed.

    Whilst copyright issues regarding the main article are (more or less clearly) published by publishers, what rights does the reader have over the metadata? Is copyright over that also claimed by the publisher? Can readers freely use the metadata harvested/mined from an article (by automated means using XMP or other), or are there restrictions?

    (Report comment)

  6. Colin Batchelor, Senior Informatics Analyst says:

    Hello Henry (and Egon)!

    Thanks for your questions about licences—Richard will be answering these in the next few days.

    (Report comment)

  7. Colin Batchelor, Senior Informatics Analyst says:

    And hello Tony and Dave!

    Thanks for your feedback. Yes, Adobe needs the RDF in XMP to be just so. It’s really fiddly.

    Now, as for Tony’s numbered points: (1) That’s a shame. (2) We’ll make sure this is fixed. (3) Yes! (4) The landing pages on the InChI resolver are an obvious place to put RDF and we’re looking at that. (5) Yes, that was a consequence of adding the packets. We’d make sure that we reoptimize them in practice.

    Best wishes,
    Colin.

    (Report comment)

  8. Ramunas says:

    @Colin Batchelor
    I can’t find any information in this blog about licence. Or is wrong place to look?

    (Report comment)

Leave a Reply