The National Archives Labs

Linked data and PRONOM

PRONOM is The National Archives’ technical registry – we plan to release the data it holds, in a linked open data format, and make it easier to reuse.

Data Tube

The PRONOM registry contains information about file formats, compression techniques and encoding types. Linked data is about linking up related data on the web, to help expose, share and connect data, information, and knowledge through using URIs and RDF.

Initially we will concentrate on modelling and publishing file format data already stored in PRONOM, using linked data standards. This is the largest core of data within PRONOM, and our first step to transform the data will be to convert existing data to RDF to describe features of each format. The new version of PRONOM will be extensible, so at a later stage we will enhance the data model to improve other areas of information in the database.

Eventually we hope to be able to use linked data to populate PRONOM from other external data sources, transparently showing where the information came from, and in doing so develop a more comprehensive technical registry.

We want the new version of PRONOM to be an open source system with a completely open code base.

We’d like to hear your comments on our plans, or suggestions for improving the PRONOM database, below – your input will inform its development.

Comments (11)

  • Tom

    Suggestions: Use Perl or Ruby, and create a web application (ala LAMP). Save the data in a SQL database, preferably PostgreSQL with the option for all tools to use SQLite, especially tools which will be installed locally by users outside the National Archives. In addition to the usual serialization formats, publish data in SQLite. SQLite is portable to all platforms, is pre-installed on most platforms, and requires no outside software for parsing an exporting data. XML and even CSV require an external parser and lack the rigor and flexibility of SQL.

    Thanks,
    Tom

    The National Archives reply:

    Hi Tom,
    The new version of Pronom will use a triplestore to store the data, rather than an SQL database, as we think this gives the data greater flexibility and the ability to expand the range of data that Pronom provides more easily and quickly. Users of Pronom won’t need to install any external software to use the new version, and the data will be published in an open format, RDF, so it will be possible to export the data to a variety of formats and this hopefully should give better adaptability than at present.

  • Tweets that mention The National Archives Labs » Blog Archive » Linked data and PRONOM -- Topsy.com

    [...] This post was mentioned on Twitter by KLA, KeepIt. KeepIt said: National Archives confirms plan for Linked data and PRONOM http://bit.ly/c0eEE3 [...]

  • Euan Cochrane

    Hi David,

    I have many comments about the structure of PRONOM, but I’ll keep this blog comment short.
    Much greater emphasis on creating application would be the main thing. The “formats” should be identified by a combination of the standard that the file is attempting to match its formatting to and the creating application for the file, rather than just the standard.

    Thanks,

    Euan

    The National Archives reply:

    Hi Euan, thanks for your thoughts on this. We will be using a triplestore to store existing Pronom data, and this will entail structuring existing data in a way that makes the whole Pronom data model more flexible, allowing us to add new types of information about the format further down the line. We will retain the Pronom Unique Identifier (PUID) associated with each format, but will also be improving how we demonstrate the source of information about the format. So, both the structure and identifying features of a format in Pronom should improve fairly early on.

  • Tom

    The RDF triplestore is something we can cope with. It would help us re-purpose the data if there were an existing RDF-to-SQL parser. I’m an evangelist for publishing all data in SQLite due to the flexibility and ease of use. (Beyond SQLite or SQL statement, I try to publish data in several formats so the customer can choose.) Fundamentally, my suggestion is that this remarkable resource be published in a form that is easy for programmers to work with. If the only download format will be RDF, it would be delightful to have Perl and Ruby examples of queries. SQL queries are easier than RFD calls to an API.

  • Tom

    In terms of improvements, is it practical for DROID to optionally return the same information and in the same format as the Linux “file” command? FITS generally returns a conflict status because DROID, Jhove and others return complex descriptions, even when the question is simple. .doc files may have a ream of detailed information, but the mime type is often enough. I have noticed that during ingest of files, we want two pieces of information: common name of a file type, detailed file type information. I think PRONOM helps with the detailed info, but it would be nice to have a standard heuristic that provides the common name (or simplified identity).

    The National Archives reply:

    Hi Tom, well, DROID and Pronom are different systems; DROID only uses a small subset of the Pronom data and links back to Pronom to offer more guidance to users. Linked data will mean all output from Pronom is more flexible, that is, there will be a lot users can do with it once it is released (making available a queryable endpoint and multiple representations such as JSON, RDF, and more). Our primary objective is to recreate a system that fulfils The National Archives and community requirements while opening up that flexibility at the same time. A new version of DROID is being developed at the moment; we do try and understand the needs of the community when developing this tool and hope that it provides a balance for everyone. The current tool is open source and available on SourceForge: http://sourceforge.net/projects/droid/ – as such it is always possible for developers to access the code to develop the functionality they require and feed back that development to the rest of the community. Before that, however, we recommend having a look at DROID 5.0, seeing if a combination of the CSV output and filtering options can’t help you with reducing the complexity as you discuss.

  • Keith

    Hi David,

    Are you able to give any information on what ontology(s), if any, you are using to model and create relationships for your data in the RDF triple store? This would be useful for others who might envisage creating, or aligning, similar data along similar lines? Are you using some form of Government standards from data.gov.uk for this?

    Thanks
    Keith

    The National Archives reply:

    Keith,
    The vocabulary we use for Pronom will be restricted necessarily by the current Pronom database schema. It will be normalized where appropriate and expanded in areas where there is an identified advantage or a business case. Improving our handling of provenance is one example where we are likely to improve the data model. We appreciate there will be an interest in the vocabulary we adopt so this will be made available through various channels while we work on it, however a more publicly available release will only be available much closer to the time this version of Pronom goes live.

  • Dulanjali Adhikari

    I would like to know the archival file formats that use in PRONOM for particular type. for an example what is the archival format for word, spread sheet, image, audio, video and database.

Leave a comment




Comment validation by @