Modernise data publishing and reuse (draft)

 

Modernising information publishing

‘In the twenty-first century, information is the force powering our democracy and our economy. Both the private and the public sector increasingly rely on information and knowledge, and create value through their ability to manage these valuable assets. Successful societies and economies in the future will depend on how well they enable information to be appropriately shared’

Sir Gus O’Donnell Cabinet Secretary in ‘Information matters: building government’s capability in managing knowledge and information’

Websites have changed a great deal in recent years.  Successful sites have become data systems that deliver a service to the customer in many different places by allowing reuse of information.  The government’s use of the web is about more than the application of a set of communication tools such as blogs and wikis. The web has an architecture based on resources and links. This enables it to be a highly effective platform for data.  Some of the most successful online tools work well because they are designed and engineered in keeping with this architecture of the web. Examples include the photo sharing website, Flickr, and the social networking service, Twitter.  These services separate data from presentation and provide separate APIs. These APIs make the service more useful and help drive traffic to the site.

Generalising this, a person may be looking at a company’s product information on the company’s own website or seeing it embedded in a widget in someone else’s site or blog. For example a person might have a community website containing feeds of information from say the BBC for traffic reports for that area or a widget from a bookstore offering books relevant to that area or a feed of planning applications from their local authority. The information from the bookstore or the BBC or the local authority would be the same if you went to their own sites, it is being re-presented automatically in a third party location.  More people will see the information if it is on more sites.

The government web estate needs to move far closer to conforming with “The Architecture of the World Wide Web” (2004 ) or Tom Coates nine point plan in “native to a web of data” (2006).  The world has moved from a controlled world, with a relatively small number of publishers selecting who and what gets published, to a world of massively democratised and decentralised publishing on the web. Web 2.0 tools such as blogs, wikis and twitter are tools at the far end of this trend. Anyone can say anything about anything, at relatively little or no cost.

These developments have led to different information structures for websites that provide and receive information. The Office for National Statistics is consulting on the use of an new model for access to the 2011 census data involving an interface to allow reusers to get at the underlying data, rather than have to go through the ONS own top-level website (see consultation here). These new structures enable easy reuse of information by third parties.  The taskforce discussed on its blog a new information model for public sector websites to design in reuse of information.

Designing in reuse

This issue is discussed in detail on the Taskforce blog here.

Diagram 1 – the ‘traditional approach’

The 'traditional' architecture

The emphasis of much web development to date has been on the presentation of the data to the public.

The assumption was that a particular website would be the unique interface to a particular set of data.

This meant that little or no thought might have been given to how anyone else would use the data set in question.

Sometimes the data and any analysis of it could be unpicked from such a site but in many instances this would be extremely difficult.

Diagram 2 – a Power of Information model

A Power of Information Architecture

Thinking has moved on over recent years with a developing understanding of the importance of separating data from its presentation. If nothing else, this allows for simpler changes to the presentation layer as, for example, websites are redesigned.

PRESENTATION LAYER – the public-facing front end, typically a set of web pages

ACCESS LAYER – all the information needed to access the data, including technical, legal and commercial aspects

ANALYSIS LAYER – any form of interpretation of the raw data, typically for summary presentation

ACCESS LAYER – all the information needed to access the data, including technical, legal and commercial aspects

DATA LAYER – the raw data sets

The Taskforce judges that to realise the power of much public information a different approach is needed to the way public data sets are treated when published on the web. There is a need for several access layers to the data. These layers must address all the issues that are necessary to enable use of the data. These typically include technical issues such as file formats, intellectual property issues such as copyright, and commercial issues such as pricing where applicable. The access layer is discussed in more detail here.  Access to data allows many other actors to create their own analyses of it.  A further Access Layer could allow reuse of the output of the analysis activity. This must again address any technical, intellectual property and commercial issues.  With the Access Layers in place there is scope for multiple web presentations of the data. Additional value can be generated through the ability to interact with a community around the data.

The full realisation of the power of the information is realised when all layers are in place with the architecture designed to offer opportunities for interaction.

Recommendation

As the internet changes, so should the way information is published.  The taskforce has developed with stakeholders a model to inform online publishing. This breaks out information into several layers with external interfaces at each layer, allowing re-use both of the raw data and the intervening software interfaces.  OPSI should develop and further test the model and publish it with a delivery mechanism, implementation plan and explanatory material by end June 2009. It should become the standard to which new systems, or re-implemented versions of existing systems, are implemented from a date determined by the CIO Council.



RSS feed of comments 17 Responses to “Modernising information publishing”

  1. I don’t think it’s coincidence that the ‘web 2.0′ sites which have emerged as no1 in their fields have been those which opened their APIs.

    Don’t downplay the application of blogs and wikis though. Tools like WordPress offer remarkable RSS functionality, which could be seen as a primitive API. Indeed, I’ve worked with one private sector client who adopted WordPress solely because of its ability to generate data feeds: they don’t offer the ‘blog’ for public view.

    I’m pleased to see ONS consulting on an API for Census data. However, in the past anyway, their consultation has concentrated on existing contacts – who have historic processes for dealing with ‘old fashioned’ output. By definition, an API will open them up to a whole new audience they don’t yet talk to.

    It would be a dramatic gesture; but perhaps ONS shouldn’t actually build a Census front end at all… just an API?

  2. John Darlington says:

    This topic goes bang straight into the technical detail. I feel it needs a little more intro. The key for me is a change in mind from practise that tries to restrict where data gets to (licensing) to a mind set of how far can we make this data spread while still letting the public know this is government information. I agree with a comment elsewhere by John that Crown Copyright label is key. The only vital license term is a viral action to maintain the label no matter how the data is distributed and aggregated.

  3. Mo says:

    Simon Dickson’s hit on the nose here: it’s all about accessing the _data_ without having to employ nasty tricks to segregate it from presentation (and if it’s a report, then it needs to be in some semantically-useful form, like DocBook or semantic HTML; if it’s regularly-updated, there need to be feeds for it, and so on).

    In the first instance, the Government webmasters should be worrying less about providing direct access to prettified versions of publicly-accessible information and more about letting others make actual [re]use of it.

    Take Hansard, for example: an absolute mine of information, all critical to the foundation of our democratic process, and yet… utterly inaccessible through any means beyond that horrible website. Why can I not get data feeds for particular MPs or Lords—a list of debates they took part in, made available via some relatively straightforward mark-up, referenced via a publicly-defined identifier scheme?

    Somebody clearly has to put that stuff up in some form or another, and they have to differentiate already between the various things which could be semantically represented, and it makes far more sense—in terms of both longevity and usefulness—to publish it first and foremost in terms of a “source” format (and if it’s XML-driven, a couple of XSL stylesheets would give you what you currently have from it).

    Instead we have the likes of this:

    Oral Answers to Questions

    …which happens to both be semantically useless, invalid HTML (inline elements placed within a container which should only contain block-level elements), and littered with presentational artefacts.

    Hansard, continuing with this particular example, should be the utter pinnacle of “public connectivity” (to coin a phrase). Instead, from a technical standpoint, it’s embarrassing. Sadly, this is pretty representative of the public sector as a whole.

  4. Tony Hirst says:

    Here’s a portable version of this post hat can be embedded elsewhere in a brandable widget; no effort involved on my part… http://grazr.com/gzpanel.html?pl=ou&exp=1&file=http://poit.cabinetoffice.gov.uk/poit/2009/01/modernising-information-publishing/?feed=rss2&withoutcomments=1

  5. Tony Hirst says:

    I’m not sure the layered model is the best graphic? You want something more like a ring (pie) chart, that shows how the same stuff (e.g. public data, Hansard records etc) can be accessed just as easily from several different directions/by several different audiences (public, lobby, government agencies, developers etc)? Some of these different access routes may themselves be layered, though?

  6. Simon Field says:

    “Office for National Statistics” please (not “of”).

    ONS is developing a data explorer that will itself be founded on an API which I hope will be published. It will be capable of operating across all ONS outputs, and so is not limited to our plans for the next Census (we hope to have it out there, and through a few releases before we reach Census outputs).

  7. Steph Gray says:

    This is a great idea – putting Government in the role of ‘wholesaler, not retailer’ of information as a colleague once put it. To ensure that a wide range of data sets are published in this way, it is important the model is simple and not unduly onerous on publishers: even publishing ‘raw’ data as spreadsheets, plain text or RSS feeds alongside PDFs or HTML pages should be encouraged and somehow incentivised.

  8. Paul Walk says:

    As an erstwhile software developer, I like a layered model as much as the next man.

    They can work well in the way that they help to convey basic concepts, like ’separation of concerns’. But then they need to be put to one side before they become a fetish. I have first hand experience of how this can happen with the JISC Information Environment – I touched on this issue here:

    http://blog.paulwalk.net/2008/08/20/all-models-are-wrong-but-some-are-useful/

    Diagram 2 should be a model for understanding, not a blueprint for development – it is already too restrictive. For example, do I really have to go through an analysis layer for every use of data?

  9. Tony Hirst says:

    “For example a person might have a community website containing feeds of information from say the BBC for traffic reports for that area or a widget from a bookstore offering books relevant to that area or a feed of planning applications from their local authority.”

    This paragraph makes an important distinction – the provision of content feeds, that a developer will transform in order to present it as they and their site design requires (subject to license terms, maybe), and the provision of widgets, where the content container and the presentation of the content itself are controlled by the original widget publisher (possibly with some minor re-skinning opportunities for the end user).

    In the case of feeds, the ultimate end-use may not know where the content came from if the developer/publisher does not acknowledge it’s source. In the case of widgets, the originating publisher can fix the content and widget container chrome so that the end user knows exactly where it came from.

  10. Mo says:

    Following on from Tony Hirst: from a long-term usefulness perspective, widgets—while ensuring “branding” at the like—are primarily useful to end-users using information in a very specific way (the way the widget author intended). Raw data feeds, on the other hand, are inherently flexible if implemented properly, and allows developers to make use of the information in ways the publisher may not have even thought possible to begin with, which is arguably the point of information-sharing and the very heart of innovation.

  11. Paul Walk says:

    Echoing Mo’s comment: Rufus Pollack is attributed with the phrase:

    “The coolest thing to do with your data will be thought of by someone else”

    I don’t have an exact reference for this but see:

    http://blog.paulwalk.net/2007/07/23/“the-coolest-thing-to-do-with-your-data-will-be-thought-of-by-someone-else”/

    This is, at one level, unlikely to be true – but it points the way to a useful attitude to take with regard to the exposure of public data.

  12. John Darlington says:

    I totally agree with the re-use statements of Mo and Paul but people need to know the provenance of the data they are looking at in the reusers application. Making sure the data is labelled is important. I want to know where the data came from so that I can understand the level of trust I should place in the data. Perhaps Crown Copyright does sound restricting but it is the current mechanism for stating the owership of the data.

  13. Mo says:

    John — I do agree with you regarding data labelling, but that’s why we have ‘terms of use’ and suchlike, I think. I firmly believe it’s important that the data format and structure shouldn’t be made deliberately limiting because of this (it’s solving the problem in the wrong place, as it were).

  14. Barry Tennison says:

    I was almost alarmed at the sentence “The Office for National Statistics is consulting on the use of an new model for access to the 2011 census data involving an interface to allow reusers to get at the underlying data”. I think what is meant is more like “The Office for National Statistics is consulting on the use of an new model for access to the 2011 census outputs involving an interface to allow reusers to get at the underlying summarised (OR aggregated OR analysed) data”. It would be best not to “frighten the horses” by even hinting that individuals’ census returns might become available to reusers.

  15. Barry Tennison says:

    I have two connected presentational points about this section..
    (1) Its content is very closely connected with the section (strangely) titled “Embedding best practice”, which is also about publishing PSI. I think some redrafting would help, combining these sections or rebalancing the material.
    (2) In contrast to surrounding sections, this one rather rapidly gets quite technical, with the layer diagrams and so on. While quite techie people like me might like this, I’m not sure that it’s best designed for the intended audience. I’d recommend considering placing some of the more technical things (maybe expanding them) in an Appendix, and concentrating here on the main message(s), tailored for the most important intended readers. (Reading it three times does help, but I’m not sure most people will do that).

  16. Andy Mabbett says:

    Microformats, an HTML mark-up technique, allow users to download things like contact addresses and event details directly into calendar/ address book apps; and allow other sites to “mash up” such data. Govt sites should emit microformats where possible.

  17. Richard Quarrell says:

    “In the twenty-first century, information is the force powering our democracy and our economy… and … the future will depend on how well they enable information to be appropriately shared…”

    This quotation trumpets the importance of the information we are discussing – particularly in economic terms – and this is certainly (and maybe painfully) true. I think it’s pertinent to insist we remember that (1) we’re not just talking about the web and (2) the market for all this is global and our attitude should reflect this. I’m not sure that report does so.