Modernise data publishing and reuse (draft)

 

Finding public information for reuse

Large scale publishing of public information

Public information distributed across thousands of websites is expensive or time consuming to gather for reuse.  The cost can be so high that little or no reuse occurs.  The Show Us a Better Way competition revealed this to be a problem when people seek information about complex public service choices.  One of the winning entries, School Guru demonstrates the scale of the challenge when choosing a school. Taskforce members with experience of building large mash ups identified a high search and acquisition cost as a major barrier to innovation in the reuse of data.

Where information is presented in one place it makes it much easier to reuse. The District of Columbia in the USA provides a vivid example of aggregating data for reuse in its data catalogue. The DC CTO has pulled together all of the District’s major data sets onto one web page and provided the data for free as a choice of feeds and downloads.  This makes it very easy for people to use information in a way that suits them.  Using modern techniques and storage it is relatively easy and inexpensive for government to aggregate performance and other data as it is produced.  And then make it freely available for re-use in virtual or physical data repositories.

Professor Nigel Shadbolt of the University of Southampton referred the Taskforce to use of data repositories in the academic sector to aggregate resources for research.  The Open Kowledge Foundation held a useful workshop with the Taskforce on finding and re-using information.  The workshop discussed the use of data catalogues which point people to where information can be found, such as the Common Knowledge Archive Network. The workshop demonstrated that finding public sector information is not straightforward and requires a detailed knowledge of how government works.

The challenge of ensuring information is discoverable and remains available over time will be met by a combination of catalogues and physical data repositories. Examples of each already exist across the public sector in the information management strategies of individual organisations. There are initiatives that aim to bring some consistency such as the Information Asset Register overseen by OPSI, part of the National Archives. Further information on information asset registers can be found in a paper produced for the ePSIplus network. However, in spite of these efforts, significant challenges remain for potential re-users, who may not have detailed knowledge of the structures of government, in finding and understanding relevant and useful information sources.

The Taskforce recommends that the government build on this existing work by establishing a public sector information repository and catalogue function based around the Office of Public Sector Information, part of the National Archives.  OPSI has the expertise in modern information publishing and, as an offshoot of National Archives, can take a long term view of custodianship. We understand that officials in OPSI have already sketched out the architecture to deliver such a service at minimal expense.

The Taskforce is pleased that the pre budget report contains a commitment from Communities and Local Government (CLG) to move forward in publishing its performance data obtained for the Comprehensive Performance Assessment (CPA).   If this performance data were to be published in a well structured way, it should be possible to produce a map of public services to help inform people’s choices.

Recommendation

The Government should ensure that public information data sets are easy to find and use.  The government should create a place or places online where public information can be stored and maintained (a ‘repository‘) or its location and characteristics listed (an online catalogue).  Prototypes should be running in 2009.



RSS feed of comments 9 Responses to “Finding public information for reuse”

  1. Jeni Tennison says:

    ED: Please expand the acronyms ‘CLG’ and ‘CSA’ in the penultimate paragraph. The paragraph as a whole doesn’t seem to fit well with this section.

    MODERATOR NOTE – thanks, agree these should be described in full. Will be corrected in next edit.

    MODERATOR NOTE – this has also revealed a typo so double thanks – ‘CSA’ should have read ‘CPA’ for Comprehensive Performance Assessment. ‘CLG’ is Communities and Local Government, a central govt department.

  2. Jeni Tennison says:

    I think you need a range of strategies for aiding the discovery of public information. Focusing on centralised strategies, such as building large unified repositories or catalogs, risks neglecting strategies that are more likely to work in a highly distributed and complex ecosystem like the web.

    Learn from what works for the wider web. Google forms the single point of entry for a lot of users, but (a) it locates information automatically, without requiring sites to register or provide metadata directly, and (b) it links to pages rather than attempting to unify the data into a query-able repository (in fact Google’s efforts in the latter direction aren’t exactly successful).

    Meanwhile, third parties who are interested in particular types of information set up their own pages of links for themselves and others to use. And sites like del.icio.us support that process.

    The idea of getting all public sector information in one place may seem attractive, but I think it’s likely to be impractical, expensive and ultimately fail to deliver the discoverability that is the real goal of the exercise. The danger is that the catalog ends up being incomplete and inaccurate while giving users the impression that it is complete and up-to-date, and that this actually leads to a worse experience for potential reusers.

    I think there needs to be more focus on providing guidance for websites on how to make their information discoverable generally, on using crowd sourcing to populate “catalogs”, and on any actual repositories being small and focused.

  3. Tony Hirst says:

    What about an effective custom search engine across different info sources?

    What about map based search tools that display search results on a geographical basis (Google maps used to be called Google *Local* search…)?

  4. Tony Hirst says:

    “The Government should ensure that public information data sets are easy to find and use”

    Should Government look to partnerships with application developers of e.g. data visualisation tools, such as IBM Many Eyes, or other shared data sites such as Swivel, Dabble DB etc etc

    Should data be mirrored/synched in ‘official channels’ on these third party sites?

  5. Tom Steinberg has suggested that something like FoI, but for data sets, be introduced:

    A person asks if a data set is available. If it is, and if it can be liberated reasonably cheaply (say <£10k) , do so. There should be a central fund to pay for requests.

    Apparently, if all of OPSI’s requests for data sets cost £10k to liberate, it would still only cost £400k or so — and could potentially generate a great deal more value than that. The number of data sets that could potentially be asked for is so much smaller than the number of potential FoI requests that it’s possible to spend a lot more money on individual requests.

    I’m not sure if that is the best way to do it, but it does cost money to make datasets available, so it would be good for the report to indicate that some money should be made available to pay for it.

  6. Clare McGinn says:

    “The cost can be so high that little no reuse occurs” – there is a missing “or” in this sentence – should be “that little or no reuse occurs”

    MODERATOR NOTE – thanks for spotting this error which has now been corrected.

  7. Barry Tennison says:

    Small points:
    “DC CTO”? I can guess (possibly wrongly) Chief Technical Officer; and the use of even the abbreviation DC is unnecessary in the context – something like “Their Chief Technical Officer has…” would be fine.
    Something called the (mispelled) “Open Kowledge Foundation” has crept in.

  8. Barry Tennison says:

    I’m afraid that I agree with Jeni that the knee-jerk of “create a central database” is likely to be inadequate to the discovery challenge. In my experience, keeping such a database accurate and up-to-date is quite impractical. Instead one needs organic, automated ways of enriching the flow of metadata – good search engines and crowd sourcing are two such resources, but almost certainly this is an area where continued observation, experiment and learning will be needed, and new ways will (if encouraged) develop.

  9. Richard Quarrell says:

    “… a public sector information repository and catalogue function…”

    This is an important and essential concept – the issues around finding/describing this class of data are key to this whole initiative – the POI is lost, if the information cannot be found. So let’s remember that you can’t build a useful repository without the content to put into it – so first you need lots of consistent IARs and to get that you need the holders to embrace the willingness culture: they’ve got be willing to create them and then willing to share them.