A short scoping study was undertaken in July and August 2006 to clarify some elements of a bid under the JISC Digitisation programme (Phase 2). This study received £6,239 in funding from the JISC Digitisation programme. Several partners to the bid also made contributions in kind.

Digitisation scoping study


Background  

A short scoping study was undertaken in July and August 2006 to clarify some elements of a bid under the JISC Digitisation programme (Phase 2). This study /upload/jisc/programmes/digitisation/jisc-curl-scoping-study-cover001.jpgreceived £6,239 in funding from the JISC Digitisation programme. Several partners to the bid also made contributions in kind.

Aims

The study had three main aims to:

  • inform the revision of the project proposal for Stage 2 submission, in particular the methodology and budget
  • provide information for a detailed Project Plan should the bid be successful
  • provide information of wider use , e.g. for similar or related projects

Activities

The scoping activities took five main forms:

  1. Surveys into the format and condition of the pamphlets and the extent of duplication between collections
  2. Visits to the seven contributing libraries to examine collections and discuss workflow issues
  3. Tests of scanning and OCRing 19th century pamphlets and analogous material
  4. Discussions among partners and with external parties (e.g. over workflow, technical specifications, transportation of original materials and data, and potential for linking with related projects)
  5. Desk research (e.g. into technical specifications)

Deliverables

The initial plan for this scoping study proposed the following deliverables:

  • Selection criteria
  • Digital capture standards
  • OCR benchmarks
  • Metadata specifications
  • De-duplication strategy
  • Workflow diagram
  • Gannt chart showing how collection consignments might be staged
  • Short narrative report covering any additional findings

Some of these deliverables have already been included in the revised project proposal. They are all included here for completeness and for those who do not have access to that document.

The University of Bristol and other contributors license all the deliverables of this study to the JISC non-exclusively and in perpetuity.

Structure of this report

The results of the scoping study are presented in three main sections, covering:

  • Pamphlet collections provides profiles of the collections along with strategies for addressing selection, IPR issues, de-duplication, and the scheduling of collections for scanning
  • Digital dataset provides details of digital capture, OCR, metadata and quality assurance
  • Project workflow describes the overall workflow proposed for the project, including a workflow diagram, work plan, and description of work packages
  • Conclusions and recommendations are also made.

Outline of 19th Century Pamphlets Online project

The proposal is for Phase 1 of a larger 19th Century Pamphlets Online project. That larger project has the vision of providing researchers, learners and teachers with online access to the most significant pamphlet collections held in UK research libraries. Phase 1 aims to digitise a proportion of the pamphlets that are held within CURL  libraries and searchable via Copac. This first phase concentrates on collections with a strong political, economic and social focus, and is subtitled: Pamphlets as a Guide to the Parliamentary Debates of the 19th Century.

The project’s Primary Partners include: seven collection holders (University of Bristol, DurhamUniversity, University of Liverpool, LSE, University of Manchester, University of Newcastle, UCL); a digital production unit, BOPCRIS (based at the University of Southampton); a US not-for-profit archiving and delivery service, JSTOR; and a metadata and resource discovery provider, MIMAS. The project is led by the University of Southampton on behalf of CURL. In addition to the Primary Partners, the project has nineteen Associate Partners: other research libraries who would be willing to contribute collections to later phases of the project.

The project intends to:

  • Provide users with a wide selection of approximately 23,000 digitised 19th century pamphlets (amounting to approximately 1 million pages) that focus on political, social and economic issues
  • Implement a consortial scanning operation at BOPCRIS to create a dataset of digital images, metadata and OCR text
  • Provide a financially and technologically sustainable delivery and preservation operation through the partnership with JSTOR
  • Deliver the digitised pamphlets to users as a special JSTOR collection, which will be free to UK FE, HE and secondary schools for at least 25 years and will enable reuse and repurposing within local learning and research environments and national learning repositories such as JORUM
  • Provide a sophisticated distributed resource discovery service, enabling users to access the collection from: (a) the JISC/JSTOR collection; (b) an item-level search on Copac and on the OPACs of holding libraries; (c) a collection-based search from the online Guide to 19th Century Pamphlets; and (d) searches on other resource discovery tools, in particular, GoogleGoogle Scholar and via the CrossRef service

Summary of scoping study findings

This scoping study has confirmed that the proposed project is viable, can be managed within the allotted timescale (2 years, from January 2007-December 2008), and would be extensible. Findings from this study have informed the revision of the bid for stage 2, particularly its methodology (e.g. standards and work packages) and its budget (which has risen). Much of the content of this report has found a place within the revised bid.

The deliverables of this scoping study have taken a very concrete form (criteria, standards, strategies, diagrams, etc). All the deliverables promised in the scoping study plan are included in this report, along with several others. Here is a full list, along with a brief summary of their findings or approach:

  1. Results of a survey into the format and condition of the pamphlets. The survey found that many pamphlets would pose challenges for scanning. It also found a great deal of variation among the collections. An important finding was that previously estimated page averages were too low, requiring a reduction in the estimated number of pamphlets to be digitised within this phase.
  2. Results of a survey into the extent of duplication across the collections. The survey found significant duplication across some of the collections, requiring some means of de-duplication, and resulting in a reduction in the number of pamphlets expected from some collections.
  3. Profiles of seven 19th century pamphlet collections. These identify key issues likely to affect the scheduling and scan-time for the collections.
  4. Selection strategy and criteria. As five whole collections have been pre-selected for inclusion in the project, a de-selection criteria is outlined, along with a selection criteria for the two remaining collections.
  5. Copyright strategy and workflow. The proposed strategy requires library partners to take primary responsibility for identifying and dealing with any copyright concerns, but the project will provide support.
  6. De-duplication strategy and workflow. This strategy proposes de-duplicating at the libraries, before any pamphlets are sent. It proposes a database to manage the de-duplication workflow.
  7. Gannt chart showing scheduling of collections. This chart shows one possible schedule for selecting and digitising the seven collections, based on their characteristics and the capacity of BOPCRIS.
  8. Image capture standards. A mix of bitonal, grey and colour capture is proposed in order to balance the need to provide easy access to intellectual content with that of representing the pamphlets as historical objects. The chosen formats are standard and conform to JISC IE and  Minerva technical guidelines.
  9. OCR benchmarks. Based on tests, the project expects to achieve an average accuracy, per character, of 97-98%. Additional specialist software would be used to maintain high accuracy levels across more difficult material.
  10. Metadata standards. A suite of established and new XML-based metadata standards is proposed for the archival and delivery datasets. The chosen standards conform to IE and Minerva guidelines.
  11. Description of production workflow and quality assurance (QA). Presents an overview of the production workflow, indicating how QA fits into the workflow.
  12. Diagram showing overall project workflow. Presents a graphical representation of the flow of original pamphlets and digitised datasets, indicating how these relate to other workflows and work packages.
  13. Gannt chart showing work plan. Presents a graphical view of work packages and associated activities
  14. Description of work packages. Presents a narrative explanation of work packages.

Apart from these tangible deliverables, important outcomes of the scoping study were the good working relationships established between project partners (e.g. libraries and the project team, BOPCRIS and JSTOR). The partners worked very successfully together in completing the work required for this study in a short timeframe. Partners were actively involved in the definition and negotiation of the approaches chosen for the project. The combination of a determined methodology and established relationships would be a powerful enabler in any extension of this work or development of similar joint projects.

While this whole report could be regarded as a set of recommendations for a specific project, three more general recommendations are made, that:

  1. scoping studies such as this are undertaken for all large digitisation projects of this nature, especially where multiple partners are involved
  2. the 19th Century Pamphlets Online project, should it proceed, evaluate the findings and approaches chosen in this study in light of the practical reality of the project, and disseminate its findings
  3. this report has a wider application than this current project and should be made available to others undertaking similar work