Results of the collaboration of BHL-Europe and IMPACT

The optical character recognition (OCR) is a very essential step towards a taxonomic intelligent search within the BHL-Europe Portal. To allow this search to work, page images from the scanner are first of all converted to full text with OCR tools. The TaxonFinder developed in the uBio project is then applied to the full text to identify the scientific names of animals and plants within the text. However, OCR errors are a major problem for these taxonomic intelligence technologies as the data for enrichment are based on the OCR text of digitised page images. Thus, improving OCR will improve the name finding and subsequently the search for taxonomic information in the digital biodiversity heritage literature.

The improvement of OCR technologies is a real challenge for BHL-Europe. Therefore, we collaborated with the EU-funded IMPACT project (Improving Access to Text). The results of this collaboration are available as a separate report and also in a blogpost. In addition, we think that the data produced during the process are certainly valuable to be published too.

Below you can find a list of the documents with corresponding source data (TIF master files) and ground truth data (XML full text versions). The description of this format can be found here. Full text documents have an accuracy around 99.95%.

Altogether 2,418 page image files were processed. The size of all files is 13.4 GB in total. All files are taken from the BHL corpus and links are provided to the items in the BHL portal.

Description of the document set (full title) BHL title/item URL (online presentation of the document) Masterfiles (TIF) Ground Truth files (XML)
Das Chitinskelett von Eosentomon, ein Beitrag zur Morphologie des Insektenkörpers, von Heinrich Prell. Mit 6 Tafeln. BHL item URL TIF zip file XML zip file
Piscium querelae et vindiciae /  expositæ à Johanne Jacobo Scheuchzero BHL item URL TIF zip file XML zip file
Trudy Russkago entomologicheskago obshchestva. Horae Societatis entomologicae vossicae, variis semonibus in Russia usitatis editae; t. 16 1881 BHL URL Biblio
BHL URL Item
TIF zip file XML zip file
Birds of Great Britain and Ireland. by Arthur G. Butler; illustrated by H. Grönvold and F.W. Frohawk. Order Passeres; volume 2 BHL URL Biblio
BHL URL Item
TIF zip file XML zip file
Conchologia iconica, or, Illustrations of the shells of molluscous animals / by Lovell Augustus Reeve; volume 5
BHL URL Biblio
BHL URL Item
TIF zip file XML zip file
Histoire naturelle des poissons, par M. le B.on Cuvier ... et par M. Valenciennes; t. 10 BHL URL Biblio
BHL URL Item
TIF zip file XML zip file
Random set of 100 single pages taken out of different books to illustrate specific OCR challenges provided by the BHL corpus XLS-file containing descriptive information and URL to pages: see attachment below TIF zip file XML zip file

 

Creative Commons License
BHL-Europe - IMPACT collaboration results by BHL-Europe is licensed under a Creative Commons Attribution 3.0 Unported License.

AttachmentSize
File bhl_random_set.xlsx19.15 KB