First, library catalogues tend to include large portions of manually inserted information. Biases, inaccuracies and gaps in data collection or quality may severely hinder productive research use of library catalogues. Differing notations and languages pose further challenges for data integration. Hence, raw bibliographic records have to be systematically harmonized and validated. Ideally, such efforts will be fully transparent both in terms of the original raw data that is being processed and in terms of the source code of the harmonization and analysis algorithms. Whereas most research algorithms are nowadays open source, many of the most comprehensive library catalogues are not yet generally available as open data, and may be difficult to obtain even for research purposes. The lack of open data availability forms a major bottleneck for transparent and collaborative development of bibliographic data science, and innovative integration and reuse the available data and software resources. Finally, whereas large portions of data analysis can be automated, efficient and reliable research use requires collaboration between traditionally distinct disciplines, such as history, informatics, and data science, and finding the right combination and balance of expertise may prove challenging in practice.
To meet these challenges and to facilitate research use of library catalogues, we have implemented an open data analytical ecosystem for systematic and scalable quantitative analysis of library catalogues. Moreover, we demonstrate how the nationalistic emphasis of the individual catalogues can be overcome by integrative analysis. Our work is based on four library catalogues that we have acquired for research use. These include the Finnish National Bibliography Fennica (FNB), the Swedish National Bibliography Kungliga (SNB), the English Short-Title Catalog (ESTC), and the Heritage of the Printed Book Database (HPBD). The HPBD is a compilation of 45 smaller, mostly national, bibliographies [LINK:
https://www.cerl.org/resources/hpb/content **]. Altogether, these bibliographies cover over 5 million entries on print products printed in Europe and elsewhere between c. 1470-1950. The original MARC files of these catalogues include ... entries (Table XXX).
The data harmonization that we have implemented follows similar principles, and largely identical algorithms, across all catalogues. As the catalogue sizes in this study are as high as 6 million [CHECK] entries in the HPBD, automation and scalability are critical. In this work, we focus on a few selected fields, namely publication year and place, language, and physical dimensions. In summary, the data harmonization includes removal of spelling errors, term disambiguation, standardization and quality control, and custom algorithms such as page count estimation from the raw MARC notation [REFS], implemented in the associated
bibliographica R package. We have augmented missing values, and added derivative fields, such as
print area, which quantifies the overall number of sheets printed in distinct documents in a given period, and thus the overall breadth of printing activity. The print area reflects the overall breadth of print products, and complements the mere
title count, or
paper consumption, which would require additional print run estimates. An overview of the harmonized data set contents and full algorithmic details of the harmonization process are available via Helsinki Computational History Group website [LINK:
https://comhis.github.io/2019_CCQ **].
Reliable research use relies on sufficiently accurate, unbiased and comprehensive information, which is ideally verifiable based on external sources. We have used external data sources, for instance on geographical places to further complement and verify the information that is available in the original library catalogues. We monitor data processing quality based on automated unit tests, manual curation, and cross-linking with external databases, incorporating best practices and tools from data science. Due to automation, any potential shortcomings in the processing can be fixed, with subsequent updates in the complete data collection.