We have investigated four different types of bibliographic metadata collections (FNB, SNB, ESTC, and HPBD), providing a fully transparent, detailed, and replicable account of data harmonization and analysis so that details of the data processing can be independently investigated and verified. Our current harmonization strategies are based on manually implemented rules for data processing, future developments could take increasing advantage of adaptive machine learning techniques that can learn such rules and exceptions from training examples, hence reducing the need for human input and improving the overall scalability of automated data harmonization. Furthermore, we have used external sources of metadata, for instance, on authors, publishers, and places, to enrich and verify the information. When combined with a proper quality control, such data analytical ecosystems have potential for wider implementation in related studies and other bibliographies as many of the encountered data analytical problems are commonly encountered in digital humanities. Moreover, ecology and related fields that have well established methods for spatio-temporal data analysis, provide a variety of statistical techniques for the analysis of such data collections. Open availability of the raw data as well as the analysis methods is central for efficient, collaborative, and cumulative research use of bibliographic collections in modern society. Our work has greatly benefited open source methods, and are contributing to the growing body of algorithms and harmonized data collections in this research area.
Whereas our current work is based on the analysis of national bibliographies, it is helping to challenge the nationalistic view of individual catalogues, and paves the way towards large-scale data integration. A number of key challenges remain to be overcome, however, in enhancing data quality, but we have demonstrated that significant historical trends, such as the rate of change in language use or book sizes are often overwhelmingly clear and seen across multiple independently collected catalogues. Integrative analysis can thus help to verify the information and provide complementary views to the universally observed historical trends. Our systematic approach provides a starting point, guidelines, and a set of practically tested algorithms for more extensive analysis and integration.
Conclusion
We have conceptualized a new approach and technologies to expand the research potential of bibliographic cataloguing and classification, calling this bibliographic data science. Whereas national bibliographies can provide comprehensive quantitative insights to the overall historical dynamics of the evolving publishing landscape across time and geography, we have encountered specific and largely overlooked challenges in using bibliographic catalogues for historical research. Drawing valid conclusions critically depends on efficient and reliable harmonization and augmentation of the raw entries, and biases, gaps, and inaccuracies in data collection may remarkably hinder productive research use of the bibliographies. Here, we have overcome some of these challenges by specifically tailored open data analytical ecosystems that facilitate robust statistical research use of bibliographic metadata collections. This approach has potential for wider implementation in related studies and other bibliographies, and provides guidelines for more extensive integration of national catalogues, thus helping to overcome the national view in analysing the past towards a more precise view of print culture beyond the confines of national bibliographies.
Supplementary Material
All analysis source code for data cleaning and harmonization and the reproducible Rmarkdown documents for generating the figures and tables in this document are available through Helsinki Computational History Group (COMHIS) website at
https://comhis.github.io. The specific versions used in this work have been included as supplementary material. We have also included the harmonized version of Fennica, the Finnish national bibliography, whose original MARC data entries are openly available from National Library of Finland. The harmonized version has been prepared for this manuscript, and is openly available and can be freely used. We are committed to maintaining and further improving the data harmonization, and future versions of this data release will be also available via the indicated research website.
Acknowledgements
This work was supported by the Academy of Finland under Grant 293316. We are grateful for the National Library of Finland, the National Library of Sweden, the British Library, and CERL for providing the bibliographies for use in this research, and for the members of Helsinki Computational History Group for supporting this work.
til
Cover letter
Dear Editor,
We kindly ask You to consider the attached manuscript for publication in the special issue on "The Role and Function of National Bibliographies for Research in Different Academic Disciplines" in Cataloging & Classification Quarterly. The work is original, it has not been published elsewhere or submitted simultaneously for publication elsewhere.
We present an analysis of the overall publishing landscape in the period 1500-1800 based on comprehensive harmonization and joint analysis of four large bibliographic catalogs. This has allowed us to assess publishing activity beyond what is accessible by the use of national catalogs alone. In addition to the historical analysis of knowledge production trends, we are releasing the openly licensed source code for catalog harmonization, and a notable improved version of the Finnish national bibliography, Fennica. This code and data release demonstrate the potential of our approach for research use of library catalogs, and the essential role that data harmonization and integration plays in this process.
The work directly addresses multiple aspects that are relevant to the CCQ journal in general, and the special issue on national bibliographies in particular. This work demonstrates how comprehensive data harmonization is essential for accurate and useful data retrieval tasks and relevant for the overall usability of the catalogue information, and how the available classification and subject analyses, geographical information, and other data can be utilized, augmented, enriched and validated based on auxiliary information sources.information sources, such as digital maps for instance. Integration of national bibliographies, special collections, and archives is relevant for international aspects of digital cataloging. As such the work highlights specific bottlenecks and shortcomings in the available cataloging and classification information, and can therefore provide relevant information for education, training, and management of cataloguing. Finally, we demonstrate how bibliographic catalog records can be used as a digital research resource, rather than a mere information retrieval tool.