A statistical approach to bibliographic metadata is an emerging research area. In addition to providing novel approaches that can support qualitative research, bibliographic data science can add significant value to Linked Open Data and other infrastructures by providing new techniques to monitor and improve data quality. Whereas bibliographical metadata collections are traditionally used for information storage and retrieval, we have demonstrated that systematic large-scale harmonization of the raw entries both within and across metadata collections can fill an important gap in their research use, and help to realize their research potential in publishing history.
As similar datasets national bibliographies are not only about mapping the national traditions of publishing, but can also be studied comparatively and ultimately be integrated across borders should so that they help overcome a national perspective in analyzing the past. We have in this article expanded our previous pilot studies on the Finnish and Swedish bibliographies and the ESTC towards large-scale integration of national bibliographies in the CERL Heritage of the Printed Book Database. Our harmonization and integration efforts are not complete, but clearly demonstrate how such integration can help scholarship to reach a more precise view of print culture beyond the confines of national bibliographies.
Obtaining valid conclusions depends on reliable harmonization and augmentation of the raw entries. However, the power of a large-scale approach is that broad patterns in knowledge production are often overwhelmingly clear, despite occasional inaccuracies and collection biases in individual data sets. Already the HPBD, with its uneven coverage, can be used to assess some general trends in publishing history although it does not compete in reliability and level of detail in the other used bibliographic metadata collections. This is exemplified by our key observations on vernacularization and the rise of the octavo, which are supported by similar trends across multiple independently maintained bibliographic metadata collections. For a a more detailed comparison across European cities, further harmonization and augmentation of the collections are needed.
Our work is part of the emerging trend towards the utilization of large digital data resources in publishing history. For instance, the Culturomics project [DOI 10.1126/science.1199644] analyzed broad historical trends in English language and culture in the period 1800-2000 based on a corpus collected from the full text content of over five million digitized books [ FOOTNOTE {on difficulties in interpreting the data, see also article commentary 10.1126/science.332.6025.35-b }. Many of the problems relating to scalable data processing and interpretation were similar to the ones we have encountered in the context of bibliographic metadata collections. A key defining feature in our work is that whereas the analysis of full texts has drawn considerable attention in digital humanities, we focus on metadata collections. Despite the challenges in data harmonization, metadata is often considerably more structured and standardized, and orders of magnitude smaller than full texts, greatly facilitating automated analysis. Moreover, the metadata could provide valuable context for interpreting full text collections, as bibliographical metadata is often available for a larger number of documents than digitized full texts are, and in a more standardized format.  
We have investigated four different types of bibliographic metadata collections (FNB, SNB, ESTC, and HPBD), providing a fully transparent, detailed, and replicable account of data harmonization and analysis so that details of the data processing can be independently investigated and verified. Our current harmonization strategies are based on manually implemented rules for data processing, future developments could take increasing advantage of adaptive machine learning techniques that can learn such rules and exceptions from training examples, hence reducing the need for human input and improving the overall scalability of automated data harmonization. Furthermore, we have used external sources of metadata, for instance, on authors, publishers, and places, to enrich and verify the information. When combined with a proper quality control, such data analytical ecosystems have potential for wider implementation in related studies in the digital humanities. Moreover, ecology and related fields that have well established methods for spatio-temporal data analysis, provide a variety of statistical techniques for the analysis of such data collections. Open availability of the raw data as well as the analysis methods is central for efficient, collaborative, and cumulative research use of bibliographic collections in modern society. Our work has greatly benefited open source methods, and are contributing to the growing body of algorithms and harmonized data collections in this research area. 
Whereas our current work is based on the analysis of national bibliographies, it is helping to challenge the national confinement  of individual metadata collections, and paves the way towards large-scale data integration. We have demonstrated that significant historical trends, such as the rate of change in language use or book sizes are often overwhelmingly clear and seen across multiple independently collected repositories. Integrative analysis can thus help to verify the information and provide more detailed perspectives on these historical trends in Western Europe. Integration of collections demands a further work in reliably detecting duplicates, different editions and translations cross catalogues. Our systematic approach provides a starting point, guidelines, and a set of practically tested algorithms for more extensive analysis and integration. 

Conclusion

We have conceptualized a new approach and technologies to expand the research potential of bibliographic cataloguing and classification, calling this approach bibliographic data science. Whereas national bibliographies can provide comprehensive quantitative insights to the overall historical dynamics of the evolving publishing landscape across time and geography, we have encountered specific and largely overlooked challenges in using bibliographic metadata collections for historical research. Drawing valid conclusions critically depends on efficient and reliable harmonization and augmentation of the raw entries, and biases, gaps, and inaccuracies in data collection may remarkably hinder productive research use of the bibliographies. Here, we have overcome some of these challenges by specifically tailored open data analytical ecosystems that facilitate robust statistical research use of bibliographic metadata collections. This approach has potential for wider implementation in related studies and other bibliographies, and provides guidelines for more extensive integration of national metadata collections, thus helping to overcome to get at transnational historical processes and moving towards a more precise view of print culture beyond the confines of national bibliographies.

Supplementary Material

All analysis source code for data cleaning and harmonization and the reproducible Rmarkdown documents for generating the figures and tables in this document are available through Helsinki Computational History Group (COMHIS) website at https://comhis.github.io. The specific versions used in this work have been included as supplementary material. We have also included the harmonized version of Fennica, the Finnish national bibliography, whose original MARC data entries are openly available from National Library of Finland. The harmonized version has been prepared for this manuscript, and  is openly available and can be freely used. We are committed to maintaining and further improving the data harmonization, and future versions of this data release will be also available via the indicated research website. 

Acknowledgements

This work was supported by the Academy of Finland under Grant 293316. We are grateful for the National Library of Finland, the National Library of Sweden, the British Library, and CERL for providing the bibliographies for use in this research, and for the members of Helsinki Computational History Group for supporting this work.
til
Cover letter
Dear Editor,
We kindly ask You to consider the attached manuscript for publication in the special issue on "The Role and Function of National Bibliographies for Research in Different Academic Disciplines" in Cataloging & Classification Quarterly. The work is original, it has not been published elsewhere or submitted simultaneously for publication elsewhere.
We present an analysis of the overall publishing landscape in the period 1500-1800 based on comprehensive harmonization and joint analysis of four large bibliographic catalogs. This has allowed us to assess publishing activity beyond what is accessible by the use of national catalogs alone. In addition to the historical analysis of knowledge production trends, we are releasing the openly licensed source code for catalog harmonization, and a notable improved version of the Finnish national bibliography, Fennica. This code and data release demonstrate the potential of our approach for research use of library catalogs, and the essential role that data harmonization and integration plays in this process.
The work directly addresses multiple aspects that are relevant to the CCQ journal in general, and the special issue on national bibliographies in particular. This work demonstrates how comprehensive data harmonization is essential for accurate and useful data retrieval tasks and relevant for the overall usability of the metadata information, and how the available classification and subject analyses, geographical information, and other data can be utilized, augmented, enriched and validated based on auxiliary information sources.information sources, such as digital maps for instance.  Integration of national bibliographies, special collections, and archives is relevant for international aspects of digital cataloging. As such the work highlights specific bottlenecks and shortcomings in the available cataloging and classification information, and can therefore provide relevant information for education, training, and management of cataloguing. Finally, we demonstrate how bibliographic catalog records can be used as a digital research resource, rather than a mere information retrieval tool.