. Thus, even when it is certain that no cumulative integrated catalogue of national bibliographic data will be perfect and free from errors, it can be sufficiently representative of important trends in knowledge production. This hypothesis comes with a huge research potential but it is yet to be systematically explored and tested. 
We have started to develop novel ways of creating a data analytical ecosystem, which is designed to harmonize and integrate different sources of metadata maintained by the research libraries, so that they can be used in quantitative research. Whereas the contemporary cataloguing efforts, and data management technologies, including LOD, have focused on data storage, management, and distribution, our efforts have a different, complementary target. Our efforts focus on ensuring the overall data quality and commensurability between independently maintained library catalogues by systematic large-scale harmonization and quality control. It is widely observed that metadata collections have high amounts of inaccurate entries, data collection biases, and missing information. Whereas this has formed severe challenges for statistical analysis and research use of these data collections, many of these problems can be potentially overcome by systematic data harmonization. We aim to fill this critical gap that has repeatedly proven to be a central bottleneck in large-scale statistical analysis of library catalogues. At the same time, as we demonstrate in the case of the Finnish National Bibliography (FNB), the harmonized data can then be combined into LOD releases, opening new doors for the research on national bibliographies. To emphasize the need for such complementary approaches and their vast research potential, we propose a new research paradigm of bibliographic data science, which is specifically targeted at enabling the use of library catalogues as research objects in their own right.
The reason why the question of data quality has not been raised with respect to national bibliographies is that even when they are cherished within information science, bibliographic information has been, nevertheless, an undervalued research resource. Our study is motivated by the analysis of Swedish National Bibliography and the Finnish National Bibliography, focusing on publication patterns in Sweden and Finland, where we have encountered specific and largely overlooked challenges in using bibliographic catalogues for historical research. Use of national bibliographies as a research resource, rather than a mere information retrieval tool, has proven to be challenging as obtaining valid conclusions critically depends on the overall understanding of the historical context but also on technical issues of data quality and completeness. Biases, inaccuracies and gaps in data collection or quality may severely hinder productive research use of library catalogues. Scalable solutions to these challenges, and subsequent research cases, in particular regarding large-scale statistical analysis of these data collections, have been missing.
Here, we demonstrate how such challenges can be overcome by specifically tailored and openly collaborative data analytical ecosystems that provide scalable tools for data processing and analysis, from efficient and reliable harmonization and augmentation of the raw entries to integration and statistical analysis of national bibliographies. We show how external sources of metadata, for instance, on authors, publishers, or geographical places, can be used to enrich and verify bibliographic information. Such systematic approach has potential for wider implementation in related studies and other bibliographies. In particular, we present an analysis of the overall publishing landscape in the period c. 1450-1800 based on comprehensive harmonization and joint analysis of four large bibliographies, which has allowed us to assess publishing activity beyond what is accessible by the use of national catalogs alone. We have prepared the first harmonized versions of the Finnish and Swedish National Bibliographies (FNB and SNB, respectively), the English Short Title Catalogue (ESTC), and the Heritage of the Printed Book Database (HPBD). This work clearly indicates that whereas national bibliographies are essentially about mapping the national canon of publishing, integrating data across borders should be managed in a way that takes into account specific local circumstances while also helping to overcome the national view in analyzing the past. Such integration can help scholarship to reach a more precise view of print culture beyond the confines of national bibliographies. 

Bibliographic data science

Quantitative, data-intensive research has not been the original or intended goal of analytical bibliography. Instead, a primary motivation for cataloging has been to preserve as much information of the original documents as possible, including potential errors and shortcomings [for a good discussion of W. W. Greg and Fredson Bowers who largely shaped the field, see http://ihl.enssib.fr/analytical-bibliography-an-alternative-prospectus/definitions-of-bibliography-and-in-particular-of-the-variety-called-analytical]. Thus, if for instance a place name is wrongly spelled, for cataloging purposes it is relevant to also to preserve that miss-spelling. Recently, however, the potential of library catalogues as a valuable data resource for large-scale statistical analysis in publishing history has started to draw attention [REFS]. Whereas library catalogues and national bibliographies have been traditionally used as a search tool for document identification and information retrieval, recent studies have indicated the vast research potential of these data resources when the information is appropriately controlled and verified. We build on and expand the existing research traditions in this emerging area. Where possible, we use established definitions of bibliographic concepts. For most analytical bibliographical definitions, we rely on Philip Gaskell, A New Introduction to Bibliography (New Castle, Del.: Oak Knoll, 1977, rev. ed. 1995). 
Our work prepares the ground for large-scale quantitative research use of bibliographic information. Here, reliable data processing and quality control are essential, and can build on the latest advances in data science. We propose the term bibliographic data science to describe this emerging field where bibliographic catalogues are viewed as quantitative research material. Relevant techniques range from data storage and retrieval to harmonization, enrichment, and statistical analysis. The harmonized data sets can be further converted into Linked Open Data [REFS] and other popular formats in order to utilize the vast pool of existing software tools. As the value of LOD  and other infrastructures are critically depend on data quality, our efforts are complementary, and intended to increase the accuracy and overall value of the data. In addition to improving the overall data quality in linked data infrastructures, the harmonized data collections can be analysed in statistical programming environments such as R [REFS], Python [REFS], or Julia [REFS] in order to gain access to latest advances in modern data analysis. The key difference between LOD and other infrastructure services on the one hand, and statistical programming environments on the other, is that the former focus on efficient data storage and retrieval , whereas the latter focus on data analysis and statistical inference. Hence, database infrastructures and statistical environments can serve different, complementary purposes.

Challenges in the research use of digital bibliographies

Library catalogues have a great potential for providing new quantitative evidence of historical trends, and contributing to a broader qualitative analysis and understanding. Unfortunately, the available library catalogues are seldom readily amenable to quantitative analysis. Key challenges include, for instance, shortcomings in data quality, availability, and the required multi-disciplinary expertise.
First, library catalogues tend to include large portions of manually inserted information, prone to mistakes and missing information. Varying notations and languages can pose serious challenges for automated harmonization efforts, and biases in data collection may further hinder productive research use. Hence, raw bibliographic records have to be systematically quality controlled and harmonized. Second, the lack of open data availability is slowing down the development of bibliographic data science as some of the most comprehensive library catalogues are not generally available even for research purposes. This forms a major bottleneck, as open data availability would greatly advance critical, collaborative, and cumulative efforts to design and utilize targeted data analysis algorithms, identify and fix shortcomings in the data, and innovatively integrate and reuse the available resources. Successful examples exist in other data-intensive fields, such as computational biology, where open availability of commonly generated data resources and algorithms is an established norm. Third, whereas large portions of data analysis could be potentially automated, the efficient and reliable research use will require expertise from multiple, traditionally distinct academic and technical disciplines, such as history, informatics, data science, and statistics. Efficient multi-disciplinary consortia that have the critical combination of expertise are more easily established in research plans than in practice, however. 
To meet these challenges, and to facilitate generic research use of library catalogues, we propose new approaches for systematic and scalable analysis of library catalogues, and demonstrate how integrative analysis of independent data collections could help to overcome the nationalistic emphasis of the individual catalogues. This work is based on four library catalogues that we have acquired for research use. These include the Finnish National Bibliography Fennica (FNB), the Swedish National Bibliography Kungliga (SNB), the English Short-Title Catalog (ESTC), and the Heritage of the Printed Book Database (HPBD). The HPBD is a compilation of dozens [CHECK EXACT NUMBER?] of smaller, mostly national, bibliographies (https://www.cerl.org/resources/hpb/content). Altogether, these bibliographies cover millions of print products printed in Europe and elsewhere between 1470-1950. The original MARC files of these catalogues include ... entries (Table XXX). 
Our implemented harmonization of the records follows similar principles, and largely identical algorithms, across all catalogues. We have designed custom workflows to facilitate data parsing, harmonization, and enrichment. As the catalogue sizes in this study vary from the 70 thousand [CHECK] raw entries in FNB to 6 million [CHECK] entries in HPBD, automation and scalability are critical. In this work, we focus on a few selected fields, which include the document publication year and place, language, and physical dimensions. Summaries of the final data sets, and full algorithmic details of the harmonization process are available via Helsinki Computational History Group website (https://comhis.github.io/2019_CCQ/). For publication years we had to interpret varying notation formats and remove spelling mistakes and apparent errors, such as future years. For publication places, we mapped disambiguated city names to open geographic databases, in particular Geonames [REFS] and manually curated city-country mappings lists. Unified treatment of the geographical names across catalogues allows integration across independently maintained catalogues.  We separated the primary and other languages for multilingual documents. The physical dimension fields of the MARC entries were converted to centimeters and standard gatherings. Finally, we developed custom algorithms to summarize the page count field of the MARC format into a single numeric page count [REFS], as implemented in the publicly available bibliographica R package. Where possible, we have augmented the missing values, and added derivative fields, such as print area, which quantifies the number of sheets used to print different documents in a given period, and thus the overall breadth of printing activity. The print area is reflecting the overall breadth of print products in a way that is complementary to mere title count, yet independent of print run estimates. In addition to the difficulties in obtaining reliable print run estimates, the standard gatherings show variation across time and place. A key innovation in our approach is that the available large-scale data allows us to rigorously estimate the varying gathering sizes in different places and time periods and thus obtain more accurate estimates for the missing entries and the overall paper consumption. Due to automation, any potential shortcomings in the processing can be fixed, with subsequent updates in the complete data collection. Reliable research use relies on sufficiently accurate, unbiased and comprehensive information, which is ideally verifiable based on external  sources. We have used external data sources, for instance on geographical places to further complement, enrich, and verify the information that is available in the original library catalogues. We constantly monitor the data processing quality based on automated unit tests, cross-linking, manual curation, and matching with external databases. In this, we have incorporated best practices and tools from data science. 

Open bibliographic data science

[COULD INCLUDE A SCHEMATIC FIGURE WHICH OUTLINES THE CATALOGUES, HARMONIZATION, AND INTEGRATION?]

What we are proposing is a systematic roadmap towards systematic large-scale standardization and integration of bibliographic information.  In this work, we demonstrate the potential of this approach based on the first substantial implementations of such data analytical ecosystem at the level of national catalogues. This generic approach relies on a number of technologies from database management to reproducible statistical data analysis, and visualization. The overall workflow is modular and reproducible. Our open source approach to code sharing, and efforts to support existing standards where possible is facilitating collaborative development of targeted data analysis algorithms. We are constantly taking advantage of, and contributing to, the growing body of open source algorithms in the relevant research fields.