Biases, inaccuracies and gaps may severely hinder productive research use of bibliographic metadata collections. Varying standards and languages pose further challenges for data integration, highlighting the need to reconsider the underlying principles on the management of bibliographic records. Our data harmonization efforts follow similar principles and largely identical algorithms across all catalogues. In this work, we focus on a few selected fields, namely publication year and place, language, and physical dimensions. We have removed spelling errors, disambiguation and standardized terms, augmented missing values, and developed custom algorithms, such as conversions from the raw MARC notation to numerical page count estimates [REFS]. We have also added derivative fields, such as print area, which quantifies the overall number of sheets in distinct documents in a given period, and thus the overall breadth of printing activity. We have also used external data sources on authors, publishers, and places to enrich and verify bibliographic information. Automation, scalability, and quality monitoring are critical as the catalogue sizes in this study are as high as 6 million [CHECK] entries in the HPBD. We have ensured data quality by automated unit tests, manual curation, and cross-linking with external databases, incorporating best practices and tools from data science. Bibliographic data science is an iterative process where improved understanding of the investigated phenomena often leads to enhancements in data harmonization and validation. This cumulative process has equipped us with a vast body of methods that support the research use of bibliographic metadata collections [LINK - bibliographica R package]. Automation allows us to fix observed shortcomings with subsequent updates in the harmonized data collection. An overview of the harmonized data sets and full algorithmic details of our analysis are available via Helsinki Computational History Group website [LINK: https://comhis.github.io/2019_CCQ **]. 
Ideally, such harmonization and validation efforts are fully transparent both in terms of data and source code. Many of the most comprehensive library catalogues are not yet generally available as open data, however, and may be difficult to obtain even for research purposes. The lack of data availability forms a major bottleneck for transparent and collaborative development of bibliographic data science, and innovative reuse the available resources. This might be gradually changing, however. The National Library of Finland, for instance, recently made available the complete MARC entries of the FNB (LINK http://data.nationallibrary.fi/) for download and reuse under the CC0 open data license. We have also shared our harmonized version, and started preparations to combine the harmonization algorithms  into the LOD releases of this bibliography. Large-scale harmonization, an combination with the existing data management infrastructures, could open up new doors for research on national bibliographies.
Whereas large portions of data analysis can be automated, efficient and reliable research use requires collaboration between traditionally distinct disciplines, such as history, informatics, and data science, and finding the right combination and balance of expertise may prove challenging in practice. Data harmonization is only the starting point for our analysis, albeit an important one. The harmonized data sets can be further integrated and converted into Linked Open Data [REFS] and other popular formats in order to utilize the vast pool of existing software tools. In addition to improving the overall data quality and hence the overall value of LOD and other data infrastructures that focus on data management and retrieval, the harmonization enables statistical analysis with scientific programming environments such as R [REFS] or Python [REFS], which provide advanced tools for modern data analysis and statistical inference. Hence, these two approaches serve different, complementary purposes. Our analysis of the FNB demonstrates the advantages of open availability of library catalogues. The raw MARC entries of the FNB  have been openly released by the National Library of Finland. We have now harmonized, augmented, and enriched this data with the open data analytical ecosystem, and hereby release the final harmonized data set that we have used in this study so that it can be further verified, investigated, and enriched by academics as well as the general public. The open availability allows us to demonstrate the advantages of a reproducible data analysis workflow, which provides a transparent account of every step from raw data to the final results. 
The content in bibliographic metadata collections are the products of at least three multi-layered historical processes. The digitization of traditional card catalogues may have meant an exclusion of material that was regarded as less important or covered elsewhere. Similarly, the collection of early national bibliographies have in general been based on a collection of existing bibliographies that were originally collected for other purposes (FOOTNOTE: For a discussion on the Danish National Bibliography, see Horstbøll 1999**). Naturally, the national bibliographies have not been able to include everything published, albeit the effort towards completeness has been remarkable. Further, the records reflect different historical practices of printing and publishing. In eighteenth-century Sweden, for instance, printing laws and decrees formed a crucial part of political discourse and was of great economic value to the book industry (CITE: Rimm, A.-M. 2005a. Den kungliga boktryckaren, del 1. Biblis 30: 4–31; Rimm, A.-M. 2005b. Den kungliga boktryckaren, del 2. Biblis 31: 27–44.**), whereas in Britain this was the case to a much lesser degree. Such practices are noticeable in the bibliographic metadata collections, but tell us more about precisely printing practices, not necessarily about other social and political phenomena, such as language relations, that we might want to study through the data. Any historically interested study using national bibliographies must therefore be attentive to these historical layers contained in the data in order to propose reasonable interpretations to quantitative data analysis.
Substantiate: