Whereas most research algorithms are nowadays open source, many of the most comprehensive library catalogues are not yet generally available as open data, and may be difficult to obtain even for research purposes. The lack of open data availability forms a major bottleneck for transparent and collaborative development of bibliographic data science, and innovative integration and reuse the available data and software resources.
Third, whereas large portions of data analysis can be automated, efficient and reliable research use requires collaboration between traditionally distinct disciplines, such as history, informatics, and data science, and finding the right combination and balance of expertise may prove challenging in practice.
Fourth, the content in bibliographic metadata collections are the products of at least three multi-layered historical processes. The digitization of traditional card catalogues may have meant an exclusion of material that was regarded as less important or covered elsewhere. Similarly, the collection of early national bibliographies have in general been based on a collection of existing bibliographies that were originally collected for other purposes (FOOTNOTE: For a discussion on the Danish National Bibliography, see Horstbøll 1999**). Naturally, the national bibliographies have not been able to include everything published, albeit the effort towards completeness has been remarkable. Further, the records reflect different historical practices of printing and publishing. In eighteenth-century Sweden, for instance, printing laws and decrees formed a crucial part of political discourse and was of great economic value to the book industry (CITE: Rimm, A.-M. 2005a. Den kungliga boktryckaren, del 1. Biblis 30: 4–31; Rimm, A.-M. 2005b. Den kungliga boktryckaren, del 2. Biblis 31: 27–44.**), whereas in Britain this was the case to a much lesser degree. Such practices are noticeable in the bibliographic metadata collections, but tell us more about precisely printing practices, not necessarily about other social and political phenomena, such as language relations, that we might want to study through the data. Any historically interested study using national bibliographies must therefore be attentive to these historical layers contained in the data in order to propose reasonable interpretations to quantitative data analysis.
Data harmonization is only the starting point for our analysis, albeit an important one. The harmonized data sets can be further integrated and converted into Linked Open Data [REFS] and other popular formats in order to utilize the vast pool of existing software tools. In addition to improving the overall data quality and hence the overall value of LOD and other data infrastructures that focus on data management and retrieval, the harmonization enables statistical analysis with scientific programming environments such as R [REFS] or Python [REFS], which provide advanced tools for modern data analysis and statistical inference. Hence, these two approaches serve different, complementary purposes. Our analysis of the FNB demonstrates the advantages of open availability of library catalogues. The raw MARC entries of the FNB  have been openly released by the National Library of Finland. We have now harmonized, augmented, and enriched this data with the open data analytical ecosystem, and hereby release the final harmonized data set that we have used in this study so that it can be further verified, investigated, and enriched by academics as well as the general public. The open availability allows us to demonstrate the advantages of a reproducible data analysis workflow, which provides a transparent account of every step from raw data to the final results. 
This process has generated a vast body of custom algorithms and concepts that support reproducible analysis of library catalogues [REFS - bibliographica R package]. These methods complement traditional software interfaces that have been designed for browsing and automated retrieval, rather than scalable statistical research. Data harmonization and quantitative analysis are intricately related objectives. Often, the actual data analysis reveals unnoticed shortcomings in the data. Hence, bibliographic data science is an inherently iterative process, where improved understanding of the data and historical trends can lead to enhances in the data harmonization procedures, and to new, independent ways to validate the data and observed patterns.
Problems: challenges in dirty data come also with key messages for research regarding the content, management, use, and usability of bibliographic records, with further implications on the underlying principles, functions, and techniques of descriptive cataloging.