Reliable research use relies on sufficiently accurate, unbiased and comprehensive information, which is ideally verifiable based on external  sources. We have also used external data sources, for instance on geographical places to further complement and verify the information that is available in the original library catalogues. We monitor data processing quality based on automated unit tests, manual curation, and cross-linking with external databases, incorporating best practices and tools from data science. Due to automation, any potential shortcomings in the processing can be fixed, with subsequent updates in the complete data collection. 
Data harmonization is only the starting point for our analysis, albeit an important one. The harmonized data sets can be further integrated and converted into Linked Open Data [REFS] and other popular formats in order to utilize the vast pool of existing software tools. In addition to improving the overall data quality and hence the overall value of LOD and other data infrastructures that focus on data management and retrieval, the harmonization enables statistical analysis with scientific programming environments such as R [REFS] or Python [REFS], which provide advanced tools for modern data analysis and statistical inference. Hence, these two approaches serve different, complementary purposes. Our analysis of the FNB demonstrates the advantages of open availability of library catalogues. The raw MARC entries of the FNB  have been openly released by the National Library of Finland. We have now harmonized, augmented, and enriched this data with the open data analytical ecosystem, and hereby release the final harmonized data set that we have used in this study so that it can be further verified, investigated, and enriched by academics as well as the general public. The open availability allows us to demonstrate the advantages of a reproducible data analysis workflow, which provides a transparent account of every step from raw data to the final results. 
This process has generated a vast body of custom algorithms and concepts that support reproducible analysis of library catalogues [REFS - bibliographica R package]. These methods complement traditional software interfaces that have been designed for browsing and automated retrieval, rather than scalable statistical research. Data harmonization and quantitative analysis are intricately related objectives. Often, the actual data analysis reveals unnoticed shortcomings in the data. Hence, bibliographic data science is an inherently iterative process, where improved understanding of the data and historical trends can lead to enhances in the data harmonization procedures, and to new, independent ways to validate the data and observed patterns.
Problems: challenges in dirty data come also with key messages for research regarding the content, management, use, and usability of bibliographic records, with further implications on the underlying principles, functions, and techniques of descriptive cataloging.