The harmonized data sets can be further integrated and converted into Linked Open Data [REFS] and other popular formats in order to utilize the vast pool of existing software tools. In addition to improving the overall data quality and hence the overall value of LOD and other data infrastructures that focus on data management and retrieval, the harmonization enables statistical analysis with scientific programming environments such as R [REFS] or Python [REFS], which provide advanced tools for modern data analysis and statistical inference. Hence, these two approaches serve different, complementary purposes.
The data harmonization that we have implemented follows similar principles, and largely identical algorithms, across all catalogues. As the catalogue sizes in this study are as high as 6 million [CHECK] entries in the HPBD, automation and scalability are critical. In this work, we focus on a few selected fields, namely publication year and place, language, and physical dimensions. In summary, the data harmonization includes removal of spelling errors, term disambiguation, standardization and quality control, and custom algorithms such as page count estimation from the raw MARC notation [REFS], implemented in the associated bibliographica R package. We have augmented missing values, and added derivative fields, such as print area, which quantifies the overall number of sheets printed in distinct documents in a given period, and thus the overall breadth of printing activity. The print area reflects the overall breadth of print products, and complements the mere title count, or paper consumption, which would require additional print run estimates. An overview of the harmonized data set contents and full algorithmic details of the harmonization process are available via Helsinki Computational History Group website [LINK: https://comhis.github.io/2019_CCQ **]. 
Reliable research use relies on sufficiently accurate, unbiased and comprehensive information, which is ideally verifiable based on external  sources. We have used external data sources, for instance on geographical places to further complement and verify the information that is available in the original library catalogues. We monitor data processing quality based on automated unit tests, manual curation, and cross-linking with external databases, incorporating best practices and tools from data science. Due to automation, any potential shortcomings in the processing can be fixed, with subsequent updates in the complete data collection.