Biases, inaccuracies and gaps in data collection or quality may severely hinder productive research use of bibliographic metadata collections. Varying standards and languages pose further challenges for data integration, highlighting the need to reconsider the underlying principles regarding the collection and management of bibliographic records.
Automation and scalability are critical as the catalogue sizes in this study are as high as 6 million [CHECK] entries in the HPBD.
In this work, we focus on a few selected fields, namely publication year and place, language, and physical dimensions. We have carried out removal of spelling errors, term disambiguation and standardization, missing value augmentation and validation, and developed custom algorithms, such as conversions from the raw MARC notation to numerical page count estimates [REFS], which we have implemented in the
bibliographica R package. We have also added derivative fields, such as
print area, which quantifies the overall number of sheets in distinct documents in a given period, and thus the overall breadth of printing activity. The print area reflects the overall breadth of print products, and complements the mere
title count, or overall
paper consumption including the print run estimates. We have also used external data sources on authors, publishers, and places to enrich and verify bibliographic information. An overview of the harmonized data sets and full algorithmic details of our analysis are available via Helsinki Computational History Group website [LINK:
https://comhis.github.io/2019_CCQ **].