Regarding data quality, bibliographic data collections tend to include large portions of manually inserted information, which is prone to spelling errors, mistakes, missing or ambiguous information, duplicates, varying notation conventions and languages that can pose serious challenges for automated harmonization efforts. Bias in terms of data collection processes or quality may hinder productive use of the bibliographies as a research resource. In practice, the raw bibliographic records have to be systematically harmonized and quality controlled. Reliable research use relies on sufficiently accurate, unbiased and comprehensive information, which is ideally verifiable based on external  sources. 
In terms of data availability, the lack of open large-scale bibliographic data is another key challenge for the development of bibliographic data science. Critical, collaborative, and cumulative development of automated data harmonization and analysis algorithms, identification of errors, biases, and gaps in the data, and integration of data across multiple bibliographies and supporting information sources, and innovative reuse of the available resources, are all facing severe limitations when key data resources are not openly available. 
Efficient multi-disciplinary consortia that have the critical combination of expertise are more easily established on paper than in practice. Whereas large portions of the data analysis could be automated, the efficient and reliable research use will require expertise from multiple, traditionally distinct academic and technical disciplines, such as history, informatics, data science, and statistics.  
In our recent work [REFS], we have encountered specific and largely overlooked challenges in using bibliographic catalogues for historical research. We propose new, systematic and scalable solutions to large-scale integrative analysis of national bibliographies, and demonstrate how integration of bibliographies could overcome the nationalistic emphasis of the individual catalogues.

Large-scale data harmonization and integration [VOI OLLA ETTÄ JOTAIN YKSITYISKOHTIA OLIS HYVÄ JÄTTÄÄ POIS NIIN TEKSTISTÄ TULEE VÄHEMMÄN RASKASTA. KIRJOITUSTEKNINEN KYSYMYS. LÄHINNÄ EHKÄ SE AUKI ETTÄ MITEN KATTAVASTI YLIPÄÄNSÄ TAHDOTAAN KUVATA TEKNISTÄ PROSESSIA, VARMAAN KOMPROMISSIA PITÄÄ TEHDÄ. HYVÄ OLISI ETTEI TEKNINEN PUOLI HYPI LIIKAA SILMILLE VAAN SURRAA SUJUVASTI TAUSTALLA. SIKÄLI MONESSA KOHTAA RIITTÄÄNEE VIITTAILLA MEIDÄN ONLINE RESURSSEIHIN JA NIITÄ VARTEN VOISI JOPA KOOSTAA JONKUN NYKYISTÄ SELKEEMMÄN LANDING PAGEN]

This work is based on four bibliographies that we have acquired for research use from the respective research libraries. These include the Finnish National Bibliography Fennica (FNB), the Swedish National Bibliography Kungliga (SNB), the English Short-Title Catalog (ESTC), and the Heritage of the Printed Book Database (HPBD). The HPBD is a compilation of dozens [CHECK EXACT NUMBER?] of smaller, mostly national, bibliographies (https://www.cerl.org/resources/hpb/content).
Altogether, these bibliographies cover millions of print products printed in Europe and elsewhere between 1470-1950. The original MARC files of these catalogues include ... entries (Table XXX). 
Harmonization of the bibliographic records follows similar principles across all catalogues. We have designed custom workflows to facilitate data parsing, harmonization, enrichment, and integration. As the sizes of the bibliographies in this study vary from the 70 thousand [CHECK] raw entries in FNB to 6 million [CHECK] entries in HPBD, we have paid particular attention to the automation and scalability of the approach.
These details now summarized in : https://comhis.github.io/2019_CCQ/ Here, we focus on a few selected fields, which include the document publication year and place, language, and physical dimensions. Summaries of the final data sets, and full algorithmic details of the harmonization process are available for each catalog via Helsinki Computational History Group website (https://comhis.github.io). For publication years we had to identify and interpret varying notation formats such as free text, arabic, and roman numerals, and remove spelling mistakes and erroneous entries. For publication places, we harmonized and disambiguated city names by a combination of string clustering and mappings to open geographic databases, in particular Geonames [REFS]. In addition, we complemented the automated searches with manually curated lists to merge synonymous place names (arising from spelling errors, varying writing conventions, and language versions) and to complement missing city-country mappings. We have paid particular attention to unified treatment of the geographical names across catalogues, in order to allow comparisons and integration of geographical information across independently maintained catalogues.  Regarding the language information, we mapped the language identifiers in the MARC format to the corresponding full language names, and in this process also identified and corrected occasional spelling errors. Where available, we separately listed the primary and other languages in multilingual documents. In the SNB, we noticed in manual checking that the languages of many non-Swedish have been marked Swedish. The likely explanation is that the missing language entries have been misleadingly filled in with a default choice. Therefore we avoided [OR ABANDONED?] the use of language information field in the SNB. We are now developing systematic ways to detect and correct such biases, for instance based on automated comparisons of the language information, geographical location and languages used in the document title. The physical dimensions include gatherings (document width and height) and page count information. Although these fields contain numerical information, the available MARC entries are not readily available as standardized height, width, and page count estimates. The notations vary standard gatherings (such as folio or octavo) to inches and centimeters; and sometimes only partial information is available. Regarding page counts, the MARC notation follows a standard convention [REFS], which separates the cover sheets, figures, and other document details. We have constructed custom algorithms to convert raw page count information into numeric values, as implemented in the publicly available bibliographica R package. Where possible, we have estimated and augmented the missing values based on the available information; for instance the document gatherings is often possible to reliably estimate when only height or width information of the document is available. Finally, we have added derivative fields, such as print area, which refers to the amount of sheets used to print different documents in a given period, ignoring print run estimates. This is, in a sense, a measure for the overall breadth of printing activity, reflecting the overall amount of print products in a way that is complementary to mere title count. With rough print run estimates of 1000x [REFS], we can provide estimates also on the approximate paper consumption. However, in addition to the difficulties in obtaining reliable print run estimates, the standard gatherings dimensions have some variation across time and place. This highlights an important A key innovation in our approach. The available large-scale data allows us to provide estimates of the variation in gathering sizes across time and place. Moreover, this information can be used to obtain more accurate estimates for the missing entries and the overall paper consumption. All of these harmonization steps are fully transparent, and, due to automation, any potential shortcomings in the processing can be fixed, with subsequent updates of the complete data collection. We have used external sources of metadata, for instance, on authors, publishers, or geographical places, to further enrich and verify the information that is available in the bibliographies.
Our automated harmonization efforts are coupled with systematic monitoring and verification of the quality and coverage of the harmonized entries [COULD ADD ESTIMATES ON THE PERCENTAGE OF MISSING ENTRIES THAT COULD BE AUGMENTED FOR EACH FIELD?]. This is facilitated by automatically generated summaries of the data conversions and mappings between the raw entries and the final data. These are available in the project homepage [LINK]. We constantly monitor the accuracy and coverage of data processing based on both automated unit tests, cross-linking of the complementary information fields (such as information on document title and language, or author life years and publication times), as well as manual curation, and, where possible, matching with external databases in order to estimate the overall accuracy and completeness of the harmonized entries. We have incorporated best practices from data science, taking advantage of tidy data formats [REFS], standard database structures and query tools, and statistical programming [REFS]. A number of R packages and Python libraries have been essential in this work [ADD MOST IMPORTANT ONES?]. For a full list, see the software page of the project.
Our analysis of the FNB demonstrates the research potential of openly available bibliographic data resources. We have enriched and augmented the raw MARC entries that have been openly released by the National Library of Finland. Open availability of the source data is allowing us to implement reproducible data analysis workflows, which provide a transparent account of every step in data analysis from raw data to the final summaries. In addition, the open licensing of the original data allows us to share our enriched version [TÄSSÄ PITÄÄ TARKISTAA, ETTÄ ON LUPA KÄYTTÄÄ MYÖS KAIKKIA RIKASTUKSEEN KÄYTETTYJÄ ULKOISIA AINEISTOJA..!] openly so that it can be further verified, investigated, and enriched by other investigators. Although we do not have permissions to provide access to the original raw data entries for the other catalogues, we are releasing the full source code of our algorithms. With this, we aim to contribute to the growing body of tools that are specifically tailored for use in this field. Moreover, we hope that the increasing availability of open analysis methods can pave the way towards gradual opening of bibliographic data collections. This can follow related successes in other fields, such as the human genome sequencing project and subsequent research programs, which critically rely on centrally maintained and openly licensed data resources, as well as thousands of algorithmic tools that have been independently built by the research community to draw information and insights from these  data collections [REFS].