bibliographic data collections tend to include large portions of manually inserted information, which is prone to spelling errors, mistakes, missing or ambiguous information, duplicates, varying notation conventions and languages that can pose serious challenges for automated harmonization efforts. Bias in terms of data collection processes or quality may hinder productive use of the bibliographies as a research resource. In practice, the raw bibliographic records have to be systematically harmonized and quality controlled. Reliable research use relies on sufficiently accurate, unbiased and comprehensive information, which is ideally verifiable based on external  sources. 
The lack of open availability of large-scale bibliographic data is another key challenge for the development of bibliographic data science. In particular, critical, collaborative, and cumulative development of automated data harmonization and analysis algorithms, identification of errors, biases, and gaps in the data, and integration of data across multiple bibliographies and supporting information sources, and innovative reuse of the available resources, are all facing severe limitations when key data resources are not openly available. 
Finally, whereas large portions of the data analysis could be automated, the efficient and reliable research use will require expertise from multiple, traditionally distinct academic and technical disciplines, such as history, informatics, data science, and statistics. Efficient multi-disciplinary consortia that have the critical combination of expertise are more easily established on paper than in practice. 
In our recent work on the Swedish and Finnish national bibliographies [REFS], we have encountered specific and largely overlooked challenges in using bibliographic catalogues for historical research. We aim to demonstrate some solutions based on our recent work on large-scale integrative analysis of national bibliographies. Moreover, we demonstrate how integration of bibliographic information comes with a remarkable research potential for overcoming the nationalistic emphasis of the individual catalogues.

Bibliographies used in this study

This work is based on four bibliographies that we have acquired for research use from the respective research libraries. These include the Finnish National Bibliography Fennica (FNB), the Swedish National Bibliography Kungliga (SNB), the English Short-Title Catalog (ESTC), and the Heritage of the Printed Book Database (HPBD). The HPBD is a compilation of dozens [CHECK EXACT NUMBER?] of smaller, mostly national, bibliographies (https://www.cerl.org/resources/hpb/content).
Altogether, these bibliographies cover millions of print products printed in Europe and elsewhere between 1470-1950. The original MARC files of these catalogues include ... entries (Table XXX). 
Furthermore, our analysis of the FNB demonstrates the research potential of openly available bibliographic data resources. We have remarkably enriched and augmented the raw MARC entries that have been openly released by the National Library of Finland. Open availability of the source data is allowing us to implement reproducible data analysis workflows, which provide a transparent account of every step in data analysis from raw data to the final summaries. In addition, the open licensing of the original data allows us to share our enriched version [TÄSSÄ PITÄÄ TARKISTAA, ETTÄ ON LUPA KÄYTTÄÄ MYÖS KAIKKIA RIKASTUKSEEN KÄYTETTYJÄ ULKOISIA AINEISTOJA..!] openly so that it can be further verified, investigated, and enriched by other investigators. Although we do not have permissions to provide access to the original raw data entries for the other catalogues, we are releasing the full source code of our algorithms. With this, we aim to contribute to the growing body of tools that are specifically tailored for use in this field. Moreover, we hope that the increasing availability of open analysis methods can pave the way towards gradual opening of bibliographic data collections. This can follow related successes in other fields, such as the human genome sequencing project and subsequent research programs, which critically rely on centrally maintained and openly licensed data resources, as well as thousands of algorithmic tools that have been independently built by the research community to draw information and insights from these  data collections [REFS]. 
The data harmonization follows similar principles across all catalogues. As a brief summary, we have built custom workflows to facilitate raw data access, parsing, filtering, entry harmonization, enrichment, validation, and integration. We have payed attention to the automation and scalability of the approach, as the sizes of the bibliographies in this study vary from the 70 thousand [CHECK] raw entries in FNB to 6 million [CHECK] entries in HPBD. In this analysis, we have focused on a few key fields, which include the document publication year and place, language, and physical dimensions.
Regarding the publication year: identification of varying notation formats such as free text, arabic, and roman numerals, removal of spelling mistakes and erroneous entries,
Regarding the publication place: harmonization and disambiguation of the names by a combination of string clustering, manually constructed correction lists, mapping to open geographic databases, in particular Geonames [REFS and CHECK THE REST], city-country mappings, and verification of data quality and coverage; taking into account the languages in the mapping; we have paid attention to the unification of the approach so that it allows integration of geographical information across catalogues;
Regarding the language information, primary vs. multiple languages; misleading cataloguing practices such as augmenting missing entries by a default choice; mapping of the languages to standardized names; ..
Regarding the physical dimensions, which include gatherings and the page count, similar initial steps of data cleaning have been implemented, followed by more in-depth analysis of the varying exceptions and notation conventions;.. and final validation..
We have used external sources of metadata, for instance, on authors, publishers, or geographical places, to further enrich and verify the information that is available in the bibliographies.
We analyze the numbers of data coverage.
Whereas documentation and polishing continue, we have done all source code openly available, so that every detail of the data processing can be independently investigated and verified.
[MT: selitykset miten erilaiset arviot on tehty. Nämä kannattais tehdä melkein erillisenä ekaks että niitä vois sitten käyttää myös muualla. Tämän jälkeen yhdistää tekstiin ja ehkä lyhentää jne. Tarkoitan siis esim. kuvausta siitä miten formaattitietoja puuttuvat on täydennetty jne. -> LL: niin, tämähän on nyt siis yksi osaprojekti katalogiputsauksissa. Tätä ei tässä kohtaa noin vaan lisätä kattavasti. joka kenttä pitää käydä läpi yksityiskohtaisemmin, että mitä steppejä on tehty. tämä tulee osaksi siivouskokonaisuutta ja on (ollut jo pitkään) työlistalla. Sitä työtä ei tehdä nyt, mutta sen sijaan kirjoitan tähän lyhyet tiiviit kuvaukset, joita voidaan myöhemmin laajentaa varsinaseksi dokumentaatioksi sinne siivousputkiin.]

Towards a unified view: catalogue integration

Obtaining valid conclusions depends on efficient and reliable harmonization and augmentation of the raw entries.
This paper demonstrates how such challenges can be overcome by specifically tailored data analytical ecosystems that provide scalable tools for data processing and analysis.
Recognition of duplicates
Furthermore, we show how external sources of metadata, for instance, on authors, publishers, or geographical places, can be used to enrich and verify bibliographic information. This type of ecosystem has potential for wider implementation in related studies and other bibliographies. 

Open bibliographic data science

data organization, data and code sharing, interfaces, software modules, analytical ecosystems