Our automated harmonization efforts are coupled with systematic monitoring and verification of the quality and coverage of the harmonized entries [COULD ADD ESTIMATES ON THE PERCENTAGE OF MISSING ENTRIES THAT COULD BE AUGMENTED FOR EACH FIELD?]. This is facilitated by automatically generated summaries of the data conversions and mappings between the raw entries and the final data. These are available in the project homepage [LINK]. We constantly monitor the accuracy and coverage of data processing based on both automated unit tests as well as manual curation, and, where possible, matching with external databases in order to estimate the overall accuracy and completeness of the harmonized entries. We have incorporated best practices from data science, taking advantage of tidy data formats [REFS], standard database structures and query tools, and statistical programming [REFS]. A number of R packages and Python libraries have been essential in this work [ADD MOST IMPORTANT ONES?]. For a full list, see the software page of the project.
We analyze the numbers of data coverage.
Our analysis of the FNB demonstrates the research potential of openly available bibliographic data resources. We have remarkably enriched and augmented the raw MARC entries that have been openly released by the National Library of Finland. Open availability of the source data is allowing us to implement reproducible data analysis workflows, which provide a transparent account of every step in data analysis from raw data to the final summaries. In addition, the open licensing of the original data allows us to share our enriched version [TÄSSÄ PITÄÄ TARKISTAA, ETTÄ ON LUPA KÄYTTÄÄ MYÖS KAIKKIA RIKASTUKSEEN KÄYTETTYJÄ ULKOISIA AINEISTOJA..!] openly so that it can be further verified, investigated, and enriched by other investigators. Although we do not have permissions to provide access to the original raw data entries for the other catalogues, we are releasing the full source code of our algorithms. With this, we aim to contribute to the growing body of tools that are specifically tailored for use in this field. Moreover, we hope that the increasing availability of open analysis methods can pave the way towards gradual opening of bibliographic data collections. This can follow related successes in other fields, such as the human genome sequencing project and subsequent research programs, which critically rely on centrally maintained and openly licensed data resources, as well as thousands of algorithmic tools that have been independently built by the research community to draw information and insights from these data collections [REFS].