We have investigated four bibliographies, which include the FNB, SNB, ESTC, and HPBD. Each catalogue is associated with a similar open source harmonization workflow, which provides a detailed and transparent account of the data processing steps from raw data harmonization to the final statistical analysis, summaries, and visualization. Furthermore, we have shown how external sources of metadata, for instance, on authors, publishers, or places, can be used to enrich and verify the information. Future developments could take increasing advantage of machine learning, ecology and related fields that have well established methods for spatio-temporal data analysis. Adaptive machine learning could help to significantly improve the scalability of data harmonization, as they could be trained with a limited set of well chosen training examples, and the accuracy of the conversions could be easily monitored until a satisfactory accuracy and coverage is reached. This type of data analytical ecosystem has potential for wider implementation in related studies and other bibliographies as many of the encountered data analytical problems are commonly encountered in digital humanities.
Open availability of the raw data as well as the analysis methods is central for efficient, collaborative, and transparent research use of bibliographic collections in modern society. We hope that open availability of data and methods, such as the ones released in this project, can pave the way towards open availability of the library catalogues. In some other fields of science, open data availability has already been established as a standard research practice. For instance, the human genome sequencing project and subsequent research programs have critically relied on open data sharing as well as the vast body of open source algorithms that have been collaboratively developed by the research community [REFS]. We seek to advance open research by releasing a notably improved version of the Finnish national bibliography FNB. As such, we hope that our work is setting an example of a dedicated open science project, which aims to open the complete research workflow for collaborative criticism and development.
Whereas our current work is based on the analysis of national catalogues, it is helping to challenge the nationalistic view of individual catalogues, and paves the way towards large-scale data integration. A number of key challenges remain to be overcome, however, in enhancing data quality, but we have demonstrated that significant historical trends, such as the rate of change in language use or book sizes are often overwhelmingly clear and seen across multiple independently collected catalogues. Integrative analysis can thus help to verify the information and provide complementary views to the universally observed historical trends. Our systematic approach provides a starting point, guidelines, and a set of practically tested algorithms for more extensive analysis and integration.
Mitä merkitystä tällä työllä ja näillä julkaisuilla suhteessa jo julkaistuihin on -- myös projektio koskien muita vastaavia katalogeja. Tämä arvokas osuus paperissa itsessään - Joo tätä pitäs avata vielä lisää / LLSystematic data harmonization, where the original raw entries are polished, disambiguated, mapped to controlled vocabularies, and verified by internal and external cross-checking of the correspondence between available data sources.