Importantly, our key observations regarding vernacularization etc. are supported by similar trends across multiple independently maintained bibliographies. This exemplifies the power of such approach in uncovering broad patterns in knowledge production, which are robust to occasional inaccuracies in the data. Whereas documentation and polishing continue, we have done all source code openly available, so that every detail of the data processing can be independently investigated and verified. Obtaining valid conclusions depends on efficient and reliable harmonization and augmentation of the raw entries. This paper demonstrates how such challenges can be overcome by specifically tailored data analytical ecosystems that provide scalable tools for data processing and analysis.
Whereas our current work is based on the analysis of national catalogues, our work is helping to challenge the nationalistic view of individual catalogues, and paves the way towards large-scale integration of bibliographic data resources. A number of key challenges remain to be overcome, however. For instance, the ambiguous author and place names will cause additional challenges; reliable identification of duplicate entries and biases in the data collection processes need to be solved; and the dozens of commonly used languages and local cataloguing conventions will complicate the analyses. On the other hand, as we have demonstrated, significant historical trends, such as the rate of change in language use or book size are often overwhelmingly clear and robust to variations in individual data entries. Integrative analysis of multiple catalogues can thus help to verify the information and provide complementary views to the universally observed historical trends. Hence, our systematic approach provides a starting point, guidelines, and a set of practically tested algorithms for more extensive integration of national catalogues. Development of targeted open source algorithms and transparent data processing workflows is a central component in such work.
We have investigated four bibliographies, which include the FNB, SNB, ESTC, and HPBD. Each catalogue is associated with an open harmonization workflow, which is largely based on the same overall harmonization methodology, with custom modifications for each catalog. These open data analytical ecosystems provide a transparent account from raw data harmonization to the final statistical analysis, summaries, and visualization. We have taken advantage of a number of openly available generic data analytical tools. Our algorithms focus specifically on bibliographic data analysis, and can be potentially used by others working on related research challenges in this or other areas as many of the problems relating to name disambiguation, entry harmonization, and integrative analysis are commonly encountered in digital humanities and other fields. Furthermore, we have shown how external sources of metadata, for instance, on authors, publishers, or geographical places, can be used to enrich and verify bibliographic information. This type of ecosystem has potential for wider implementation in related studies and other bibliographies. We are continuing to improve code documentation in order to facilitate collaborative methods development in this field. The open data analytical ecosystems that we have developed, are designed and implemented by and for the users of bibliographic records, at the same time augmenting missing information and enriching the data with supporting information from external sources.
In addition to the historical analysis of knowledge production trends, and algorithmic tools for such analysis, we are releasing a notably improved version of the Finnish national bibliography FNB. The lack of open data availability sets remarkable limitations for efficient and collaborative research use, and accumulation of knowledge regarding the research use of these digital resources. Here, the combination of code and data demonstrate the potential of our open science approach for the open research use of library catalogs, and the essential role that data harmonization and integration plays in this process. As such, we hope that our work is setting an example of a dedicated open science project, which aims to open the complete research workflow for collaborative criticism and development.
Research use is part of validation. Automation of the workflow would in principle allow also the analysis of the robustness of this approach to varying technical choices in the data harmonization, although such analysis falls beyond the scope in this manuscript. Future development could take increasing advantage of machine learning, and borrow further methods from ecology and related fields that have well established methods for spatio-temporal data analysis. Machine learning and articial intelligence (AI) could help to significantly improve the scalability and accuracy of data harmonization and verification. For instance, the raw page count fields have systematic structure, and instead of a lengthy algorithm construction process, adaptive machine learning algorithms could be trained with a limited set of well chosen training examples, and the accuracy of the conversions into page counts could be easily monitored and exactly quantified until a satisfactory accuracy and coverage is reached.
This provides a starting point and guidelines for more extensive integration of national catalogues. National bibliographies are essentially about mapping the national canon of publishing, but integrating data across borders should be managed in a way that takes into account specific local circumstances while also helping to overcome the national view in analyzing the past. Such integration can help scholarship to reach a more precise view of print culture beyond the confines of national bibliographies. Open availability of the raw data as well as the analysis methods is central for efficient, collaborative, and transparent research use of bibliographic collections in modern society. Whereas traditional data management policies do not support open sharing of these digital resources, the time for change is ripe. Open availability of bibliographic data collections and supporting data sources can foster innovative and nontraditional research use of the catalogs, as demonstrated in this article. In this rapidly changing field, the development toward more collaborative development of research methods can advance the transition from data management towards collaborative quality control and research. This demonstrates how comprehensive data harmonization is essential for accurate and useful data retrieval tasks and relevant for the overall usability of the catalogue information, and how the available classification and subject analyses, geographical information, and other data can be utilized, augmented, enriched and validated based on auxiliary information sources.information sources, such as digital maps for instance. Integration of national bibliographies, special collections, and archives is relevant for international aspects of digital cataloging. As such the work highlights specific bottlenecks and shortcomings in the available cataloging and classification information, and can therefore provide relevant information for education, training, and management of cataloguing. Finally, we demonstrate how bibliographic catalog records can be used as a digital research resource, rather than a mere information retrieval tool.
Mitä merkitystä tällä työllä ja näillä julkaisuilla suhteessa jo julkaistuihin on -- myös projektio koskien muita vastaavia katalogeja. Tämä arvokas osuus paperissa itsessään - Joo tätä pitäs avata vielä lisää / LL