Future development could take increasing advantage of machine learning, and borrow further methods from ecology and related fields that have well established methods for spatio-temporal data analysis. Machine learning and articial intelligence (AI) could help to significantly improve the scalability and accuracy of data harmonization and verification. For instance, the raw page count fields have systematic structure, and instead of a lengthy algorithm construction process, adaptive machine learning algorithms could be trained with a limited set of well chosen training examples, and the accuracy of the conversions into page counts could be easily monitored and exactly quantified until a satisfactory accuracy and coverage is reached.
National bibliographies are essentially about mapping the national canon of publishing, but integrating data across borders should be managed in a way that takes into account specific local circumstances while also helping to overcome the national view in analyzing the past. We are now expanding our pilot study on the Finnish and Swedish bibliographies towards large-scale integration of national bibliographies in the CERL Heritage of the Printed Book Database. Such integration can help scholarship to reach a more precise view of print culture beyond the confines of national bibliographies.
Importantly, our key observations regarding vernacularization etc. are supported by similar trends across multiple independently maintained bibliographies. This exemplifies the power of such approach in uncovering broad patterns in knowledge production, which are robust to occasional inaccuracies in the data. Whereas documentation and polishing continue, we have done all source code openly available, so that every detail of the data processing can be independently investigated and verified. Obtaining valid conclusions depends on efficient and reliable harmonization and augmentation of the raw entries. This paper demonstrates how such challenges can be overcome by specifically tailored data analytical ecosystems that provide scalable tools for data processing and analysis.