; Bell and Barnard 1992; Weedon 2009 **] Thus, even when it is certain that no cumulative integrated catalogue of bibliographic data will be perfect and free from errors, it can be sufficiently representative of important trends in the history of the book and knowledge production. This hypothesis comes with a huge research potential but it is yet to be systematically explored and tested. Use of national bibliographies as a research resource, rather than a mere information retrieval tool, has proven to be challenging as obtaining valid conclusions critically depends not only on the overall understanding of the historical context but also on technical issues of data quality and completeness. Scalable solutions to these challenges, and subsequent research cases that build on statistical analysis of these data collections, have been missing.
We have started to develop novel ways of addressing these needs by creating a data analytical ecosystem, which is designed to harmonize and integrate different sources of library catalogue metadata maintained by the research libraries. We call this approach bibliographic data science, which is specifically targeted at enabling the use of library catalogues as research objects. Whereas data management technologies, including LOD, have focused on data storage, management, and distribution, our efforts have a different, complementary target. We focus on enhancing the overall data quality and commensurability between independently maintained library catalogues by systematic large-scale harmonization and quality control. It is widely observed that metadata collections have high amounts of inaccurate entries, data collection biases, and missing information. Many of these issues can be potentially overcome. We aim to show how large-scale quantitative analysis of bibliographic metadata becomes reliable.
Our analysis covers the overall publishing landscape in the period c. 1500-1800 based on joint analysis of four large bibliographies, which has allowed us to assess publishing activity beyond what is accessible by the use of national catalogues alone. In particular, we have prepared the first harmonized versions of the Finnish and Swedish National Bibliographies (FNB and SNB, respectively), the English Short-Title Catalogue (ESTC), and the Heritage of the Printed Book Database (HPBD). The HPBD is a compilation of 45 smaller, mostly national, bibliographies [LINK:
https://www.cerl.org/resources/hpb/content **]. Altogether, these bibliographies cover over 5 million entries on print products printed in Europe and elsewhere between c. 1470-1950. The original MARC files of these catalogues include ... entries (Table XXX). At the same time, we demonstrate in the case of the Finnish National Bibliography (FNB) that the harmonized data can then be combined into LOD releases, opening new doors for the quantitative research of national bibliographies.
Such systematic approach has vast potential for wider implementation in related studies and other bibliographies. Our work indicates that whereas national bibliographies have essentially been about mapping the national canon of publishing. Although print culture has obviously been tied to the nation and national culture, there has been cultural processes that transgressed national and state borders. Integrating data across borders set by national bibliographies helps us to get at those cross-border processes and trends and to overcome the national view in analyzing the past.
Bibliographic data science
Quantitative, data-intensive research has not been the original or intended goal of analytical bibliography. Instead, a primary motivation for cataloguing has been to preserve as much information of the original document and it's physical creation as possible, including potential errors caused by the printer [FOOTNOTE: for a good discussion of W. W. Greg and Fredson Bowers who largely shaped the field, see \cite{analytical} **]. Thus, if for instance a place name is wrongly spelled, for cataloguing purposes it is relevant to also to preserve that miss-spelling. For anyone desiring to work on quantitative approach to bibliographic metadata, this is a crucial point to understand and respect. Our work builds on traditional bibliographic research, and we are using established definitions of bibliographic concepts where possible. [FOOTNOTE: For most analytical bibliographical definitions, we rely on (gaskell1995new) **].
Our use of the term bibliographic data science implies that bibliographic data is viewed as quantitative research material, and systematic efforts on our part are carried out to ensure data reliability and completeness. Available bibliographic metadata is thus seldom readily amenable to quantitative analysis. Key challenges include data quality, availability, and the need for multi-disciplinary expertise.