Introduction
Library catalogues are essential tools in information science, and their utilization has been greatly advanced by digitalization. The need to manage and organize the ever increasing body of digital information has motivated the development of new concepts and technologies, such as Linked Open Data (LOD), which was first introduced some twenty years ago and has been on the agenda of most National Libraries since then. Metadata collections of published material that different libraries hold are particularly suitable for interlinking and enriching with different semantic layers [FOOTNOTE : good examples of LOD services among National Libraries are \cite{finto} and \cite{data}. About development of LOD in the library sector, see \cite{Yoose_2013} **]. LOD represents a crucial step in taking full advantage of digital resources through the integration of web sources and open, reusable metadata and its enrichment.
This article relates closely to these efforts in National Libraries as it claims that it is extremely important that we can rely on data quality and subsequently make robust statistical claims based on metadata collections [FOOTNOTE: The relevance of this type of quantitative analysis of library catalogues has been understood in \cite{Buringh_2009} & \cite{Baten_2008}; Suarez, M. F. 2009. Towards a bibliometric analysis of the surviving record, 1701–1800. In The Cambridge history of the book in Britain, vol. 5, ed. M. F. Suarez and M. L.Turner, 37–65. Cambridge: Cambridge University Press; Suarez, M. F. 2015. Book history from descriptive bibliographies. In The Cambridge companion to the history of the book, ed. L. Howsam, 199–219. Cambridge: Cambridge University Press; Bell, M. and J. Barnard. 1992. Provisional Count of STC Titles, 1475–1640. Publishing History 31 (1): 47–64; Weedon, A. 2009. The uses of quantification. In A Companion to the History of the Book, ed. S. Eliot and J. Rose, 33–49. London: Wiley Blackwell. **] Thus, even when it is certain that no cumulative integrated catalogue of bibliographic data will be perfect and free from errors, it can be sufficiently representative of important trends in the history of the book and knowledge production. This hypothesis comes with a huge research potential but it is yet to be systematically explored and tested. Use of national bibliographies as a research resource, rather than a mere information retrieval tool, has proven to be challenging as obtaining valid conclusions critically depends not only on the overall understanding of the historical context but also on technical issues of data quality and completeness. Scalable solutions to these challenges, and subsequent research cases that build on statistical analysis of these data collections, have been missing.
We have started to develop novel ways of addressing these needs by creating a data analytical ecosystem, which is designed to harmonize and integrate different sources of library catalogue metadata maintained by the research libraries. We call this approach bibliographic data science, which is specifically targeted at enabling the use of library catalogues as research objects. Whereas data management technologies, including LOD, have focused on data storage, management, and distribution, our efforts have a different, complementary target. We focus on enhancing the overall data quality and commensurability between independently maintained library catalogues by systematic large-scale harmonization and quality control. It is widely observed that metadata collections have high amounts of inaccurate entries, data collection biases, and missing information. Many of these issues can be potentially overcome. We aim to show how large-scale quantitative analysis of bibliographic metadata becomes reliable by turning to two historical research cases: The rise of the octavo format in printing in Europe and the breakthrough of vernacular languages in public discourse.
Our analysis covers the overall publishing landscape in the period c. 1500-1800 based on joint analysis of four large bibliographies, which has allowed us to assess publishing activity beyond what is accessible by the use of national catalogues alone. In particular, we have prepared the first harmonized versions of the Finnish and Swedish National Bibliographies (FNB and SNB, respectively), the English Short-Title Catalogue (ESTC), and the Heritage of the Printed Book Database (HPBD). The HPBD is a compilation of 45 smaller, mostly national, bibliographies [LINK:
https://www.cerl.org/resources/hpb/content **]. Altogether, these bibliographies cover over 5 million entries on print products printed in Europe and elsewhere between c. 1470-1950. The original MARC files of these catalogues include ... entries (Table XXX). At the same time, we demonstrate in the case of the Finnish National Bibliography (FNB) that the harmonized data can then be combined into LOD releases, opening new doors for the quantitative research of national bibliographies.
Such systematic approach has vast potential for wider implementation in related studies and other bibliographies. Our work indicates that whereas national bibliographies have essentially been about mapping the national canon of publishing. Although print culture has obviously been tied to the nation and national culture, there has been cultural processes that transgressed national and state borders. Integrating data across borders set by national bibliographies helps us to get at those cross-border processes and trends and to overcome the national view in analyzing the past.
Bibliographic data science
Quantitative, data-intensive research has not been the original or intended goal of analytical bibliography. Instead, a primary motivation for cataloguing has been to preserve as much information of the original document and it's physical creation as possible, including potential errors caused by the printer [FOOTNOTE: for a good discussion of W. W. Greg and Fredson Bowers who largely shaped the field, see \cite{analytical} **]. Thus, if for instance a place name is wrongly spelled, for cataloguing purposes it is relevant to also to preserve that miss-spelling. For anyone desiring to work on quantitative approach to bibliographic metadata, this is a crucial point to understand and respect. Our work builds on traditional bibliographic research, and we are using established definitions of bibliographic concepts where possible. [FOOTNOTE: For most analytical bibliographical definitions, we rely on (gaskell1995new) **].
Our use of the term bibliographic data science implies that bibliographic data is viewed as quantitative research material, and systematic efforts on our part are carried out to facilitate this by ensuring data reliability and completeness. Available bibliographic metadata is thus seldom readily amenable to quantitative analysis. Key challenges include data quality, availability, and the need for multi-disciplinary expertise.