We have started to develop novel ways of addressing these needs by creating a data analytical ecosystem, which is designed to harmonize and integrate different sources of bibliographic metadata maintained by the research libraries. We call this approach bibliographic data science, which is specifically targeted at enabling the use of bibliographic metadata as research object. Whereas data management technologies, including LOD, have focused on data storage, management, and distribution, our efforts have a different, complementary target. We focus on enhancing the overall data quality and commensurability between independently maintained metadata collections by systematic large-scale harmonization and quality control. It is widely observed that bibliographic data has high amounts of inaccurate entries, data collection biases, and missing information. Many of these issues can be potentially overcome. We aim to show how large-scale quantitative analysis of bibliographic metadata becomes reliable by turning to two historical research cases: The rise of the octavo format in printing in Europe and the breakthrough of vernacular languages in public discourse.
Our analysis covers the overall publishing landscape in the period c. 1500-1800 based on joint analysis of four large bibliographies, which has allowed us to assess publishing activity beyond what is accessible by the use of individual national bibliographies alone. In particular, we have prepared the first harmonized versions of the Finnish and Swedish National Bibliographies (FNB and SNB, respectively), the English Short-Title Catalogue (ESTC), and the Heritage of the Printed Book Database (HPBD). The HPBD is a compilation of 45 smaller, mostly national, bibliographies [LINK:
https://www.cerl.org/resources/hpb/content **]. Altogether, these bibliographies cover over 5 million entries on print products printed in Europe and elsewhere between c. 1470-1950. The original MARC files of these metadata collections include ... entries (Table XXX). At the same time, we demonstrate in the case of the Finnish National Bibliography (FNB) that the harmonized data can then be combined into LOD releases, opening new doors for the quantitative research of national bibliographies.
Such systematic approach has vast potential for wider implementation in related studies and other bibliographies. Our work indicates that whereas national bibliographies have essentially been about mapping the national canon of publishing. Although print culture has obviously been tied to the nation and national culture, there has been cultural processes that transgressed national and state borders. Integrating data across borders set by national bibliographies helps us to get at those cross-border processes and trends and to overcome the national view in analyzing the past.
Bibliographic data science
Quantitative, data-intensive research has not been the original or intended goal of analytical bibliography. Instead, a primary motivation for cataloguing has been to preserve as much information of the original document and it's physical creation as possible, including potential errors caused by the printer [FOOTNOTE: for a good discussion of W. W. Greg and Fredson Bowers who largely shaped the field, see \cite{analytical} **]. Thus, if for instance a place name is wrongly spelled, for cataloguing purposes it is relevant to also to preserve that miss-spelling. For anyone desiring to work on quantitative approach to bibliographic metadata, this is a crucial point to understand and respect. Our work builds on traditional bibliographic research, and we are using established definitions of bibliographic concepts where possible. [FOOTNOTE: For most analytical bibliographical definitions, we rely on (gaskell1995new) **]. Our use of the term bibliographic data science implies that bibliographic data is viewed as quantitative research material, and systematic efforts on our part are carried out to facilitate this by ensuring data reliability and completeness.
Available bibliographic metadata is thus seldom readily amenable to quantitative analysis. Biases, inaccuracies and gaps hinder productive research use of bibliographic metadata collections, and varying standards and languages pose challenges for data integration, highlighting the need to reconsider the overall metadata collection and management methods. Moreover, the content in bibliographic metadata collections are the products of at least three multi-layered historical processes. The digitization of traditional card catalogues may have meant an exclusion of material that was regarded as less important or covered elsewhere. Similarly, the collection of early national bibliographies have in general been based on a collection of existing bibliographies that were originally collected for other purposes (FOOTNOTE: For a discussion on the Danish National Bibliography, see Horstbøll 1999**). Naturally, the national bibliographies have not been able to include everything published, albeit the effort towards completeness has been remarkable in many cases. Further, the records reflect different historical practices of printing and publishing. In eighteenth-century Sweden, for instance, printing laws and decrees formed a crucial part of political discourse and this was of great economic value to the book industry (CITE: Rimm, A.-M. 2005a. Den kungliga boktryckaren, del 1. Biblis 30: 4–31; Rimm, A.-M. 2005b. Den kungliga boktryckaren, del 2. Biblis 31: 27–44.**), whereas in Britain this was the case to a much lesser degree. Such practices are noticeable in the bibliographic metadata collections, but tell us more about precisely printing practices, not necessarily about other social and political phenomena, such as language relations, that we might want to study through the data. Any historically interested study using national bibliographies must therefore be attentive to these historical layers contained in the data in order to propose reasonable interpretations to quantitative data analysis.
Our data harmonization follows similar principles and largely identical algorithms across all metadata collections. In this work, we focus on a few selected fields, namely publication year and place, language, and physical dimensions. We have removed spelling errors, disambiguation and standardized terms, augmented missing values, and developed custom algorithms, such as conversions from the raw MARC notation to numerical page count estimates [REFS]. We have also added derivative fields, such as
print area, which quantifies the overall number of sheets in distinct documents in a given period, and thus the overall breadth of printing activity, and used external data sources on authors, publishers, and places to enrich and verify bibliographic information. Automation, scalability, and quality control are critical as the data collection sizes in this study are as high as 6 million [CHECK] entries in the HPBD. We are incorporating best practices and tools from data science, such as code libraries, unit tests and cross-linking between data sources. Bibliographic data science is an iterative process where improved understanding often leads to enhancements in data harmonization and validation that can be incorporated in the automated processing steps. An overview of the harmonized data sets and full algorithmic details of our analysis are available via Helsinki Computational History Group website [LINK:
https://comhis.github.io/2019_CCQ **].
Ideally, such harmonization and validation efforts are fully transparent both in terms of data and source code. The cumulative harmonization process has equipped us with a vast body of methods that support the research use of bibliographic metadata collections, and we have collected these custom algorithms for bibliographic data science in the open in the openly s bibliographica R package [LINK]. In contrast to code availability, many of the most comprehensive bibliographic metadata collections are not yet generally available as open data, however, and may be difficult to obtain even for research purposes. This lack of data availability forms a major bottleneck for transparent and collaborative development of bibliographic data science and for innovative reuse the available resources. This might be gradually changing, however. The National Library of Finland, for instance, has recently made available the complete MARC entries of the FNB (LINK
http://data.nationallibrary.fi/) under the CC0 open data license, and our harmonized version that we have used in this study is openly available in the supplementary material so that it can be further verified, investigated, and enriched by others. The harmonized data sets can be further integrated and converted into Linked Open Data [REFS] and other popular formats in order to utilize the vast pool of existing software tools. In the next step, we are planning to incorporate our validated harmonization algorithms in the existing Linked Open Data Releases of the FNB. Combining large-scale harmonization with existing data management infrastructures, could open up new doors for research on national bibliographies.
Data harmonization and management is only the starting point for analysis, albeit an important one. In addition to improving the overall data quality and hence the overall value of LOD and other data infrastructures that focus on data management and retrieval, the harmonization enables statistical analysis with scientific programming environments such as R [REFS] or Python [REFS], which provide advanced tools for modern data analysis and statistical inference. These two approaches serve different, complementary purposes. Moreover, whereas whereas large portions of data analysis can be automated, efficient and reliable research use requires collaboration between traditionally distinct disciplines, such as history, informatics, and data science. Finding the right combination of expertise may be challenging.
Language and format of early modern publications
The hand-press period is particularly fruitful for quantitative research because there were remarkably few changes in printing technology from 1450 to approximately 1830s. It has been famously claimed that Gutenberg himself would have been able to operate a printing press in late eighteenth-century London since it would have been so similar to the one found in mid-fifteenth-century Mainz. As revolutionary as the movable type printing press was for early modern culture and economy in general, it is a good fortune for our aspirations to understand the development of early modern publishing that there were no game-changing innovations for the next 400 years or so after Gutenberg's time in printing technology [FOOTNOTE: \cite{McKitterick2005}. About the relevance of movable type printing press, see \cite{eisenstein1980printing}. See also, \cite{cipolla1972} and \cite{Pettegree2008}. On economic impact of printing press on early modern cities :\cite{Dittmar_2011}. See also, \cite{coldiron2015printers} and \cite{Coldiron_2004} **] In our research on different bibliographic metadata we have come to realise that the relatively stable nature of printing opens up different avenues for cross-European research. For example, we can estimate the long-term development of book formats in some detail across Europe, which in turn is significant for understand the relevance of printing for the establishment of public sphere. This is why for this article we have developed two Europe-wide bibliographical metadata cases to analyse the rise of octavo format and process of vernacularization in the early modern period. This tests also the metadata collections in their different levels of data harmonization and respective levels of historical representativity. Both of these research cases represent large-scale European-wide transformations that took place predominantly during the hand-press era, but an inspection of them through several metadata collections and by zooming in and out bearing in mind different publication profiles of European cities show intriguing variety. The cases also make it possible to discuss how the used methods, varying levels of data harmonization and gaps in data affect the analyses, thus paving the way for new research and guidelines for future data integration.
The rise of octavo in the Enlightenment period
The general trend in the metadata collections that we have studied is that octavo format supersedes other printing formats during the eighteenth century. [FOOTNOTE: Henrik Horstbøll has previously studied the relevance of octavo format for Danish publishing in detail based on analogue methods and smaller samples. Our work confirms his findings and further extends the scope by studying a much larger and cross-European data. See, \cite{Horstbøll1999}; \cite{Horstböll2009} and \cite{Horstböll2010}]. We can measure this by looking at a simple title count of documents published in different formats, or we can study the paper consumption of these formats in which case we are focused on the print area of the documents instead of counting the number of documents. We find the study of the print area quite useful and our choice in this article has been to examine particularly the paper consumed in the printed documents. Here, we use two complementary measures: the print area, which quantifies the amount of sheets used for unique copies of titles, and the paper consumption, which additionally takes the possibly variable print run estimates into account. Print area is a measure of the overall breadth of publishing activity, and paper consumption could be used to compare our findings to our earlier studies that have provided estimates on paper consumption. When we examine the publishing trends of book formats in the HPBD, we notice that at a general European level the rise of the octavo format is particularly strong during the eighteenth century, and further supported by the ESTC and SNB (Fig. 1) where Octavo is not only the fastest gainer of the market, but also holds the largest share of the print area by the end of the eighteenth century. If we look at particular places with respect to octavo share in HPBD, a striking feature is the octavo share in German cities of Frankfurt (Supplementary Fig. 1), Leipzig (Supplementary Fig. 1), Halle (Supplementary Fig. 1) and Berlin (Supplementary Fig. 1). The manner in which folio drops and octavo rises in German soil during the eighteenth century suggests that the octavo format was the high rising star of the Enlightenment.
Among this type of general Europe-wide trends, there are of course local differences, and for example in Turku (Supplementary Fig. 1), and Finland that was part of Sweden at the time, the rise of octavo comes much later than in Sweden in general. This was due to the fact that the main part of the documents printed in Finland were official documents, pamphlets and theses. If we look at the share of the different formats in Turku, another way of saying this would be that printing in Turku only takes off in the later eighteenth century whereas in Stockholm hand press printing industry seems to have reached a different level of maturity earlier (Supplementary Fig. 1). The simplest explanation for the success of the octavo format is that it was particularly suited for smaller books that could be carried around and read practically anywhere, whereas the quarto (and folio) were more commonly used in governmental and academic documents; pamphlets and in larger books alike, especially in the earlier centuries [FOOTNOTE: about relationship between books and pamphlets, see \cite{raymond2003} **]. We have analysed the relevance of the rise of octavo with respect to book printing in the case of "history" publishing earlier [CITE: \cite{Lahti_2015} **]. Of course, larger formats in book printing carried certain prestige also in the eighteenth century even when reading started to be partly removed from stately mansion libraries, becoming more equal and the price of the book turned out to be a decisive factor for dissemination of ideas [CITE: \cite{Allan_2013}. \cite{Allan2008}. \cite{Allan2008b}. \cite{Towsey_2010} **] When considering quarto and octavo publications, it is quite telling that David Hume (1711-1776) wanted his History of England to be printed in quarto sized fine-paper six-volume set in late 1760s (as it had appeared earlier), but the editions that were actually published after 1767 until Hume's death (including the 1778 posthumous edition) are octavo editions in eight volumes. The octavo editions might have lacked the exclusivity and finesse of heavier tomes with large margins that connoisseurs might have preferred for aesthetic reasons, but it was particularly the cheaper and smaller formats, octavo and duodecimo, that changed the nature and relevance of printing in the later part of the eighteenth century.