Bibliographic data science is an iterative process, where improved understanding of the data and historical trends can lead to enhances in the data harmonization procedures, and to new, independent ways to validate the data and observed patterns. Whereas most research algorithms are nowadays open source, many of the most comprehensive library catalogues are not yet generally available as open data, and may be difficult to obtain even for research purposes. The lack of open data availability forms a major bottleneck for transparent and collaborative development of bibliographic data science, and innovative integration and reuse the available data and software resources.
Whereas large portions of data analysis can be automated, efficient and reliable research use requires collaboration between traditionally distinct disciplines, such as history, informatics, and data science, and finding the right combination and balance of expertise may prove challenging in practice. Data harmonization is only the starting point for our analysis, albeit an important one. The harmonized data sets can be further integrated and converted into Linked Open Data [REFS] and other popular formats in order to utilize the vast pool of existing software tools. In addition to improving the overall data quality and hence the overall value of LOD and other data infrastructures that focus on data management and retrieval, the harmonization enables statistical analysis with scientific programming environments such as R [REFS] or Python [REFS], which provide advanced tools for modern data analysis and statistical inference. Hence, these two approaches serve different, complementary purposes. Our analysis of the FNB demonstrates the advantages of open availability of library catalogues. The raw MARC entries of the FNB have been openly released by the National Library of Finland. We have now harmonized, augmented, and enriched this data with the open data analytical ecosystem, and hereby release the final harmonized data set that we have used in this study so that it can be further verified, investigated, and enriched by academics as well as the general public. The open availability allows us to demonstrate the advantages of a reproducible data analysis workflow, which provides a transparent account of every step from raw data to the final results.
The content in bibliographic metadata collections are the products of at least three multi-layered historical processes. The digitization of traditional card catalogues may have meant an exclusion of material that was regarded as less important or covered elsewhere. Similarly, the collection of early national bibliographies have in general been based on a collection of existing bibliographies that were originally collected for other purposes (FOOTNOTE: For a discussion on the Danish National Bibliography, see Horstbøll 1999**). Naturally, the national bibliographies have not been able to include everything published, albeit the effort towards completeness has been remarkable. Further, the records reflect different historical practices of printing and publishing. In eighteenth-century Sweden, for instance, printing laws and decrees formed a crucial part of political discourse and was of great economic value to the book industry (CITE: Rimm, A.-M. 2005a. Den kungliga boktryckaren, del 1. Biblis 30: 4–31; Rimm, A.-M. 2005b. Den kungliga boktryckaren, del 2. Biblis 31: 27–44.**), whereas in Britain this was the case to a much lesser degree. Such practices are noticeable in the bibliographic metadata collections, but tell us more about precisely printing practices, not necessarily about other social and political phenomena, such as language relations, that we might want to study through the data. Any historically interested study using national bibliographies must therefore be attentive to these historical layers contained in the data in order to propose reasonable interpretations to quantitative data analysis.
Language and format of early modern publications
The hand-press period is particularly fruitful for quantitative research because there were remarkably few changes in printing technology from 1450 to approximately 1830s. It has been famously claimed that Gutenberg himself would have been able to operate a printing press in late eighteenth-century London since it would have been so similar to the one found in mid-fifteenth-century Mainz. As revolutionary as the movable type printing press was for early modern culture and economy in general, it is a good fortune for our aspirations to understand the development of early modern publishing that there were no game-changing innovations for the next 400 years or so after Gutenberg's time in printing technology [FOOTNOTE: \cite{McKitterick2005}. About the relevance of movable type printing press, see \cite{eisenstein1980printing}. See also, \cite{cipolla1972} and \cite{Pettegree2008}. On economic impact of printing press on early modern cities :\cite{Dittmar_2011}. See also, \cite{coldiron2015printers} and \cite{Coldiron_2004} **] In our research on different library catalogue metadata we have come to realise that the relatively stable nature of printing opens up different avenues for cross-European research. For example, we can estimate the long-term development of book formats in some detail across Europe, which in turn is significant for understand the relevance of printing for the establishment of public sphere. This is why for this article we have developed two cross-catalogue cases to analyse the rise of octavo format and process of vernacularization in the early modern period. This tests also the catalogues in their different levels of data harmonization and respective levels of historical representativity. Both of these research cases represent large-scale European-wide transformations that took place predominantly during the hand-press era, but an inspection of them through several catalogues and by zooming in and out bearing in mind different publication profiles of European cities show intriguing variety. The cases also make it possible to discuss how the used methods, varying levels of data harmonization and gaps in data affect the analyses, thus paving the way for new research and guidelines for future data integration.
The rise of octavo in the Enlightenment period
The general trend in the catalogues that we have studied is that octavo format supersedes other printing formats during the eighteenth century. [FOOTNOTE: Henrik Horstbøll has previously studied the relevance of octavo format for Danish publishing in detail based on analogue methods and smaller samples. Our work confirms his findings and further extends the scope by studying a much larger and cross-European data. See, \cite{Horstbøll1999}; \cite{Horstböll2009} and \cite{Horstböll2010}]. We can measure this by looking at a simple title count of documents published in different formats, or we can study the paper consumption of these formats in which case we are focused on the print area of the documents instead of counting the number of documents. We find the study of the print area quite useful and our choice in this article has been to examine particularly the paper consumed in the printed documents. Here, we use two complementary measures: the print area, which quantifies the amount of sheets used for unique copies of titles, and the paper consumption, which additionally takes the possibly variable print run estimates into account. Print area is a measure of the overall breadth of publishing activity, and paper consumption could be used to compare our findings to our earlier studies that have provided estimates on paper consumption. When we examine the publishing trends of book formats in the HPBD, we notice that at a general European level the rise of the octavo format is particularly strong during the eighteenth century, and further supported by the ESTC and SNB (Fig. 1) where Octavo is not only the fastest gainer of the market, but also holds the largest share of the print area by the end of the eighteenth century. If we look at particular places with respect to octavo share in HPBD, a striking feature is the octavo share in German cities of Frankfurt (Supplementary Fig. 1), Leipzig (Supplementary Fig. 1), Halle (Supplementary Fig. 1) and Berlin (Supplementary Fig. 1). The manner in which folio drops and octavo rises in German soil during the eighteenth century suggests that the octavo format was the high rising star of the Enlightenment.
Among this type of general Europe-wide trends, there are of course local differences, and for example in Turku (Supplementary Fig. 1), and Finland that was part of Sweden at the time, the rise of octavo comes much later than in Sweden in general. This was due to the fact that the main part of the documents printed in Finland were official documents, pamphlets and theses. If we look at the share of the different formats in Turku, another way of saying this would be that printing in Turku only takes off in the later eighteenth century whereas in Stockholm hand press printing industry seems to have reached a different level of maturity earlier (Supplementary Fig. 1). The simplest explanation for the success of the octavo format is that it was particularly suited for smaller books that could be carried around and read practically anywhere, whereas the quarto (and folio) were more commonly used in governmental and academic documents; pamphlets and in larger books alike, especially in the earlier centuries [FOOTNOTE: about relationship between books and pamphlets, see \cite{raymond2003} **]. We have analysed the relevance of the rise of octavo with respect to book printing in the case of "history" publishing earlier [CITE: \cite{Lahti_2015} **]. Of course, larger formats in book printing carried certain prestige also in the eighteenth century even when reading started to be partly removed from stately mansion libraries, becoming more equal and the price of the book turned out to be a decisive factor for dissemination of ideas [CITE: \cite{Allan_2013}. \cite{Allan2008}. \cite{Allan2008b}. \cite{Towsey_2010} **] When considering quarto and octavo publications, it is quite telling that David Hume (1711-1776) wanted his History of England to be printed in quarto sized fine-paper six-volume set in late 1760s (as it had appeared earlier), but the editions that were actually published after 1767 until Hume's death (including the 1778 posthumous edition) are octavo editions in eight volumes. The octavo editions might have lacked the exclusivity and finesse of heavier tomes with large margins that connoisseurs might have preferred for aesthetic reasons, but it was particularly the cheaper and smaller formats, octavo and duodecimo, that changed the nature and relevance of printing in the later part of the eighteenth century.