Burigh,  E.  and  J.  van  Zanden  (2009),  “Charting  the  ‘Rise  of  the  West’: Manuscripts and Printed Books in Europe, A Long-Term Perspective from the Sixth through Eighteenth Centuries,” Journal of Economic History, Vol. 69, pp. 409-445; .

Bibliographic data science

Quantitative, data-intensive research has not been the original or intended goal of analytical bibliography. Instead, a primary motivation for cataloging has been to preserve as much information of the original documents as possible, including potential errors and shortcomings [for a good discussion of W. W. Greg and Fredson Bowers who largely shaped the field, see http://ihl.enssib.fr/analytical-bibliography-an-alternative-prospectus/definitions-of-bibliography-and-in-particular-of-the-variety-called-analytical]. Thus, if for instance a place name is wrongly spelled, for cataloging purposes it is relevant to also to preserve that miss-spelling. Recently, however, the potential of library catalogues as a valuable data resource for large-scale statistical analysis in publishing history has started to draw attention [REFS]. Whereas library catalogues and national bibliographies have been traditionally used as a search tool for document identification and information retrieval, recent studies have indicated the vast research potential of these data resources when the information is appropriately controlled and verified. We build on and expand the existing research traditions in this emerging area. Where possible, we use established definitions of bibliographic concepts. For most analytical bibliographical definitions, we rely on Philip Gaskell, A New Introduction to Bibliography (New Castle, Del.: Oak Knoll, 1977, rev. ed. 1995). 
Our work prepares the ground for large-scale quantitative research use of bibliographic information. Here, reliable data processing and quality control are essential, and can build on the latest advances in data science. We propose the term bibliographic data science to describe this emerging field where bibliographic catalogues are viewed as quantitative research material. Relevant techniques range from data storage and retrieval to harmonization, enrichment, and statistical analysis. The harmonized data sets can be further converted into Linked Open Data [REFS] and other popular formats in order to utilize the vast pool of existing software tools. As the value of LOD  and other infrastructures are critically depend on data quality, our efforts are complementary, and intended to increase the accuracy and overall value of the data. In addition to improving the overall data quality in linked data infrastructures, the harmonized data collections can be analysed in statistical programming environments such as R [REFS], Python [REFS], or Julia [REFS] in order to gain access to latest advances in modern data analysis. The key difference between LOD and other infrastructure services on the one hand, and statistical programming environments on the other, is that the former focus on efficient data storage and retrieval , whereas the latter focus on data analysis and statistical inference. Hence, database infrastructures and statistical environments can serve different, complementary purposes.

Challenges in the research use of digital bibliographies

Library catalogues have a great potential for providing new quantitative evidence of historical trends, and contributing to a broader qualitative analysis and understanding. Unfortunately, the available library catalogues are seldom readily amenable to quantitative analysis. Key challenges include, for instance, shortcomings in data quality, availability, and the required multi-disciplinary expertise.
First, library catalogues tend to include large portions of manually inserted information, prone to mistakes and missing information. Varying notations and languages can pose serious challenges for automated harmonization efforts, and biases in data collection may further hinder productive research use. Hence, raw bibliographic records have to be systematically quality controlled and harmonized. Second, the lack of open data availability is slowing down the development of bibliographic data science as some of the most comprehensive library catalogues are not generally available even for research purposes. This forms a major bottleneck, as open data availability would greatly advance critical, collaborative, and cumulative efforts to design and utilize targeted data analysis algorithms, identify and fix shortcomings in the data, and innovatively integrate and reuse the available resources. Successful examples exist in other data-intensive fields, such as computational biology, where open availability of commonly generated data resources and algorithms is an established norm. Third, whereas large portions of data analysis could be potentially automated, the efficient and reliable research use will require expertise from multiple, traditionally distinct academic and technical disciplines, such as history, informatics, data science, and statistics. Efficient multi-disciplinary consortia that have the critical combination of expertise are more easily established in research plans than in practice, however. 
To meet these challenges, and to facilitate generic research use of library catalogues, we propose new approaches for systematic and scalable analysis of library catalogues, and demonstrate how integrative analysis of independent data collections could help to overcome the nationalistic emphasis of the individual catalogues. This work is based on four library catalogues that we have acquired for research use. These include the Finnish National Bibliography Fennica (FNB), the Swedish National Bibliography Kungliga (SNB), the English Short-Title Catalog (ESTC), and the Heritage of the Printed Book Database (HPBD). The HPBD is a compilation of dozens [CHECK EXACT NUMBER?] of smaller, mostly national, bibliographies (https://www.cerl.org/resources/hpb/content). Altogether, these bibliographies cover millions of print products printed in Europe and elsewhere between 1470-1950. The original MARC files of these catalogues include ... entries (Table XXX). 
Our implemented harmonization of the records follows similar principles, and largely identical algorithms, across all catalogues. We have designed custom workflows to facilitate data parsing, harmonization, and enrichment. As the catalogue sizes in this study vary from the 70 thousand [CHECK] raw entries in FNB to 6 million [CHECK] entries in HPBD, automation and scalability are critical. In this work, we focus on a few selected fields, which include the document publication year and place, language, and physical dimensions. Summaries of the final data sets, and full algorithmic details of the harmonization process are available via Helsinki Computational History Group website (https://comhis.github.io/2019_CCQ/). For publication years we had to interpret varying notation formats and remove spelling mistakes and apparent errors, such as future years. For publication places, we mapped disambiguated city names to open geographic databases, in particular Geonames [REFS] and manually curated city-country mappings lists. Unified treatment of the geographical names across catalogues allows integration across independently maintained catalogues.  We separated the primary and other languages for multilingual documents. The physical dimension fields of the MARC entries were converted to centimeters and standard gatherings. Finally, we developed custom algorithms to summarize the page count field of the MARC format into a single numeric page count [REFS], as implemented in the publicly available bibliographica R package. Where possible, we have augmented the missing values, and added derivative fields, such as print area, which quantifies the number of sheets used to print different documents in a given period, and thus the overall breadth of printing activity. The print area is reflecting the overall breadth of print products in a way that is complementary to mere title count, yet independent of print run estimates. In addition to the difficulties in obtaining reliable print run estimates, the standard gatherings show variation across time and place. A key innovation in our approach is that the available large-scale data allows us to rigorously estimate the varying gathering sizes in different places and time periods and thus obtain more accurate estimates for the missing entries and the overall paper consumption. Due to automation, any potential shortcomings in the processing can be fixed, with subsequent updates in the complete data collection. Reliable research use relies on sufficiently accurate, unbiased and comprehensive information, which is ideally verifiable based on external  sources. We have used external data sources, for instance on geographical places to further complement, enrich, and verify the information that is available in the original library catalogues. We constantly monitor the data processing quality based on automated unit tests, cross-linking, manual curation, and matching with external databases. In this, we have incorporated best practices and tools from data science. 

Open bibliographic data science

[COULD INCLUDE A SCHEMATIC FIGURE WHICH OUTLINES THE CATALOGUES, HARMONIZATION, AND INTEGRATION?]

What we are proposing is a systematic roadmap towards systematic large-scale standardization and integration of bibliographic information.  In this work, we demonstrate the potential of this approach based on the first substantial implementations of such data analytical ecosystem at the level of national catalogues. This generic approach relies on a number of technologies from database management to reproducible statistical data analysis, and visualization. The overall workflow is modular and reproducible. Our open source approach to code sharing, and efforts to support existing standards where possible is facilitating collaborative development of targeted data analysis algorithms. We are constantly taking advantage of, and contributing to, the growing body of open source algorithms in the relevant research fields.