Baten, J. and J. van Zanden (2008), “Book production and the onset of modern economic growth,” Journal of Economic Growth, 13, pp. 217-235.

Bibliographic data science [ALREADY AROUND THE PLANNED 1500 WORDS. MAY HAVE TO EXPAND TO 2000 WORDS; OR SOME DETAILS LEFT OUT - ei haittaa jos menee yli, tilaahan on vaikka kuinka - ei kyllä ole, meillä on jo suunniteltu maksimipituuteen koko homma eikä näytä että muissa kappaleissakaan olisi paljon lyhennysvaraa? -]

Quantitative research is not the original or intended goal in analytical bibliography. Instead, a primary motivation for cataloging has been to preserve as much information of the original documents as possible, including potential errors and shortcomings. Thus, if a place name is spelled wrong, for cataloging purposes it is relevant to also to preserve that miss-spelling. Recently, the potential of bibliographies as a data resource for large-scale statistical analysis in publishing historhas started to draw attention. Whereas bibliographies have been traditionally used as a search tool for document identification and information retrieval, recent studies have indicated the vast research potential of these data resources when the information is appropriately controlled and verified.
Our work prepares the ground for more general large-scale research use of bibliographic information. Scalable and reliable data processing and quality control are essential components in this, and rely on our ability to take advantage of the latest methods of data science, covering aspects ranging from data storage to retrieval, harmonization, enrichment, and informative summaries. We suggest that the term bibliographic data science could be used to refer to this emerging multi-disciplinary approach, where bibliographic catalogues are used as research material, and subjected to comprehensive quantitative analysis. We build on and expand the existing research traditions in this emerging area. Where possible, we use established definitions of bibliographic concepts. For most analytical bibliographical definitions, we rely on Philip Gaskell, A New Introduction to Bibliography (New Castle, Del.: Oak Knoll, 1977, rev. ed. 1995). In addition, we propose novel concepts that are useful for our research purposes, and could prove more widely informative in this field. 
Linked Open Data olisi fiksua ottaa tähän heti kärkeen mukaan. Se on sellainen mitä melkein kaikki näiden kirjastokatalogien kanssa työskentelevät miettii. Se kannattaisi muotoilla niin, että tämä on se pohjatyö (tämä käsite bibliographic data science) johon sitten myös LOD tulee mukaan ja päälle. Siihenhän tässä mennään. Mutta lukijat ovat mukana alusta lähtien kun heti alussa tässä kohtaa tehdään selväksi suhde LODiin. Ja siihen kannattaa suhtautua positiivisesti varmaankin mutta korostaen tätä putsauksen merkitystä.

Challenges in the research use of digital bibliographies

Research on digital bibliographic information has a vast potential to provide new quantitative evidence on the timings and relative magnitudes of historical trends and events, and ultimately reflected in qualitative understanding. Unfortunately, the available bibliographies are rarely readily amenable to large-scale quantitative analysis. Key challenges include limitations in data quality, availability, and the need for multi-disciplinary expertise.
Bibliographic data collections tend to include large portions of manually inserted information, which is prone to spelling errors, mistakes, missing or ambiguous information, duplicates, varying notation conventions and languages that can pose serious challenges for automated harmonization efforts. Bias in terms of data collection processes or quality may hinder productive use of the bibliographies as a research resource. In practice, the raw bibliographic records have to be systematically harmonized and quality controlled. Reliable research use relies on sufficiently accurate, unbiased and comprehensive information, which is ideally verifiable based on external  sources. 
The lack of open availability of large-scale bibliographic data is another key challenge for the development of bibliographic data science. In particular, critical, collaborative, and cumulative development of automated data harmonization and analysis algorithms, identification of errors, biases, and gaps in the data, and integration of data across multiple bibliographies and supporting information sources, and innovative reuse of the available resources, are all facing severe limitations when key data resources are not openly available. 
Finally, whereas large portions of the data analysis could be automated, the efficient and reliable research use will require expertise from multiple, traditionally distinct academic and technical disciplines, such as history, informatics, data science, and statistics. Efficient multi-disciplinary consortia that have the critical combination of expertise are more easily established on paper than in practice. 
In our recent work on the Swedish and Finnish national bibliographies, focusing on publication patterns in Sweden and Finland during the period 1640-1910, we have encountered specific and largely overlooked challenges in using bibliographic catalogues for historical research. We aim to demonstrate some solutions to these challenges based on our recent work and experiences on large-scale integrative analysis of national bibliographies. Moreover, we demonstrate how integration of bibliographic information comes with a remarkable research potential for overcoming the nationalistic emphasis of the individual catalogues.

Bibliographies used in this study

This work is based on four bibliographies that we have acquired for research use from the respective research libraries. These include the Finnish National Bibliography Fennica (FNB), the Swedish National Bibliography Kungliga (SNB), the English Short-Title Catalog (ESTC), and the Heritage of the Printed Book Database (HPBD). The HPBD is a compilation of dozens [CHECK EXACT NUMBER?] of smaller, mostly national, bibliographies (https://www.cerl.org/resources/hpb/content).
Altogether, these bibliographies cover millions of print products printed in Europe and elsewhere between 1470-1950. The original MARC files of these catalogues include ... entries (Table XXX). 
Furthermore, our analysis of the FNB demonstrates the research potential of openly available bibliographic data resources. We have remarkably enriched and augmented the raw MARC entries that have been openly released by the National Library of Finland. Open availability of the source data is allowing us to implement reproducible data analysis workflows, which provide a transparent account of every step in data analysis from raw data to the final summaries. In addition, the open licensing of the original data allows us to share our enriched version [TÄSSÄ PITÄÄ TARKISTAA, ETTÄ ON LUPA KÄYTTÄÄ MYÖS KAIKKIA RIKASTUKSEEN KÄYTETTYJÄ ULKOISIA AINEISTOJA..!] openly so that it can be further verified, investigated, and enriched by other investigators. Although we do not have permissions to provide access to the original raw data entries for the other catalogues, we are releasing the full source code of our algorithms. With this, we aim to contribute to the growing body of tools that are specifically tailored for use in this field. Moreover, we hope that the increasing availability of open analysis methods can pave the way towards gradual opening of bibliographic data collections. This can follow related successes in other fields, such as the human genome sequencing project and subsequent research programs, which critically rely on centrally maintained and openly licensed data resources, as well as thousands of algorithmic tools that have been independently built by the research community to draw information and insights from these  data collections [REFS]. 
The data harmonization follows similar principles across all catalogues. As a brief summary, we have built custom workflows to facilitate raw data access, parsing, filtering, entry harmonization, enrichment, validation, and integration. We have payed attention to the automation and scalability of the approach, as the sizes of the bibliographies in this study vary from the 70 thousand [CHECK] raw entries in FNB to 6 million [CHECK] entries in HPBD. In this analysis, we have focused on a few key fields, which include the document publication year and place, language, and physical dimensions.
Regarding the publication year: identification of varying notation formats such as free text, arabic, and roman numerals, removal of spelling mistakes and erroneous entries,
Regarding the publication place: harmonization and disambiguation of the names by a combination of string clustering, manually constructed correction lists, mapping to open geographic databases, in particular Geonames [REFS and CHECK THE REST], city-country mappings, and verification of data quality and coverage; taking into account the languages in the mapping; we have paid attention to the unification of the approach so that it allows integration of geographical information across catalogues;
Regarding the language information, primary vs. multiple languages; misleading cataloguing practices such as augmenting missing entries by a default choice; mapping of the languages to standardized names; ..
Regarding the physical dimensions, which include gatherings and the page count, similar initial steps of data cleaning have been implemented, followed by more in-depth analysis of the varying exceptions and notation conventions;.. and final validation..
We have used external sources of metadata, for instance, on authors, publishers, or geographical places, to further enrich and verify the information that is available in the bibliographies.
We analyze the numbers of data coverage.
Whereas documentation and polishing continue, we have done all source code openly available, so that every detail of the data processing can be independently investigated and verified.
[MT: selitykset miten erilaiset arviot on tehty. Nämä kannattais tehdä melkein erillisenä ekaks että niitä vois sitten käyttää myös muualla. Tämän jälkeen yhdistää tekstiin ja ehkä lyhentää jne. Tarkoitan siis esim. kuvausta siitä miten formaattitietoja puuttuvat on täydennetty jne. -> LL: niin, tämähän on nyt siis yksi osaprojekti katalogiputsauksissa. Tätä ei tässä kohtaa noin vaan lisätä kattavasti. joka kenttä pitää käydä läpi yksityiskohtaisemmin, että mitä steppejä on tehty. tämä tulee osaksi siivouskokonaisuutta ja on (ollut jo pitkään) työlistalla. Sitä työtä ei tehdä nyt, mutta sen sijaan kirjoitan tähän lyhyet tiiviit kuvaukset, joita voidaan myöhemmin laajentaa varsinaseksi dokumentaatioksi sinne siivousputkiin.]

Towards a unified view: catalogue integration

Obtaining valid conclusions depends on efficient and reliable harmonization and augmentation of the raw entries.
This paper demonstrates how such challenges can be overcome by specifically tailored data analytical ecosystems that provide scalable tools for data processing and analysis.
Recognition of duplicates
Furthermore, we show how external sources of metadata, for instance, on authors, publishers, or geographical places, can be used to enrich and verify bibliographic information. This type of ecosystem has potential for wider implementation in related studies and other bibliographies. 

Open bibliographic data science

data organization, data and code sharing, interfaces, software modules, analytical ecosystems