for download and reuse under the CC0 open data license
the harmonized data can then be combined into LOD releases, opening new doors for the research on national bibliographies.
Whereas large portions of data analysis can be automated, efficient and reliable research use requires collaboration between traditionally distinct disciplines, such as history, informatics, and data science, and finding the right combination and balance of expertise may prove challenging in practice. Data harmonization is only the starting point for our analysis, albeit an important one. The harmonized data sets can be further integrated and converted into Linked Open Data [REFS] and other popular formats in order to utilize the vast pool of existing software tools. In addition to improving the overall data quality and hence the overall value of LOD and other data infrastructures that focus on data management and retrieval, the harmonization enables statistical analysis with scientific programming environments such as R [REFS] or Python [REFS], which provide advanced tools for modern data analysis and statistical inference. Hence, these two approaches serve different, complementary purposes. Our analysis of the FNB demonstrates the advantages of open availability of library catalogues. The raw MARC entries of the FNB have been openly released by the National Library of Finland. We have now harmonized, augmented, and enriched this data with the open data analytical ecosystem, and hereby release the final harmonized data set that we have used in this study so that it can be further verified, investigated, and enriched by academics as well as the general public. The open availability allows us to demonstrate the advantages of a reproducible data analysis workflow, which provides a transparent account of every step from raw data to the final results.
The content in bibliographic metadata collections are the products of at least three multi-layered historical processes. The digitization of traditional card catalogues may have meant an exclusion of material that was regarded as less important or covered elsewhere. Similarly, the collection of early national bibliographies have in general been based on a collection of existing bibliographies that were originally collected for other purposes (FOOTNOTE: For a discussion on the Danish National Bibliography, see Horstbøll 1999**). Naturally, the national bibliographies have not been able to include everything published, albeit the effort towards completeness has been remarkable. Further, the records reflect different historical practices of printing and publishing. In eighteenth-century Sweden, for instance, printing laws and decrees formed a crucial part of political discourse and was of great economic value to the book industry (CITE: Rimm, A.-M. 2005a. Den kungliga boktryckaren, del 1. Biblis 30: 4–31; Rimm, A.-M. 2005b. Den kungliga boktryckaren, del 2. Biblis 31: 27–44.**), whereas in Britain this was the case to a much lesser degree. Such practices are noticeable in the bibliographic metadata collections, but tell us more about precisely printing practices, not necessarily about other social and political phenomena, such as language relations, that we might want to study through the data. Any historically interested study using national bibliographies must therefore be attentive to these historical layers contained in the data in order to propose reasonable interpretations to quantitative data analysis.
Substantiate: