Bibliographic data science is an emerging research area, focusing on statistical analysis and research use of library catalogues. In addition to providing novel approaches that can support qualitative research, it can add significant value to Linked Open Data and other infrastructures by providing new techniques to monitor and improve data quality. Whereas library catalogues are traditionally used for information storage and retrieval, we have demonstrated that systematic large-scale harmonization of the raw entries can fill an important gap in the research use of library catalogues, and help to realize their remarkable research potential in publishing history.
National bibliographies [CATALOGUES VS BIBLIOGRAPHIES?] are essentially about mapping the national canon of publishing, but integrating data across borders should be managed in a way that takes into account specific local circumstances while also helping to overcome the national view in analysing the past. We have now expanded our pilot study on the Finnish and Swedish bibliographies towards large-scale integration of national bibliographies in the CERL Heritage of the Printed Book Database. Our harmonization and integration efforts are  far from complete, but clearly demonstrate how such integration can help scholarship to reach a more precise view of print culture beyond the confines of national bibliographies. 
Importantly, our key observations regarding vernacularization etc. are supported by similar trends across multiple independently maintained bibliographies. This exemplifies the power of such approach in uncovering broad patterns in knowledge production, which are robust to occasional inaccuracies in the data. Whereas documentation and polishing continue, we have done all source code openly available, so that every detail of the data processing can be independently investigated and verified. Obtaining valid conclusions depends on efficient and reliable harmonization and augmentation of the raw entries. This paper demonstrates how such challenges can be overcome by specifically tailored data analytical ecosystems that provide scalable tools for data processing and analysis.
It is also interesting to notice that there seems to be a correlation between the language of the document and the format in question. Comparing (USA GATHERINGS, ESTC) books published in English (fig.???), Latin (fig.???) and other languages (fig.???) in London suggests that especially duodecimo was the preferred format for books printed in other languages than English and Latin, whereas octavo was the one used proportionally more in Latin books than others. Especially the small share of folio documents in Latin is interesting. Also the quarto share of Latin in this respect in London is noteworthy (fig.???). 
We have investigated four bibliographies, which include the FNB, SNB, ESTC, and HPBD. Each catalogue is associated with a similar open source harmonization workflow, which provides a detailed and transparent account of the data processing steps from raw data harmonization to the final statistical analysis, summaries, and visualization. Furthermore, we have shown how external sources of metadata, for instance, on authors, publishers, or places, can be used to enrich and verify the information. Future developments could take increasing advantage of machine learning, ecology and related fields that have well established methods for spatio-temporal data analysis. Adaptive machine learning could help to significantly improve the scalability of data harmonization, as they could be trained with a limited set of well chosen training examples, and the accuracy of the conversions could be easily monitored until a satisfactory accuracy and coverage is reached. This type of data analytical ecosystem has potential for wider implementation in related studies and other bibliographies as many of the encountered data analytical problems are commonly encountered in digital humanities. 
Open availability of the raw data as well as the analysis methods is central for efficient, collaborative, and transparent research use of bibliographic collections in modern society. We hope that open availability of data and methods, such as the ones released in this project, can pave the way towards open availability of the library catalogues. In some other fields of science, open data availability has already been established as a standard research practice. For instance, the human genome sequencing project and subsequent research programs have critically relied on open data sharing as well as the vast body of open source algorithms that have been collaboratively developed by the research community [REFS]. We seek to advance open research by releasing a notably improved version of the Finnish national bibliography FNB. As such, we hope that our work is setting an example of a dedicated open science project, which aims to open the complete research workflow for collaborative criticism and development. 
Whereas our current work is based on the analysis of national catalogues, it is helping to challenge the nationalistic view of individual catalogues, and paves the way towards large-scale data integration. A number of key challenges remain to be overcome, however, in enhancing data quality, but we have demonstrated that significant historical trends, such as the rate of change in language use or book sizes are often overwhelmingly clear and seen across multiple independently collected catalogues. Integrative analysis can thus help to verify the information and provide complementary views to the universally observed historical trends. Our systematic approach provides a starting point, guidelines, and a set of practically tested algorithms for more extensive analysis and integration. 

Conclusion [300 sanaa sovittu alustavasti - nyt ollaa suunnilleen siinä]

This represents a new research program and technologies to expand the research potential of bibliographic cataloging and classification. We discuss the future implications of these methods. This covers key research aspects on the content, management, use, and usability of bibliographic records, with further implications on the underlying principles, functions, and techniques of descriptive cataloging. Our work combines both traditional and contemporary elements of research, and combines theory and scholarly research with a practical application. [HUOM NOITA POINTSEJA POIMITTU JOURNAALIN SCOPESTA - VOI OLLA HYVÄ KOITTAA HUOMIOIDA(?)] National bibliographies can provide comprehensive quantitative insights to the overall historical dynamics of the evolving publishing landscape across time and geography. Biases in data collection or quality may remarkably hinder productive research use of the bibliographies, however. Drawing valid conclusions critically depends on efficient and reliable harmonization and augmentation of the raw entries. In our study based on the Swedish National Bibliography and the Finnish National Bibliography and focusing on publication patterns in Sweden and Finland during the period 1640-1910, we have encountered specific and largely overlooked challenges in using bibliographic catalogues for historical research. Here, we have demonstrated how such challenges can be overcome by specifically tailored open source workflows for data processing and analysis. Furthermore, we show how external sources of metadata, for instance, on authors, publishers, or geographical places, can be used to enrich and verify bibliographic information. This work has potential for wider implementation in related studies and other bibliographies, and provides guidelines for more extensive integration of national catalogues, thus helping to overcome the national view in analysing the past towards a more precise view of print culture beyond the confines of national bibliographies.
pohjoismaa-aspekti ei ole tämän artikkelin kannalta enää relevantti, vaan ollaan vaan menty kohti cross-catalogue analyysiä jossa myös nämä ovat mukana. Kuitenkaan ne ei ole mitenkään käsitteellisesti relevantteja tähän.
Tässä myös tämä Fennica ja Kungliga. Liikkeelle lähdettiin siis siitä että tämä juttu olisi reflektiota siitä miten sen kanssa on tehty. Lopputulos oli kuitenkin se että tultiin kunnianhimoisempaan lähtökohtaan eli tässä nyt sitten paljon muutakin ja paljon mielekkäämmin. Kuitenkin myös tekstiä pitää samassa suhteessa muistaa päivittää.
Transfer into practice !