Regarding the language information, primary vs. multiple languages; misleading cataloguing practices such as augmenting missing entries by a default choice; mapping of the languages to standardized names; ..
Regarding the physical dimensions, which include gatherings and the page count, similar initial steps of data cleaning have been implemented, followed by more in-depth analysis of the varying exceptions and notation conventions;.. and final validation..
Overall, we have aspired to use best practices from data science, including the concept of tidy data [REFS], statistical programming [REFS], and standard database structures and query tools. A number of R packages and Python libraries have been essential in this work. For a full list, see the software page of the project.
We have used external sources of metadata, for instance, on authors, publishers, or geographical places, to further enrich and verify the information that is available in the bibliographies.
We analyze the numbers of data coverage.
Our analysis of the FNB demonstrates the research potential of openly available bibliographic data resources. We have remarkably enriched and augmented the raw MARC entries that have been openly released by the National Library of Finland. Open availability of the source data is allowing us to implement reproducible data analysis workflows, which provide a transparent account of every step in data analysis from raw data to the final summaries. In addition, the open licensing of the original data allows us to share our enriched version [TÄSSÄ PITÄÄ TARKISTAA, ETTÄ ON LUPA KÄYTTÄÄ MYÖS KAIKKIA RIKASTUKSEEN KÄYTETTYJÄ ULKOISIA AINEISTOJA..!] openly so that it can be further verified, investigated, and enriched by other investigators. Although we do not have permissions to provide access to the original raw data entries for the other catalogues, we are releasing the full source code of our algorithms. With this, we aim to contribute to the growing body of tools that are specifically tailored for use in this field. Moreover, we hope that the increasing availability of open analysis methods can pave the way towards gradual opening of bibliographic data collections. This can follow related successes in other fields, such as the human genome sequencing project and subsequent research programs, which critically rely on centrally maintained and openly licensed data resources, as well as thousands of algorithmic tools that have been independently built by the research community to draw information and insights from these  data collections [REFS].