1 Introduction

Bibliographic data science is a relatively new interdisciplinary field of research that lies at the intersection of library science (or, more broadly, cultural heritage science), history and social sciences, and certain components of computer science. The objective of bibliographic data science is to establish previously hidden or possibly only suspected historical or collection trends based on data sources containing a (typically but not exclusively) large number of bibliographic records, ideally all those related to a given topic (e.g., national bibliographies), and on data science methods. Some of the field’s research questions:

What was the spatial distribution and prosopography of 17th-century German legal dissertations? (Heßbrüggen-Walter 2025)
What degree of interdisciplinarity can be observed based on the metadata of philosophical dissertations? (Heßbrüggen-Walter 2024)
How did the format and language of books change over time in different regions? (Lahti et al. 2019)
What are the patterns of translations from a given language, how have they changed, and which languages were super-central, central, and peripheral in a given era? (Heilbron 1999)
What impact do publishers have on fiction? (Bourdieu 2008)
What were the profiles of the various book collections?
Is there a correlation between the genre and format of the book? (Lahti et al. 2019)
How have genre proportions changed? (Király and Kiséry 2025)
How many early modern publications could have been destroyed without a trace? (Farkas et al. 2025)
How can the reception of works be examined using bibliographic data? (Szemes and Dobás 2025)
What is the quality of cultural heritage data, and what improvement strategies can be developed? (Király 2019)
How do cultural heritage data, data structures, and standards help (or hinder) answering the above questions? What development opportunities does the research suggest for cultural heritage data standards? (Király et al. 2025)

Although digital humanities education has developed dynamically in recent years, computer-based analysis of bibliographic sources is unfortunately rarely featured, and similarly absent from library science and IT education. In my opinion, this gap could be remedied by a new informal vocational training program that would appeal to those who are interested in some of the above issues and who already have some knowledge in one of the relevant fields (e.g., library science, cultural history, literary sociology, information technology). The analysis of records based on library bibliographic standards would probably also be of interest in library training. The training may take the form of a summer university or a seminar/course jointly organized by several university departments. Participants in the training could be university students or practicing professionals.

1.1 Preparation

In this book we use the Python programming language. You should have a basic knowledge of the language and should know how to install it on your machine. In order to separate our environment from already installed Python modules, we use a virtual environment. To create it run the following:

python -m venv venv

When you run the code in the book, you should first activate this virtual environment:

source venv/bin/activate

… and when you finish the session, you should deactivate it:

deactivate

When we talk about installing a module you should do it within this environment, then you can use the standard Python module installation method:

venv/bin/pip install pandas

We provide a list of modules used in this book, you can install them in a single step as:

venv/bin/pip install -r requirements.txt

Some of the code examples run in the command line and written in bash, that is available by default in Linux and Mac machines. For Windows you can install it via WLS.

Open command line or PowerShell and enter:

wsl --install -d Ubuntu
wsl --set-default-version 2
wsl --set-default ubuntu

in Windows search enter Ubuntu and click on the Ubuntu icon, or in command line/PowerShell enter

ubuntu

When you enter this virtual Ubuntu the first time, you should give a user name (which might be the same or different as your Windows user name), and a password.

You can find more details and troubleshooting in the following documentation page: WSL Installation

1.2 A note on code

The code (Python, HTML, XML etc.) in this book is a bit formatted by adding spaces and line breaks in order to make it easier to understand. These changes neither affect the original intention of the code, nor the processing workflow. For the original format please check source code of the examples.

Bourdieu, Pierre. 2008. “A Conservative Revolution in Publishing.” Translation Studies 1 (2): 123–53. https://doi.org/10.1080/14781700802113465.

Farkas, Farkas Gábor, János Káldos, and Péter Király. 2025. “A Régi Magyarországi Kiadványok „Sötét Anyaga”.” Magyar Könyvszemle 141 (2): 226–66. https://doi.org/10.17167/mksz.2025.2.226-266.

Heilbron, Johan. 1999. “Towards a Sociology of Translation: Book Translations as a Cultural World-System.” European Journal of Social Theory 2 (4): 429–44. https://doi.org/10.1177/136843199002004002.

Heßbrüggen-Walter, Stefan. 2024. “Interdisciplinarity in the 17th Century? A Co-Occurrence Analysis of Early Modern German Dissertation Titles.” Synthese 203 (2): 67. https://doi.org/10.1007/s11229-024-04494-2.

Heßbrüggen-Walter, Stefan. 2025. “Early Modern Dissertations in French Libraries: The EMDFL Dataset.” Journal of Open Humanities Data 11 (June): 36. https://doi.org/10.5334/johd.307.

Király, Péter. 2019. “Measuring Metadata Quality.” Doctoral dissertation, University of Göttingen. https://doi.org/10.13140/RG.2.2.33177.77920.

Király, Péter, and András Kiséry. 2025. “‘Mór Jókai, Alas’: The Most Successful Hungarian Writer. A Quantitative Analysis.” In Patterns of Translation. https://translationpatterns.substack.com/p/mor-jokai-alas-the-most-successful.

Király, Péter, Tomasz Umerle, Vojtěch Malínek, et al. 2025. “Effects of Open Science and the Digital Transformation on the Bibliographical Data Landscape.” In Library Catalogues as Data, 1st ed., edited by Paul Gooding, Melissa Terras, and Sarah Ames. Facet. https://doi.org/10.29085/9781783306602.004.

Lahti, Leo, Jani Marjanen, Hege Roivainen, and Mikko Tolonen. 2019. “Bibliographic Data Science and the History of the Book (c. 1500–1800).” Cataloging & Classification Quarterly 57 (1): 5–23. https://doi.org/10.1080/01639374.2018.1543747.

Szemes, Botond, and Kata Dobás. 2025. “A Visegrádi Országok Digitális Irodalmi Emlékezete : Wikipedia, Wikidata – a Regionális Irodalomtörténet Új Alakzatai.” Irodalomtörténeti Közlemények 129 (2): 191–212. https://doi.org/10.56232/itk.2025.2.04.