8  data harmonisation

abstract: normalization and data enrichment. The reproducible conversion into a data set suitable for quantitative humanities analysis.

Data maintained by others rarely fit in every respect to the specific analytical purpose for which we are preparing them. The data that are important to us must be harmonized, i.e., normalized (standardize, resolve contradictions, convert certain data types—such as particular text variables to numeric ones) and enriched (calculate derived data, such as page numbers, import data from external data sources). Below, we examine four such harmonization steps: the harmonization of dates, place names, persons, and concepts. The dates show a high degree of variation not only in the MoMe collection, but in almost every library catalog we find dates that differ from the format that is easy for programs to handle. For example, dates given in Roman numerals (“MDCCLXXX. [1780]”), in text form (“druk janvier 2016.”), according to the reign of a monarch (“Meiji 40”) or according to another calendar. Another problem is the handling of uncertain dates (e.g., “18–” and “18uu” in library catalogs both mean that the publication is from the 19th century). Due to the variety, the conversion is not trivial, but neither is deciding what to convert the dates to in the end. There are different approaches in certain areas (see, for example, archival standards or the practice of Europeana). As the latest proposal, the undate Python library1 created by the DHTech community stores the following data elements: the unchanged form of the date in the source, the calendar, the accuracy of the date, the earliest and latest normalized dates, and the duration – i.e., for the sake of consistency in retrieval, the date is always a time range. For place names, gazeeters are available for identification and, if necessary, for extra data elements required for map representation or the display of language variants. Among the most important ones are CERL Thesaurus,2 Getty Thesaurus of Geographic Names,3 and Geonames,4 which can be queried via APIs. Although these are rich databases built from many sources and thoroughly checked, practice shows that in almost every bibliographic source we will find name forms that are not recognized by these services, so these can be incorporated into our own database with some non-automated manual data refinement. The same procedure can be followed for individuals, but naturally using different services: VIAF (Virtual International Authority File),5 the CERL Thesaurus personal name database, ISNI (International Standard Name Identifier),6 Wikidata.7 It is important to note that any given database will naturally contain many more personal names than geographical names, so the hit rate is likely to be lower. The world of concepts is much more diverse than that of geographical and personal names. Although there are universal conceptual dictionaries (knowledge organization systems), there is virtually no library catalog whose records contain only the concepts of a single dictionary. Instead of specific dictionaries, we recommend using the BARTOC service (Basic Register of Thesauri, Ontologies and Classification)8 to find the dictionary that best suits your research questions. When discussing harmonization, it is essential to mention the categories of inaccurate, incomplete, subjective, and uncertain data.9 We have seen an example of inaccurate data and its handling in the case of dates. Incomplete data is when we do not know all the details, for example, not all authors of a work are listed, or there are gaps in the provenance history of an object. We can deduce some data, but it is very difficult to describe what does not exist. Subjective data refers to provenance, i.e., who made the statement in question. Such statements are often hypothetical and may even be contradictory. Finally, uncertain data is when the truthfulness of a statement is doubtful. An important part of the theories cited above is that the past is constructed, and the interpretation of sources also depends on the interpreter’s prior knowledge. Consequently, historical information systems must necessarily allow for the coexistence of contradictory interpretations, and instead of binary (true-false) logic, uncertainties could be described using probability values.10 For example, “Alexandre Dumas” (if no other information is available) could refer to either the father or the son (both writers)—the former being more likely, the value of which can be recorded in the database and used, for example, when sorting search results.


  1. (Koeser et al. 2025)↩︎

  2. https://data.cerl.org/thesaurus/↩︎

  3. https://www.getty.edu/research/tools/vocabularies/tgn/index.html↩︎

  4. https://geonames.org↩︎

  5. https://viaf.org/↩︎

  6. https://isni.org/↩︎

  7. https://wikidata.org↩︎

  8. https://bartoc.org↩︎

  9. (Mariani 2023)↩︎

  10. Thaller, Manfred, On vagueness and uncertainty in historical data = Ivory Tower blog, 2020. https://ivorytower.hypotheses.org/88.↩︎