5 data acquisition

abstract: Main types of bibliographic data and data sources (library catalogs, citation databases, research data repositories, historical sources). Methods of data acquisition (standards, application programming interfaces, and tools), information about data structure (metadata schemas and serialization formats), terms of use.

Digital data can be retrieved in three main methods. The most convenient option is for these to be available as downloadable files (e.g., as reusable research data), however this type of data sharing is relatively rare. It is more common for data sources to be accessed through some kind of application programming interface. Various applications are available for the most common interfaces (OAI-PMH, Z39.50, SRW/SRU, SPARQL), so programming is not necessarily required, but time must be set aside to study the institution-specific settings and parameters of the interfaces. Finally, it is often the case that no previous opportunity was available. At this point, we extract the data from the HTML source of the website, assuming that the typographical formatting consistently indicates certain semantic elements17 – but in this case, it is worth consulting with the website operator to see if there are any other options not documented on the site. Whichever solution you choose, make sure that the data license allows reuse. After downloading the data, the first step is to import it. Programming libraries supporting various bibliographic formats are available; for MARC21, for example, there are ones for Java, Python, Go, JavaScript, R, PHP, and other programming languages. In this textbook we will work with Python, but the code could be adapted to other programming languages.

There are four main types of data sources for library history research:

library catalogs (e.g., national libraries or general purpose union catalogs, as well as catalogs of specifically old books, such as the VD16, VD17 and VD18 series that register 16th, 17th, and 18th century German books, their Italian counterpart EDIT16, and the Heritage of the Printed Book database),
digital library catalogs (Europeana, Gallica, German Digital Library, Hungarian Electronic Library, HathiTrust),
citation databases and research data repositories (DataCite, Zenodo, OpenAlex, Open Citation), and finally
databases of digitized (book) historical sources (the database of the Société Typographique de Neuchâtel, the database of 18th-century Dutch auction book catalogs MEDIATE, or the no defunct Eruditio in Hungary).

5.1 Creating directories

In this textbook we will work with different directories:

data: the data we created
raw-data: our original data sources downloaded from data providers’ sites
plots: the output of data visualization

The first programming task is to create these directories. To create them we use the os Python library that provides standard operating system functions, such as makedirs that creates a directory.

1import os

2directories = ['data', 'raw-data', 'plots']

3for directory in directories:
4    if not os.path.exists(directory):
5        os.makedirs(directory)

1: imports the os library
2: creates a list with three strings, the names of the directories
3: iterates the directories one by one. In each iteration the name of the current directory will be stored in the variable directory
4: checks if the directory does not exist. If it exists the script skips the directory and takes the next one. If it doesn’t exist, it continue with the next command
5: creates the directory (due to the test in the previous line only if it doesn’t exist already)

The code of this section is available at scripts/ch3a-create-directories.py of the repository.

5.2 Download a MARCXML file

You can find a list of downloadable library catalogues ḣere. Now download a relatively small file with MARC records of the Latvian National Bibliography (2014-2023) provided by the Open Data Portal of the National Library of Latvia. The files are compressed with zip, so in order to use them we should extract them, and because the zip file contains a data subdirectory, we rename that to lnb (after the domain name of the National Library of Latvia).

1import urllib.request
import zipfile
import os

2url = 'https://dati.lnb.lv/files/natl_bibliography-2014-2023-marc.zip'
target_dir = 'raw-data'
target_file = target_dir + '/lnb-natl_bibliography-2014-2023-marc.zip'

3urllib.request.urlretrieve(url, target_file)

4with zipfile.ZipFile(target_file, 'r') as zip_ref:
    zip_ref.extractall(target_dir)

5os.rename(target_dir + '/data', target_dir + '/lnb')

1: Imports the Python libraries: urllib for the download, and zipfile for uncompressing
2: creats variables
3: calls the download function with two arguments: the URL of the data source and the target file in our system
4: uncompresses the zip file to our target directory
5: renames the data subdirectory (comes inside the zip file) to lnb

At the end of the process we will have a file raw-data/lnb/natl_bibliography-2014-2023-marc.xml, a set of MARC21 record in MARCXML serialization format, that is MARC21 in XML.

The code of this section is available at scripts/ch3a-download-a-file.py of the repository.

5.3 APIs to retrieve records

5.3.1 OAI-PMH

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAI-PMH. Service Providers then make OAI-PMH service requests to harvest that metadata. OAI-PMH is a set of six verbs or services that are invoked within HTTP. From https://www.openarchives.org/pmh/

The protocol provides 6 ‘verbs’ or endpoints to retrieve information:

GetRecord: to retrieve an individual metadata record
Identify: information about a repository
ListIdentifiers: to harvest record identifiers
ListMetadataFormats: to retrieve the metadata formats available from a repository. These formats are alternative XML representations of the same content, but it also happens that a library provides different content via different formats. In the ListIdentifiers and ListRecords verbs it is used as the metadataPrefix parameter. At least one format, the Dublin Core is mandatory.
ListRecords: to harvest records
ListSets: to retrieve the set structure of a repository. Sets are optional collections within a repository, e.g. there might be individual sets according to the physical collections of a library.

The response of these verbs is always XML. For the List* verbs the maintainer of the service sets a pager size, so you do not retrieve all records at once, only limited number (e.g. 100). The pagination is implemented by a distinct XML element called resumptionToken, which the client should use in the next call. Its value can change from call to call, so the client always have to extract is. In the above links you can find more detailed information and examples.

There are different implementations of OAI-PMH clients in Python, we use Mathias Loesch’s Sickle module (version 0.7.0) to retrieve the Estonian National Bibliography from the DataLab, the data sharing platform of the Estonian National Library. At the time of writing it returns almost 410 000 MARCXML records. We save them into local files each containing 100 000 records. As OAI-PMH is based on XML technology we use the lxml module (version 6.0.0), a lightweight XML and HTML processing toolbox.

import io
import sys
import os
from lxml import etree
from sickle import Sickle

1if len(sys.argv) == 1:
    print('Please give a directory name')
    exit()
2dir = sys.argv[1]
if not os.path.exists(dir):
    os.mkdir(dir)
print('saving to %s' % (dir))

3namespaces = {
    'oai': 'http://www.openarchives.org/OAI/2.0/',
    'marc21': 'http://www.loc.gov/MARC21/slim'
}

4header = '<?xml version="1.0" encoding="utf8"?>' + "\n" \
       + '<collection>' + "\n"
footer = '</collection>'

1: When we run Python, the sys.argv list contains the arguments we passed to the interpreter. The len() function returns the number of elements in a collection, so it gives here the number of arguments. The first argument is the name of the current script, so if the number here is zero it means that the script itself does not have any argument. We should pass at least one argument: the name of the directory where the script will store the records – if we don’t provide it, the script will send us a message and stop running (with the exit() function).
2: If we provide an argument it will be stored as the name of the output directory. Then it creates the directory if it does not already exist, and notify us.
3: namespaces variable registers the XML namespaces the process needs - otherwise the XPath expressions would not work.
4: header and footer needs at the beginning and end of the output XML files. The XML files are hierarchical: their main content is a list of MARC records, and their parent element will be the <collection>.

To save the records we have to do several things, like naming an output file, insert header and footer. We should save records several times, so if we have such repeatable task, it is always suggested to create a method.

1def write_output(xmlrecords, file_counter, dir_name):
    """
    Writes MARC records to file
    Parameters                              
    ----------
    xmlrecords : list
        A list of string contains the individual MARCXML records
    file_counter : int
        A file name counter
    dir_name : int
        The name of the output directory
    """
2    file_name = os.path.join(dir_name, f'{file_counter:06}.xml')
3    print(file_name)
4    with io.open(file_name, 'w', encoding='utf8') as f:
5        f.write(header)
6        f.write("\n".join(xmlrecords) + "\n")
7        f.write(footer)

1: A write_output() function creates the output files in the specified directory. It prints the XML header, join together and prints the XML records, and finally prints the XML footer.
2: To create an output file name might looks a bit complicated at first sight. The os.path.join function is a safe way to create a path in your operating system – you might know that the directory separator character is different in Windows (\) than in Mac and Linux (/). With this function Python returns the path syntax that fits to your OS. In this case it has two parameters: a directory name, and a file name. We generate the later with the f-string syntax. {file_counter:06} returns a string that takes the value of the file_counter variable (a number) and pad it with zeros on the left side up to six characters so 1 becomes 000001, 1000 becomes 001000 etc. The last part of the f-string append this with the file extension.
3: Just for logging the progress it prints out the generated path.
4: opens the file in writing mode with UTF-8 character encoding
5: first, writes out the above defined XML header
6: then, joins the individual XML records to a single string and writes it (alternatively we could iterate over the records and write them one by one).
7: finally, writes out the XML footer (the closing collection tag).

Then we create some variables to store the harvested records, and keeping status information, then initialize and start the harvester.

1xmlrecords = []
file_counter = 0
record_counter = 0

2sickle = Sickle('https://data.digar.ee/repox/OAIHandler', max_retries=4)
3it = sickle.ListRecords(metadataPrefix='marc21xml', set='erb')

1: Initialisation of the necessary variables: the collector of the individual records, the output file counter and the record counter.
2: The Sickle harvester is initialized with the endpoint of the OAI-PMH service and an extra argument to specify how many times it can retry to fetch a single URL. It is needed because sometimes there are communication problems between the server and the client (due to problems on either side or in the network).
3: Start the harvest using the ListRecords verb. It returns an iterator, so we should not take care of pagination with individual HTTP calls and resumptionToken mentioned above, it is handled automatically by the Sickle module. We use two arguments: we set the metadata schema as MARCXML and the set as erb which stands for the Estonian National Bibliography. These values can be extracted from a preliminary investigation of the supported values returned by the ListMetadataFormats and ListSets verbs.

In the following the script iterates over the individual XML responses of the calls.

1for content in it:
2    tree = etree.ElementTree(content)

    records = tree.xpath(
3                  '/oai:record[*]/oai:metadata/marc21:record',
                  namespaces=namespaces)
4    for record in records:
5        xmlrecord = etree\
6              .tostring(record, encoding='utf8', method='xml')\
7              .decode("utf-8")\
8              .replace("<?xml version='1.0' encoding='utf8'?>\n", '')
9        xmlrecords.append(xmlrecord)
10        record_counter += 1

11    if len(xmlrecords) >= 100000:
12        write_output(xmlrecords, file_counter, dir)
13        xmlrecords = []
        file_counter += 1

1: Iterates over the responses. We do not need to call individual HTTP requests, the underlying library Sickle does it for us automatically.
2: The raw XML content is transformed to an XML element tree that is an internal data structure with an API.
3: Selection of the individual records with an XPath expression. For the expression we should provide the XML namespace. In OAI-PMH the individual records (<oai:record>) have two parts: a header, that contains record identifier and other metadata (such as the set it belongs to, the date of last modification) and the actual record (<oai:metadata>). This later is the container of the MARCXML record (<marc21:record>). If you would like to apply this script to another OAI-PMH server, please check the namespaces it uses. The abbreviations are not important, you can name them as you like (however to keep them as in the source, i.e. the XML response of the service helps in debugging the problems, if there are), but you should always use the URLs used by the service.
4: Iterate over the individual records that xpath() returned.
5: Transformation of the above mentioned tree structure of the record to XML via a sequence of steps:
6: create the XML string…
7: convert to UTF-8 encoding…
8: remove the XML declaration – we do not need it for each record.
9: add the record into the record collection
10: increase the record counter by one
11: if the record collection contains more than 100 000 records…
12: the records should be written to a file…
13: then empty the collection, and increase file counter by one

Since during the iteration we collect records, but do not save the list to file only when the list size reaches a limit, there is a chance that when the iteration ends we still have unsaved record. So we should save them and print a report.

1if len(xmlrecords) > 0:
    write_output(xmlrecords, file_counter, dir)
2print('saved %d records to %d files' % (record_counter, file_counter))

1: after the iteration if the record collection is not empty write it to a file
2: finally notify the user about the number of records ingested and files written

If you reached this point without errors, congratulation, you downloaded the Estonian national bibliography’s MARCXML records.

A final note: while you can use OAI-PMH to download records from many libraries, the protocol and its implementations have some downsides:

you can not select or search for records, you are depending on what the organisation have selected for you via the sets of the protocol
the iteration of the record is serial, and you can not skip some records or jump to the n-th record. The resumption token gives you the key for the next iteration. Some implementation however contains a counter that you might “hack” by manipulating it. Sometimes it works, however it is not supported by the protocol itself. In this post we did not see the resumption token, you should do some investigation yourself to figure out how to implement such hack when using Sickle or other Python library.
sometimes it happens that server stops responding, or responds an error or even the resumption token is invalidated. In that case you can retry issuing the same URL request again (we saw max_retries parameter when Sickle was initialized). If that would not work you have two options: either hack the resumption token to jump ahead (but as mentioned before, it is not available if there is no counter in it) or you should start over.

5.4 Download an SQL file

Some bibliographic data sources are available as a relational database. These databases are distributed as a so-called SQL dump, a non binary text based file, that contains both the definition of the database structure (the properties of tables and columns), and the values. The dump files can be imported to the database. In this section we will import the following database:

Simon Burrows and Mark Curran, The French Book Trade in Enlightenment Europe Database, 1769-1794 (https://fbtee.westernsydney.edu.au/stn/interface/, 6 May 2014)

It maps the trade of the Société Typographique de Neuchâtel, 1769-1794, and based on the almost intactly surviving archive of the alliance of the printers at Neuchâtel (Switzerland). The database (henceforth FBTEE) could be imported into MySQL or its Open Source form MariaDB. the database is downloadable from https://fbtee.westernsydney.edu.au/main/download-database/.

As now we would like to download another file we might choose to copy the previous script, and replace the variables url and target_file with another one, but it would duplicate the business logic of the script. In software development there is a principle Don’t Repeat Yourself abbreviated as DRY. It suggests that we should avoid code repetition, or in other words: the code should be reusable. Instead of writing two (or more) specialized scripts that repeat the actual download and uncompressing part, we will create a function that will be used by these scripts.

First we will create a utility script that will contain functions we can use in the course. It will be called bdsutils (short for Bibliographic Data Science utilities). This is the

import urllib.request
import zipfile

1target_dir = 'raw-data'

2def download_and_unzip(url, target_file):
3    """
    Download a zip file, and uncompress it
    
    Parameters
    ----------
    url : str
        The URL of the zip file to download
    sound : str
        The local file name
    """

    urllib.request.urlretrieve(url, target_file)

    with zipfile.ZipFile(target_file, 'r') as zip_ref:
        zip_ref.extractall(target_dir)

1: the script keeps only one variable target_dir, that will remain the same across calls
2: the definition of the method
3: the documentation of the method

TODO: The code should be explained more!

The file is available in the repository as scripts/bdsutils.py.

Once we have this utility script, we can create the specific one for FBTEE. This zip file does not contain directories, only a single file chop_leeds_ac_uk.sql.

1from bdsutils import target_dir, download_and_unzip
import os

2url = 'https://fbtee.uws.edu.au/stn/database/download/STN_database.zip'
target_file = target_dir + '/stn_database.zip'

3download_and_unzip(url, target_file)

directory = target_dir + '/fbtee'
4if not os.path.exists(directory):
    os.mkdir(directory)

5os.rename(target_dir + '/chop_leeds_ac_uk.sql',
          directory + '/chop_leeds_ac_uk.sql')

1: imports both the variable and the function from bdsutils (we should not add the .py extension). This way we should not use the prefix bdsutils for these components (such as bdsutils.target_dir)
2: the definition of the FBTEE specific variables
3: usage of the method defined in bdsutils
4: creates a dedicated directory for the database if it doesn’t already exist
5: moves the file to this directory

If you run it, you will have two files: - raw-data/fbtee/chop_leeds_ac_uk.sql - raw-data/stn_database.zip

The file is available in the repository as scripts/ch3a-download-stn.py.

5.4.1 import BTEE to the database

Importing a datadump into MySQL or MariaDB is a two step process. First you should create the database itself with permissions (for security reasons it is advisable to create a dedicated user as well and not to use the default database user), then you can import the data.

Step 1. Prepare the database

Log in to MySQL, then create the database fbtee and a dedicated user (replace <username> and <password> with your chosen credentials):

CREATE DATABASE fbtee;

CREATE USER '<username>'@'localhost' IDENTIFIED BY '<password>';
GRANT ALL PRIVILEGES ON fbtee.* TO '<username>'@'localhost' WITH GRANT OPTION;
FLUSH PRIVILEGES;
quit

Note: you can choose any name for the database, this material will use fbtee.

Step 2. Import the data dump

mysql -u <username> -p fbtee < raw-data/fbtee/chop_leeds_ac_uk.sql

This command will ask for the password. Use the same credentials you set in Step 1. If you did not get any error message, you can check the database structure. Log in again with the dedicated credentials:

mysql -u <username> -p fbtee

The you can check the list of tables with

SHOW TABLES;

that returns

+-------------------------------------+
| Tables_in_fbtee                     |
+-------------------------------------+
| authors                             |
| books                               |
| books_authors                       |
| books_call_numbers                  |
| books_stn_catalogues                |
| clients                             |
| clients_addresses                   |
| clients_correspondence_manuscripts  |
| clients_correspondence_places       |
| clients_people                      |
| clients_professions                 |
| keyword_assignments                 |
| keyword_free_associations           |
| keyword_tree_associations           |
| keywords                            |
| orders                              |
| orders_agents                       |
| orders_sent_via                     |
| orders_sent_via_place               |
| parisian_keywords                   |
| parisian_system_keyword_assignments |
| people                              |
| people_professions                  |
| places                              |
| professions                         |
| super_books                         |
| super_books_keywords                |
| tags                                |
| transactions                        |
| transactions_volumes_exchanged      |
+-------------------------------------+
30 rows in set (0,001 sec)

or the structure of the books table with

DESCRIBE books;

The result:

+---------------------------+---------------+------+-----+---------+-------+
| Field                     | Type          | Null | Key | Default | Extra |
+---------------------------+---------------+------+-----+---------+-------+
| book_code                 | char(9)       | NO   | PRI | NULL    |       |
| super_book_code           | char(11)      | NO   |     | NULL    |       |
| edition_status            | varchar(15)   | NO   |     | NULL    |       |
| edition_type              | varchar(50)   | NO   |     | NULL    |       |
| full_book_title           | varchar(750)  | YES  | MUL | NULL    |       |
| short_book_titles         | varchar(1000) | YES  |     | NULL    |       |
| translated_title          | varchar(750)  | YES  |     | NULL    |       |
| translated_language       | varchar(50)   | YES  | MUL | NULL    |       |
| languages                 | varchar(200)  | YES  | MUL | NULL    |       |
| stated_publishers         | varchar(1000) | YES  | MUL | NULL    |       |
| actual_publishers         | varchar(1000) | YES  |     | NULL    |       |
| stated_publication_places | varchar(1000) | YES  |     | NULL    |       |
| actual_publication_places | varchar(1000) | YES  |     | NULL    |       |
| stated_publication_years  | varchar(1000) | YES  |     | NULL    |       |
| actual_publication_years  | varchar(10)   | YES  |     | NULL    |       |
| pages                     | varchar(250)  | YES  |     | NULL    |       |
| quick_pages               | varchar(10)   | YES  |     | NULL    |       |
| number_of_volumes         | int(11)       | YES  |     | NULL    |       |
| section                   | varchar(10)   | YES  |     | NULL    |       |
| edition                   | varchar(100)  | YES  |     | NULL    |       |
| book_sheets               | varchar(200)  | YES  |     | NULL    |       |
| notes                     | varchar(4000) | YES  |     | NULL    |       |
| research_notes            | varchar(1000) | YES  |     | NULL    |       |
+---------------------------+---------------+------+-----+---------+-------+
23 rows in set (0,001 sec)

This material is not intended to provide an introduction to SQL, if you are interested, we can suggest the Library Carpentry’s SQL lesson.

5.5 Web scraping

Oftentimes it happens that an interesting bibliographic data source does not provide any download or programmable access option. We have two options here. The first is to contact the data provider and ask help to access the data. If we can not find contact information, or we do not receive any answer to our question, the second option is to extract data from the public web site, called screen scraping or web scraping. In this section the method will be demonstrated on UNESCO’s Index Translationum database (IT).

IT collected bibliographical records describing translated books in any genre. These records came from national libraries between 1978 and roughly 2009 (IT never announced the cease of data collection, but practically records after this year are quite rare). Before 1978 IT had been published in printed volumes, however I am not aware of digitised versions. Another important problem is that the IT website is not always available.

The database does not provide browsing functionality, so one should start with a query, such as looking for translations of an author, or translations from a particular language (specifying the source language). The search queries (expressed as URLs) look like these:

for authors: https://www.unesco.org/xtrans/bsresult.aspx?a=[author], where [author] should contain the URL encoded name of the author in family name, first name format (e.g. William Shakespeare should be encoded as Shakespeare,%20William)
for source language: https://www.unesco.org/xtrans/bsresult.aspx?sl=[language code]&tie=a, where the [language code] should be substituted with a concrete language code, such as hun for Hungarian

You can play at the user interface to find the good initial query term, then you can start with the code (note: the URL encoding is done by the web form, but if you do it programmatically you have to take care with your programming language).

A note on terminology: in translation study the language from which the work has been written is called the source language, and the translation is written in the target language. Sometimes the translator does not work from the source language, but from another translation, the language of which is called the intermediate language.

A note on code: every HTML code in this section is a bit formatted by adding spaces and line breaks in order to make it easier to understand. For the original format please check IT’s HTML source.

The result is a HTML page that you have to parse and extract data. The first thing we extract now is pagination. You receive only 10 results for a single query. The next set can be accessed by clicking on the right arrow icon representing the link to the next page. Fortunately IT applies the class attribute of the HTML element to inject some semantics into the page, and we will utilize this feature. Here is the next link’s HTML structure:

<td class=”next”>
  <a href=”bsresult.aspx?lg=0&amp;a=Shakespeare, William&amp;fr=10”>
    <img border=”0” src=”Images/cy_r_arr.gif” alt=”prev”>
  </a>
</td>

The class=”next” part is unique in the page, however on the last page it is empty, since there are no more hits. Thus we should parse it in two steps: first get the content of the <td> element, then we should find inside the <a> element’s href attribute. Since it contains a relative link, in order to fetch it, we should add https://www.unesco.org/xtrans/ to the beginning. We can continue the iteration until the td element becomes empty.

The next thing is to parse the content, the individual bibliographical descriptions. The results are displayed in a table structure, where the table has the following structure:

<table class=”restable”>
  <tr>
    <td class=”res1”>1/4380</td>
    <td class=”res2”>[bibliographical description]</td>
  </tr>
  <tr>
    <td class=”res1”>2/4380</td>
    <td class=”res2”>[bibliographical description]</td>
  </tr>
  ...
</table>

so we should find the table with class=”restable”, then inside it we should process the bibliographical descriptions, which always take place in a cell (<td> element) with class=”res2”.

Thanks to the unknown software designers and developers behind IT, all elements of the bibliographical description are in a semantically identifiable <span> element. Here is an example:

<td class=”res2”>
  <span class=”sn_auth_name”>Shakespeare</span>,
  <span class=”sn_auth_firstname”>William</span>:
  <span class=”sn_target_title”>König Lear: Tragödie</span>
  [<span class=”sn_target_lang”>German</span>]
  / <span class=”sn_transl_name”>Baudissin</span>,
  <span class=”sn_transl_firstname”>Wolf Heinrich</span>
  / <span class=”sn_pub”><span class=”place”>Stuttgart</span>:
  <span class=”publisher”>Reclam</span>
  [<span class=”sn_country”>Germany</span>]</span>,
  <span class=”sn_year”>1979</span>.
  <span class=”sn_pagination”>112 p.</span>
  <span class=”sn_orig_title”>King Lear</span>
  [<span class=”sn_orig_lang”>English</span>]
</td>

The IT records come with the following vocabulary:

place: publication place
publisher: publisher
sn_auth_name: author’s family name
sn_auth_firstname: author’s first name
sn_target_title: the title in the target language
sn_target_lang: the target language
sn_country: country of the publication place
sn_year: publication year
sn_pagination: pagination
sn_orig_lang: original (or source) langue
sn_transl_name: translator’s family name
sn_transl_firstname: translator’s first name
sn_orig_title: title in source languge
sn_editionstat: edition statement
_sn_interm_titl_e: title in intermediate language
sn_interm_lang: intermediate language
sn_isbn: ISBN
sn_auth_quality: quality indicator for the author
sn_transl_quality: quality indicator for the translator

Of course not all data elements are available in every record. IT collected bibliographical records from national libraries, and the quality of them vary a lot. There are character encoding problems, and sometimes falsely tagged data elements.

There are a number of technical possibilities to parse HTML. Since IT’s HTML pages are seemingly transformed from a database with strict rules, one can apply regular expressions, which is a risky approach in lots of other cases. There are more advanced HTML parser libraries however such as BeautifulSoap. Here is the documentation, and Programming Historian also has a tutorial:

Jeri Wieringa, “Intro to Beautiful Soup,” Programming Historian 1 (2012), https://doi.org/10.46430/phen0008

which unfortunately a ‘retired’ lesson, since the HTML page on which the author demonstrates the strength of the library is no longer available, but still a good introduction, and one can apply the methods on IT.

1import requests
2import lxml.html
import os
3import re
4import urllib.parse
5import csv
6from collections import Counter

1: we use another URL handling library: the request library that provides more convenient methods for the same task as the urllib package. The request package needs installation with a pip install request.
2: previously we used another module from the lxml library. html provides methods for parsing HTML files.
3: re provides regular expression functionalities. As this part of core Python, it doesn’t need installation.
4: the parse module helps to parse URLs and extract important parameters
5: the csv module helps in reading and writing CSV files.
6: Counter, well it is counting things

We define two variables for our URL requests.

base_url = 'https://www.unesco.org/xtrans/bsresult.aspx'
it_params = {
    'a': 'Bourdieu, Pierre',
    'fr': 0
}

base_url is part of the URL we always use in the requests. To retrieve the second and any following pages of the hit list we should change the rest of the URL, the so-called query part (after the question mark). It is a list of key=value pairs separated by ampersand (&) character. We set up the it_params variable, which represents the URL query. This is a dict data type, and will be automatically transformed to URL during the request.

We should do some more preparation. When we parse HTML pages the process is almost always iterative, because we usually move forward with baby steps adding new and new parts to our script. In practice it means that we fetch the website many times – that might cause problems on the data provider’s web site. A best practice is to be nice to them, and try to eliminate the number of requests by caching: when we first retrieve a page, we should store it locally as a file in our disk, and next time we read it from the disk.

Add a new function to our bsutils module:

import os
...
def create_dir(dir):
    """
    Create a directory if it doesn't yet exist
    
    :param dir: the directory name
    """
    if not os.path.exists(dir):
        os.makedirs(dir)

Then we use it in the web scraping script:

1import bsutils
...
2create_dir('cache')
3cache_dir = os.path.join('cache', 'it')
create_dir(cache_dir)

1: imports our module
2: creates a directory called ‘cache’
3: defines a directory called it inside the cache directory. Different operating systems handle subdirectories differently, os.path.join provides a safe way to use the appropriate syntax.

We also have to create containers for the data we plan to extract.

1records = []
2authors = []
translators = []
3record_counter = 1
4field_names = Counter({})

1: record (a list of dictionaries) will contain our records.
2: as there might be multiple authors, and translators, we collect them separately. Records will have an id field (IT does not provide one), and these two entities will refer to that identifier as the external keys in relational databases.
3: record_counter is the source of this identifier. When we process a new record, we will increment this value
4: during the development we do not always know each field beforehand. Counter is a special Python data type that provides us a method to count things.

We start the process with the initial values:

request_page(it_params)

that triggers the following:

def request_page(it_params):
1    cache_file = os.path.join(cache_dir, f'results_{it_params['fr']}.html')
2    if os.path.exists(cache_file):
3        with open(cache_file, 'r') as f:
            content = f.read()
        process_page(content)
    else:
4        response = requests.get(base_url, params=it_params)
5        if response.status_code == 200:
6            content = response.text
7            if re.search('The requested URL was rejected.', content) is None:
8                with open(cache_file, 'wb') as f:
                    f.write(response.content)
9            process_page(content)

1: defines the cache file name out of the current URL’s fr parameter (the offset of the first record in the page from the first record in the hit list).
2: checks if the cache file already exists
3: if it does, the content should be read from the file, then process it with process_page() method.
4: otherwise we should request the page from the IT server
5: if it returns with HTTP status code 200 – which means the communication went well
6: extracts the content
7: however, the IT web server also returns code 200 even though there are some errors, for example if we issue too many requests (which is now prevented by the caching mechanism). During the development we met only one error message. re.search(pattern, string) returns mathing objects if pattern is found in the string, otherwise it returns None.
8: if it doesn’t find the error message, the content will be saved to the cache file.
9: processes it with process_page() method.

def process_page(content):
1    doc = lxml.html.fromstring(content)
2    extract_translations(doc)
3    from_param = extract_next_link(doc)
4    if from_param != None:
5        it_params['fr'] = from_param
6        request_page(it_params)

1: the lxml library parses the document into an internal object
2: extract_translations() method will use it to extract the records from this object
3: extract_next_link() method extracts the above mentioned from (fr) parameter from the “next” link on the page.
4: if it is not None
5: update the parameter holder
6: and call request_page() to process the next page.

So these two functions call each other until there is a “next” link on the page.

def extract_translations(doc):
1    global record_counter, authors, translators
2    items = doc.findall('body/table[@class="restable"]/tr/td[@class="res2"]', {})
    for item in items:
3        record = {'id': record_counter}
4        spans = item.findall('span')
        for span in spans:
5            key = span.get('class')
6            if key == 'sn_pub':
                subs = span.findall('span')
                for sub in subs:
7                    extract_key_value(record, sub)
            else:
8                extract_key_value(record, span)
9        record, record_authors, record_translators = normalize_record(record)
10        records.append(record)
11        authors += record_authors
        translators += record_translators
12        record_counter += 1

1: the method uses some global variables, we make it possible with the global keyword
2: extracts individual table cells with an XPath expression (explained above). For each HTML snippet…
3: creates a dictionary for the record structure, initially only with the identifier
4: all the important data elements are in <span> elements, so we have to find extract them (inside the current cell), and iterate over them
5: the semantic information is found in the class attribute, so we should save it
6: however there is a special span: sn_pub is a container of other spans, and we have to run a second iteration to extract its children
7: and 8. at the end we call the extract_key_value() method with the record structure and the current (non complex) span element. It extracts the value and stores it in the record
8: ditto
9: after processing all spans we should normalize the record with the normalize_record() method. It returns the normalized method, the list of authors, authors and translators
10: as record is a single dictionary it can be appended easily to the list of records
11: however as authors and translators are list of dictionaries, append() can not be used here, we should concatenate the lists
12: increases the record counter by one

def extract_key_value(record, span):
1    key = re.sub(r'^sn_', '', span.get('class'))
2    if key not in ['auth_name', 'auth_firstname', 'transl_name', 'transl_firstname']:
        field_names.update([key])
3    if key not in record:
        record[key] = []
4    record[key].append(span.text)

1: the key of the field is derived from the class attribute, but the ‘sn_’ string from its beginning should be removed with a regular expression. re.sub has three mandatory parameters: pattern, replacement and the input string. ^ is a building block of regular expressions, meaning the beginning of the string, so it will not match same string elsewhere (see the full syntax here)
2: author and translator names will be stored in a distinct table, but we count how many times the rest appear in the pages.
3: we expect that each field is repeatable, so if the key is not available in the dictionary, we initialize it as a list
4: and append the value (the text attribute of the span) to this list

def extract_next_link(doc):
1    next = doc.findall('body/table[@class="nav"]/tr[1]/td[@class="next"]/a', {})
2    if len(next) > 0:
3        link = next[0].get('href')
4        parsed_link = urllib.parse.urlparse(link)
5        parameters = urllib.parse.parse_qs(parsed_link.query)
6        if 'fr' in parameters and len(parameters['fr']) == 1:
7            return parameters['fr'][0]
8    return None

1: extracts the “next” link
2: it is not available on the last page, so it should be checked if it is processable
3: the fr parameter is available in the href attribute of the first (and only) “next” link
4: parses it with the urllib’s method, that returns a fixed structure: 6-item named tuple.
5: its element query contains only a string, which should be parsed with parse_qs
6: the keys in the query string are repeatable, and parse returns a dictionary. Make sure that the parameters you look for is available and has only one value
7: returns the first (and only) fr value
8: in every other cases returns a special value: None (not above we had a check on this None)

As mentioned above there are two things we should improve in the extracted record structure:

we would store the author and translators in distinct CSV files (and we will join tables only during data analysis)
the values of the other fields are also lists, however we do not necessary need them as list, and in CSV it is not easy to handle lists in cells

def normalize_record(record):
1    for key, value in record.items():
2        if key not in ['auth_name', 'auth_firstname', 'transl_name', 'transl_firstname']:
3            record[key] = ', '.join(value)

4    authors = extract_names(record, 'auth_name', 'auth_firstname')
    translators = extract_names(record, 'transl_name', 'transl_firstname')

5    return record, authors, translators

1: iterates over the key-value pairs of the dictionary
2: if the key is not one of the author and translator name related field
3: “flatten” it by concatenate the value with comma (and space)
4: handles author and translator names with the extract_names() method, that returns the list of authors, and the list of translators
5: returns the cleared record, and the two name lists

The extract_names() method however is not that simple as it seems to be. There are two problems.

in HTML we have sequences of last name, first name pairs, while our structure collects all last names and first names in distinct lists
sometimes the first name pair is missing. It would not be the problem if it would be the last element, but there are cases when an ‘et al.’ (as last name) takes place in the middle of the list. Because of this problem we can not use the otherwise handy built-in zip function, that creates tuples out of the pairs.

def extract_names(record, lastname_key, firstname_key):
1    name_records = []
2    i = -1
3    if lastname_key in record:
4        firstnames = record.get(firstname_key, [])
5        for name in record.get(lastname_key):
6            name_record = {'rid': record['id'], 'last': name, 'first': None}
7            if name != 'et al.':
8                i += 1
9                if (len(firstnames) > i):
                    name_record['first'] = firstnames[i]
            name_records.append(name_record)

10        del record[lastname_key]
11        if len(firstnames) > 0:
            del record[firstname_key]
    
    return name_records

1: creates an empty list for the names
2: initializes a counter with -1, because 0 is the index of the first element in a list and we will increase it before using it
3: sometimes translators are missing from the records, so we check if we should process it at all
4: the first name is also missing sometimes, so we get an empty list if this is the case. If we would not add the default parameter, we would retrieve None instead which is not iterable, and raises errors when we check the length later.
5: iterates over the last names
6: creates a dictionary with None as the default first name that will be updated if there is a real first name
7: ‘et al.’ does not have first name, so we can skip it
8: in every other case it increases our counter
9: if there is a corresponding first name by the index it updates the dictionary with that value. At the end of the check the current name will be appended to the list of names.
10: as we do not need it any more in the main record structure, we delete the last name
11: and the first name as well if it exists at all.

What remained is to save the extracted data into CSV files:

create_diroutput_dir = os.path.join('raw-data', 'it')
create_dir(output_dir)

1output_file = os.path.join(output_dir, 'it.csv')
with open(output_file, 'w', encoding='utf-8') as csv_file: 
2    column_names = ['id'] + list(field_names.keys())
3    output_writer = csv.DictWriter(csv_file, fieldnames=column_names)
4    output_writer.writeheader()
5    output_writer.writerows(records)

6column_names=['rid', 'last', 'first']
7names = {'authors.csv': authors, 'translators.csv': translators}
8for file_name, data in names.items():
9    output_file = os.path.join(output_dir, file_name)
    with open(output_file, 'w', encoding='utf-8') as csv_file:
        output_writer = csv.DictWriter(csv_file, fieldnames=column_names)
        output_writer.writeheader()
        output_writer.writerows(data)

1: opens a CSV file raw-data/it/it.csv using the safe os.path.join function
2: the column names are ‘id’ and the names we extracted from the class attributes
3: CSV writer’s API is similar to other file writers in Python. The csv library provides a number of functions, here we use the DictWriter that allows us to turn the list of dictionaries to CSV. It has two mandatory parameters: the file handler and the list of field names. All keys in the dictionaries should be available in the field name list, however they might be missing values in the records. E.g. if “location” is a key in the record dictionary, the field name list must contain it, but not every record should have the “location” key.
4: writes out the header line to the CSV file
5: writes out the individual records
6: repeats the same process for authors and translators, but here we give a fixed field name list, and not extract it from the HTML.
7: creates a dictionary for enabling the iteration. Its keys will be used as file names and the values are the name lists.
8: items() on a dictionary iterates over its key-value pairs – we can name them anyhow.
9: the rest is very similar to the first CSV writing process.

The full script is available in the repository as scripts/ch3a-web-scraping.py. The Intra-Belgian literary translations since 1970 (BELTRANS) project of KBR (the Belgian National Library) provides a more elaborated approach to the same problem in their Github repository – thanks to Sven Lieber for the pointer.