Metadata Quality Assessment Framework - Metadata Quality Assessment Framework

A Metadata Quality Assurance Framework — My research project plan.
Library APIs
Libraries provide different application programming interfaces (APIs) that allow users to access library data programmatically, e.g. via a script. The APIs can be implementation of widely used specifications, which are available in many libraries, or custom interfaces which are available in particular institutions. APIs have three important layers:
Running QA catalogue with Docker
Now the QA catalogue (more information about this software at the end of this post) can be run with Docker. Once you have Docker on the machine, you can run the full end-to-end quality assessment process (including building the web user interface) with the following two commands:
Measuring subject term usage in bibliographic records
The following text is one of my current research plans. I submitted it to Koninklijke Bibliotheek’s (the Dutch National Library) Researcher-in-residence programme (without success). From this version I removed some administrative staff and contact details. I continue to work on this plan, and will present some preliminary results at SWIB 2019 (2019-11-27 11:05 - 11:30 if you will attend at the conference). Any feedback or suggestions are welcome!
Incremental Solr indexing
In the Metadata Quality Assurance Framework as a first step we create some CSV files. The structure of the CSV (in case of Europeana) looks like this:
```
[record ID],[dataset ID],[data provider ID],score1,score2,...scoreN
[record ID],[dataset ID],[data provider ID],score1,score2,...scoreN
...
```
where each score belong to a field- or record-level quality metric. There are several measurements (such as frequency and cardinatity of fields, multilinguality, language distribution, metadata anti-pattern detection, uniqueness), each produce such CSV files.

Once we have them we run statistical analyses on these files to get a collection level overview and we present the result in a web UI. From the web UI it is not easy to go back to the records, or at least to figure it out which records have a give value (or range of values). We get feedbacks from the Data Quality Committee members – most recently from Tom Miles (British Library) –, that in some cases it would be necessary to check the records itself e.g. ‘which record contains values in Dutch?’, ‘which records have lots of dc:subject values?’ etc.
Report about LDCX 2018 (in Hungarian)
Thanks to Christina Harlow I had a privilege to participate in the Stanford University Library’s yearly developer unconference, LDCX 2018. I wrote a report about it in Hungarian. English version comes soon.
MARC21 structure in JSON
This page displayes MARC21 structure in formatted JSON format. It is generated from the Java classes which describe the structure. The structure follows Avram JSON schema which was created by Jakob Voß in order to create a common ground for both MARC21, PICA and metadata standards. There are some more attributes which is not exported into the JSON format because it is specific only for MARC and is not available elsewhere, but both this export and Avram is a work in progress. If you miss something, please let me know via the issue tracker. Credits: the JSON schema was originally requested by Jakob Voß and Johann Rolschewski, Avram - as far as I understand - grown from this discussion.

A raw version is avalable here.
Running MARC21 analysis in Spark
So far I have worked with MARC21 files in a standalone manner. Now it is time to run MARC analysis in Apache Spark. Here I describe only the first steps, the tool is not ready to run all the analysis which is possible with the standalone manner.
Self-descriptive MARC21 codes
The project’s main purpose is to evaluate and to assess the quality of MARC21 records. It turned out, that - since I am not one of those cataloguers who know the whole MARC21 standard by heart - I can do it more effectively if I create a map which turns the purely technical codes to more understandable self-descriptive codes.
Recent updates
In the last months we had opportunity to present our research project in different conferences.
Report Number Four (Making General)
In the last weeks I was working on making the framework really general and reusable in other project. The initiative came from Europeana. They would like to build in the measurement into their ingestion service. It was a good occasion to separate different concerns of the source code.
A poor man's join in Apache Spark
There API functions for joining two CSV files in Apache Spark, but it turned out, they are requires more robust machines than I have access to, so I had to do some tricks to achive the goal.
Report Number Three (Cassandra, Spark and Solr)
Changing from Hadoop to Spark, refining mandatory calculation, adding field statistics, storing records in Cassandra, indexing with Solr and calculating uniqueness.
Report Number Two (Finishing harvest)
The second report describing the process of harvest with tricks and tips, the running of full ‘completeness’ analyses, and the first visible results.
The strange APIs of The Metropolitan Museum, New York

Last Friday I heard a presentation of Fernando Pérez, who is — among many other things — one of the founders of Berkeley Institute for Data Science (shortly BIDS). After his presentation I spent some time on the BIDS website, investigating their projects. One of those is the ROpenSci, which is a community for „transforming science through open data”. They create R packages in several scientific domains.
Report Number One (Baby steps)
I have started working on the implementation of my plan, [1] and now I attached a resulting image came from it.