• A Metadata Quality Assurance Framework — My research project plan.
  • Incremental Solr indexing

    In the Metadata Quality Assurance Framework as a first step we create some CSV files. The structure of the CSV (in case of Europeana) looks like this:

    [record ID],[dataset ID],[data provider ID],score1,score2,...scoreN
    [record ID],[dataset ID],[data provider ID],score1,score2,...scoreN
    ...
    

    where each score belong to a field- or record-level quality metric. There are several measurements (such as frequency and cardinatity of fields, multilinguality, language distribution, metadata anti-pattern detection, uniqueness), each produce such CSV files.

    Once we have them we run statistical analyses on these files to get a collection level overview and we present the result in a web UI. From the web UI it is not easy to go back to the records, or at least to figure it out which records have a give value (or range of values). We get feedbacks from the Data Quality Committee members – most recently from Tom Miles (British Library) –, that in some cases it would be necessary to check the records itself e.g. ‘which record contains values in Dutch?’, ‘which records have lots of dc:subject values?’ etc.

  • Report about LDCX 2018 (in Hungarian)

    Thanks to Christina Harlow I had a privilege to participate in the Stanford University Library’s yearly developer unconference, LDCX 2018. I wrote a report about it in Hungarian. English version comes soon.

  • MARC21 structure in JSON

    This page displayes MARC21 structure in formatted JSON format. It is generated from the Java classes which describe the structure. The structure follows Avram JSON schema which was created by Jakob Voß in order to create a common ground for both MARC21, PICA and metadata standards. There are some more attributes which is not exported into the JSON format because it is specific only for MARC and is not available elsewhere, but both this export and Avram is a work in progress. If you miss something, please let me know via the issue tracker. Credits: the JSON schema was originally requested by Jakob Voß and Johann Rolschewski, Avram - as far as I understand - grown from this discussion.

    A raw version is avalable here.

  • Running MARC21 analysis in Spark

    So far I have worked with MARC21 files in a standalone manner. Now it is time to run MARC analysis in Apache Spark. Here I describe only the first steps, the tool is not ready to run all the analysis which is possible with the standalone manner.

  • Self-descriptive MARC21 codes

    The project’s main purpose is to evaluate and to assess the quality of MARC21 records. It turned out, that - since I am not one of those cataloguers who know the whole MARC21 standard by heart - I can do it more effectively if I create a map which turns the purely technical codes to more understandable self-descriptive codes.

  • Recent updates

    In the last months we had opportunity to present our research project in different conferences.

  • Report Number Four (Making General)

    In the last weeks I was working on making the framework really general and reusable in other project. The initiative came from Europeana. They would like to build in the measurement into their ingestion service. It was a good occasion to separate different concerns of the source code.

  • A poor man's join in Apache Spark

    There API functions for joining two CSV files in Apache Spark, but it turned out, they are requires more robust machines than I have access to, so I had to do some tricks to achive the goal.

  • Report Number Three (Cassandra, Spark and Solr)

    Changing from Hadoop to Spark, refining mandatory calculation, adding field statistics, storing records in Cassandra, indexing with Solr and calculating uniqueness.

  • Report Number Two (Finishing harvest)

    The second report describing the process of harvest with tricks and tips, the running of full ‘completeness’ analyses, and the first visible results.

  • The strange APIs of The Metropolitan Museum, New York

    Logo of the Metropolitan Museum of Art

    Last Friday I heard a presentation of Fernando Pérez, who is — among many other things — one of the founders of Berkeley Institute for Data Science (shortly BIDS). After his presentation I spent some time on the BIDS website, investigating their projects. One of those is the ROpenSci, which is a community for „transforming science through open data”. They create R packages in several scientific domains.

  • Report Number One (Baby steps)

    I have started working on the implementation of my plan, [1] and now I attached a resulting image came from it.