• A Metadata Quality Assurance Framework — My research project plan.
  • MARC21 structure in JSON

    This page displayes MARC21 structure in formatted JSON format. It is generated from the Java classes which describe the structure. There are some more attributes which is not exported into the JSON format, but it is a work in progress. If you miss something, please let me know via the issue tracker. Credits: the JSON schema was requested by Jakob Voß and Johann Rolschewski.

    A raw version is avalable here.

  • Running MARC21 analysis in Spark

    So far I have worked with MARC21 files in a standalone manner. Now it is time to run MARC analysis in Apache Spark. Here I describe only the first steps, the tool is not ready to run all the analysis which is possible with the standalone manner.

  • Self-descriptive MARC21 codes

    The project’s main purpose is to evaluate and to assess the quality of MARC21 records. It turned out, that - since I am not one of those cataloguers who know the whole MARC21 standard by heart - I can do it more effectively if I create a map which turns the purely technical codes to more understandable self-descriptive codes.

  • Recent updates

    In the last months we had opportunity to present our research project in different conferences.

  • Report Number Four (Making General)

    In the last weeks I was working on making the framework really general and reusable in other project. The initiative came from Europeana. They would like to build in the measurement into their ingestion service. It was a good occasion to separate different concerns of the source code.

  • A poor man's join in Apache Spark

    There API functions for joining two CSV files in Apache Spark, but it turned out, they are requires more robust machines than I have access to, so I had to do some tricks to achive the goal.

  • Report Number Three (Cassandra, Spark and Solr)

    Changing from Hadoop to Spark, refining mandatory calculation, adding field statistics, storing records in Cassandra, indexing with Solr and calculating uniqueness.

  • Report Number Two (Finishing harvest)

    The second report describing the process of harvest with tricks and tips, the running of full ‘completeness’ analyses, and the first visible results.

  • The strange APIs of The Metropolitan Museum, New York

    Logo of the Metropolitan Museum of Art

    Last Friday I heard a presentation of Fernando Pérez, who is — among many other things — one of the founders of Berkeley Institute for Data Science (shortly BIDS). After his presentation I spent some time on the BIDS website, investigating their projects. One of those is the ROpenSci, which is a community for „transforming science through open data”. They create R packages in several scientific domains.

  • Report Number One (Baby steps)

    I have started working on the implementation of my plan, [1] and now I attached a resulting image came from it.