• A Metadata Quality Assurance Framework — My research project plan.
  • Report Number Four (Making General)

    In the last weeks I was working on making the framework really general and reusable in other project. The initiative came from Europeana. They would like to build in the measurement into their ingestion service. It was a good occasion to separate different concerns of the source code.

  • A poor man's join in Apache Spark

    There API functions for joining two CSV files in Apache Spark, but it turned out, they are requires more robust machines than I have access to, so I had to do some tricks to achive the goal.

  • Report Number Three (Cassandra, Spark and Solr)

    Changing from Hadoop to Spark, refining mandatory calculation, adding field statistics, storing records in Cassandra, indexing with Solr and calculating uniqueness.

  • Report Number Two (Finishing harvest)

    The second report describing the process of harvest with tricks and tips, the running of full ‘completeness’ analyses, and the first visible results.

  • The strange APIs of The Metropolitan Museum, New York

    Logo of the Metropolitan Museum of Art

    Last Friday I heard a presentation of Fernando Pérez, who is — among many other things — one of the founders of Berkeley Institute for Data Science (shortly BIDS). After his presentation I spent some time on the BIDS website, investigating their projects. One of those is the ROpenSci, which is a community for „transforming science through open data”. They create R packages in several scientific domains.

  • Report Number One (Baby steps)

    I have started working on the implementation of my plan, [1] and now I attached a resulting image came from it.