- A Metadata Quality Assurance Framework — My research project plan.
Report about LDCX 2018 (in Hungarian)
Thanks to Christina Harlow I had a privilege to participate in the Stanford University Library’s yearly developer unconference, LDCX 2018. I wrote a report about it in Hungarian. English version comes soon.
MARC21 structure in JSON
This page displayes MARC21 structure in formatted JSON format. It is generated from the Java classes which describe the structure. The structure follows Avram JSON schema which was created by Jakob Voß in order to create a common ground for both MARC21, PICA and metadata standards. There are some more attributes which is not exported into the JSON format because it is specific only for MARC and is not available elsewhere, but both this export and Avram is a work in progress. If you miss something, please let me know via the issue tracker. Credits: the JSON schema was originally requested by Jakob Voß and Johann Rolschewski, Avram - as far as I understand - grown from this discussion.
A raw version is avalable here.
Running MARC21 analysis in Spark
So far I have worked with MARC21 files in a standalone manner. Now it is time to run MARC analysis in Apache Spark. Here I describe only the first steps, the tool is not ready to run all the analysis which is possible with the standalone manner.
Self-descriptive MARC21 codes
The project’s main purpose is to evaluate and to assess the quality of MARC21 records. It turned out, that - since I am not one of those cataloguers who know the whole MARC21 standard by heart - I can do it more effectively if I create a map which turns the purely technical codes to more understandable self-descriptive codes.
In the last months we had opportunity to present our research project in different conferences.
Report Number Four (Making General)
In the last weeks I was working on making the framework really general and reusable in other project. The initiative came from Europeana. They would like to build in the measurement into their ingestion service. It was a good occasion to separate different concerns of the source code.
A poor man's join in Apache Spark
There API functions for joining two CSV files in Apache Spark, but it turned out, they are requires more robust machines than I have access to, so I had to do some tricks to achive the goal.
Report Number Three (Cassandra, Spark and Solr)
Changing from Hadoop to Spark, refining mandatory calculation, adding field statistics, storing records in Cassandra, indexing with Solr and calculating uniqueness.
Report Number Two (Finishing harvest)
The second report describing the process of harvest with tricks and tips, the running of full ‘completeness’ analyses, and the first visible results.
The strange APIs of The Metropolitan Museum, New York
Last Friday I heard a presentation of Fernando Pérez, who is — among many other things — one of the founders of Berkeley Institute for Data Science (shortly BIDS). After his presentation I spent some time on the BIDS website, investigating their projects. One of those is the ROpenSci, which is a community for „transforming science through open data”. They create R packages in several scientific domains.
Report Number One (Baby steps)
I have started working on the implementation of my plan,  and now I attached a resulting image came from it.