Running MARC21 analysis in Spark - Metadata Quality Assessment Framework

So far I have worked with MARC21 files in a standalone manner. Now it is time to run MARC analysis in Apache Spark. Here I describe only the first steps, the tool is not ready to run all the analysis which is possible with the standalone manner.

If we the source of the Spark analysis is a file, Spark reads the file in a line-by-line manner. Unual MARC21 files are big files with thousands or million records in one line. The record separator character is not a line ending character, but the (in hexadecimal notation) 1d character. The MARC validation tool so far accepts one or more file names as input. In a Spark context we have to do two steps:

change MARC files that they contain each record in a separate line
change code to accept binary string instead of file name

The first step can be done with one line of Bash code:

sed 's/\x1d/\x1d\n/g' marc.mrc > marc-line-separated.mrc

If you have several files, put it into a loop:

FILE=*.mrc

for IN in $FILE; do
  OUT=$(echo $IN | sed 's/.mrc/-line-separated.mrc/')
  sed 's/\x1d/\x1d\n/g' $IN > $OUT
done

I don’t describle the process of the second step, it took an hour or two to adapt the code to work in Spark environment. From the user’s perspect the important thing is to know how to run it.

Running in a local file system

1) Make sure HADOOP_CONF_DIR is not set. If it is set Spark would like to communicate with Hadoop File System, and it is not running, the whole process will stop.

echo $HADOOP_CONF_DIR

if it returns anything else than an empty line, unset it:

unset HADOOP_CONF_DIR

2) Run analysis!

spark-submit \
  --class de.gwdg.metadataqa.marc.cli.spark.ParallelValidator \
  --master local[*] \
  target/metadata-qa-marc-0.2-SNAPSHOT-jar-with-dependencies.jar \
  --format "tab-separated" \
  --fileName output \
  --marcVersion MARC21 \
  /path/to/\*-line-separated.mrc \

This command has two parts: the first 3 arguments are for Apache Spark, they are the class to run, the number of cores include in the process (* means all), and the jar file which contains the application. The rest are the standard arguments of the MARC analyzer application.

It is important to escape asteriks with the backslash character (\*), this guarantees, that the shell will not substitutes that line with the names of all the files matches the pattern.

3) Retrieve output:

The output is a directory, you can extract the results into a single file with the following command:

cat output/part-* > output.csv

Running it with Hadoop

1) Upload files to Hadoop file system:

hdfs dfs -put /path/to/\*-line-separated.mrc /marc

This will upload the files into the /marc directory of Hadoop FS.

2) Make sure that HADOOP_CONF_DIR is set (we unset it in the local file system example):

echo $HADOOP_CONF_DIR

if it return empty line, it means it is not set, so set it:

export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop

3) Run analysis!

unset 
spark-submit \
  --class de.gwdg.metadataqa.marc.cli.spark.ParallelValidator \
  --master local[*] \
  target/metadata-qa-marc-0.2-SNAPSHOT-jar-with-dependencies.jar \
  --format "tab-separated" \
  --fileName hdfs://localhost:54310/output \
  --marcVersion MARC21 \
  hdfs://localhost:54310/marc21/*-line-separated.mrc \

4) Retrieve output:

hdfs dfs -getmerge /marc21/* output.csv