4  Validating MARC records

It validates each records against the MARC21 standard, including those local defined field, which are selected by the MARC version parameter.

The issues are classified into the following categories: record, control field, data field, indicator, subfield and their subtypes.

There is an uncertainty in the issue detection. Almost all library catalogues have fields, which are not part of the MARC standard, neither that of their documentation about the locally defined fields (these documents are rarely available publicly, and even if they are available sometimes they do not cover all fields). So if the tool meets a field which are undefined, it is impossible to decide whether it is valid or invalid in a particular context. So in some places the tool reflects this uncertainty and provides two calculations, one which handles these fields as error, and another which handles these as valid fields.

The tool detects the following issues:

machine name explanation
record level issues
undetectableType the document type is not detectable
invalidLinkage the linkage in field 880 is invalid
ambiguousLinkage the linkage in field 880 is ambiguous
control field position issues
obsoleteControlPosition the code in the position is obsolete (it was valid in a previous version of MARC, but it is not valid now)
controlValueContainsInvalidCode the code in the position is invalid
invalidValue the position value is invalid
data field issues
missingSubfield missing reference subfield (880$6)
nonrepeatableField repetition of a non-repeatable field
undefinedField the field is not defined in the specified MARC version(s)
indicator issues
obsoleteIndicator the indicator value is obsolete (it was valid in a previous version of MARC, but not in the current version)
nonEmptyIndicator indicator that should be empty is non-empty
invalidValue the indicator value is invalid
subfield issues
undefinedSubfield the subfield is undefined in the specified MARC version(s)
invalidLength the length of the value is invalid
invalidReference the reference to the classification vocabulary is invalid
patternMismatch content does not match the patterns specified by the standard
nonrepeatableSubfield repetition of a non-repeatable subfield
invalidISBN invalid ISBN value
invalidISSN invalid ISSN value
unparsableContent the value of the subfield is not well-formed according to its specification
nullCode null subfield code
invalidValue invalid subfield value

Usage:

java -cp $JAR de.gwdg.metadataqa.marc.cli.Validator [options] <file>

or with a bash script

./validator [options] <file>

or

catalogues/<catalogue>.sh validate

or

./qa-catalogue --params="[options]" validate

options:

Outputs: * count.csv: the count of bibliographic records in the source dataset

total
1192536
id,category,instances,records
2,control field,994241,313960
3,data field,12,12
4,indicator,5990,5041
5,subfield,571,555
id,categoryId,category,type,instances,records
5,2,control field,"invalid code",951,541
6,2,control field,"invalid value",993290,313733
8,3,data field,"repetition of non-repeatable field",12,12
10,4,indicator,"obsolete value",1,1
11,4,indicator,"non-empty indicator",33,32
12,4,indicator,"invalid value",5956,5018
13,5,subfield,"undefined subfield",48,48
14,5,subfield,"invalid length",2,2
15,5,subfield,"invalid classification reference",2,2
16,5,subfield,"content does not match any patterns",286,275
17,5,subfield,"repetition of non-repeatable subfield",123,120
18,5,subfield,"invalid ISBN",5,3
19,5,subfield,"invalid ISSN",105,105
id,MarcPath,categoryId,typeId,type,message,url,instances,records
53,008/33-34 (008map33),2,5,invalid code,'b' in 'b ',https://www.loc.gov/marc/bibliographic/bd008p.html,1,1
70,008/00-05 (008all00),2,5,invalid code,Invalid content: '2023  '. Text '2023  ' could not be parsed at index 4,https://www.loc.gov/marc/bibliographic/bd008a.html,1,1
28,008/22-23 (008map22),2,6,invalid value,| ,https://www.loc.gov/marc/bibliographic/bd008p.html,12,12
19,008/31 (008book31),2,6,invalid value, ,https://www.loc.gov/marc/bibliographic/bd008b.html,1,1
17,008/29 (008book29),2,6,invalid value, ,https://www.loc.gov/marc/bibliographic/bd008b.html,1,1
recordId,errors
99117335059205508,1:2;2:1;3:1
99117335059305508,1:1
99117335059405508,2:2
99117335059505508,3:1

1:2;2:1;3:1 means that 3 different types of issues are occurred in the record, the firs issue which has issue ID 1 occurred twice, issue ID 2 which occurred once and issue ID 3, which occurred once. The issue IDs can be resolved from the issue-summary.csv file’s firs column.

id,errorId,instances
99117335059205508,1,2
99117335059205508,2,1
99117335059205508,3,1
99117335059305508,1,1
99117335059405508,2,2
99117335059505508,3,1
type,instances,records
0,0,251
1,1711,848
2,413,275

where types are - 0: records without errors - 1: records with any kinds of errors - 2: records with errors excluding invalid field errors

errorId,recordIds
1,99117329355705508;99117328948305508;99117334968905508;99117335067705508;99117335176005508;...

An example with parameters used for analysing a PICA dataset. When the input is a complex expression it is displayed here in a parsed format. It also contains some metadata such as the versions of MQFA API and QA catalogue.

{
  "args":["/path/to/input.dat"],
  "marcVersion":"MARC21",
  "marcFormat":"PICA_NORMALIZED",
  "dataSource":"FILE",
  "limit":-1,
  "offset":-1,
  "id":null,
  "defaultRecordType":"BOOKS",
  "alephseq":false,
  "marcxml":false,
  "lineSeparated":false,
  "trimId":true,
  "outputDir":"/path/to/_output/k10plus_pica",
  "recordIgnorator":{
    "criteria":[],
    "booleanCriteria":null,
    "empty":true
  },
  "recordFilter":{
    "criteria":[],
    "booleanCriteria":{
      "op":"AND",
      "children":[
        {
          "op":null,
          "children":[],
          "value":{
            "path":{
              "path":"002@.0",
              "tag":"002@",
              "xtag":null,
              "occurrence":null,
              "subfields":{"type":"SINGLE","input":"0","codes":["0"]},
              "subfieldCodes":["0"]
            },
            "operator":"NOT_MATCH",
            "value":"^L"
          }
        },
        {"op":null,"children":[],"value":{"path":{"path":"002@.0","tag":"002@","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"0","codes":["0"]},"subfieldCodes":["0"]},"operator":"NOT_MATCH","value":"^..[iktN]"}},
        {"op":"OR","children":[{"op":null,"children":[],"value":{"path":{"path":"002@.0","tag":"002@","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"0","codes":["0"]},"subfieldCodes":["0"]},"operator":"NOT_MATCH","value":"^.v"}},{"op":null,"children":[],"value":{"path":{"path":"021A.a","tag":"021A","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"a","codes":["a"]},"subfieldCodes":["a"]},"operator":"EXIST","value":null}}],"value":null}
      ],
      "value":null
    },
    "empty":false
  },
  "ignorableFields":{
    "fields":["001@","001E","001L","001U","001U","001X","001X","002V","003C","003G","003Z","008G","017N","020F","027D","031B","037I","039V","042@","046G","046T","101@","101E","101U","102D","201E","201U","202D"],
    "empty":false
  },
  "stream":null,
  "defaultEncoding":null,
  "alephseqLineType":null,
  "picaIdField":"003@$0",
  "picaSubfieldSeparator":"$",
  "picaSchemaFile":null,
  "picaRecordTypeField":"002@$0",
  "schemaType":"PICA",
  "groupBy":null,
  "detailsFileName":"issue-details.csv",
  "summaryFileName":"issue-summary.csv",
  "format":"COMMA_SEPARATED",
  "ignorableIssueTypes":["FIELD_UNDEFINED"],
  "pica":true,
  "replacementInControlFields":null,
  "marc21":false,
  "mqaf.version":"0.9.2",
  "qa-catalogue.version":"0.7.0-SNAPSHOT"
}
id,groupId
010000011,0
010000011,77
010000011,2035
010000011,70
010000011,20

Currently, validation detects the following errors:

Leader specific errors:

Control field specific errors:

Data field specific errors

Errors of specific fields:

An example:

Error in '   00000034 ': 
  110$ind1 has invalid code: '2'
Error in '   00000056 ': 
  110$ind1 has invalid code: '2'
Error in '   00000057 ': 
  082$ind1 has invalid code: ' '
Error in '   00000086 ': 
  110$ind1 has invalid code: '2'
Error in '   00000119 ': 
  700$ind1 has invalid code: '2'
Error in '   00000234 ': 
  082$ind1 has invalid code: ' '
Errors in '   00000294 ': 
  050$ind2 has invalid code: ' '
  260$ind1 has invalid code: '0'
  710$ind2 has invalid code: '0'
  710$ind2 has invalid code: '0'
  710$ind2 has invalid code: '0'
  740$ind2 has invalid code: '1'
Error in '   00000322 ': 
  110$ind1 has invalid code: '2'
Error in '   00000328 ': 
  082$ind1 has invalid code: ' '
Error in '   00000374 ': 
  082$ind1 has invalid code: ' '
Error in '   00000395 ': 
  082$ind1 has invalid code: ' '
Error in '   00000514 ': 
  082$ind1 has invalid code: ' '
Errors in '   00000547 ': 
  100$ind2 should be empty, it has '0'
  260$ind1 has invalid code: '0'
Errors in '   00000571 ': 
  050$ind2 has invalid code: ' '
  100$ind2 should be empty, it has '0'
  260$ind1 has invalid code: '0'
...

4.0.0.1 post processing validation result (validate-sqlite)

Usage:

catalogues/<catalogue>.sh validate-sqlite

or

./qa-catalogue --params="[options]" validate-sqlite

or

./common-script [options] validate-sqlite

[options] are the same as for validation

4.0.0.1.1 Catalogue for a single library

If the data is not grouped by libraries (no --groupBy <path> parameter), it creates the following SQLite3 database structure and import some of the CSV files into it:

issue_summary table for the issue-summary.csv:

It represents a particular type of error

id         INTEGER,  -- identifier of the error
MarcPath   TEXT,     -- the location of the error in the bibliographic record
categoryId INTEGER,  -- the identifier of the category of the error
typeId     INTEGER,  -- the identifier of the type of the error
type       TEXT,     -- the description of the type
message    TEXT,     -- extra contextual information 
url        TEXT,     -- the url of the definition of the data element
instances  INTEGER,  -- the number of instances this error occured
records    INTEGER   -- the number of records this error occured in

issue_details table for the issue-details.csv:

Each row represents how many instances of an error occur in a particular bibliographic record

id         TEXT,    -- the record identifier
errorId    INTEGER, -- the error identifier (-> issue_summary.id)
instances  INTEGER  -- the number of instances of an error in the record
4.0.0.1.2 Union catalogue for multiple libraries

If the dataset is a union catalogue, and the record contains a subfield for the libraries holding the item (there is --groupBy <path> parameter), it creates the following SQLite3 database structure and import some of the CSV files into it:

issue_summary table for the issue-summary.csv (it is similar to the other issue_summary table, but it has an extra groupId column)

groupId    INTEGER,
id         INTEGER,
MarcPath   TEXT,
categoryId INTEGER,
typeId     INTEGER,
type       TEXT,
message    TEXT,
url        TEXT,
instances  INTEGER,
records    INTEGER

issue_details table (same as the other issue_details table)

id         TEXT,
errorId    INTEGER,
instances  INTEGER

id_groupid table for id-groupid.csv:

id         TEXT,
groupId    INTEGER

issue_group_types table contains statistics for the error types per groups.

groupId    INTEGER,
typeId     INTEGER,
records    INTEGER,
instances  INTEGER

issue_group_categories table contains statistics for the error categories per groups

groupId    INTEGER,
categoryId INTEGER,
records    INTEGER,
instances  INTEGER

issue_group_paths table contains statistics for the error types per paths per groups

groupId    INTEGER,
typeId     INTEGER,
path       TEXT,
records    INTEGER,
instances  INTEGER

For union catalogues it also creates an extra Solr index with the suffix _validation. It contains one Solr document for each bibliographic record with three fields: the record identifier, the list of group identifiers and the list of error identifiers (if any). This Solr index is needed for populating the issue_group_types, issue_group_categories and issue_group_paths tables. This index will be ingested into the main Solr index.