4  Validating MARC records

It validates each records against the MARC21 standard, including those local defined field, which are selected by the MARC version parameter.

The issues are classified into the following categories: record, control field, data field, indicator, subfield and their subtypes.

There is an uncertainty in the issue detection. Almost all library catalogues have fields, which are not part of the MARC standard, neither that of their documentation about the locally defined fields (these documents are rarely available publicly, and even if they are available sometimes they do not cover all fields). So if the tool meets a field which are undefined, it is impossible to decide whether it is valid or invalid in a particular context. So in some places the tool reflects this uncertainty and provides two calculations, one which handles these fields as error, and another which handles these as valid fields.

The tool detects the following issues:

machine name explanation
record level issues
undetectableType the document type is not detectable
invalidLinkage the linkage in field 880 is invalid
ambiguousLinkage the linkage in field 880 is ambiguous
control field position issues
obsoleteControlPosition the code in the position is obsolete (it was valid in a previous version of MARC, but it is not valid now)
controlValueContainsInvalidCode the code in the position is invalid
invalidValue the position value is invalid
data field issues
missingSubfield missing reference subfield (880$6)
nonrepeatableField repetition of a non-repeatable field
undefinedField the field is not defined in the specified MARC version(s)
indicator issues
obsoleteIndicator the indicator value is obsolete (it was valid in a previous version of MARC, but not in the current version)
nonEmptyIndicator indicator that should be empty is non-empty
invalidValue the indicator value is invalid
subfield issues
undefinedSubfield the subfield is undefined in the specified MARC version(s)
invalidLength the length of the value is invalid
invalidReference the reference to the classification vocabulary is invalid
patternMismatch content does not match the patterns specified by the standard
nonrepeatableSubfield repetition of a non-repeatable subfield
invalidISBN invalid ISBN value
invalidISSN invalid ISSN value
unparsableContent the value of the subfield is not well-formed according to its specification
nullCode null subfield code
invalidValue invalid subfield value
Usage
  ./qa-catalogue --params="[options]" validate
  # or
  catalogues/<catalogue>.sh validate
Options
  • general parameters
    • granularity of the report
  • -S, --summary: creating a summary report instead of record level reports
  • -H, --details: provides record level details of the issues
    • output parameters:
  • -G <file>, --summaryFileName <file>: the name of summary report the program produces. The file provides a summary of issues, such as the number of instance and number of records having the particular issue.
  • -F <file>, --detailsFileName <file>: the name of report the program produces. Default is validation-report.txt. If you use “stdout”, it won’t create file, but put results into the standard output.
  • -R <format>, --format <format>: format specification of the output. Possible values:
    • text (default),
    • tab-separated or tsv,
    • comma-separated or csv
    • -W, --emptyLargeCollectors: the output files are created during the process and not only at the end of it. It helps in memory management if the input is large, and it has lots of errors, on the other hand the output file will be segmented, which should be handled after the process.
    • -T, --collectAllErrors: collect all errors (useful only for validating small number of records). Default is turned off.
    • -I <types>, --ignorableIssueTypes <types>: comma separated list of issue types not to collect. The valid values are (for details see the issue types table):
  • undetectableType: undetectable type
  • invalidLinkage: invalid linkage
  • ambiguousLinkage: ambiguous linkage
  • obsoleteControlPosition: obsolete code
  • controlValueContainsInvalidCode: invalid code
  • invalidValue: invalid value
  • missingSubfield: missing reference subfield (880$6)
  • nonrepeatableField: repetition of non-repeatable field
  • undefinedField: undefined field
  • obsoleteIndicator: obsolete value
  • nonEmptyIndicator: non-empty indicator
  • invalidValue: invalid value
  • undefinedSubfield: undefined subfield
  • invalidLength: invalid length
  • invalidReference: invalid classification reference
  • patternMismatch: content does not match any patterns
  • nonrepeatableSubfield: repetition of non-repeatable subfield
  • invalidISBN: invalid ISBN
  • invalidISSN: invalid ISSN
  • unparsableContent: content is not well-formatted
  • nullCode: null subfield code
  • invalidValue: invalid value

4.1 Output files

flowchart LR
  A(Catalogue) --> B[validate]
  B --> C(count.csv)
  B --> D(issue-by-category.csv)
  B --> E(issue-by-type.csv)
  B --> F(issue-summary.csv)
  B --> G(issue-details.csv)
  B --> H(issue-details-normalized.csv)
  B --> I(issue-total.csv)
  B --> J(issue-collector.csv)
  B --> K(id-groupid.csv)
  B --> L(validation.params.json)
  F -.-> Q(qa_catalogue.sqlite)
  G -.-> Q
  G -.-> S(Solr)
  K -.-> S(Solr)

Execution of the dotted lines requires postprocessing.

4.1.1 count.csv

The count of bibliographic records in the source dataset

total
1192536

4.1.2 issue-by-category.csv

The counts of issues by categories. Columns:

  • id the identifier of error category
  • category the name of the category
  • instances the number of instances of errors within the category (one record might have multiple instances of the same error)
  • records the number of records having at least one of the errors within the category
id,category,instances,records
2,control field,994241,313960
3,data field,12,12
4,indicator,5990,5041
5,subfield,571,555

4.1.3 issue-by-type.csv

The count of issues by types (subcategories).

id,categoryId,category,type,instances,records
5,2,control field,"invalid code",951,541
6,2,control field,"invalid value",993290,313733
8,3,data field,"repetition of non-repeatable field",12,12
10,4,indicator,"obsolete value",1,1
11,4,indicator,"non-empty indicator",33,32
12,4,indicator,"invalid value",5956,5018
13,5,subfield,"undefined subfield",48,48
14,5,subfield,"invalid length",2,2
15,5,subfield,"invalid classification reference",2,2
16,5,subfield,"content does not match any patterns",286,275
17,5,subfield,"repetition of non-repeatable subfield",123,120
18,5,subfield,"invalid ISBN",5,3
19,5,subfield,"invalid ISSN",105,105

4.1.4 issue-summary.csv

Details of individual issues including basic statistics

id,MarcPath,categoryId,typeId,type,message,url,instances,records
53,008/33-34 (008map33),2,5,invalid code,'b' in 'b ',https://www.loc.gov/marc/bibliographic/bd008p.html,1,1
70,008/00-05 (008all00),2,5,invalid code,Invalid content: '2023  '. Text '2023  ' could not be parsed at index 4,https://www.loc.gov/marc/bibliographic/bd008a.html,1,1
28,008/22-23 (008map22),2,6,invalid value,| ,https://www.loc.gov/marc/bibliographic/bd008p.html,12,12
19,008/31 (008book31),2,6,invalid value, ,https://www.loc.gov/marc/bibliographic/bd008b.html,1,1
17,008/29 (008book29),2,6,invalid value, ,https://www.loc.gov/marc/bibliographic/bd008b.html,1,1

4.1.5 issue-details.csv

List of issues by record identifiers. It has two columns, the record identifier, and a complex string, which contains the number of occurrences of each individual issue concatenated by semicolon.

recordId,errors
99117335059205508,1:2;2:1;3:1
99117335059305508,1:1
99117335059405508,2:2
99117335059505508,3:1

1:2;2:1;3:1 means that 3 different types of issues are occurred in the record, the firs issue which has issue ID 1 occurred twice, issue ID 2 which occurred once and issue ID 3, which occurred once. The issue IDs can be resolved from the issue-summary.csv file’s firs column.

4.1.6 issue-details-normalized.csv

The normalized version of the previous file

id,errorId,instances
99117335059205508,1,2
99117335059205508,2,1
99117335059205508,3,1
99117335059305508,1,1
99117335059405508,2,2
99117335059505508,3,1

4.1.7 issue-total.csv

The number of issue free records, and number of record having issues

type,instances,records
0,0,251
1,1711,848
2,413,275

where types are - 0: records without errors - 1: records with any kinds of errors - 2: records with errors excluding invalid field errors

4.1.8 issue-collector.csv

Non normalized file of record ids per issues. This is the “inverse” of issue-details.csv, it tells you in which records a particular issue occurred.

errorId,recordIds
1,99117329355705508;99117328948305508;99117334968905508;99117335067705508;99117335176005508;...

4.1.9 validation.params.json

The list of the actual parameters during the running of the validation

An example with parameters used for analysing a PICA dataset. When the input is a complex expression it is displayed here in a parsed format. It also contains some metadata such as the versions of MQFA API and QA catalogue.

{
  "args":["/path/to/input.dat"],
  "marcVersion":"MARC21",
  "marcFormat":"PICA_NORMALIZED",
  "dataSource":"FILE",
  "limit":-1,
  "offset":-1,
  "id":null,
  "defaultRecordType":"BOOKS",
  "alephseq":false,
  "marcxml":false,
  "lineSeparated":false,
  "trimId":true,
  "outputDir":"/path/to/_output/k10plus_pica",
  "recordIgnorator":{
    "criteria":[],
    "booleanCriteria":null,
    "empty":true
  },
  "recordFilter":{
    "criteria":[],
    "booleanCriteria":{
      "op":"AND",
      "children":[
        {
          "op":null,
          "children":[],
          "value":{
            "path":{
              "path":"002@.0",
              "tag":"002@",
              "xtag":null,
              "occurrence":null,
              "subfields":{"type":"SINGLE","input":"0","codes":["0"]},
              "subfieldCodes":["0"]
            },
            "operator":"NOT_MATCH",
            "value":"^L"
          }
        },
        {"op":null,"children":[],"value":{"path":{"path":"002@.0","tag":"002@","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"0","codes":["0"]},"subfieldCodes":["0"]},"operator":"NOT_MATCH","value":"^..[iktN]"}},
        {"op":"OR","children":[{"op":null,"children":[],"value":{"path":{"path":"002@.0","tag":"002@","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"0","codes":["0"]},"subfieldCodes":["0"]},"operator":"NOT_MATCH","value":"^.v"}},{"op":null,"children":[],"value":{"path":{"path":"021A.a","tag":"021A","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"a","codes":["a"]},"subfieldCodes":["a"]},"operator":"EXIST","value":null}}],"value":null}
      ],
      "value":null
    },
    "empty":false
  },
  "ignorableFields":{
    "fields":["001@","001E","001L","001U","001U","001X","001X","002V","003C","003G","003Z","008G","017N","020F","027D","031B","037I","039V","042@","046G","046T","101@","101E","101U","102D","201E","201U","202D"],
    "empty":false
  },
  "stream":null,
  "defaultEncoding":null,
  "alephseqLineType":null,
  "picaIdField":"003@$0",
  "picaSubfieldSeparator":"$",
  "picaSchemaFile":null,
  "picaRecordTypeField":"002@$0",
  "schemaType":"PICA",
  "groupBy":null,
  "detailsFileName":"issue-details.csv",
  "summaryFileName":"issue-summary.csv",
  "format":"COMMA_SEPARATED",
  "ignorableIssueTypes":["FIELD_UNDEFINED"],
  "pica":true,
  "replacementInControlFields":null,
  "marc21":false,
  "mqaf.version":"0.9.2",
  "qa-catalogue.version":"0.7.0-SNAPSHOT"
}

4.1.10 id-groupid.csv

The pairs of record identifiers - group identifiers.

id,groupId
010000011,0
010000011,77
010000011,2035
010000011,70
010000011,20

4.2 Validation errors

validation detects the following errors:

Leader specific errors:

  • Leader/[position] has an invalid value: ‘[value]’ (e.g. Leader/19 (leader19) has an invalid value: '4')

Control field specific errors:

  • 006/[position] ([name]) contains an invalid code: ‘[code]’ in ‘[value]’ (e.g. 006/01-05 (tag006book01) contains an invalid code: 'n' in ' n ')
  • 006/[position] ([name]) has an invalid value: ‘[value]’ (e.g. 006/13 (tag006book13) has an invalid value: ' ')
  • 007/[position] ([name]) contains an invalid code: ‘[code]’ in ‘[value]’
  • 007/[position] ([name]) has an invalid value: ‘[value]’ (e.g. 007/01 (tag007microform01) has an invalid value: ' ')
  • 008/[position] ([name]) contains an invalid code: ‘[code]’ in ‘[value]’ (e.g. 008/18-22 (tag008book18) contains an invalid code: 'u' in 'u ')
  • 008/[position] ([name]) has an invalid value: ‘[value]’ (e.g. 008/06 (tag008all06) has an invalid value: ' ')

Data field specific errors

  • Unhandled tag(s): [tags] (e.g. Unhandled tag: 265)
  • [tag] is not repeatable, however there are [number] instances
  • [tag] has invalid subfield(s): [subfield codes] (e.g. 110 has invalid subfield: s)
  • [tag]\([indicator] has invalid code: '[code]' (e.g. `110\)ind1 has invalid code: ‘2’`)
  • [tag]\([indicator] should be empty, it has '[code]' (e.g. `110\)ind2 should be empty, it has ‘0’`)
  • [tag]\([subfield code] is not repeatable, however there are [number] instances (e.g. `072\)a is not repeatable, however there are 2 instances`)
  • [tag]\([subfield code] has an invalid value: [value] (e.g. `046\)a has an invalid value: ‘fb—–’`)

Errors of specific fields:

  • 045\(a error in '[value]': length is not 4 char (e.g. `045\)a error in ‘2209668’: length is not 4 char`)
  • 045$a error in ‘[value]’: ‘[part]’ does not match any patterns
  • 880 should have subfield $a
  • 880 refers to field [tag], which is not defined (e.g. 880 refers to field 590, which is not defined)

An example:

Error in '   00000034 ': 
  110$ind1 has invalid code: '2'
Error in '   00000056 ': 
  110$ind1 has invalid code: '2'
Error in '   00000057 ': 
  082$ind1 has invalid code: ' '
Error in '   00000086 ': 
  110$ind1 has invalid code: '2'
Error in '   00000119 ': 
  700$ind1 has invalid code: '2'
Error in '   00000234 ': 
  082$ind1 has invalid code: ' '
Errors in '   00000294 ': 
  050$ind2 has invalid code: ' '
  260$ind1 has invalid code: '0'
  710$ind2 has invalid code: '0'
  710$ind2 has invalid code: '0'
  710$ind2 has invalid code: '0'
  740$ind2 has invalid code: '1'
Error in '   00000322 ': 
  110$ind1 has invalid code: '2'
Error in '   00000328 ': 
  082$ind1 has invalid code: ' '
Error in '   00000374 ': 
  082$ind1 has invalid code: ' '
Error in '   00000395 ': 
  082$ind1 has invalid code: ' '
Error in '   00000514 ': 
  082$ind1 has invalid code: ' '
Errors in '   00000547 ': 
  100$ind2 should be empty, it has '0'
  260$ind1 has invalid code: '0'
Errors in '   00000571 ': 
  050$ind2 has invalid code: ' '
  100$ind2 should be empty, it has '0'
  260$ind1 has invalid code: '0'
...

4.3 Postprocessing

Postprocessing command validate-sqlite writes validation results into SQLite database file qa_catalogue.sqlite and into Solr index.

Usage
  ./qa-catalogue --params="[options]" validate-sqlite
  # or
  catalogues/<catalogue>.sh validate-sqlite

Options are the same as for validation.

4.3.1 Catalogue for a single library

If the data is not grouped by libraries (no --groupBy <path> parameter), it creates the database tables issue_summary (with data from issue-summary.csv) and issue_details (with data from issue-details.csv) in qa_catalogie.sqlite:

BEGIN TRANSACTION;

-- particular types of errors and how often they have been found
DROP TABLE IF EXISTS issue_summary;
CREATE TABLE IF NOT EXISTS "issue_summary" (
  "id"         INTEGER,  -- identifier of the error
  "MarcPath"   TEXT,     -- the location of the error in the bibliographic record
  "categoryId" INTEGER,  -- the identifier of the category of the error
  "typeId"     INTEGER,  -- the identifier of the type of the error
  "type"       TEXT,     -- the description of the type
  "message"    TEXT,     -- extra contextual information 
  "url"        TEXT,     -- the url of the definition of the data element
  "instances"  INTEGER,  -- the number of instances this error occured
  "records"    INTEGER   -- the number of records this error occured in
);

-- how many instances of an error occur in a particular bibliographic record
DROP TABLE IF EXISTS issue_details;
CREATE TABLE IF NOT EXISTS "issue_details" (
  "id"         TEXT,    -- the record identifier
  "errorId"    INTEGER, -- the error identifier (-> issue_summary.id)
  "instances"  INTEGER  -- the number of instances of an error in the record
);

COMMIT;

The postprocessing also writes validation results into Solr index.

4.3.2 Union catalogue for multiple libraries

If the dataset is a union catalogue, and the record contains a subfield for the libraries holding the item (there is --groupBy <path> parameter), it creates the following SQLite3 database structure and import some of the CSV files into it:

issue_summary table for the issue-summary.csv (it is similar to the other issue_summary table, but it has an extra groupId column)

groupId    INTEGER,
id         INTEGER,
MarcPath   TEXT,
categoryId INTEGER,
typeId     INTEGER,
type       TEXT,
message    TEXT,
url        TEXT,
instances  INTEGER,
records    INTEGER

issue_details table (same as the other issue_details table)

id         TEXT,
errorId    INTEGER,
instances  INTEGER

id_groupid table for id-groupid.csv:

id         TEXT,
groupId    INTEGER

issue_group_types table contains statistics for the error types per groups.

groupId    INTEGER,
typeId     INTEGER,
records    INTEGER,
instances  INTEGER

issue_group_categories table contains statistics for the error categories per groups

groupId    INTEGER,
categoryId INTEGER,
records    INTEGER,
instances  INTEGER

issue_group_paths table contains statistics for the error types per paths per groups

groupId    INTEGER,
typeId     INTEGER,
path       TEXT,
records    INTEGER,
instances  INTEGER

For union catalogues it also creates an extra Solr index with the suffix _validation. It contains one Solr document for each bibliographic record with three fields: the record identifier, the list of group identifiers and the list of error identifiers (if any). This Solr index is needed for populating the issue_group_types, issue_group_categories and issue_group_paths tables. This index will be ingested into the main Solr index.

4.4 Internals

The validate task internally calls script validator which calls java -cp $JAR de.gwdg.metadataqa.marc.cli.Validator [options] <file>.