flowchart LR A(Catalogue) --> B[completeness] B ---> C(marc-elements.csv) B ---> D(packages.csv) B ---> E(libraries.csv) B ---> F(libraries003.csv) B ---> G(completeness.params.json) C ---> H(qa_catalogue.sqlite)
6 Completeness
The completeness task counts data elements in the input records and creates basic statistics.
- Usage
-
./qa-catalogue --params="[options]" completeness # or catalogues/<catalogue>.sh completeness
- Options
-
- general parameters
-R <format>
,--format <format>
: format specification of the output. Possible values are:
tab-separated
ortsv
,comma-separated
orcsv
,text
ortxt
json
-V
,--advanced
: advanced mode (not implemented yet)-P
,--onlyPackages
: only packages (not implemented yet)
- general parameters
6.1 Output files
6.1.1 marc-elements.csv
is list of MARC elements (field$subfield
) and their occurrences in two ways as number or records, and number of instances. The columns in the file are:
documenttype
: the document types found in the dataset. There is an extra document type:all
representing all recordspath
: the notation of the data elementpackageid
andpackage
: each path belongs to one package, such asControl Fields
, and each package has an internal identifier.tag
: the label of tagsubfield
: the label of subfieldnumber-of-record
: means how many records they are available,number-of-instances
: means how many instances are there in total (some records might contain more than one instances, while others don’t have them at all)min
,max
,mean
,stddev
the minimum, maximum, mean and standard deviation of the number of instances per record (as floating point numbers)histogram
: the histogram of the instances (1=1; 2=1
means: a single instance is available in one record, two instances are available in one record)
documenttype | path | packageid | package | tag | subfield | number-of-record | number-of-instances | min | max | mean | stddev | histogram |
---|---|---|---|---|---|---|---|---|---|---|---|---|
all | leader23 | 0 | Control Fields | Leader | Undefined | 1099 | 1099 | 1 | 1 | 1.0 | 0.0 | 1=1099 |
all | leader22 | 0 | Control Fields | Leader | Length of the implementation-defined portion | 1099 | 1099 | 1 | 1 | 1.0 | 0.0 | 1=1099 |
all | leader21 | 0 | Control Fields | Leader | Length of the starting-character-position portion | 1099 | 1099 | 1 | 1 | 1.0 | 0.0 | 1=1099 |
all | 110$a | 2 | Main Entry | Main Entry - Corporate Name | Corporate name or jurisdiction name as entry element | 4 | 4 | 1 | 1 | 1.0 | 0.0 | 1=4 |
all | 340$b | 5 | Physical Description | Physical Medium | Dimensions | 2 | 3 | 1 | 2 | 1.5 | 0.3535533905932738 | 1=1; 2=1 |
all | 363$a | 5 | Physical Description | Normalized Date and Sequential Designation | First level of enumeration | 1 | 1 | 1 | 1 | 1.0 | 0.0 | 1=1 |
all | 340$a | 5 | Physical Description | Physical Medium | Material base and configuration | 2 | 3 | 1 | 2 | 1.5 | 0.3535533905932738 | 1=1; 2=1 |
6.1.2 packages.csv
The completeness of packages (packages are groups of tags)
Its columns:
documenttype
: the document type of the recordpackageid
: the identifier of the packagename
: name of the packagelabel
: label of the packageiscoretag
: does the package belong to the Library of Congress MARC standardcount
: the number of records having at least one data element from this package
documenttype | packageid | name | label | iscoretag | count |
---|---|---|---|---|---|
all | 1 | 01X-09X | Numbers and Code | true | 1099 |
all | 2 | 1XX | Main Entry | true | 816 |
all | 6 | 4XX | Series Statement | true | 358 |
all | 5 | 3XX | Physical Description | true | 715 |
all | 8 | 6XX | Subject Access | true | 514 |
all | 4 | 25X-28X | Edition, Imprint | true | 1096 |
all | 7 | 5XX | Note | true | 354 |
all | 0 | 00X | Control Fields | true | 1099 |
all | 99 | unknown | unknown origin | false | 778 |
6.1.3 libraries.csv
Lists the content of the 852$a
(it is useful only if the catalog is an aggregated catalog). Its columns are:
library
: the code of a librarycount
: the number of records having a particular library code
library | count |
---|---|
“00Mf” | 713 |
“British Library” | 525 |
“Inserted article about the fires from the Courant after the title page.” | 1 |
“National Library of Scotland” | 310 |
“StEdNL” | 1 |
“UkOxU” | 33 |
6.1.4 libraries003.csv
List the content of the 003 (it is useful only if the catalog is an aggregated catalog). Its columns are:
library
: the code of a librarycount
: the number of records having a particular library code
library | count |
---|---|
“103861” | 1 |
“BA-SaUP” | 143 |
“BoCbLA” | 25 |
“CStRLIN” | 110 |
“DLC” | 3 |
6.1.5 completeness.params.json
The list of the actual parameters in analysis.
An example with parameters used for analysing a MARC dataset. When the input is a complex expression it is displayed here in a parsed format. It also contains some metadata such as the versions of MQFA API and QA catalogue.
{
"args":["/path/to/input.xml.gz"],
"marcVersion":"MARC21",
"marcFormat":"XML",
"dataSource":"FILE",
"limit":-1,
"offset":-1,
"id":null,
"defaultRecordType":"BOOKS",
"alephseq":false,
"marcxml":true,
"lineSeparated":false,
"trimId":false,
"outputDir":"/path/to/_output/",
"recordIgnorator":{
"conditions":null,
"empty":true
},
"recordFilter":{
"conditions":null,
"empty":true
},
"ignorableFields":{
"fields":null,
"empty":true
},
"stream":null,
"defaultEncoding":null,
"alephseqLineType":null,
"picaIdField":"003@$0",
"picaSubfieldSeparator":"$",
"picaSchemaFile":null,
"picaRecordTypeField":"002@$0",
"schemaType":"MARC21",
"groupBy":null,
"groupListFile":null,
"format":"COMMA_SEPARATED",
"advanced":false,
"onlyPackages":false,
"replacementInControlFields":"#",
"marc21":true,
"pica":false,
"mqaf.version":"0.9.2",
"qa-catalogue.version":"0.7.0"
}
6.2 Output files for union catalogues
For union catalogues the marc-elements.csv
and packages.csv
have a special version.
6.2.1 completeness-grouped-marc-elements.csv
The same as marc-elements.csv
but with an extra element groupId
groupId
: the library identifier available in the data element specified by the--groupBy
parameter.0
has a special meaning: all libraries
groupId | documenttype | path | packageid | package | tag | subfield | number-of-record | number-of-instances | min | max | mean | stddev | histogram |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
350 | all | 044K$9 | 50 | PICA+ bibliographic description | “Schlagwortfolgen (GBV, SWB, K10plus)” | PPN | 1 | 1 | 1 | 1 | 1.0 | 0.0 | 1=1 |
350 | all | 044K$7 | 50 | PICA+ bibliographic description | “Schlagwortfolgen (GBV, SWB, K10plus)” | Vorläufiger Link | 1 | 1 | 1 | 1 | 1.0 | 0.0 | 1=1 |
6.2.2 completeness-grouped-packages.csv
The same as packages.csv
but with an extra element group
group
: the library identifier available in the data element specified by the--groupBy
parameter.0
has a special meaning: all libraries
group | documenttype | packageid | name | label | iscoretag | count |
---|---|---|---|---|---|---|
0 | Druckschriften (einschließlich Bildbänden) | 50 | 0… | PICA+ bibliographic description | false | 987 |
0 | Druckschriften (einschließlich Bildbänden) | 99 | unknown | unknown origin | false | 3 |
0 | Medienkombination | 50 | 0… | PICA+ bibliographic description | false | 1 |
0 | Mikroform | 50 | 0… | PICA+ bibliographic description | false | 11 |
0 | Tonträger, Videodatenträger, Bildliche Darstellungen | 50 | 0… | PICA+ bibliographic description | false | 1 |
0 | all | 50 | 0… | PICA+ bibliographic description | false | 1000 |
0 | all | 99 | unknown | unknown origin | false | 3 |
100 | Druckschriften (einschließlich Bildbänden) | 50 | 0… | PICA+ bibliographic description | false | 20 |
100 | Medienkombination | 50 | 0… | PICA+ bibliographic description | false | 1 |
6.2.3 completeness-groups.csv
This is available for union catalogues, containing the groups
id
: the group identifiergroup
: the name of the librarycount
: the number of records from the particular library
id | group | count |
---|---|---|
0 | all | 1000 |
100 | Otto-von-Guericke-Universität, Universitätsbibliothek Magdeburg [DE-Ma9] | 21 |
1003 | Kreisarchäologie Rotenburg [DE-MUS-125322…] | 1 |
101 | Otto-von-Guericke-Universität, Universitätsbibliothek, Medizinische Zentralbibliothek (MZB), Magdeburg [DE-Ma14…] | 6 |
1012 | Mariengymnasium Jever [DE-Je1] | 19 |
6.2.4 id-groupid.csv
This is the very same file what validation creates. Completeness creates it only if it is not yet available.
6.2.5 qa_catalogue.sqlite
The contents of marc-elements.csv
or completeness-grouped-marc-elements.csv
is imported into marc_elements
table of qa_catalogue.sqlite
. For the catalogues without the --groupBy
parameter the groupId
column will be filled by 0
. The table definition is:
DROP TABLE IF EXISTS "marc_elements";
CREATE TABLE IF NOT EXISTS "marc_elements" (
"groupId" INTEGER,
"documenttype" TEXT,
"path" TEXT,
"sortkey" TEXT,
"packageid" INTEGER,
"package" TEXT,
"tag" TEXT,
"subfield" TEXT,
"number-of-record" INTEGER,
"number-of-instances" INTEGER,
"min" INTEGER,
"max" INTEGER,
"mean" REAL,
"stddev" REAL,
"histogram" TEXT
);CREATE INDEX IF NOT EXISTS "gme_groupId" ON "marc_elements" ("groupId");
CREATE INDEX IF NOT EXISTS "gme_documenttype" ON "marc_elements" ("documenttype");
CREATE INDEX IF NOT EXISTS "gme_sortkey" ON "marc_elements" ("sortkey");
6.3 Internals
The completeness task conists of two steps:
- script
completeness
callsjava -cp $JAR de.gwdg.metadataqa.marc.cli.Completeness [options] <file>
- import result into
qa_catalogue.sqlite
The second step can also be called independently as command completeness-sqlite
.