11 Classification analysis
It analyses the coverage of subject indexing/classification in the catalogue. It checks specific fields, which might have subject indexing information, and provides details about how and which subject indexing schemes have been applied.
java -cp $JAR de.gwdg.metadataqa.marc.cli.ClassificationAnalysis [options] <file>
Rscript scripts/classifications/classifications-type.R <output directory>
with a bash script
./classifications [options] <file>
Rscript scripts/classifications/classifications-type.R <output directory>
or
catalogues/[catalogue].sh classifications
or
./qa-catalogue --params="[options]" classifications
options:
- general parameters
-w
,--emptyLargeCollectors
: empty large collectors periodically. It is a memory optimization parameter, turn it on if you run into a memory problem.
The output is a set of files:
classifications-by-records.csv
: general overview of how many records has any subject indexingclassifications-by-schema.csv
: which subject indexing schemas are available in the catalogues (such as DDC, UDC, MESH etc.) and where they are referredclassifications-histogram.csv
: a frequency distribution of the number of subjects available in records (x records have 0 subjects, y records have 1 subjects, z records have 2 subjects etc.)classifications-frequency-examples.csv
: examples for particular distributions (one record ID which has 0 subject, one which has 1 subject, etc.)classifications-by-schema-subfields.csv
: the distribution of subfields of those fields, which contains subject indexing information. It gives you a background that what other contextual information behind the subject term are available (such as the version of the subject indexing scheme)classifications-collocations.csv
: how many record has a particular set of subject indexing schemesclassifications-by-type.csv
: returns the subject indexing schemes and their types in order of the number of records. The types are TERM_LIST (subtypes: DICTIONARY, GLOSSARY, SYNONYM_RING), METADATA_LIKE_MODEL (NAME_AUTHORITY_LIST, GAZETTEER), CLASSIFICATION (SUBJECT_HEADING, CATEGORIZATION, TAXONOMY, CLASSIFICATION_SCHEME), RELATIONSHIP_MODEL (THESAURUS, SEMANTIC_NETWORK, ONTOLOGY).