16 Indexing bibliographic records with Solr

Run indexer:

java -cp $JAR de.gwdg.metadataqa.marc.cli.MarcToSolr [options] [file]

With script:

catalogues/[catalogue].sh all-solr

./qa-catalogue --params="[options]" all-solr

options:

general parameters
-S <URL>, --solrUrl <URL>: the URL of Solr server including the core (e.g. http://localhost:8983/solr/loc)
-A, --doCommit: send commits to Solr regularly (not needed if you set up Solr as described below)
-T <type>, --solrFieldType <type>: a Solr field type, one of the predefined values. See examples below.
- marc-tags - the field names are MARC codes
- human-readable - the field names are Self Descriptive MARC code
- mixed - the field names are mixed of the above (e.g. 245a_Title_mainTitle)
-C, --indexWithTokenizedField: index data elements as tokenized field as well (each bibliographical data elements will be indexed twice: once as a phrase (fields suffixed with _ss), and once as a bag of words (fields suffixed with _txt). [This parameter is available from v0.8.0]
-D <int>, --commitAt <int>: commit index after this number of records [This parameter is available from v0.8.0]
-E, --indexFieldCounts: index the count of field instances [This parameter is available from v0.8.0]
-G, --indexSubfieldCounts: index the count of subfield instances [This parameter is available from v0.8.0]
-F, --fieldPrefix <arg>: field prefix

The ./index file (which is used by catalogues/[catalogue].sh and ./qa-catalogue scripts) has additional parameters: * -Z <core>, --core <core>: The index name (core). If not set it will be extracted from the solrUrl parameter * -Y <path>, --file-path <path>: File path * -X <mask>, --file-mask <mask>: File mask * -W, --purge: Purge index and exit * -V, --status: Show the status of index(es) and exit * -U, --no-delete: Do not delete documents in index before starting indexing (be default the script clears the index)

16.1 Solr field names

QA catalogue builds a Solr index which contains a) a set of fixed Solr fields that are the same for all bibliographic input, and b) Solr fields that depend on the field names of the metadata schema (MARC, PICA, UNIMARC etc.) - these fields should be mapped from metadata schema to dynamic Solr fields by an algorithm.

16.1.1 Fixed fields

id: the record ID. This comes from the identifier of the bibliographic record, so 001 for MARC21
record_sni: the JSON representation of the bibliographic record
groupId_is: the list of group IDs. The content comes from the data element specified by the --groupBy parameter split by commas (‘,’).
errorId_is: the list of error IDs that come from the result of the validation.

16.1.2 Mapped fields

The mapped fields are Solr fields that depend on the field names of the metadata schema. The final Solr field follows the pattern:

Field prefix:

With --fieldPrefix parameter you can set a prefix that is applied to the variable fields. This might be needed because Solr has a limitation: field names start with a number can not be used in some Solr parameter, such as fl (field list selected to be retrieved from the index). Unfortunately bibliographic schemas use field names start with numbers. You can change a mapping parameter that produces a mapped value that resembles the BIBFRAME mapping of the MARC21 field, but not all field has such a human readable association.

Field suffixes:

*_sni: not indexed, stored string fields – good for storing fields used for displaying information
*_ss: not parsed, stored, indexed string fields – good for display and facets
*_tt: parsed, not stored, indexed string fields – good for term searches (these fields will be availabe if --indexWithTokenizedField parameter is applied)
*_is: parsed, not stored, indexed integer fields – good for searching for numbers, such as error or group identifiers (these fields will be availabe if --indexFieldCounts parameter is applied)

The mapped value

With --solrFieldType you can select the algorithm that generates the mapped value. Right now there are three formats: * marc-tags - the field names are MARC codes (245$a → 245a) * human-readable - the field names are Self Descriptive MARC code (245$a → Title_mainTitle) * mixed - the field names are mixed of the above (e.g. 245a_Title_mainTitle)

16.1.2.1 “marc-tags” format

"100a_ss":["Jung-Baek, Myong Ja"],
"100ind1_ss":["Surname"],
"245c_ss":["Vorgelegt von Myong Ja Jung-Baek."],
"245ind2_ss":["No nonfiling characters"],
"245a_ss":["S. Tret'jakov und China /"],
"245ind1_ss":["Added entry"],
"260c_ss":["1987."],
"260b_ss":["Georg-August-Universität Göttingen,"],
"260a_ss":["Göttingen :"],
"260ind1_ss":["Not applicable/No information provided/Earliest available publisher"],
"300a_ss":["141 p."],

16.1.2.2 “human-readable” format

"MainPersonalName_type_ss":["Surname"],
"MainPersonalName_personalName_ss":["Jung-Baek, Myong Ja"],
"Title_responsibilityStatement_ss":["Vorgelegt von Myong Ja Jung-Baek."],
"Title_mainTitle_ss":["S. Tret'jakov und China /"],
"Title_titleAddedEntry_ss":["Added entry"],
"Title_nonfilingCharacters_ss":["No nonfiling characters"],
"Publication_sequenceOfPublishingStatements_ss":["Not applicable/No information provided/Earliest available publisher"],
"Publication_agent_ss":["Georg-August-Universität Göttingen,"],
"Publication_place_ss":["Göttingen :"],
"Publication_date_ss":["1987."],
"PhysicalDescription_extent_ss":["141 p."],

16.1.2.3 “mixed” format

"100a_MainPersonalName_personalName_ss":["Jung-Baek, Myong Ja"],
"100ind1_MainPersonalName_type_ss":["Surname"],
"245a_Title_mainTitle_ss":["S. Tret'jakov und China /"],
"245ind1_Title_titleAddedEntry_ss":["Added entry"],
"245ind2_Title_nonfilingCharacters_ss":["No nonfiling characters"],
"245c_Title_responsibilityStatement_ss":["Vorgelegt von Myong Ja Jung-Baek."],
"260b_Publication_agent_ss":["Georg-August-Universität Göttingen,"],
"260a_Publication_place_ss":["Göttingen :"],
"260ind1_Publication_sequenceOfPublishingStatements_ss":["Not applicable/No information provided/Earliest available publisher"],
"260c_Publication_date_ss":["1987."],
"300a_PhysicalDescription_extent_ss":["141 p."],

A distinct project metadata-qa-marc-web, provides a web application that utilizes to build this type of Solr index in number of ways (a facetted search interface, term lists, search for validation errors etc.)

16.2 Index preparation

The tool uses different Solr indices (aka cores) to store information. In the following example we use loc as the name of our catalogue. There are two main indices: loc and loc_dev. loc_dev is the target of the index process, it will create it from scratch. During the proess loc is available and searchable. When the indexing has been successfully finished these two indices will be swaped, so the previous loc will become loc_dev, and the new index will be loc. The web user interface will always use the latest version (not the dev).

Besides these two indices there is a third index that contains different kind of results of the analyses. At the time of writing it contains only the results of validation, but later it will cover other information as well. It can be set by the following parameter:

-4, --solrForScoresUrl <arg>: the URL of the Solr server used to store scores (it is populated in the validate-sqlite process which runs after validation)

During the indexing process the content of this index is meged into the _dev index, so after a successfull end of the process this index is not needed anymore.

In order to make the automation easier and still flexible there are some an auxilary commands:

./qa-catalogue prepare-solr: created these two indices, makes sure that their schemas contain the necessary fields
./qa-catalogue index: runs the indexing process
./qa-catalogue postprocess-solr: swap the two Solr cores ( and _dev)
./qa-catalogue all-solr: runs all the three steps

If you would like to maintain the Solr index yourself (e.g. because the Solr instance wuns in a cloud environment), you should skip prepare-solr and postprocess-solr, and run only index. For maintaining the schema you can find a minimal viable schema among the test resources

You can set autocommit the following way in solrconfig.xml (inside Solr):

<autoCommit>
  <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
  <maxDocs>5000</maxDocs>
  <openSearcher>true</openSearcher>
</autoCommit>
...
<autoSoftCommit>
  <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>

It needs if you choose to disable QA catalogue to issue commit messages (see --commitAt parameter), which makes indexing faster.

In schema.xml (or in Solr web interface) you should be sure that you have the following dynamic fields:

<dynamicField name="*_ss" type="strings" indexed="true" stored="true"/>
<dynamicField name="*_tt" type="text_general" indexed="true" stored="false"/>
<dynamicField name="*_is" type="pints" indexed="true" stored="true" />
<dynamicField name="*_sni" type="string_big" docValues="false" multiValued="false" indexed="false" stored="true"/>

<copyField source="*_ss" dest="_text_"/>

or use Solr API:

NAME=dnb
SOLR=http://localhost:8983/solr/$NAME/schema

// add copy field
curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-dynamic-field":{
     "name":"*_sni",
     "type":"string",
     "indexed":false,
     "stored":true}
}' $SOLR

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-copy-field":{
     "source":"*_ss",
     "dest":["_text_"]}
}' $SOLR
...

See the solr-functions file for full code.

QA catalogue has a helper scipt to get information about the status of Solr index (Solr URL, location, the list of cores, number of documents, size in the disk, and last modification):

$ ./index --status
Solr index status at http://localhost:8983
Solr directory: /opt/solr-9.3.0/server/solr

core                 | location        | nr of docs |       size |       last modified
.................... | ............... | .......... | .......... | ...................
nls                  | nls_1           |     403946 | 1002.22 MB | 2023-11-25 21:59:39
nls_dev              | nls_2           |     403943 |  987.22 MB | 2023-11-11 15:59:49
nls_validation       | nls_validation  |     403946 |   17.89 MB | 2023-11-25 21:35:44
yale                 | yale_2          |    2346976 |    9.51 GB | 2023-11-11 13:12:35
yale_dev             | yale_1          |    2346976 |    9.27 GB | 2023-11-11 10:58:08