Validating Bibliographic Data

danfowler · October 31, 2016, 4:24pm

Recently, we’ve received some requests see how Frictionless Data specs and tooling (most likely, JSON Table Schema) can be usefully applied to validating bibliographic records . A very common format for bibliographic data is MARC 21. This format mandates some validation rules for data which are broadly analogous to the type of rules we would define in JSON Table Schema. After doing a bit of research here myself, I thought it might be useful to reach out to the wider community.

Specifically, I was wondering if anyone here has any experience with, say, MARC could weigh in here. If this did indeed make sense, perhaps a first step would be to define a standard way to export MARC records to a tabular representation in CSV and define a schema to validate that.

Some tools I came across while researching this:

pymarc: GitHub - edsu/pymarc: process MARC records from Python
catmandu: Catmandu
MarcEdit: Validating Records - MarcEdit - LibGuides at University of Illinois at Urbana-Champaign

@dimin @todrobbins

dianemercier · November 1, 2016, 3:23pm

Better to see also the semantic open format with Zotero (rdf with Dublin Core schema) , open source bibliographic management software.

Please work with information science experts like : librarian, archivist, documentalist, muselogue, historian, etc.

Diane Mercier, Ph.D.
https://www.zotero.org/dmercier/items/order/dateModified/sort/desc

danfowler · November 1, 2016, 3:31pm

Thanks @dianemercier! I’d love to get as much feedback as possible!

I’m still researching this. As far I can tell, the first critical step is to get MARC data into a tabular format (e.g. CSV, Excel) using something like MarcEdit. One can find the way to do this in MarcEdit in the MARC Utilities → Export Delimited window. At this point, one can choose which (sub)fields they want in the CSV file.

For each (sub)field, we can probably define a good JSON Table Schema constraint and type for validation. For instance, we can define 043$c to be limited to ISO codes. For (sub)fields with multiple values embedded that might be more difficult. At this point, when the data is in tabular form, we can do the validation. Once the data is validated, you can then use an Excel import tool (e.g. http://manual.koha-community.org/3.2/en/marceditexcel.html) to get the data into software like Koha.

dianemercier · November 1, 2016, 3:47pm

Zotero, like other open source, have great tools to “translate” rdf format to CSV or other. This important is the schema, not the file format. CKAN use Dublin Core. Why do you use MARC.

If we want see to the futur, I am confident that semantic is better with RDFa and other mechanisms

rufuspollock · November 7, 2016, 7:08pm

Just to say I’ve done a lot of biblio work over the years including quite a bit recently.

Could you give a bit more detail about the requests re JTS for biblio records? In general biblio records are not that tabular. In addition. most of the MARC i have seen in the wild is XML not CSV or even JSON …

Would def be interesting but not sure the match here is that nice.

todrobbins · April 6, 2017, 9:46pm

@danfowler @dianemercier @rufuspollock let’s continue this conversation! Especially in light of recent momentum.

todrobbins · April 6, 2017, 9:46pm

PS: feel free to tag or post at #working-groups:open-bibliography