MARC parser

Reasoning
In order to create a mapping from MARC to RDF we need a way to identify precisely the values in MARC even in the value. For example, you may be trying to parse values from the MARC field 773 which contains different types of values, such as date of publication or name of publisher. In this case, when you need to refer precisely to the date of publication in your mapping, you need not only the field tag, subfield code or the values of indicators, but also the location of this type of data in the parse tree of the value.

We propose a identify MARC data elements by XPath expressions on the upper level and by location in the parse tree on the level of value. For this purpose, we need to create a small parsers for the different kinds of MARC data elements. For example, there can be a specific parser for the field 300, subfield $a. In this field, you can have values such as "vii, 269 p. : ill. ; 30 cm." and in your mapping you might want to refer to number of pages specifically. This is impossible to be done with XPath only so you need to introduce a more precise way of identification, namely the location in the parse tree. Then, you can have identifiers such as  where the XPath expression   identifies the data element on the upper level and the   adds precision by specifying the location in the parse tree that is produced by the parser written for the subfield $a of field 300.

Suggestions

 * Write BNF grammar for each MARC data element that needs parsing on the level of value and put it on this wiki, so that people can write mapping based on the parse tree the grammar produces.
 * Producing context-free grammar can be almost impossible for such unstructured text that can be found in MARC so maybe a contextual grammar or other parsing methods must be employed.
 * In case there are different possible structures of the value, such as for the field 773, write multiple BNF grammars and train a Bayesian classifier on the open bibliographic datasets that will choose which grammar to use for the parsing.
 * Write a parser based on this grammar and put it on an openly accessible repository, such as GitHub.
 * Combine these parsers with an XPath based MARC parser.

Alternative suggestion

 * Write a small DSL to define all the MARC fields which is easy to express, commented by metadata experts.

E.g. Here a Clojuresque way 245 245 :ind1 :a 008 :pos [7 11] 755 :g :pos [7 11] 700 :d #"-(\d{4})" 245 :a :when (245 :c #",$")

In some cases you can't easily parse out the data using regular expressions and specialized parsers might be needed which can be implemented. The domain expert could document the syntax or point to an external document describing the content. A list of fields can be provided which can be serve as hints of the types of fields that can be found in the MARC subfield. In other cases maybe a N3 or Turtle example can be provided of the suggested parser output.

852 :a :manual http://somewhere.org/holding.pdf :plugin http://github/mary/java_loction_parser :plugin http://github/john/my_ruby_parser :output http://githib/john/852-a-output.n3       :fields [:bibo:year :bibo:volume :fa0:missing]

Alternative examples using the SOLRMARC language

title = 245a title_uniform = 130adfgklmnoprst:240adfgklmnoprs, first author_meeting = custom, getAllAlphaSubfields(111)

Where getAllAlphaSubfields are references to external procedures which might be documented.

Examples of parsing problems
One example of very difficult (and perhaps impossible) patterns that can be matched by regular grammars are holdings statements (found in 866-a). Suggested syntaxes follow Z39.42, Z39.71, ISO10324 but quite some local variations can be found.

One example are the holdings statements found in Belgian libraries for the 'Nature' journal



Royal Library:

Vol. 1, no 1 (1869)-vol. 227, no 5278 (1970); vol. 247, no 5485 (1974)-vol. 378, no 6559 (1995); missing:  93(1914)2336-2339; 99(1917)3-8; 100(1917/18)9-2; 102(1919)2567-2570,2572; 103(1919)2601-2618; 147(1940)3718, 3719; 157(1946)3974- 3979, 3982-3990, 3992-3994; 179(1957)4561; (1992)6341recl. 20/I/92

KU Leuven:


 * 1) 141(1938) - 447(2007) 7141 * Ontbr.: 147(1941) - 156(1945)

UCL:

289,1981--353,1991; 354(MQ.6351),1991; 355,1992--426,2003; 427(MQ.6970/6973),2003; 428,2003--437, 2005; 438(MQ.7069),2005; 439,2005; 440(MQ.7085),2006; 441,2006--442,2006

UGent:

#24(1881) - 25(1881/82); 30(1884) - 33(1885/86); 48(1893); 52(1895); 54(1896); 68(1903); 75(1906);  87(1911)2187-2188; 124(1929)3116, 3118, 3120-3126, 3128; 128(1931)3236-3243; 129(1932) - 188(1960);   209(1966) - 294(1981)5836; 316(1985)6023-6028, 6030-6031; 359(1992); (2004)6969 - (2005)7056

ULB:

v.76, no1(1907)-v.84(1915) ; v.103, no1(1919)-v.224(1969) ; v.225(1970)-v.280, no5724(1979) ; v.281, no5726(1979)-v.380, no6559(1996) ; v.380, no6561(1996)-v.380, no6569(1996) ; v.380, no6571(1996)-v.421, no6924(2002) ; v.421, no6926(2002)-v.422, no6929(2003) ; v.422, no6931(2003)-

It would be interesting to find a RDF syntax for these things as they contain very interesting information to share.

Links

 * Solr Marc Language
 * Tim Brody. JISC ParaCite Project
 * Patrick Hochstenbach's MARC parser implemented in Clojure.