Words as identifiers

Words as identifiers
One of the main problems of the MARC format is the use of words (textual strings) as identifiers for data entities. In this way the links in library data are expressed. The example of such way of referring to entities is the use of formalized headings to identify authority records (e.g., "Segaran, Toby" as a link to the entity of the author) or the links to values from controlled vocabularies using formalized codes (e.g., "xxu" for the United States figuring as the country in which the document was published).

The problem with words lies in the fact that they are able to identify an entity only partially. When serving as identifiers they are not unique nor universal. Words drawn from a natural language are inherently ambiguous and so, if they are to be used in the role of an identifier they need to be strictly formalized to achieve a reasonable degree of uniqueness. One example of such standardized formalization can be writing of a person's name in the form "Surname, Name". Even if a formalization is employed it may not be unique so there is a need to add other value that refines the identifier. For example, in the case of persons, their names are often entered with the dates of birth and death of the person. The same principle is used with subject headings when they are accompanied with a qualifier code referring to the category in which the heading is located.

When words are used as identifiers there is a possibility of misspelling and also, different words, such as synonyms, can be used to express the same concept. This implies that their meaning is context-dependent. By contrast, identifiers such as URIs work even outside of context. Lexical identifiers are bound to a natural language which means they constitute a barrier in internationalization and sharing of library data between countries using different languages. For these reasons, the identifiers used in library data should be neutral and not bound to any particular language.

The other problem with words as identifiers lies in the fact that they cannot be used to directly obtain a representation of the entity they refer to. One must first search a database to actually find the entity that is labelled with a word identifier. In other words, there is no standard mechanism for dereferencing words to obtain the description of the entity they point to. Likewise, when one searches for an entity one needs to re-create the identifier that is used for it. This means that re-using of the identifier should be easy and not error-prone.

Many of the issues that words as identifiers have can be solved by using URIs instead. However, the use of URIs in MARC is cumbersome because it requires adding new non-standard subfields, just as using RDFa is extending the structure of HTML.

What we want
We want identifiers that are


 * unique
 * universal
 * context independent
 * neutral
 * not bound to any particular language
 * dereferencable
 * easily reusable