Deconstructing RIS (part II)

I've tried to explain in a previous entry what's wrong with the RIS bibliography format, and I also figured out a likely reason why RIS came to be what it is. In this entry I'd like to show a possible way to "fix" RIS.

What are our options to design a reference data format? Using XML is a fairly obvious first step to make the data fit for validation. Besides that, there's basically two concepts, each with its pros and cons:

- Use the smallest possible number of general-purpose elements. This approach has been promoted e.g. by Bruce D'Arcus and is enormously flexible, something that is greatly appreciated in the humanities. However, reference data of this kind do not lend themselves well to being stored in databases (except in native XML databases), and the approach does not give any guidance to the reference data authors how to encode their stuff.

- Use a predefined set of reference types, each of which uses an individual structure. This approach is weaker in terms of flexibility (but see below how some flexibility can still be gained), but the reference author gets a pretty good support how to encode the data.

In the first case the question is "how do I encode this?". In the second case the question is "which type is the best fit?". I believe the second question is far easier to answer for the uninitiated.

So what are the goals?

- Preserve the scope of RIS. For the sake of compatibility all RIS reference types should be supported, and the new format should be able to hold all information contained in RIS entries.
- Extend the scope of RIS where the latter is crippled. If there is a CHAP type and a BOOK type, then there must also be a SONG type along with the SOUND type.
- Untangle the multiple-purpose fields like M1-M3,IS and the like. Only analogous information should be stored in the same elements/database fields. Define separate elements/database fields for unrelated information. With 200GB platters hitting the market, there is no need to fold unrelated stuff into the same field.
- Sanitize the illogical A1-A3 and T1-T3 levels. These used to be a mix of the orthogonal concepts of "most likely to be asked for" on the one hand and the three-layer librarian approach on the other. Stick with the latter.
- Drop the distinction between journals and other publications that contain parts. Turn each publication with parts into a separately citable entry. Do the same with sets which are composed of several publications.
- Support relations between entries. "Is-part-of" is an obvious one, but we also might have "also-published-as" or "cited-after" relationships.
- Provide validation during data entry. This is best done using an XML schema language.
- Turn the schema into a data entry form. The schema should restrict data entry for each supported publication type in a way that you can't enter information which is not useful for this type, and which would therefore not be stored in the database anyway.
- Turn the schema into a database schema blueprint. It should be easy to deduce what information needs to be stored in order to support all reference types.

A first attempt to improve RIS was the risx.dtd, used as an XML data format in RefDB. This was a small step on purpose, just designed to fix the most obvious problems. As it was implemented as an XML DTD, it added the capability to validate your references, and it turned it into a target of transformations for bibliographic data stored in different SGML or XML applications. It also cleaned up the A1-A3 and T1-T3 mess by using three levels of bibliographic information. However, data entry was not really simplified as it did not offer the user much help about how to encode different reference types. Validation was limited to checking whether the structure matched what the database can store - it did not take into account the special requirements of each reference type. risx also did not make any attempt to clean up the RIS multi-purpose fields and other relics of the record-based data storage. However, risx is fairly simple to understand and to store in a database.

Time for another try then. I figured that a DTD would not be flexible enough to implement the idea of a data entry form. It would have required more than 30 top-level elements, one for each reference type, and each one using a wealth of subelements to encode the information appropriately for each reference type. Remember that risx didn't hinder you to add e.g. part information to a book. This was still valid, albeit useless.

Relax NG allows to rearrange a limited set of elements in an almost unlimited number of different patterns. The idea was to define e.g. a publication element which can hold all the information that any reference type wanted to put in there. This is automatically a description for a database schema designer which fields and relationships need to be implemented. Then each reference type is implemented by a set of patterns which picks the required subelements e.g. from the pool of subelements defined in the publication element. This in turn is a description for the authors of reference information which combinations of elements are allowed. The schema (the working name is rbib in want of something better) is implemented in three files which separate these implementations. rbib.rnc defines the reference types, rbib-start.rnc defines the allowed top-level elements, and rbib-library is sort of the element pool.

Now lets see how this schema addresses the goals mentioned above:

- Preserve the scope of RIS: rbib currently implements all types known to RIS, except GEN which is not supposed to be used by reference authors. This type made sense decades ago when Reference Manager still used a record-based database engine. If a reference would not match one of the predefined types, you could dump the data any way you wanted into the record by means of the GEN type. It does not make any sense in the context of rbib, hence it was dropped. Some information acceptable to RIS was restricted. E.g. I don't see why a BOOK entry should contain page information. If you're interested in a chapter, it is a CHAP entry. If you want to refer to particular pages in a book, it is a BOOK entry, but the page information goes into the citation in your document. In a few cases the content models of particular types were extended to simplify the schema. E.g. in RIS, journal articles and abstracts on the one hand and magazine and newspaper articles on the other differ only in that the former may record a media type. By allowing the media type field for the latter types too, all four types now share the same content model.
- Extend the scope of RIS where it is crippled: This is currently not implemented, but doable and planned for the near future. Improved handling for author names and additional types (e.g. for encoding songs on a record) come to mind.
- Untangle the multiple-purpose fields like M1-M3,IS and the like: This is implemented.
- Sanitize the illogical A1-A3 and T1-T3 levels: rbib uses analytical, monographic, and series information to avoid these problems.
- Drop the distinction between journals and other publications that contain parts: Several types that can contain parts may now act as a container, regardless of whether they're monographs or published periodically. Same for series.
- Support relations between entries: some are implemented, but more could be added.
- Provide validation during data entry: obviously done by using Relax NG
- Turn the schema into a data entry form: done by the design of the schema, see below.
- Turn the schema into a database schema blueprint: also done by the design of the schema, as discussed above.

How does this affect data entry? If we use a validating editor like the marvellous nXML mode in Emacs (there are also non-free tools that support this, so you don't have to become an Emacs adept), we won't be able to add part information to a book entry as the schema does not accept this. All you need to do at the beginning is to figure out the most appropriate reference type for your data. From then on, you can basically let nXML-mode suggest the next element and fill in its value if it is available, until you've finished the reference entry.

All in all, the first implementation of the rbib schema does not attempt to solve all problems of bibliographic data once and forever, but it is a suggestion that helps both users and database implementers getting beyond the current limitations.


Noch keine Kommentare

Mein Kommentar