Scholarly Machine-readable Texts


Electronic texts are nothing new. A variety of texts created in proprietary software formats, as well as those in plain ASCII format, have been available for years. But these texts suffer from a number of limitations. Texts created in proprietary software formats are not easy to exchange. ASCII texts are easy to exchange but are not formatted in any useful way. The web makes sharing texts easy, but finding them based on keyword searching is problematic. All of these texts are electronic, but they are limited in how we can work with them.

Machine-readable or electronic text: what's the difference?

Let's take a look at a text:


Conventions of our written language, developed over hundreds of years, allow us to add structure to this paragraph by adding spacing and punctuation, making it easier for us to read:

It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife.

However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters.

"My dear Mr. Bennet,'' said his lady to him one day, "have you heard that Netherfield Park is let at last?''

As human readers we understand that this text is three paragraphs, contains dialog, and concerns two people, Mr. and Mrs. Bennet. We may even recognize this text as the opening of Jane Austen's novel, Pride and Prejudice. Computers have no such inherent understanding. Recreating this text with a word processor allows us to use these same written conventions, as well as giving us a way to copy, alter, rearrange, print and read this text. Putting this text on a web site gives us the added benefit of being able to share it with many others simultaneously. It also allows for limited searching of the document.

But is that good enough? As a scholar I may want to know more about this text. I may want to know who wrote it, when it was written, what form it took in the original, where it was published, who the characters are, what are their speech patterns and what can I learn about them through their speech? What can I learn about the author through this work? What can I learn about the culture in which it was written? If I am a book historian, what can I learn about this book as an artifact? If I'm reading this text online I may need to know the original pagination. All these questions and more may be asked of this text and on all these points this electronic version is silent.

It is another "truth universally acknowledged" that computers should be able to help scholars. But while computers are wonderful inventions for processing data, they have no intuitive sense about that data. For example, if we were to search the web for the phrase "Jane Austen" we might find the text of Pride and Prejudice, but we would be as likely to find the Jane Austen Society home page, a paper by a high school student on Jane Austen, a list of books for sale at a bookstore, an e-mail note that mentions a recent television adaptation of one of her works. . .the list goes on. So how do we create texts that we can read as texts but that can also be processed as data by a computer in a useful way?

SGML: Adding "intelligence" to an electronic text

Enter SGML, or the Standard Generalized Markup Language. The premise behind SGML is that we can build intelligence into the document itself. After that, we can not only use those texts in the same way we do word processed documents, i.e., make things like copying and pasting, transporting, printing, and keyword searching easier, but we can also do new things with them. For example, we can automatically apply formatting to a number of documents based on one document, we can identify specific portions within a document (all sub-titles, all first paragraphs of chapters, all lines spoken by Hamlet in the play, etc.), or we can find all the instances of a given word and see how they are distributed throughout a text.

How does it work?

SGML allows us to define a language to describe our document. For example, if I were dealing with a series of plays I would want to define items like "speaker," "act," "scene," "stage directions," etc. If I were creating a series of poems I would want to define items like "stanza," "line," etc., or if I were dealing with a library collection of the papers of George Perkins Marsh, I would need to define items like "letters," "journals", "photographs," etc. These descriptions, or Document Type Definitions (DTD) allow SGML to be applied flexibly to any type of document.

Defining the document type means deciding on names for tags that will be embedded in the text itself. If I were creating an electronic edition of the novel, Pride and Prejudice, I would probably want to have a tag for the book's title, the author, the chapter, or the publisher, publishing location, date, etc. these tags might be, respectively, <title>, <author>, <chapter>, etc. The DTD must also specify how these tags can be used. My DTD might say that a section marked as a <chapter> cannot come before text marked as a <title>.

Once the DTD is created there are several steps to completing our online edition. First, we must encode the text itself, that is, insert the appropriate tags in the appropriate places. Because we can have an unlimited number of tags, many of which may or may not be allowed in certain order, we usually use an SGML editor to help with this task. The SGML editor reads the DTD and allows us to insert only those tags that are allowed at any given point, based on that DTD. After you have inserted the tags, an SGML parser checks over the document to validate that all the tags are inserted correctly. Some SGML editors come with this function built in. There are also separate programs that can handle this validation.

What are the TEI DTD and the EAD DTD?

Computing scholars have long realized that having a consistent encoding scheme for the interchange of scholarly electronic texts would be beneficial. Although SGML allows you to define your own DTD, exchanging documents based on multiple DTDs could prove confusing. For example, you may want to search through all the titles in a document and you have them tagged as <title> while a colleague has them tagged as <head>. Ten years ago the Association for Computers in the Humanities decided to address this issue. From that effort has come the Text Encoding Initiative. The TEI DTD strives to provide encoding for all types of texts that might be used by humanities scholars. It is designed to be flexible yet comprehensive. You can read more about the TEI at their web page:

A similar effort has been developing for encoding Finding Aids. These catalogs provide information about a library or other organization's special holdings. For example, UVM's Special Collections contains the papers of George Perkins Marsh. Typically these items are stored in cartons and are not catalogued in the library's usual online catalog. The Encoded Archival Description DTD, or EAD DTD, seeks to provide a standardized way of encoding these guides. More information about the EAD DTD is available at: http://lcweb.loc.gov/ead/

We have our DTD, our document is tagged and has been validated. What next?

What do we want to do with the document? SGML allows us to create documents that can be used and reused in a variety of ways. Word processors use procedural markup to control formatting (left margin 1 inch, font Times New Roman, size 12 for text and 16 for titles, etc.). They specify how a document will look. SGML uses descriptive rather than procedural markup. It describes the parts of a document, not how those parts should be rendered. Because of this, SGML documents can be rendered in a variety of ways based on how they will be used. This is accomplished through style sheets. For example, if I plan to print the text, I might create a style sheet that specifies 1 inch margins, a 12 point font, and I might specify that titles should be bolded. If I intend that this document will be read on a computer screen with an SGML browsing program, I might create a style sheet that specifies a larger font size for easier reading. And if I plan to make this text available across the web, I might create a style sheet that includes HTML tags so the web browser will display it properly. In all cases I do not have to change the original document, only the style sheets. As an added benefit, if I have a series of documents I want to format in the same way, I can create a single style sheet or set of style sheets that can then be applied to all those documents simultaneously.

Where does HTML fit in?

HTML is actually an SGML DTD with a slight twist. True SGML documents can be read with any SGML browsing program because the DTD that describes their parts accompanies them. HTML documents, on the other hand, are designed to be read with a specific HTML browser, that is, the browser is created to understand only the limited HTML tag set. So, you, as an HTML document author, may use only those predefined tags. And, since some browsers are built to understand their own tags and not others, you may often find that people with other browsers have difficulty reading your documents. Also, HTML straddles the procedural/descriptive boundary in that some of its tags describe how a portion of a document should look (<B>, <I>, etc.) while some describe the function (<H1>, <STRONG>, etc.).

What about XML?

SGML is complicated. The technical specification for how SGML programs should be written, how they must be structured, and how they can do what they must do, is a document that weighs in at several pounds. Because of this there have been few SGML tools written and these tend to be expensive. HTML is easy, but that ease comes at a price: the limited number of tags means that you can only create limited types of documents. XML is an attempt to provide the flexibility of SGML with the ease of HTML. XML documents are SGML documents. They will be understood by any SGML program. But the programs needed to create, validate, and display XML documents will be much easier to create (the entire XML technical specification is only 30 pages long). Because of this we should see any number of XML programs becoming available soon. Like SGML, XML will allow you to define your own tags and style sheets, thus gaining the full benefit of SGML.

Where is there more information on this?

Visit the web page at http://www.uvm.edu/~hag/scriptorium.html for more on SGML, the TEI, XML and other electronic text information. From that page you can also visit two UVM initiatives in this area: the UVM Libraries Finding Aids project, and the experimental server for Scholarly Electronic Resources. If you would like to try creating your own SGML documents, come see me. If you are a faculty member with students interested in working in machine-readable texts, let's talk about internships or class projects. Or if you are those students, let's work together. You can reach me at: hope.greenberg@uvm.edu.


Hope Greenberg, Academic Computing Services, University of Vermont. 14 November 1997.