The Burlington Agenda

The Burlington Agenda

Research Issues in Intellectual Access

Electronically Published Historical Documents

Background

On April 7-9, 2000, ten people⁽¹⁾ representing a variety of disciplines convened in Burlington, Vermont, to discuss ways to improve and standardize intellectual access to electronically published historical documents. The meeting, sponsored by the University of Vermont and funded by a grant from the National Historical Publications and Records Commission (NHPRC), focused on identifying issues for study and describing research opportunities, methods, projects, and collaborators for projects that will contribute to an intellectual framework or set of editorial guidelines for electronic publication of historical documents in a way that assures effective intellectual access.

The meeting was organized by Elizabeth Dow of the Special Collections Department of the University of Vermont in close collaboration with Mark Conrad, a member of the NHPRC staff.

The Need

The World Wide Web provides a highly attractive distribution medium for primary historical documents. To enable scholars and non-scholars alike to retrieve those documents effectively, publishers must learn to provide intellectual access to the contents that will rival or exceed the standard set by printed historical documentary editions. Modern editions of historical documents provide a highly sophisticated level of information retrieval through back-of-the-book indexes. The indexers of these volumes, usually the editors, capture references ranging from ordinary terms such as "illness" to sophisticated abstractions such as "property rights." They sort out people and places with the same names. They organize and classify the contents of the documents. They reference concepts stated in many different ways by applying their knowledge of the subject to analyze the text and label the concepts with a consistent vocabulary. They use cross-references frequently. Retrieval is precise and accurate, but the work is highly labor intensive. As we approach the age of electronic historical editions, we must ask, "How can we best achieve the same results in the most cost-effective manner?"

Today's powerful search engines can find explicit terms that occur in a text; they can find terms that the user links together; they can isolate terms from other terms. Front end vocabulary systems and search models based on statistical analysis of documents can retrieve documents that reflect the concepts in the search. A superficial examination of a printed index would reveal that today's search engines could retrieve many of the references, but it is not clear whether they could build the web of relationships and analysis provided by a good back-of-the-book index, or the navigational support provided by such an index. Publishers don't yet know how to sort out what technology can do, what requires human intervention, and how the two can be woven into frameworks for providing intellectual access to electronically published historical documents.

The University of Vermont's George Perkins Marsh Online Research Center provides online a modern scholarly edition of historical documents in that it contains professionally transcribed and annotated documents published with explanatory essays, and a well-considered design. Except for the full-text engine built into its publication software, it lacks only an "index" to its contents. The need to develop an index-equivalent for this scholarly edition drove the efforts that culminated in the Burlington meeting.

The Meeting

The meeting brought together experts working in documentary editing, experimental electronic publishing, library and information science research, and computer science research, as well as the staff from the Marsh project. The group convened Friday afternoon and Elizabeth Dow, who chaired the meeting, opened with remarks in which she emphasized that:

1. The goal of the research on intellectual access to electronically published historical documents should be to develop generalized policies, practices, methods, and applications to ensure that publications will provide full intellectual access.

2. Research on electronically published historical documents should anticipate technological trends so far as possible, but should not wait for anticipated trends to materialize.

3. Research on electronically published historical documents should recognize historical standards created by the back-of-the-book indexes that provide passage-level retrieval rather than document-level retrieval.

4. Research should be interdisciplinary to draw from and build on knowledge available in a number of fields confronting similar issues, including artificial intelligence, information retrieval, human computer interaction, information extraction, etc.

5. Resources in technological expertise, funding, and even cultural/political power are limited within the community of documentary editors and other publishers of historical documents. Thus researchers must find ways to coordinate activities to maximize available resources.

Dow emphasized that the question of conveying or discovering what the content of a document "means" is an old question that has a new urgency. It must be approached once more by developing good research hypotheses, involving interdisciplinary teams whenever possible, and using many research methods.

Following Dow's remarks, the participants focused on a set of primary documents they had received prior to the meeting. Each had prepared notes on the information retrieval problems, which they shared with the group. Over the course of an hour and a half's conversation, they made the following points about primary documents, the nature of the Web, and traditional indexing methods:

Primary historical documents are full of oblique references, undefined terms, unidentified references, abbreviations, words that have many meanings depending on discipline, time, and context. A topic may be discussed throughout a lengthy document and have no specific reference to it. All of these may be presented without the analysis or abstraction customarily found in secondary literature. While in an edited collection of documents the identification of people, places, and topics may appear in the annotations, many, if not most, concepts remain unnamed, at least in a form recognizable to most modern readers or non-specialists. Unedited documents lack even rudimentary clarification of the simplest factual matter.

Traditional book indexing developed in response to and within the limits of the physical environment of a book. The index belongs to a closed system, i.e., the volume or set of volumes it references. The look-and-feel of indexes has been standardized to a narrow range of structures that most scholars understand, having learned to use them as school children. Like all book indexers, documentary editors develop their own systems within a given set of volumes, including their own thesaurus of index terms, their own standards of detail, and their own method for handling abstraction, ambiguities of language, etc. They tend to rely on their documents as the source of the problem and the source of many of the solutions they develop. In multi-volume editions, they rarely have the luxury of time and resources to go back and re-work their indexes for any purpose, though they find it difficult to index consistently over time in an environment of changing scholarly interests.

The nature of the Web has changed much that documentary editors took for granted, starting with the fact that they are less and less able to predict who will use their materials. The Web opens historical documents to a wide range of ages, educational levels, and ethnic and national groups. Given the diversity of the audience, no editorial apparatus will serve all Web audiences, even if added using several layers of language and references.

Even when a publisher has determined which audience it will strive to serve, there has evolved no generally accepted way of presenting editorial apparatus - no generally accepted "system," like a back-of-the-book index, for people to learn. It is very unclear what the "index" of a collection of electronically published historical documents, with or without editing, might look like.

In the Web environment, a publisher can neither predict nor control the level at which any one user will "discover" material. Web search engines find materials at the level comparable to the "front" of a collection as well as at the level of an individual document. The Web can flatten or destroy the structure a publisher develops within a collection of documents.

These considerations guided the discussion of the next day and a half, and the following recommendations flowed from that discussion. Throughout the discussion, the participants used the term "index" and "indexing" to mean a wide variety of approaches to providing intellectual access. Further, unless they referred specifically to traditional scholarly editions of historical documents, they made no distinction between edited and unedited documents.

Research Agenda

General Areas of Research

Providing intellectual access to electronically published historical documents requires research in three very broad areas: user studies, publication management studies, and studies of access to information itself. While the meeting, and so the research focus, was driven by the needs of the document editing community, the participants understand the potential for a network of cultural heritage resources that would connect documents, arts and crafts, and common artifacts on the Web. One research issue addresses that potential directly, but the language of the rest reflects the meeting's roots in the document world. In choosing to write the report in the language of document publishing, we seek clarity of expression only, and do not mean to exclude from the rest of the agenda the needs of non-document cultural heritage repositories choosing to make their holdings available through the Web; some issues are common to all.

User studies: Providing intellectual access to electronically published historical documents raises the issue of identifying the users of the document. Traditionally, documentary editors have assumed a readership of fairly sophisticated scholars and students using their volume(s) in a small number of research libraries. The Web makes the edited works, as well as a growing number of unedited historical documents, accessible to many audiences ranging from the traditional readership of sophisticated scholars to school children all over the world. These studies should explore the information needs and behaviors of a wide range of users so publishers may optimize surrogation, presentation, user interface design, and other tools to facilitate intellectual access.

Publication management: Electronically publishing historical documents will alter the way in which documentary editions are created and published. These studies should explore procedural and management changes in the creation and publication processes.

Access to information: Intellectual access to electronically published historical documents depends on procedures that will be labeled here as: discovery, navigation, and retrieval. "Discovery" refers to the process by which potential users may discover materials related to their needs, i.e., finding the Web site(s) that will supply useful information. "Navigation" refers to the process by which users move around or among the Web site(s). "Retrieval" refers to the process of identifying and collecting detailed information within a given Web site. Navigation implies the use of links to move from page to page or document to document, whereas retrieval implies the use of a search engine within the Website to locate specific documents or passages within documents. These studies should explore ways to improve these processes.

Specific Areas of Research

The participants believe that resolution of the eight issues below are key to significant advances in the development of efficient and effective methods for providing intellectual access to historical documents and will eventually lead to editorial guidelines. These issues contain within their broadly stated questions many specific research questions. None will be resolved easily, but without clarification of these issues, we cannot assure optimum intellectual access to electronically published historical documents and therefore their maximum use.

The participants also believe that the items on the research agenda are broad enough for individuals and institutions to develop research projects involving a wide variety of disciplines and approaches. While the following discussion includes "possible approaches" to the research, it does not mean to dictate that these are either the only possible approaches or even the best. Creative and innovative approaches need documentary editors, and others publishing primary historical documents, to raise compelling questions drawing on and directed to the skills and knowledge of researchers in many related disciplines. Large collaborative institutional efforts and funding arrangements can significantly enhance the research effort. However, well-conceived and developed small projects can also contribute answers to specific questions.

Research Issue 1

Who are the users of electronically published historical documents, what do they need from sites publishing documents, and what do they need from the documents themselves?

Purpose

To assure the design and implementation of effective systems for intellectual access that will assure that the product meets the needs of the intended and actual audience.

Background

Print editions focused on the needs of the scholarly community and were generally available only in research libraries. The Web brings this work to everyone. Electronic editions have been designed based on the print metaphor and without study of users and potential users. While no publication project can address the needs of every potential audience, many may, with relatively little additional effort, improve their publications to meet the needs of a much broader audience than print allows. Once publishers have identified their primary, and perhaps secondary, audience(s), they need to know how to alter their print model for publications to meet Web users needs.

Possible Approaches

Examine the use of current sites to determine the actual audience and patterns of usage of the publications.
Set up review/chat functions on sites to invite user responses, reviews, etc.
Survey traditional users of printed publications and ask them about their potential use of electronically published historical documents.
Explore the value and effectiveness of alternative dialog interfaces.
Develop users studies questioning the research goals and strategies of traditional users of documents, and in what ways digital, markup, and network technologies will serve these goals and strategies.
Develop users studies to identify potential new users, especially K-16 educators and students.
Develop users studies exploring economic, intellectual, and technical feasibility of expanding traditional editing and publishing in the electronic environment to serve new users.
Develop users studies to test the effectiveness of the discovery, navigation, retrieval, and presentation features in a prototype system for online access to historical documents.

Result

An understanding of user needs.

Benefits

Web publication projects that meet user needs. Better sites may produce greater user support for the effort to fund the publication of historical documents.

Research Issue 2

How do we assure that users effectively and efficiently navigate and retrieve information from collections within sites? In other words, how do we accommodate user needs through contextual and navigational aids.

Purpose

To understand how users use contextual and navigational tools so publishers can retain the contextual integrity of their collections while they assure that the user obtains maximum benefit from the editorial apparatus.

Background

Among its principal objectives, documentary editing contextualizes documents, which is to say, it orients or situates documents among other historically and intellectually related documents. All editorial work, from document ordering to added apparatus, stems from the editor's intention to present the documents in a way that makes them understandable. Web publishers do likewise. Yet, on the Web, users may come into a collection of documents along a path that does not use the structure. For instance, looking at the results of a Web-search, users may not know that they have arrived in the "middle" of a larger site and that they would benefit greatly from various forms of editorial apparatus that the site may provide. Maintaining context in the electronic environment may require special apparatus or techniques. Web publishers need to know how users make use of navigational and other contextual tools before they can design effective sites.

Possible Approaches

Explore ways to maintain the document and the descriptive and historical context of documents in the networked, electronic environment.
Explore the design of the user interface to enhance navigation.
Explore the infrastructure requirements enabling effective contextualization of document components, documents within collections, and document collections in cross-cultural heritage collections.
Identify essential contextual information required for identification, understanding, and navigational orientation of document components, collection documents, and sub-collections in collections.
Explore use of hierarchical subject analysis apparatus for providing intellectual and historical context and interrelationships.
Explore user models.

Result

Clearer understanding of the contextual needs of Web publications and ways to maintain and navigate through them.

Benefits

Users will have an opportunity to understand electronically published historical documents more fully.

Research Issue 3

What represents the most efficient and effective use of markup in support of intellectual access, presentation, and contextual understanding of electronically published historical documents?

Purpose

To determine an effective level of markup to support intellectual access to text, generate an effective presentation of electronically published historical documents, and provide contextual understanding during the discovery, navigation, and retrieval processes.

Background⁽²⁾

Whether a standalone item or part of a large collection or union database of documents, each document will need some level of markup. Markup serves several purposes.⁽³⁾ It defines the structure of the document (e.g., head, salutation, body, closing, signature); it defines the nature of the content (e.g. person, speaker, supporter, supplied ); it links one document to another or to information outside the document itself (e.g. ixe (index), xptr, xref). Markup can be used to provide intellectual access by identifying the function of a fragment of text, normalizing names, disambiguating words, embedding index terms, linking to external resources, etc.⁽⁴⁾ Publishers need guidelines for the kinds and amount of markup they should use to make their documents intellectually accessible.

Possible Approaches

Explore the use of markup for augmenting data and the impact on discovery and retrieval.
Explore use of emerging markup and style sheet technologies to gather and present essential, dispersed contextual information in navigational and hierarchical contexts.
Explore modifications to the Model Editions Partnership (MEP) DTD⁽⁵⁾ and other DTDs used to publish historical documents which might provide better support for discovery, navigation, and retrieval, as well as interoperability with other electronically published documents such as archival inventories and web-based databases.
Explore markup required for effective presentation.
Explore use of markup, perhaps at the passage level, to provide intellectual access comparable to back-of-the-book indexes.

Result

A standard for markup to achieve maximum intellectual access as well as effective presentation of information.

Benefits

Clearer editorial guidelines for markup of electronically published historical documents.
Improved discovery, navigation, retrieval, presentation and use of electronically published historical documents.

Research Issue 4

What is the minimum range of uniform encoding practice and "indexing" required to support effective discovery and navigation of items in a union database (a database holding many collections) or collaborative efforts involving several types of cultural heritage institutions?

Purpose

To determine the degree of standardization and uniform practice which will be required to provide discovery, navigation and retrieval to an enriched Web-based research environment which provides integrated access to the collections of many repositories and repository types.

Background

Traditionally, publishers of documents have treated each printed volume or multi-volume series as a self-contained or stand-alone entity, and thus decisions concerning style, indexing, design, and technical apparatus have been determined on an case by case basis, without reference to other documents. While the tradition of the documentary editing community has involved following suggested "best practice" approaches, the editors had no reason to develop professional standards for data structure, data content, or data value for back-of-the-book indexes. Lack of such standards will be particularly problematic in an electronic union collection.

Similarly, cultural heritage repositories have developed their exhibits in a self-contained environment and take some pride in the style and creativity each has displayed. They rely on a largely informal network for providing information about related holdings in many institutions.

Thanks to digital/network technology, we may see the first true union access to historical evidence held in all types of cultural heritage institutions linked to any that relate in any way. Such access is highly desirable, because it will support more efficient research. Additionally, such access opens these documents and exhibits to potential new users, in particular, K-16.

In the first decade of Web-based access, cultural heritage repositories have developed "best practice' formats and protocols for creating Web exhibits which vary by both repository focus and the implementation of individual institutions. While each repository's Web exhibitions or published collection will continue to have idiosyncratic needs, hyperlinked union databases and exhibits will require some standardization and uniform practice, although it is not well understood how much. For instance, while it's convenient to think in terms of discovery through higher-level descriptors at the collection or exhibit level and retrieval as the sort of detailed retrieval provided by searching within a collection or exhibit, in actual practice the distinction becomes blurred when the researcher cannot rely on standard metadata to identify a resource. If a genealogist seeks a particular family name not included in high-level metadata, it is highly unlikely that the genealogist will discover a collection. It is important to begin thinking about ways in which we can improve the discovery of related collections, regardless of repository.

Possible Approaches

Explore these issues in an experimental union database of historical documents.
Explore these issues in an experimental collaboration of diverse cultural heritage. institutions including a wide variety of document repositories and museums.
Develop models for collaboration among various cultural heritage specialists, including "special collections" curators, university archivists, public archivists, museum personnel, etc.
Explore ways to create and use document- or publication-level descriptors to assure "discovery" and/or high-level access.
Develop different types of topical thesauri to support document-level, collection level, and cross-collection level description.
Develop crosswalks between various existing descriptor schemes.
Undertake semantic analysis of the way various cultural heritage specialists fill in and use data fields in existing descriptor schemes.
Develop ways to automate the completion of descriptor scheme data fields to assure uniformity and efficiency.
Develop benchmark tests for discovery, navigation, retrieval, and presentation within a union database of electronically published historical documents.
Develop benchmark tests for discovery, navigation, retrieval, and presentation within a collaboration of diverse cultural heritage repositories.
Explore incorporating collection indexes into a union database of indexes which would serve as a master index to collections.
Explore the level of contextuality required for accurate identification and understanding of components of document sub-collections in cross-cultural heritage collections.

Result

Guidelines for the effective markup and vocabulary enhancement for a union database.
Guidelines for effectively establishing links between diverse cultural heritage collections through shared high-level descriptors.
Efficient and effective technologies for linking related materials across many institutions and institution types.

Benefits

Better integration of scholarly resources in union databases/exhibits, and enhanced discovery, navigation, retrieval, and presentation of resources.
Guidelines developed through this work to support more efficient and effective editing and publishing of documents and development of integrated resources.
Researchers find what they want while identifying where to pursue an interest further.
Institutions get both greater exposure and more clearly relevant researchers using their materials.

Research Issue 5

What are the benefits and challenges of linking electronically published historical documents to external resources?

Purpose

To maximize the use of available resources published on the Web.

Background

The added editorial and scholarly information in well-edited historical documents requires a substantial amount of time and personnel to research, verify, annotate, and edit. Much of that time is spent in identifying people, places, and events for which there may already exist shareable Web-based information. The editing process could be made both more efficient and more accurate if documents or portions of documents could simply link to these resources and either display them as an enhancement to the document itself, or download the information into the document rather than require that the editorial project create its own records that duplicate information available elsewhere.

Possible Approaches

Develop a method for dynamically accessing information in Web-based authority file records, handbook data, encyclopedia articles, etc.
Develop a method for dynamically updating information from Web-based authority file records, handbook data, encyclopedia articles, etc.
Explore ways to determine levels of external linking appropriate to a given audience.
Explore emerging standards for external linking and develop guidelines for their use.

Result

Techniques that would allow electronic publication projects to access and use the information already available in external resources.

Benefits

More efficiently produced Web-based publications.
Increased accuracy and potentially richer additional information delivered with the document.

Research Issue 6

What are the capabilities and limitations of information retrieval technologies for providing intellectual access to online historical editions? Can human intervention during the editorial process be used to overcome some of the limitations?

Purpose

To develop new methods for providing intellectual access to electronically published historical editions that use the best of information retrieval methods combined with the knowledge of documentary editors and indexers.

Background

Historically, all intellectual access to published historical documents came through editorial apparatus such as a table of contents, lists of various sorts, and most powerfully, a back-of-the-book index. The Model Editions Partnership provides intellectual access to online historical editions, using Dynatex� and Dynaweb�. While Boolean text retrieval facilitates access to the electronic edition, it does not provide intellectual access comparable to that of a back-of-the-book index.

Most information retrieval systems are document retrieval systems. They attempt to find those documents that are relevant to a user's request. All approaches to document retrieval create some sort of index to the documents. Boolean search and retrieval uses an inverted index of all content terms in the documents. Vector space methods use a vector of keywords associated with each document to determine relevance to a user query. Probabilistic methods use an estimate of the probability of the relevance of document in terms of a query. The proceedings of Text Retrieval Conferences (TREC) provide a comprehensive view of document retrieval research activities and results.⁽⁶⁾

Passage retrieval approaches to information retrieval, which identify relevant passages within a document, come closer to providing intellectual access to an electronic document than document-retrieval systems. Passage retrieval indicates the sentence(s), or paragraphs related to a user request.⁽⁷⁾

There have been several attempts to integrate natural language analysis into information retrieval. These approaches use knowledge of the morphology of words, the syntactic structure of sentences, and semantic relationships of terms. They create a semantic or conceptual representation of the meaning of the text. A conceptual representation is also created for a user's request and is matched against the conceptual representation of the text. The results of these approaches have had mixed results. ⁽⁸⁾ None seems to have achieved the precision of a back-of-the-book index, but one doesn't really know because there have been no experiments to determine whether they do.

Lexicons and thesauri are key natural language processing tools for determining the semantic relationships of terms and for constructing a conceptual representation of the meaning of sentences, paragraphs and documents. Lexicons (machine-readable dictionaries) and thesauri are kinds of ontologies, or specifications of concepts⁽⁹⁾. WordNet is an example of a lexicon that is also an ontology.⁽¹⁰⁾ It represents common knowledge of the semantic relationships of English terms. For the natural language analysis of electronic historical editions, ontologies would be needed for terms and concepts specific to the historical context. For instance, to understand a collection of papers from the period of the American Revolution, knowledge of British provincial and American revolutionary government would need to be included as an ontology.

A back-of-the-book index is a kind of conceptual taxonomy for the book. In other words, it can be viewed as an ontology, or conceptual specification for the subjects of a document.⁽¹¹⁾

Possible Approaches

Compare the work done by an automatic indexing system to that done by the indexer of a single collection of electronically published historical documents.
Explore various information retrieval tools against a database of historical publications.
Digitize (where necessary), integrate, reconcile, and rationalize existing back-of-the-book indexes to develop a base conceptual taxonomy.
Develop ontologies for historical people, places, events, etc. thus creating a suite of ontologies for various domains. More real world knowledge will be represented through the addition of each new collection's vocabulary.
Explore the potential for integrating domain-specific taxonomies into an open set of taxonomies and ontologies for electronically published historical documents.
Explore application of existing computer science /information technology research and software in topical ontologies, taxonomies, and thesauri to historical documents.
Evaluate and compare relative effectiveness of computer generated ontologies and human created back-of-the-book indexes.
Compare the results of information retrieval on a body of annotated and un-annotated documents.
Examine whether human intervention in the creation of conceptual taxonomies improves passage retrieval.

Result

Delineation of the point at which human effort is indispensable for optimal intellectual access within the current technological environment. These tests could be repeated periodically as new technology comes on the scene. In other words, we should routinely update guidelines on the capabilities and limitations of information retrieval and thus guidelines on intellectual access points that must be applied to documents through human intervention.
Discovery of most effective automated processes for intellectual access.
Provision of a "baseline" vocabulary drawn from the documents themselves that can be applied retrospectively to other collections to improve computer-assisted subject analysis and enhance subject-based retrieval of documents.

Benefits

Development of rich conceptual taxonomies and subject ontologies for many levels and domains within the electronic publishing world.
More consistent and more thorough indexing throughout those projects that make use of the growing taxonomies and ontologies.
Increased indexing efficiency that may lead to better use of project funds.
Retrospective enhancing of previously published documents at minimum expense.
Improved conceptualization of scholarly efforts in these projects that can, in turn, lay the ground work for creating more efficient and effective work

Research Issue 7

How do we assure that users' find and retrieve information from within sites most efficiently and effectively?

Purpose

To present results from an in-site search quickly, precisely, and effectively while retaining the contextual integrity of a set of hits within either a single database or union database.

Background

A back-of-the-book index shows the researcher the language applied to define the volume(s) in use. Though many bibliographic databases display their controlled vocabularies with their records, the user has this information only after making a search, and may never create the best search statement because of a lack of understanding of the best approaches for their needs. Yet, library research shows that people recognize the terms they need to follow more quickly and accurately than they can predict them. Within the world of full-text publications, publishers rely entirely on the documents' text to guide the user. Given that electronically published historical documents may receive some terminological augmentation, and given that union databases of such collections will contain many thousands of documents, users may be well served, once at a site, if they have access some form of vocabulary to choose terms from before doing a search and something other than a list of hits to work with after they have done the search.

Possible Approaches

Develop ways to allow the users to view the "indexing vocabulary," by experimenting with various interfaces, e.g. offer the choice of a conceptual query or a view of the index/taxonomy.
Explore the design of the user interface to enhance navigation (graphical and/or topographical representation, etc.) within a set of search results.
Look for better metaphors than the book or the bibliographic database as a way to present data.
Identify essential contextual information required for identification, understanding, and navigational orientation of document components, collection documents, and sub-collections in collections.
Explore use of emerging markup and style sheet technologies to gather and present essential, dispersed contextual information in appropriate navigational and hierarchical contexts.

Result

More effective use of site-level search engines.
More effective presentation formats for the results of a search, including a variety of approaches to respond to a variety of levels and types of intellectual access within a publication.

Benefits

More efficient and effective use of the electronically published historical documents.

Research Issue 8⁽¹²⁾

What kind of intellectual framework will be appropriate for editors of historical documents published in electronic format to assure users' need to find information as well as the editor's goal of providing suitable context for that information.

Purpose

To assess the impact of the electronic environment on the documentary editing process.

Background

Indexing historical documents published as a collection has always been and remains today an idiosyncratic endeavor. Long before computers, all long-term, multi-volume editorial projects developed procedures for indexing that included thesauri of accepted index terms (some sophisticated, some crude), rules for treatment of problems encountered routinely from volume to volume, patterns of cross-reference, etc. These survive in one form or another.

For twenty years, some editors have employed one form or another of the CINDEX and NLCINDEX computerized indexing tool designed for documentary series. This tool, however, was designed for print editions, and no plans are under way for modifications that might be helpful in electronic editions.

The use of word-processing equipment for the maintenance of in-process "annotational indexes" has been exploited by some (but not all) editors to facilitate the preparation of traditional back-of-book indexes, but no one has yet explored the possibility that the work in natural language understanding and information retrieval can ease the task of creating such traditional access tools by drawing on the existing ontological resources in multi-volume index entries for the same series, much less looking ahead to the way that computational linguistics research might be used in providing access to Web-based documentary editions.

Incorporating knowledge based tools into the editing process has the potential to change it drastically. Editors, and those who fund documentary publishing projects need to know how they will change the process.

Possible Approaches

Explore approaches for integrating the application of information retrieval technology in preparing different aspects of a documentary edition.
Explore the feasibility of establishing a centralized service center for providing technical assistance in electronic publication of historical documents, including intellectual access technologies.
Explore approaches to developing the information technology expertise within the documentary editing community.

Result

Guidelines for cost effective procedures for publication projects.
Greater understanding of staffing patterns and knowledge needed for the electronic publication environment.
New process models for documentary editing.

Benefits

Increased and improved productivity, uniformity, project compatibility and quality through standardization.

Resources

Resources come in the form of individuals and institutions with whom researchers can collaborate and those from whom they may receive funding. Collaborators can provide support with and without funding that includes, but is not limited to, contacts with funders, political support, test beds of documents, test populations, counsel and advice, expertise in research design and evaluation, trained staff time, proprietary technology, etc. Such collaborators might include:

Repositories of all sorts that collect, organize, and disseminate primary documents and other forms of evidence relating to our cultural heritage, e.g., special collections departments in college and university libraries, presidential libraries, governmental archives, and a wide variety of museums and galleries. All these types of repositories are publishing their holdings on the Web; all want them discovered and used.

Educational institutions and associations such as state departments of education and individual schools and teachers using the Web as an educational tool. These institutions want well designed Web sites to support curriculum requirements. Further, they understand how they and their students use the Web and the resources they find there today, and can provide information on how that use changes over time.

Academic researchers in conceptual taxonomies, topical ontologies, human/computer interfaces, computational linguistics, knowledge representation, information retrieval, and information user studies. These researchers have worked with secondary literature, but most have not worked with primary documents and would welcome the challenge. They can offer several decades in research designs and methods in this area.

Web interface developers and e-commerce companies need to know how people look for and find information, and what make good interfaces.

Publishers of both Web and print collections have an interest in intellectual access to electronically published text.

Professional associations such as the Society for American Archivists, the American Library Association, American Society of Indexers, Association for Documentary Editing, the National Association of Social Studies Teachers, and many others, include many members with reason to care about these issues.

The research will need funding as well as collaboration. Funders could include:

Federal granting agencies which sponsor research in computer applications, text understanding, digital libraries, educational research, support for or use of cultural heritage repositories and their holdings.

State granting agencies that sponsor the issues identified above at the local level and which may have local issues that relate to these questions.

Private foundations of all sizes and relevant interests.

Private industry such as computer manufactures, e-commerce companies, browser companies, commercial publishers of electronic text, who have established research facilities and/or an interest in a better-functioning Web.

Suggested Criteria for Project Evaluation

The participants urged the adoption of the following criteria for the solicitation and evaluation of proposed projects. They assume that these criteria will appear in the public invitations and development instructions for funding of projects. The order of presentation here carries no particular significance as the participants assume reviewers will apply them with flexibility based on the nature of the proposals under consideration and their objectives. Projects should:

Appeal to multiple funding and institutional sources.
Apply sound methods based on established research standards including stated means of evaluation.
Build on prior work
Consider political and policy implications.
Create usable models, reproducible and generalizable results.
Determine cost, benefits, and other economic impacts.
Encourage cooperation, collaboration, and coordination, among repositories and publishers of all kinds.
Enhance intellectual access to electronic documentary resources.
Enhance the effectiveness and/or efficiency of creating documentary editions.
Expand the user community.
Find publication in the professional literature of all relevant disciplines.
Identify mechanisms required for widespread implementation.
Produce recommendations that will benefit documentary publishing or use.

Conclusions

As increasing numbers of primary historical documents appear on the Web, publishers of those documents will need ways to provide intellectual access to the contents so the documents may be discovered and used. Today's search engines still have limitations, but since the advent of the Web many different types of researchers have increased the amount of work going into improving them. Most researchers, however, work with secondary materials which rarely present the severe problems of missing and ambiguous information that primary documents contain. On the other hand, even today's search engines can identify much that's in primary materials. Publishers don't yet know how to sort out what technology can do, what requires human intervention, and how the two can be woven into frameworks for providing intellectual access to electronically published historical documents comparable to or better than that provided by a back-of-the-book index.

At a three-day meeting in Burlington, Vt., experts in documentary editing, experimental electronic publishing, library and information science research, and computer science research identified three general areas of research which can contribute to our understanding of how to construct and improve intellectual access to historical documents on the Web. These include:

user studies to determine the needs and reactions of the audience(s),
implications for change in publication management
technological approaches to access to information.

Within these areas, the group identified a series of specific research issues which were painted in broad strokes to allow the flexibility necessary for individuals and institutions to build proposals which serve the needs of the documentary editing community and others engaged in publishing historical documents. These areas of research can provide significant advances in the development of efficient and effective publication of historical documents and may eventually lead to editorial guidelines to ensure full intellectual access for users. They may also lead to more systematic and interoperable standards for intellectual access. Without those standards, content providers and publishers will be wasting much of the time, money, and effort that goes into that sort of publication, while denying potential users access to the very materials they seek.

In retrospect, the potential alliance of documentary editors, archivists, librarians, and specialists in the information sciences seems to offer some of the most immediate benefits-both in terms of improving information retrieval and improved efficiency in preparing material for publication. As one participant later put it, "the most important practical outcome of our meeting in Burlington was the mutual realization of what [the editors of] multi-volume documentary editions and researchers in the field of computer-based conceptual indexing have to offer each other.... It's an opportunity left 'missed' too long and a collaboration that should begin as soon as possible and continue for decades to come."⁽¹³⁾

1. See list of participants - Appendix I.

2. See: David Chesnutt, excerpt from "Model Editions Partnership: Smart Text and Beyond." Appendix II.

3. J. H. Coombs, A. H. Renear and S. J. DeRose." Markup systems and the future of scholarly text processing," Communications of the ACM, Vol. 30, No. 11 (Nov 1987), pp. 933-947.

4. C. Welty and N. Ide. "Using the Right Tools: Enhancing Retrieval from Marked-up Documents." J. Computers and the Humanities. 33 (1-2). Spring, 1999. Kluwer.

5. D. R. Chestnut, S. M. Hockey and C. M. Sperberg-McQueen. Markup Guidelines for Documentary Editions, 4 July 1999. http://adh.sc.edu/MepGuide.html

6. Text REtrieval Conference (TREC). http://trec.nist.gov/

7. G. Salton, J. Allan, and C. Buckle, "Approaches to passage retrieval in full text information systems." SIGIR 1993, pp. 49-58; M. Kasakiel and J. Zobel, "Passage retrieval revisited." SIGIR 1997, pp. 178-185.

8. D. D. Lewis and K. Spark Jones. "Natural language processing for information retrieval." Communications of the ACM, 39 (1), pp. 92-101, 1996; T. Strzalkowski. F. Lin and J. Perez-Carballo. "Natural language information retrieval." Proceedings of the Sixth Text Retrieval Conference, 1997; J. Ambroziak and W.A. Woods, "Natural language technology in precision content retrieval," Proceedings of the International Conference on Natural Language Processing and Industrial Applications, August 18-21, 1998.

9. See Chris Welty, What is an Ontology, Appendix III.

10. C. Fellbaum (ed.) WordNet: An Electronic Lexical Database, MIT Press, 1998.

11. C. Welty. "The ontological nature of subject taxonomies" in N. Guarino, ed. Formal Ontology in Information Systems. IOS Press Frontiers in AI Applications Series, Trento, Italy, June 1998.

12. See: Mary-Jo Kline, "Basic Steps to Documentary Editing: before and after Computerization." Appendix IV.

13. Mary-Jo Kline. Private email to Elizabeth Dow, May 6, 2000.