Introduction
Research libraries and other academic institutions collect, amongst other things, authors' archives, including drafts and working versions of published and unpublished works, and correspondence. Nowadays, authors use word processors for writing and email for correspondence. This raises the issue of how such material should be archived for posterity. The Electronic Palimpsest is a project to investigate how this may be best accomplished. This document is some notes on one aspect of that project: how best to archive emails. This document is itself a work-in-progress, and for so long as it is indicated above as being in "Draft" status you are advised to check back periodically for updates.
Retrieving the physical mail files
Email programs (Mail User Agents - MUAs henceforth) often represent messages and message folders in ways which bear no resemblance to their physical location on disk. Some MUAs insist on storing their mail stores in particular locations (e.g. Microsoft Outlook and Microsoft Outlook Express), whilst others allow the location to be user configurable (e.g. Eudora, Pegasus Mail, Netscape Mail, Mozilla Mail). Users cannot be relied upon to have any clear idea where their mail is located, and often even which MUA they in fact use. They may use more than one - they may have used Eudora in the past, but used Netscape mail for mailto: URLs, and then switched over to Microsoft Outlook but use Microsoft Outlook Express for mailto: URLs. Therefore, their total store of email may be distributed over many locations and be in many different formats. Tracking all of it down will be no easy matter. Attempting to establish which MUAs are actually installed and then checking their configuration options to locate the location of their mail stores will help in some cases but not in all; a user may very well uninstall an MUA they no longer need whilst still leaving its mail store on their hard disk.
Further complications arise if their mail is not stored locally at all. Many people use web based MUAs, e.g. Hotmail or Yahoo. IMAP, in which messages are stored remotely on the mail server is increasingly popular. I'm afraid that I have no concrete suggestions for web based MUAs - I'm not even really sure how the mail is stored, perhaps it's in an IMAP folder or in a SQL database. Whether Hotmail or Yahoo offer users the ability to download their mail is something I don't know, and would bear further investigation. In the case of IMAP, the user's MUA should (hopefully) offer the capability of copying all the messages from their subscribed folders to a local mail store.
Whilst it is certainly feasible to hire a programmer to write a program which attempted to search through a hard disk and locate all possible mail stores, the effort required and cost incurred may simply not be worthwhile. A far more reliable method is human ingenuity. Whoever is responsible for retrieving the physical mail files from an author's computer should be trained in what to look for, where to search, and the quirks of not only the popular MUAs but also the downright obscure (obscurity is a relative issue: what is an obscure MUA now may have been the most popular application of its day ten years ago). Furthermore, training must cover more than PCs running Microsoft Windows. Authors may very well be using Apple Macintoshes, they may be using some ancient machine that they have had for years, perhaps running Acorn RiscOS, OS/2, or plain old DOS, or they may be using something more exotic like Linux, BSD, or BeOS. The variety of MUAs available, together with their various particular ways of doing things, is platform dependent.
Conversion from proprietary mail formats
Once the first hurdle of actually tracking down the location of the author's mail and copying it onto removable media has been overcome the fun really begins. What format is it all in? Microsoft Outlook and Outlook Express use (different!) proprietary binary formats. Pegasus mail uses a variation on the standard UNIX mbox format. Eudora and Netscape Mail use the mbox format.
Why not simply leave the retrieved mail stores in their original format? Because there is no guarantee that in ten or twenty years time there will be any means of reading them.
"Much of the data from the original moon landings in the late 1960s and early 1970s is now effectively lost. Even if you can find a tape drive that reads the now obsolete tapes, nobody knows what format the data on the tapes is stored in!" (Elliotte R. Hays and W. Scott Means, XML in a Nutshell, p. 6.)
A proprietary file format, and the ability to read it, is hostage to the whims and fortunes of the company or organisation which developed it. Also, attempting to create a unified archive of many authors' emails, possibly each created using many different MUAs, requires that we find some neutral format to store it all in.
Unfortunately, there is no absolute open public standard for mail stores. The three popular UNIX formats - mbox, maildir and mh - are probably the best known and constitute something of a de facto standard. Of these mbox is the most widely used and has the advantage of also being the simplest. Unfortunately, an mbox file represents one folder in a user's total mail store. Whilst I can imagine that some people in fact store all their incoming mail in their inbox, and copies of all their outgoing mail in a sent folder, thus reducing the amount of folders to worry about to two, that cannot be assumed to be the norm. Information about the folder hierarchy may be stored in a separate file, as it is in Pegasus Mail, in which all mail folders are stored as separate files in one directory in the filesystem, or the folder hierarchy may be mirrored in the subdirectory structure within the mail directory in the filesystem. (It is important to bear in mind that what is displayed within the MUA as a hierarchy of folders may bear absolutely no resemblance to how the mail is actually physically stored on disk. For example, Microsoft Outlook stores all of its data, including emails, addresses and appointments in one single file. For the sake of clarity, I will use the terms "folder" and "message" when referring to an MUA's organisation of its data and "directory" and "file" when referring to the file system.)
As the way an author has chosen to organise and file their messages should be regarded as an intrinsic part of the archive along with the messages themselves, some way is needed of storing the messages and the mail folder hierarchy. It may be the case that either the maildir or the mh formats offer some advantage over the mbox format regarding this. That should be investigated further.
Nevertheless, what is needed is a utility, or a suite of utilities which will convert from all known proprietary mail store formats into a non-proprietary format in which both the messages and the folder hierarchy are preserved. Unfortunately, no such utility or suite of utilities exists. I cannot see any way around this issue other than to hire a programmer to write the software for you. This is not as unattractive an option as may at first appear. If the software was open source and released under the GNU Public License (GPL) the developer could benefit from all that the open source community has to offer. Existing import filters from open source MUAs such as Evolution, Kmail, Mozilla Mail, and others, could be used and drawn upon, and those projects in turn could contribute to and draw upon features of the software under development. Thus, not only would a developer hired by your institution be able to draw upon existing code and the expertise of the open source community, he could also expect to have others help directly with the project, solving problems, debugging and adding features. In turn the finished product would be available to everyone for free, greatly enhancing the reputation of your institution in making outstanding contributions to problems in information retrieval, processing and archiving. Given the usefulness of such a product it seems highly likely that external funding could be secured to finance it.
Archiving, cataloguing and annotating emails
Having got an author's mail archive, complete with folder hierarchy, what do we then do with it? We want to be able to add cataloging information to the archives, annotate them for scholars, and make their contents available to others. The ideal solution is to use XML. XML is an ideal format for the kind of data involved because it is relatively free-form, easily redeployable, and provides the greatest level of integration with existing archival and cataloguing data formats. If the entire mail store is in XML then it is trivial to add RDF and Dublin Core metadata to the entire archive, to folders within the archive, and to messages within the folders, and then export that metadata as RSS to harvesting clients; it is also trivial to add cataloguing information in MARC or EAD format; and it is trivial to annotate the messages using TEI.
Once the mail archive is in XML format it can be transformed with XSLT to provide a variety of resources. Furthermore, it must be remembered that a single XML document such as our author's mail archive need not be a single physical file. For speed of access the XML document could be serialised into a SQL database.
Storage: Creating an email archive format in XML
Imagine the simplest possible scenario: we have a MailStore
DTD, in which an entire mail store is a single
XML document. Our MailStore DTD defines three
elements: <folder>, <subfolders> and <message> in addition to a root <mailstore> element:
<mailstore>must contain one or more<folder>elements, and nothing else.<folder>must contain either at most one<subfolders>element followed by any number (including none) of<message>elements, or at least one<message>element. In addition, folder is defined as having a mandatory attribute:title.<subfolders>must contain at least one<folder>element.<message>contains individual messages, complete with headers and MIME attachments. In order to avoid anything in the messages themselves rendering the document invalid, such as the > sign which is regularly used as a quoted-text prefix in email messages, we can for the time being stipulate that the contents of each element are wrapped between<![CDATA[and]]>to prevent them being parsed. (Naturally, this would be a very unsatisfactory way of encoding the mail messages, but we are currently dealing with the simplest possible case. We will add features as we progress.)
We should also insist that each <folder> and <message> element has a required id attribute so that we have some unique way of referring
to them. Thus, so far, our toy MailStore.dtd looks like this:
<!-- A simple Mail Store DTD -->
<!ELEMENT mailstore (folder+) >
<!ELEMENT folder ( (subfolders, message*) | (message+) ) >
<!ATTLIST folder
title CDATA #REQUIRED
id ID #REQUIRED >
<!ELEMENT subfolders (folder+) >
<!ELEMENT message (#PCDATA) >
<!ATTLIST message
id ID #REQUIRED >
As an example then, imagine we have as an instance of this document type Joe Blogg's mail store, Joe_Bloggs.xml:
<?xml version="1.0" standalone="no" encoding="UTF-8"?>
<!DOCTYPE mailstore SYSTEM "MailStore.dtd">
<mailstore>
<folder id="folder-1" title="Inbox">
<message id="msg-1-1">
<![CDATA[
Email, unindented, complete with headers and all, here.
]]>
</message>
</folder>
<folder id="folder-2" title="Work">
<subfolders>
<folder id="folder-2-1" title="Publishers">
<message id="msg-2-1-1">
<![CDATA[
Email, unindented, complete with headers and all, here.
]]>
</message>
</folder>
</subfolders>
</folder>
<folder id="folder-3" title="Sent">
<message id="msg-3-1">
<![CDATA[
Email, unindented, complete with headers and all, here.
]]>
</message>
</folder>
</mailstore>
This seemingly trivial DTD is in fact adequate to the
task of representing the folder hierarchy and the individual messages
within it. Of course that alone is not adequate for our needs. But
everything else can be achieved by using pre-existing XML
applications, together with XML Namespaces. Firstly, we
need to remove the <![CDATA[…]]>
sections, and encode the messages themselves as XML.
XMTP -
XML MIME Transformation Protocol,
described here and here - is just such an encoding.
It provides a mapping of SMTP headers and multipart
MIME messages into XML. The root
element of each message is <Message>,
and the unique MIME Message-ID is preserved as the content of a <Message-ID> element. This means that the above Mail
Store DTD can be further simplified by dropping the <message> element in favour of <xmtp:Message
xmlns:xmtp="http://www.openhealth.org/xmtp#">. There are
several examples of mail messages encoded in XMTP at the
URLs above.
Also through the use of XML namespaces RDF and Dublin Core metadata can be added to the entire mail store, and, through the use of XPath and XPointer to identify sub-parts of the mail store, to individual folders and messages. Inclusion of the fifteen elements of the Dublin Core Metadata Element Set as RDF is trivial:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://www.purl.org/dc/">
<rdf:Description about="">
<dc:title></dc:title>
<dc:creator></dc:creator>
<dc:subject></dc:subject>
<dc:description></dc:description>
<dc:publisher></dc:publisher>
<dc:contributor></dc:contributor>
<dc:date></dc:date>
<dc:type></dc:type>
<dc:format></dc:format>
<dc:identifier></dc:identifier>
<dc:source></dc:source>
<dc:language></dc:language>
<dc:relation></dc:relation>
<dc:coverage></dc:coverage>
<dc:rights></dc:rights>
</rdf:Description>
</rdf:RDF>
The about attribute of the <rdf:Description> element has as its value a
URL pointing to the subject of the metadata, be it the
entire mailstore or parts of it such as folders or messages.
The problem of attachments
Needless to say, the fun doesn't stop there. Imagine this is one of the messages in our author's email archive:
To: B. I. G. Publisher From: I. M. A. Poet Subject: New drafts Here are the new drafts of the poems I've been working on.
The poems themselves are however in an attached Microsoft Word document. We may have all the mail messages in some standard XML format, but what about the attachments? In particular, many word processors have a "Mail this to ..." command on their file menu. How many emails may turn out to be in fact letters written in a word processor, and so saved in some proprietary format, which have been sent as a MIME attachment in an otherwise blank email message?
It is crucial that having once got the mail archives into an XML document we traverse through the document tree and examine all binary attachments (XPath is our friend here, and another justification for choosing XML). Some file formats may constitute public standards which we can rely upon to be around for some time to come: JPEG and PNG for images, MPEG for movies, MP3 for audio, Postscript and PDF for documents. But other file formats will need to be converted. In the case of office documents such as spreadsheets and word processor files, we can hopefully import them into another office suite which uses XML as its native file format such as Open Office or Microsoft Office 2000, save them as XML and then reimport them back into the XML mail archive. With documents where their visual appearance is important distilling them into PDF may be a more appropriate route. Whilst PDF is a proprietary format in the sense that it is developed by Adobe (basically it is postscript with the programming constructs removed) it is open in the sense that it is well documented as it was developed specifically as a format for document exchange, and there exist open source PDF readers (such as Ghostscript). With other kinds of documents more creative techniques will have to be devised.
Cataloguing: Integration with MALVINE and LEAF
So far, we have only covered the storage of the mail archives, together with metadata about those archives. What about the addition of cataloguing information?
Integration with MALVINE and its follow-up project LEAF is clearly a desiderata. MALVINE uses EAD as an interchange format for sharing catalogue and archive information amongst institutions. EAD is actually an SGML application, although because it uses none of SGML's features which are not available in XML it is also trivially an XML application. MALVINE resulted in the development of a suite of Perl scripts which can translate a number of different catalogue formats from and to EAD. This then makes our task really quite easy. As long as cataloguing information in or associated with the mail archives is either itself in EAD or in one of the popular formats MALVINE can import and export then integration should be no problem.
Integration with LEAF is harder to assess, as LEAF is still in its early stages, and few concrete details have emerged. Nevertheless, there is a close point of contact. LEAF is concerned with pooling the various name authority files that institutions such as libraries maintain. Clearly, one of the most important features of an email archive is the clear identification of the senders and recipients of those emails, especially when one considers that a single individual may have many different email addresses over the years, and that many email messages might well contain only an email address in the "From:" or "To:" header, and not a full name. Once LEAF has reached some level of maturity, it will be salient to investigate not only how best to associate names in the email archive with the "pooled" name authority files, but also whether there might be some way of automating the identification of email senders and recipients by querying those "pooled" resources.
We can only speculate about how such integration with LEAF might be possible, but if we use RDF to encode our cataloguing information then we can define certain RDF resources which are in fact pointers to LEAF, and then use those RDF resources directly in our cataloguing information like so (in the following example, we are using an imaginary cataloguing format which conveniently defines elements for cataloguing email messages, and an imaginary way of accessing the LEAF authority files):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:mcs="http://my.cataloguing.system"
xmlns:leaf="http://www.leaf.org">
<rdf:Description about="John_Smith">
<rdf:type resource="http://www.leaf.org/person"/>
<leaf:person>http://www.leaf.org/person?firstname=John&lastname=Smith&email=john.smith@company.com</leaf:person>
</rdf:Description>
<rdf:Description about="Joe_Bloggs">
<rdf:type resource="http://www.leaf.org/person"/>
<leaf:person>http://www.leaf.org/person?firstname=Joe&lastname=Bloggs&email=joe.bloggs@organisation.org</leaf:person>
</rdf:Description>
<rdf:Description about="url-of-mail-store/x-pointer-to-message">
<rdf:type resource="http://my.cataloguing.system/email">
<mcs:from rdf:resource="John_Smith"/>
<mcs:to rdf:resource="Joe_Bloggs"/>
</rdf:Description>
</rdf:RDF>
Annotating the archives
Finally, it is clear that archivists and curators may actually wish to annotate the messages themselves. TEI (available in an XML "Lite" [sic.] version) is a markup language for humanities scholars. This would seem the most likely candidate for adding such annotations. Again, the beauty and simplicity of XML namespaces allows us to seamlessly add such markup to the archive.
Authoring tools
One problem remains. What tools will curators, archivists and librarians use to add RDF, Dublin Core, and EAD markup to the mail archives? That is not an easy question to answer. Whilst I tend to the view that there is no substitute for a good text editor (such as Emacs with psgml mode) it is unlikely that library staff would be comfortable using such tools. One possibility is to write in house tools specifically for the job, and the other is to invest in commercial XML editing software. Commercial XML editing packages can however be prohibitively expenive. A further option would be to use a commercial database application such as Microsoft Access if the mail archives are stored in a database (an XML document need not correspond to single file; there is no reason why it can't be broken into parts and stored in a database). Which would be the most cost-effective? Making everyone use Emacs - it's free! However, this would not be resource effective, as staff would almost certainly be less productive. The other options would need to be investigated further.
Summary
To conclude:
- Retrieving the mail stores from author's machines, or from web or mail servers, is a non-trivial business, and is an unlikely candidate for automation. It would probably require some amount of training for the person whose responsibility it is.
- Converting the mail stores from proprietary, closed file formats to public open ones will probably require bespoke software development, although the open source community could be relied upon to provide some help with this, and external funding should be attainable.
- Having the archives in an open public mail format is only an intermediate step. From there, the archives should be converted into XML (a relatively trivial issue), so that they can be annotated and shared in the widest number of ways.
Some useful resources
- What is Extensible Markup Language (XML)?
- What is Extensible Style Language Transformations (XSLT)?
- What is Resource Description Framework (RDF)?
- An Introduction to Dublin Core
- Dublin Core in the Wild - contains information on Dublin Core and RDF/Rich Site Summary (RSS).
