Data Collection Methods

The LISTSERV software used to distribute the list maintains the archives of the BEE-L automatically. The concept of archiving by the host involved is common to other list management software, and provides a repository of previous messages. These archives can be used to provide a wide range of information in addition to the actual content of the messages. The format of the stored archives may not always be in a form to immediately perform analyses on list trends overall, being designed primarily to provide searching for subjects and specific text.

In the BEE-L list's first few years, the archives were written as text files containing one month's postings to the list. In January 1996 the archiving interval was changed to be weekly.

The archives for the first twelve years of this list, then, consist of a total of 398 text files, containing a total of 35,852 messages (see Table 3). The archive files range in size from 1.4kb (July 1989, the first part month of the list’s existence, with only two messages) to 1.15Mb (September 1997, including one very large binary file attachment). Cumulatively, the archive consists of 66.4Mb of data.

As each archive file contained a number of individual messages, the use of message delimiters and header delimiters were used to parse the files out to individual messages. In the archive, 73 “equals” signs precede each new message:



=========================================================================

Following this are a number of message “header lines”, in no particular order, containing information about the message: the address provided by the email programme of the person who sent it, the date and time it was written, the subject as provided by the writer and a number of other such meta-content lines. The last of these header lines is followed by one blank line, indicating that the body of the message is to begin.

One example of a series of header lines for a particular message would be:



Date:         Sat, 8 Aug 1992 19:20:00 +1200
Reply-To:     Discussion of Bee Biology <BEE-L@ALBNYVM1.BITNET>
Sender:       Discussion of Bee Biology <BEE-L@ALBNYVM1.BITNET>
From:         NICKW@WAIKATO.AC.NZ
Subject:      AFB sterilisation

[text/body of the message]

Some header lines, such as the Sender: and Reply-To: lines, remain constant throughout each message in the archive, and thus convey no information of value for this analysis.

Other lines, such as the From: line contain, at a minimum, the email from which the original message was sent to the list server. For most messages, this email address is preceded with some form of name for the person whose email address it is. While this information might take a number of valid forms, it can be generally be textually parsed to identify both name and email of origin. Examples of acceptable formatting of the address in the From: header line include:



nickw@beekeeping.co.nz
"Nick Wallingford" <nickw@beekeeping.co.nz>
<nickw@beekeeping.co.nz>
Nick Wallingford <nickw@beekeeping.co.nz>
nickw@beekeeping.co.nz (Nick Wallingford)
<nickw@beekeeping.co.nz> (Nick Wallingford)

as well as a number of other such formatting combinations.

Resnick (2001) describes in detail the standards and conventions related to addressing of emails on the Internet.

Perhaps even more problematic for this dissertation has been the use of multiple email addresses for the same individual, either over time or even during the same range of time. The identification of the “name” of the sender of any given set of messages was determined by examining the comment/name field, the email address and in some cases the signature within the body of the message, if present.

As well, the domains of the email addresses (the material to the right of the @ symbol in the address) were further analysed for the information they contained. For instance, the country of origin, and even the nature of the domain (educational, government, non-profit organisation) can be obtained by analysis of the domain name. Mockapetris (1987a and 1987b) detailed the format of the hierarchical information contained in the domain name components, as well as describing how “resolvers” are used to ultimately route email messages to their destinations.

After each message was parsed into an individual text file, the messages were analysed to extract information from the headers, as well as information about the size of the headers and the size of the messages.

Most textual parsing and extraction was done using tools such as grep, awk and internal utilities such as head and tail within the framework of the Linux operating system. As information was drawn from the files, it was appended to a Microsoft Excel^® spreadsheet, where the data analysis was performed, primarily through the use of the pivot table functionality. When data was appropriately summarised, charts were produced from within the spreadsheet.

Specific header lines from the text files that were used included:



To: (for the name, email address and domain of the sender)
Subject: (for the nature of the particular message)
Date: (for the date, time and weekday of the posting)

The numbers of individual header lines in each message were recorded, as well as the number of lines in the body of the message.

The only investigation into the body of the message related to size and to identify the lines that could be considered as “quotes” of other messages. As many email client programmes use the convention of prepending a > character to indicate a line copied from another message, these lines were counted as quotes. It is acknowledged that for a variety of reasons discussed later this would tend to understate overall the number of quoted lines. An example of quoted material would be:



> I haven't seen anyone willing to answer my question. Would
> someone please try?

I'll try to answer the question, but need more information first.

In this instance, it is seen that a previous writer had written the first two lines, and only the final line was written as original material in the posting.

This same convention carries through from the email client into the email distribution lists, with a message conveying a sense of context through the use of these areas indicated as being quotes. The term “threaded conversation” is sometimes used to describe the effect, though strictly the term should probably be restricted to the nature of discussions on USENET and discussion boards. In those cases, replies and replies-to-replies are presented to the user in a manner that indicates both the chronology and the directed nature of the responses, rather than referring to the specific portions of text within the messages.