Does converting "born digital" documents affect their evidentiary value?

Dear Editor;

As you are no doubt aware, more and more "original" documents (especially email) have never been anything but digital. While it is often recommended or required that such "born digital" documents be stored as received, they can become unreadable over time due to software changes or need to be converted into another format for archiving. Personally; I would like to extract my emails from the email program's database and store them in a way that can be consulted in the longer term.

Converting "born digital" documents to a more "timeless format", such as Tag Image Format (TIF), Text (TXT) or hardcopy, could be argued to affect their evidentiary value.  I do know that some government agencies are particularly concerned about document conversion. Does a direct conversion to a recognized archival format actually affect their evidentiary value (or classification) for genealogical work?
 

Submitted byEEon Tue, 03/12/2019 - 10:27

That's an interesting question, History-Hunter. I, too, convert emails of long-term value to a more permanent format. Post-conversion, I do scan the conversion to ensure that no visible content is lost or altered. However, because I am not an IT expert, there are undoubtedly problems I don't perceive.

Let's see if we can generate some advice from colleagues with IT expertise.

Submitted byyhoitinkon Wed, 03/13/2019 - 02:39

When born digital files are converted to a different file format, we are dealing with a derivative source. Just like in the paper world, the way the derivative source is created can affect the reliability. With a paper record, a full-color scan would technically be a derivative, but the quality is such that no information is lost so we can treat the scan as an original source. A photocopy of a photocopy of a photocopy is much worse because of compound quality loss. A transcription has a chance of interpretation errors being introduced. They are all derivative sources, but the numbers and types of errors we can expect are different. 

Just like in the paper world, not all derivative sources of born digital files are created equally. We have to assess the way in which the derivative was created to determine whether it affects the evidentiary value. 

Key questions are:

  • Does the derivative contain all the information that the original contains? 
  • Does it retain the structure of elements that was present in the original, if the structure affects meaning? 
  • Is the metadata that was present in the original still present in the original? 

For example, when a born digital photo is converted to PDF, the EXIF metadata about the location and date the photo was taken may be lost. Similarly, if a Word document is converted to a plain text file, formatting will be lost, which may affect the interpretation, e.g. when there are strike-throughs or tables in the original. 

As keepers of born digital information, we have to choose the preservation strategy that best retains the evidentiary value of the original. The most common strategies are migration and emulation. 
With migration, you convert the file to a different file format, typically in an open standard. Choosing a format that preserves the information, structure of the information, and metadata is important to keep the quality high. 
With emulation, we keep the file as it is and use the original software to view the file, for example in a virtual machine with an old operating system and an old version of the software that is capable of reading the file. 

A best practice for archives is to use migration as the primary preservation strategy and convert obsolete born digital files to a current file format in an open standard; in addition to keeping a copy of the original file. Future conversions can then be done from the original, rather than risking compound quality loss. Having the original allows them to view the original file using emulation to verify that all the information, structure, and metadata is preserved. Sometimes, a presentation copy is also created, for example a JPG for quicker viewing. This strategy can result in different manifestations of the same file: the original, several preservation copies in different open file formats over time, and presentation copies. 

The preservation and presentation copies are derivative sources. By analyzing how they were created and what information, structure, or metadata could have been lost in the conversion, we can establish how reliable these derivatives are.

Submitted byyhoitinkon Wed, 03/13/2019 - 03:44

While talking about this thread with my friend Jeroen van Luin, a digital preservation specialist at the National Archives of the Netherlands, he had a comment:

"The best strategy is to convince the archival creator to create the born digital record in an open file format in the first place, eliminating the need for the first conversion."

Thanks, Jeroen, for that addition and the permission to share. 

Submitted byvcorraleson Wed, 03/13/2019 - 10:37

This is a very interesting topic. I will try no to be too technical (I have computer science background)

E-mail is always a digital copy. The only original briefly existed when a person was typing the email, once he/she presses "send" button a copy of that is transmitted to a server (the outgoing SMTP server, SMTP stands for simple Mail transfer Protocol). Eventually, the email reaches the SMTP server of the receiver (it could have passed by several servers in its way from the sender to the receiver).

An email is composed of a set of headers and a body. The headers include several information, like the TO, From, Subject, that most of the email software display, but it also contains a list of all the servers the email has passed by. Something similar to:

Received: by 2002:a4f:92d3:0:0:0:0:0;

        Wed, 13 Mar 2019 08:04:57 -0700 (PDT)

Received: from rs224.mailgun.us by mx.google.com

        Wed, 13 Mar 2019 08:04:57 -0700 (PDT)

In this case email has been copy two times. There is usually an option in email software to show the “original” (Reality is that it is a copy). When we read an email using a software (or webmail) we are not really watching the original email, but a rendered version of it (for example the received headers are hidden).

To make things even more interesting, an email is not transmitted as a unique entity, the communication protocol used by the Internet (called TCP/IP) will probably divide that email in small pieces (called IP packets) that are transmitted, and later on they are reassembled into the email.

With that say, an email is really a copy that has been copied several times. However, there are technology safeguards (i.e. checksums) in place to guarantee that is an accurate copy.

Submitted byyhoitinkon Wed, 03/13/2019 - 10:55

I think an email, though technically a copy, can be treated like an original just like we would treat the full-color scan of a paper record. Checksums are the digital equivalent of archivists in the reading room who see to it we don't tear any pages out of the registers, or use our pens to change the information. Preservation is about ensuring the authenticity and integrity of the data, regardless of its form. Digital may seem to bring new challenges, but many of the same principles can be applied. 

Just a quick note.

What has been discussed regarding emails is the potential for corruption during transit. The issue that affects many genealogists is that there isn't a universal standard for taking them offline for storage. This is evidenced by the practice, in many companies, of requiring the users to cull their email in order to have the capacity to maintain key emails within the email database. Home users often have a much greater problem with storage space. So the issue, in this case, is what can a home user do to transfer the emails from their system to a reasonably standard "archival" format. 

No necessary only in transit. Even when the emails are in a server (where they are physical stored), when a person try to visualize them (for example using a website like gmail), a new copy is generated an transmitted between the server to the client. To make things more interesting, this copy is rendered as the software was programmed to do (many programs only show some of the headers available). 

On my personal side, when I am archiving, I just print the email as PDF from my Gmail. I have found that is most suitable for my purposes, as I am interested in the content of the email. However, technically, I am losing information as the headers are not saved.

Another archival option, is to download them in EML format (Outlook can read). This is a well documented Standard. 

https://tools.ietf.org/html/rfc822

I will just put a copy of the standard in the archive, as it explains how to read the format.

 

 

 

Yes, I agree, they could be treated as an original. The same way we might treat a microfilm, or a digital image of a microfilm. I think the classification we do in derivative is to indicate that some errors might have introduced or information might have lost.

Technically, the checksums are more accurate than the archivist. Computers and software has bugs, but they are more reliable to human when dealing with a repetitive task :-) 

 

Submitted byHistory-Hunteron Wed, 03/13/2019 - 12:17

I appreciate the detail in your answers and do understand that you are attempting to cover the several scenarios and potential solutions that are possible. I really do hope that people will continue to think about this and voice both opinions and suggestions.

Perhaps the following will shed some light on why my question arose...

As genealogists, we really do not often have a great deal of control over the format in which an electronic document is originally produced or even any conversions it undergoes in its lifetime prior to reaching us.

When we download a document from some of the major genealogy sites, we don't know if the document is "as scanned". It may have been converted from the originally scanned format and even had the metadata altered. We just don't know for sure. Yet; we often accept these as equivalent to the original without any questions.

Even in truly "born digital" documents, the mere act of transmitting them electronically causes them to be converted into packets and reassembled when received. Yet we don't seem to have an issue with this, either.

Any document that has been archived as a "zip" file or similar has also undergone a conversion. This is yet another example of an accepted conversion.

First, I think that there needs to be a more "formal" definition of what it is that we are trying to preserve in an electronic document and, therefore, what "equivalent to original" really means wrt. electronic documents. It seems to me that there are two key elements that need to be addressed in such a definition; information content (that which we may wish to cite) and quality (meaning our ability to distinguish what we may wish to cite).

Let me address the second criterion first. Assuming the first criterion is met, the noted "quality" is something we address daily in analyzing physical documents. We already have some reasonably good history with doing this. So; this is likely not an issue.

The first, however, is critical. If we cannot demonstrate to an independent party (e.g. client) that the information content is intact, then the entire item may need to be discarded. In an electronic artefact, I would suggest that this boils down to whether we know and can convey our confidence in the process by which the document generated, transmitted, received and stored, including any conversions. When I worked as software product assurance engineer in the aerospace industry, we would typically only accept electronic data that came from a controlled source, via a controlled and validated procedure, to a controlled destination. We had full control over that. As genealogists, we usually do not. 

So; the real question is ... Can we find out enough about the "history" of electronic document we received to; convey a reasonable level of confidence in what we're seeing and, are we (ourselves) using a validated process to handle, archive and store it?

Submitted byvcorraleson Wed, 03/13/2019 - 22:01

I think the Genealogical Standards Glossary gives us the clue to answer, when they define a fascimile :-)

"Facsimile An image showing a source with no sign of cropping, blurring, or other alteration, including color or shading changes that mask information; an exact copy; see image" [1]

At the end of the day, we treat a "copy" as original if we can assert that is an "exact copy". On the physical world, the process of "copying" is not that frequent. On the digital world, we are continuously  creating copies of copies that are 100% exact. As there are safeguards in place to guarantee (to a reasonable extend) that the copying process is accurate, we have no issue to treat them as originals.

I think digital conversion only reduce the evidentiary value when there is the risk that an alteration that could result on incorrect information (i.e. if a JPEG file has been converted and reconverted it might become blurry). However, I would not discard it, I will just make a note of it (source citation) and use as another piece of evidence in my analysis an correlation.

[1] Genealogy Standards: 50th Anniversary Edition (Kindle Locations 927-929). Turner Publishing Company. Kindle Edition. 

Submitted byCrankyTodayon Thu, 03/14/2019 - 08:29

Assuming that we don't crop or edit the image or create a low quality copy, it is highly unlikely that converting format will make changes any greater than were made before the image was presented to us. Most of the images that we see were first microfilmed, then scanned to an image (probably in a TIFF format), then converted to JPG, PNG, or PDF format for presentation on the website. When we save it to a convenient format with appropriate care, we're probably causing a loss of information that is much smaller than what has already been lost. Of course, we need to think about future genealogists who may need to trace back to the source. If the image is likely to remain available online (such as a file from a reputable repository or website), then we should document where to get it. If the image is not likely to be available to others (such as an email note), then we should keep the original as well as our copy.