Born Digital Text | Evidence Explained

By Anonymous on Fri, 03/13/2015 - 12:31

Log in or register to post comments
13644 views

Forums

Citation Issues

Increasingly more often, offical, government recorders enter information directly into a computer rather than writing on paper forms. This is sometimes called "born digital." I think the evidentiary strength of a born-digital record is akin to a digital image of an original record. That leaves me uncomfortable citing the born-digital database like this:

"Database name," database, Website Title (URL : access date).

The strength is more like that of a "database and digital image," even though there is no image. One could attempt to signal the strength of the database by attributing it to the government agency:

Name of government agency, "Database name," database, Website Title (URL : access date).

But that doesn't differentiate from the born-digital database from one created later by extraction. And when third parties republish government databases, they often "massage" the information. I consider those to be derivatives that should not be attributed to the original government agency.

What would you think about including some term akin to "born-digital" somewhere in the citation:

"Database name," born-digital database, Website Title (URL : access date).

What are your thoughts?

I don't quite see the

I don't quite see the difference like that. If data had been recorded in typescript or manuscript then you might indicate that fact, and so in this case of "digital script" then you could also indicate that. However, I don't believe it's akin to a digital image because (a) digital text is not the same as an image, and (b) there was no original typescript/manucript to take an image of.

From that point of view (noting that I am no expert), I would expect it to be just like a digital publication.

Tony

I agree with the original

I agree with the original poster that there is a big difference in a database created as a derivative source, for example a database where a volunteered typed in information from an original source and a database created by an organization as a digital born object, for example a database with current addresses of residents like our municipalities keep. The 'derivative' database would be less reliable, since the typing in introduces the chance of errors and the original record may have information not transmitted to the database. The digital born database would be an original source with all the hallmarks of quality that go with that. Source citations are important to assess the quality of the evidence, so making a distinction between these two different critters is essential. I personally like the idea of a separate designation like "digital born database."

Raymond, Tony, and Yvette,

At this point, EE doesn't agree or disagree with either view. There do seem to be questions that need resolving. Bottom line: What, exactly, constitutes "digitally born?"

Yvette gives the example of a city that maintains a database with current addresses of residents. From the standpoint of quality—a point Yvette raised—we might consider that high, because the city has a vested interest in knowing who lives where. But it's not error free. People move without letting the city know. Typists make typos. More to the point: originality is a different issue from quality.

Such a database also involves keying in data that the data entry person takes from something else—a form someone has filled out for electric service, for example. So ...

How does this make the database different from one in which, say, someone takes old voter rolls and creates a database from them for historical research?
How would you define a database with medical records, where in a physician dictates a synopsis of a person's complaint, symptoms, diagnosis, and treatment—after which his dictation is entered into a database, with no paper trail in between? Would this be a digital-born database?

Would you mind sharing other examples of what you consider a "digital born" and why or how it differs from, say, the voter-roll example above?

There is talk of 'databases'

There is talk of 'databases' here but surely the citation depends on the actual published medium.

Recalling that we can only cite what we saw and used, then if a database was specifically mentioned (incl. name and version details) then that is the published medium to be cited. However, if you were using some smart text or document retrieval system, with no mention of how the data was stored behind the scenes, then that system is the published medium to be cited.

When I mentioned 'electronic publication', above, then I was imagining that this data would be presented through such a system (e.g. via Web pages) rather than raw database queries.

Although not a genealogical

Although not a genealogical source (one would hope!), I can give an example of a "digital born" database (first time I've ever heard that term). When I go to the doctor's office, nobody except an incoming patient writes anything on paper. When the medical assistant takes my blood pressure and all, the numbers are entered directly into the computer. When the doctor makes his notes, they go directly into the computer. No medical transcriptionists are involved.

At this point I can't think of such a database that would be routinely used by genealogists, but I'm sure they're out there. I like the idea of differentiating between the two types of databases, but have one question: how do you know which type you're referencing?

Dave

Dave, your closing question

Dave, your closing question is the crux of the issue.

This article on the Library

This article on the Library of Congres website has a good explanation about the difference between born digital and digitized:

Trevor Owens, "All Digital Objects Are Born Digital Objects," Library of Congress, The Signal, 15 May 2012 (http://blogs.loc.gov/digitalpreservation/2012/05/all-digital-objects-are-born-digital-objects : accessed 14 March 2015).

Hmm, I did not count on the

Hmm, I did not count on the fancy blockquote markup, putting all of that in italics instead of just The Signal.

Owens makes an interesting

Owens makes an interesting argument, Yvette. Certainly an original photo taken with a digital camera would be "born digital." On the other hand, it could more validly (I think) be argued that a ditization of an existing image is only an adaptation or an enhancement, rather than the birth of something new. Most of the elements of the "new" image already existed, unless the artist-behind-the-camera deliberatedly tried to obscure the original details. Yes? No?

But there's a huge difference

But there's a huge difference between image data and digital text (or 'digital script' as I described it above). As you say, a digital-camera image is an image of something else, and that applies to any sort of recorded digital image (e.g. a scan).

When that doctor or nurse types data into some program (which may or may not be using a 'database' to store it) then it is stored as digital text -- not in some image format.

In order for these questions to make sense then we have to be careful with these distinctions. It's worth noting that all the hassles we may encounter with OCR are because we're trying to convert image data into digital text.

ACProctor, your penultimate

ACProctor, your penultimate sentence is, for me, the bottom line. Coupled with the fact that most historical researchers are not tech professionals and will have difficulties drawing distinctions that you, Robert, and Yvette can easily draw, would we not invite more confusion by "mandating" technical distinctions that the general public is not equipped to make?

Raymond, Tony, and All,

Tony's distinction between "born digital image" and "born digital text" is critical. That begs a new question. Raymond, if we say that "born digital text" is a different entity from "born digital database, would that "database" distinction be limited to situations in which pieces of data were actually entered, for the one and only time, in a relational database that is searchable by elements? If so, then

How will users be able to discern whether the data was truly born digital (unless they have "inside knowledge" as you would have with, say, FS databases)?
What label would we apply to born-digital narratives (books, blogs, etc.) that are also searchable through a database of some type?
Would it not suffice for (a) users who are unable to discern the mechanism behind the produ—i.e., most researchers—to continue to use the basic descriptors, article, blog, book, database, database with images, etc.; while (b) those with inside knowledge of the creation of the digital source can use the comments field of a citation to identify instances in which they are certain the text or the image was "born digital"?

Elizabeth

That works for me Elizabeth.

Tony

I am joining the conversation

I am joining the conversation late and I agree wtih the Editor's conclusion that the basic descriptions will suffice for most situations.

Since more and more sources will be digital, perhaps I can share my thoughts.

It seems to me that the issue is not whether the information was "born digital" or not. Perhaps the real issue is how to evaluate the strength or weakness of the source (EE p. 10) and convey that in the citation.

A source can be born digital, live digital, and, just like a paper source, still contain errors through later digital translations that would weaken the validity of the source.

At what point does a born digital original source become a derivative source? We can probably use the same reasoning as we do for traditional sources.

Let me illustrate with an example.

When I type a compiled lineage into my word processor, it is "born digital." For the sake of simplicity, let’s suppose that I release all claims to copyright on this document. I then save the document as an Acrobat file (.pdf). At this point, it is slightly "less original" than my word processor document, but still almost as useful since the information is clearly readable and the text is embedded into the document so that it can be copy-pasted as needed. However, if someone finds an error in my .pdf document, they can't fix it without tracking me down because it is not easily editable.

Next, I attach the .pdf as a source in FamilySearch, thus releasing it into the wild, wild Internet. Later, someone converts the .pdf to a 72dpi jpeg image for some reason - perhaps to include it on their web site. Now the jpeg document is "less original" than the .pdf. It is harder to read and you can no longer select the text since it is just an image.

Finally, someone attempts to OCR the jpeg image (i.e. convert the jpeg image of text to editable text characters) to recreate the text. The resulting text document is "less original" than the jpeg and probably contains errors.

In this example, the document was born digital and was digital every step of the way. However, the different derivative versions of the document have varying abilities to convey accurate information.

So perhaps the question we are trying to address is how we can evaluate the strength or weakness of these digital sources and convey that in the citation. Maybe we can use similar reasoning to traditional sources, but it is also possible that we are trying to fit too much information in the citation.

- Brad

Brad, you've just raised more

Brad, you've just raised more issues here than 83% of researchers will want to think about. Well done. (And, for the record, that stat is totally made up, which is why it carries no citation.)

Brad and all,

Here's an example illustrating my need.

Citation 1:

Suppose someone goes to a Newspaper A's website and types in an obituary. A copy of that obituary is electronically transfered to genealogical website B with 99.999% accuracy. Newspaper A prints the obituary in a paper edition of its newspaper. Website B publishes the obituary as part of a database of obituaries. You look at the database record of the obituary. Knowing the propensity for OCR errors, you look for a link to an image so you can verify the information, but there is none. The website provides a proper citation which you use in a compiled genealogy which you enter in your word processor.

Citation 2:

Genealogy website B gets permission to publish the text from newspaper A, but without images. It scans the paper with an OCR process that yields 99% acurracy of common words, 96% accuracy of names, and 93% accuracy of numbers. Website B publishes the obituary--the same one as above--as part of a database of newspaper stories. You look at the database record of the obituary. Knowing the propensity for OCR errors, you look for a link to an image so you can verify the information, but there is none. The website provides a proper citation which you use in a compiled genealogy which you enter in your processor.

Citation 3:

Genealogy website B finally gets permission to publish images of newspaper B. You look at the database record of the obitary. You find a link to an image and you utilize it instead of the database record. The website provides a proper citation which you use in a compiled genealogy which you enter in your processor.

Here's my quandry:

What would the three citations look like?
If the citations all had to lead with website B's publication, what would they look like?
How can we make the citations as succinct as possible?

I've always looked at the item type, "database" vs. "image," as a leading indicator of the evidentiary value of the source. From a reputable publisher, "database" always had the evidentiary value of a textual derivative while "image" always had an evidentiary value close to an original. That's no longer true. Must we add a tertiary element to the citation to signal this new item flavor?

---Robert

Robert, my brain is awash in

Robert, my brain is awash in possibilities and possibilities, so perhaps I've missed something in your scenarios. However, it seems to me that the database vs. image proposition sets up a "false choice." If I've understood you correctly, the original 'born digital' item isn't either a database or an image. It's an article. More specifically, it's an obituary, but obits fall into the larger category of "articles."

I know you recall EE's first QuickSheet, Citing Online Historical Sources. (I know it because you often seem to know EE better than I do!) Most of that QuickSheet, deals with databases and images. But page 1 begins (on purpose) with a diagrammed template for an article. That is a very important "third option" for online materials, though I wouldn't call it tertiary.

Beyond that point, the derivatives do offer us the database vs. image dichotomy. Do you care to offer us those three citations for debate or rubber-stamping?

Elizabeth

We accept that a recorded

We accept that a recorded image (as opposed to a created or modified image) is a derivate copy of something, and it's correctly pointed out here that OCR is also a further derivative, and so subject to all the same issues. However, born-digital text should not degrade and so every (digital-)copy should be exactly the same. I'm avoiding issues here such as character-set conversion, and also any differences in the rendition of what's stored (e.g. you may have perfectly valid Japanese text stored, irrespective of whether your system can display it or not). In effect, I don't see any situation that cannot be addressed by Elizabeth's earlier suggestion.

I can't close this post, too, without mentioning a particular issue that concerns me as a software developer, and that's the distinction between database and image. I understand the goal, and I have discussed this with Elizabeth before, but databases do not simply store text; they can store images too. Typically, you cannot tell whether a provider is storing images in the database or just the names of image files held externally to it. Under these circumstances, I am happy to use database-with-images but acknowleding that database is the indexed method of organised storage rather than some group of textual items extracted from it.

"Mr Pedantic" ;-)

>"A particular issue ...

>"A particular issue ... concerns me as a software developer, and that's the distinction between database and image. ... Databases do not simply store text; they can store images, too. ... I am happy to use database-with-images."

We hear you, Tony. :)

I think the underlying issue

I think the underlying issue is original versus derivative. We have the same issues with manuscripts: only by analyzing them properly will be understand the context in which the record was created and conclude that a manuscript is an original document or derived (copied) from an earlier document. We're used to thinking of databases as derivatives, but the digital advances means that distinction isn't so easy to make anymore since databases can now be originals too.

Yvette,

I completely agree. I don't think that the term image or database in the citation alone should be used to interpret the quality of a source such as original or derivative. These descriptors merely indicate the way the information is stored. We need further analysis to determine the quality of the source.

If we can accept this premise, then it seems that the use of current descriptor such as database, image, blog, article, etc., will work just fine no matter how the source was created.

- Brad