Born Digital Text

Increasingly more often, offical, government recorders enter information directly into a computer rather than writing on paper forms. This is sometimes called "born digital." I think the evidentiary strength of a born-digital record is akin to a digital image of an original record. That leaves me uncomfortable citing the born-digital database like this:

          "Database name," database, Website Title (URL : access date).

The strength is more like that of a "database and digital image," even though there is no image. One could attempt to signal the strength of the database by attributing it to the government agency:

          Name of government agency, "Database name," database, Website Title (URL : access date).

But that doesn't differentiate from the born-digital database from one created later by extraction. And when third parties republish government databases, they often "massage" the information. I consider those to be derivatives that should not be attributed to the original government agency.

What would you think about including some term akin to "born-digital" somewhere in the citation:

          "Database name," born-digital database, Website Title (URL : access date).

What are your thoughts?

Submitted byACProctoron Fri, 03/13/2015 - 14:12

I don't quite see the difference like that. If data had been recorded in typescript or manuscript then you might indicate that fact, and so in this case of "digital script" then you could also indicate that. However, I don't believe it's akin to a digital image because (a) digital text is not the same as an image, and (b) there was no original typescript/manucript to take an image of.

From that point of view (noting that I am no expert), I would expect it to be just like a digital publication.

Tony

Submitted byyhoitinkon Sat, 03/14/2015 - 09:13

I agree with the original poster that there is a big difference in a database created as a derivative source, for example a database where a volunteered typed in information from an original source and a database created by an organization as a digital born object, for example a database with current addresses of residents like our municipalities keep. The 'derivative' database would be less reliable, since the typing in introduces the chance of errors and the original record may have information not transmitted to the database. The digital born database would be an original source with all the hallmarks of quality that go with that. Source citations are important to assess the quality of the evidence, so making a distinction between these two different critters is essential. I personally like the idea of a separate designation like "digital born database."

Submitted byEEon Sat, 03/14/2015 - 10:39

Raymond, Tony, and Yvette,

At this point, EE doesn't agree or disagree with either view. There do seem to be questions that need resolving. Bottom line: What, exactly, constitutes "digitally born?"

Yvette gives the example of a city that maintains a database with current addresses of residents. From the standpoint of quality—a point Yvette raised—we might consider that high, because the city has a vested interest in knowing who lives where. But it's not error free. People move without letting the city know. Typists make typos. More to the point: originality is a different issue from quality.

Such a database also involves keying in data that the data entry person takes from something else—a form someone has filled out for electric service, for example.  So ...

  • How does this make the database different from one in which, say, someone takes old voter rolls and creates a database from them for historical research?
  • How would you define a database with medical records, where in a physician dictates a synopsis of a person's complaint, symptoms, diagnosis, and treatment—after which his dictation is entered into a database, with no paper trail in between? Would this be a digital-born database?

Would you mind sharing other examples of what you consider a "digital born" and why or how it differs from, say, the voter-roll example above?

Submitted byACProctoron Sat, 03/14/2015 - 12:29

There is talk of 'databases' here but surely the citation depends on the actual published medium.

Recalling that we can only cite what we saw and used, then if a database was specifically mentioned (incl. name and version details) then that is the published medium to be cited. However, if you were using some smart text or document retrieval system, with no mention of how the data was stored behind the scenes, then that system is the published medium to be cited.

When I mentioned 'electronic publication', above, then I was imagining that this data would be presented through such a system (e.g. via Web pages) rather than raw database queries.

Submitted bydsliesseon Sat, 03/14/2015 - 12:30

Although not a genealogical source (one would hope!), I can give an example of a "digital born" database (first time I've ever heard that term).  When I go to the doctor's office, nobody except an incoming patient writes anything on paper.  When the medical assistant takes my blood pressure and all, the numbers are entered directly into the computer.  When the doctor makes his notes, they go directly into the computer.  No medical transcriptionists are involved.

At this point I can't think of such a database that would be routinely used by genealogists, but I'm sure they're out there.  I like the idea of differentiating between the two types of databases, but have one question: how do you know which type you're referencing?

Dave

Submitted byyhoitinkon Sat, 03/14/2015 - 17:18

This article on the Library of Congres website has a good explanation about the difference between born digital and digitized:

Trevor Owens, "All Digital Objects Are Born Digital Objects," Library of Congress, The Signal, 15 May 2012 (http://blogs.loc.gov/digitalpreservation/2012/05/all-digital-objects-are-born-digital-objects : accessed 14 March 2015).

Owens makes an interesting argument, Yvette. Certainly an original photo taken with a digital camera would be "born digital."  On the other hand, it could more validly (I think) be argued that a ditization of an existing image is only an adaptation or an enhancement, rather than the birth of something new. Most of the elements of the "new" image already existed, unless the artist-behind-the-camera deliberatedly tried to obscure the original details. Yes? No?

But there's a huge difference between image data and digital text (or 'digital script' as I described it above). As you say, a digital-camera image is an image of something else, and that applies to any sort of recorded digital image (e.g. a scan).

When that doctor or nurse types data into some program (which may or may not be using a 'database' to store it) then it is stored as digital text -- not in some image format.

In order for these questions to make sense then we have to be careful with these distinctions. It's worth noting that all the hassles we may encounter with OCR are because we're trying to convert image data into digital text.

ACProctor, your penultimate sentence is, for me, the bottom line. Coupled with the fact that most historical researchers are not tech professionals and will have difficulties drawing distinctions that you, Robert, and Yvette can easily draw, would we not invite more confusion by "mandating" technical distinctions that the general public is not equipped to make?

Submitted byEEon Sun, 03/15/2015 - 14:29

Raymond, Tony, and All,

Tony's distinction between "born digital image" and "born digital text" is critical. That begs a new question. Raymond, if we say that "born digital text" is a different entity from "born digital database, would that "database" distinction be limited to situations in which pieces of data were actually entered, for the  one and only time, in a relational database that is searchable by elements? If so, then

  • How will users be able to discern whether the data was truly born digital (unless they have "inside knowledge" as you would have with, say, FS databases)?
  • What label would we apply to born-digital narratives (books, blogs, etc.) that are also searchable through a database of some type?
  • Would it not suffice for (a) users who are unable to discern the mechanism behind the produ—i.e., most researchers—to continue to use the basic descriptors, article, blog, book, database, database with images, etc.; while  (b) those with inside knowledge of the creation of the digital source can use the comments field of a citation to identify instances in which they are certain the text or the image was "born digital"?

Elizabeth

 

I am joining the conversation late and I agree wtih the Editor's conclusion that the basic descriptions will suffice for most situations.

Since more and more sources will be digital, perhaps I can share my thoughts.

It seems to me that the issue is not whether the information was "born digital" or not.  Perhaps the real issue is how to evaluate the strength or weakness of the source (EE p. 10) and convey that in the citation.

A source can be born digital, live digital, and, just like a paper source, still contain errors through later digital translations that would weaken the validity of the source. 

At what point does a born digital original source become a derivative source?  We can probably use the same reasoning as we do for traditional sources.

Let me illustrate with an example.

When I type a compiled lineage into my word processor, it is "born digital."  For the sake of simplicity, let’s suppose that I release all claims to copyright on this document.  I then save the document as an Acrobat file (.pdf).  At this point, it is slightly "less original" than my word processor document, but still almost as useful since the information is clearly readable and the text is embedded into the document so that it can be copy-pasted as needed.  However, if someone finds an error in my .pdf document, they can't fix it without tracking me down because it is not easily editable.

Next, I attach the .pdf as a source in FamilySearch, thus releasing it into the wild, wild Internet.  Later, someone converts the .pdf to a 72dpi jpeg image for some reason - perhaps to include it on their web site.  Now the jpeg document is "less original" than the .pdf.  It is harder to read and you can no longer select the text since it is just an image.

Finally, someone attempts to OCR the jpeg image (i.e. convert the jpeg image of text to editable text characters) to recreate the text.  The resulting text document is "less original" than the jpeg and probably contains errors.

In this example, the document was born digital and was digital every step of the way.  However, the different derivative versions of the document have varying abilities to convey accurate information.

So perhaps the question we are trying to address is how we can evaluate the strength or weakness of these digital sources and convey that in the citation.  Maybe we can use similar reasoning to traditional sources, but it is also possible that we are trying to fit too much information in the citation.

     - Brad

 

Submitted byrraymondon Thu, 03/19/2015 - 18:02

Brad and all,

Here's an example illustrating my need.

Citation 1:

Suppose someone goes to a Newspaper A's website and types in an obituary. A copy of that obituary is electronically transfered to genealogical website B with 99.999% accuracy. Newspaper A prints the obituary in a paper edition of its newspaper. Website B publishes the obituary as part of a database of obituaries. You look at the database record of the obituary. Knowing the propensity for OCR errors, you look for a link to an image so you can verify the information, but there is none. The website provides a proper citation which you use in a compiled genealogy which you enter in your word processor.

Citation 2:

Genealogy website B gets permission to publish the text from newspaper A, but without images. It scans the paper with an OCR process that yields 99% acurracy of common words, 96% accuracy of names, and 93% accuracy of numbers. Website B publishes the obituary--the same one as above--as part of a database of newspaper stories. You look at the database record of the obituary. Knowing the propensity for OCR errors, you look for a link to an image so you can verify the information, but there is none. The website provides a proper citation which you use in a compiled genealogy which you enter in your processor.

Citation 3:

Genealogy website B finally gets permission to publish images of newspaper B. You look at the database record of the obitary. You find a link to an image and you utilize it instead of the database record. The website provides a proper citation which you use in a compiled genealogy which you enter in your processor.

Here's my quandry:

  • What would the three citations look like?
  • If the citations all had to lead with website B's publication, what would they look like?
  • How can we make the citations as succinct as possible?

I've always looked at the item type, "database" vs. "image," as a leading indicator of the evidentiary value of the source. From a reputable publisher, "database" always had the evidentiary value of a textual derivative while "image" always had an evidentiary value close to an original. That's no longer true. Must we add a tertiary element to the citation to signal this new item flavor?

---Robert

Submitted byEEon Thu, 03/19/2015 - 20:24

Robert, my brain is awash in possibilities and possibilities, so perhaps I've missed something in your scenarios. However, it seems to me that the database vs. image proposition sets  up a "false choice." If I've understood you correctly, the original 'born digital' item isn't either a database or an image. It's an article. More specifically, it's an obituary, but obits fall into the larger category of "articles." 

I know you recall EE's first QuickSheet, Citing Online Historical Sources. (I know it because you often seem to know EE better than I do!) Most of that QuickSheet, deals with databases and images. But page 1 begins (on purpose) with a diagrammed template for an article. That is a very important "third option" for online materials, though I wouldn't call it tertiary.

Beyond that point, the derivatives do offer us the database vs. image dichotomy.  Do you care to offer us those three citations for debate or rubber-stamping?

Elizabeth

Submitted byACProctoron Fri, 03/20/2015 - 03:35

We accept that a recorded image (as opposed to a created or modified image) is a derivate copy of something, and it's correctly pointed out here that OCR is also a further derivative, and so subject to all the same issues. However, born-digital text should not degrade and so every (digital-)copy should be exactly the same. I'm avoiding issues here such as character-set conversion, and also any differences in the rendition of what's stored (e.g. you may have perfectly valid Japanese text stored, irrespective of whether your system can display it or not). In effect, I don't see any situation that cannot be addressed by Elizabeth's earlier suggestion.

I can't close this post, too, without mentioning a particular issue that concerns me as a software developer, and that's the distinction between database and image. I understand the goal, and I have discussed this with Elizabeth before, but databases do not simply store text; they can store images too. Typically, you cannot tell whether a provider is storing images in the database or just the names of image files held externally to it. Under these circumstances, I am happy to use database-with-images but acknowleding that database is the indexed method of organised storage rather than some group of textual items extracted from it.

"Mr Pedantic"  ;-)

 

 

Submitted byEEon Fri, 03/20/2015 - 10:01

>"A particular issue ... concerns me as a software developer, and that's the distinction between database and image. ... Databases do not simply store text; they can store images, too. ... I am happy to use database-with-images."

We hear you, Tony. :)

 

Submitted byyhoitinkon Fri, 03/20/2015 - 11:42

I think the underlying issue is original versus derivative. We have the same issues with manuscripts: only by analyzing them properly will be understand the context in which the record was created and conclude that a manuscript is an original document or derived (copied) from an earlier document. We're used to thinking of databases as derivatives, but the digital advances means that distinction isn't so easy to make anymore since databases can now be originals too. 

Yvette,

I completely agree.  I don't think that the term image or database in the citation alone should be used to interpret the quality of a source such as original or derivative.  These descriptors merely indicate the way the information is stored.  We need further analysis to determine the quality of the source.

If we can accept this premise, then it seems that the use of current descriptor such as database, image, blog, article, etc., will work just fine no matter how the source was created.

     - Brad