1. Welcome to the CivilWarTalk, a forum for questions and discussions about the American Civil War! Become a member today for full access to all of our resources, it's fast, simple, and absolutely free!
Dismiss Notice
Join and Become a Patron at CivilWarTalk!
Support this site with a monthly or yearly subscription! Active Patrons get to browse the site Ad free!
START BY JOINING NOW!

Researching the digitized ORs - a cautionary tale

Discussion in 'Battle of Gettysburg' started by Bob Velke, Oct 10, 2017.

  1. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    52
    I'm really disappointed in what I'm seeing available in the digitized versions of the Gettysburg ORs (s1, v27, p1-3) so I think that I'm going to have to digitize them myself. I started with the third volume (Correspondence) and thought that I'd post some of the results here by way of a warning.

    The trouble, of course, is not false-positives. The insidious errors are the false-negatives: when the search engine fails to find what is there. If you search and find something, it is easy to assume that you're getting everything - especially if the search results shows you a scan of the actual page. But that is not what you're really searching, of course. You're searching an OCRed version of the text that often has had little or no proofreading. And it is shocking to discover what you're missing!

    Cornell University's online version at http://ebooks.library.cornell.edu/m/moawar/waro.html is often cited as a good source so I started with that. It has a powerful search engine that I know is used by a lot of people. But how many people have clicked on "View entire text" to see what it is actually searching? It's scary.

    When you look at that text, the first clue of the problem is in the title: The ..."OFFJCJAL IRECOIRDS"...

    Yikes. They couldn't even proofread the title??

    For the first piece of correspondence alone (Page 3), I found several errors in the text, including:
    • Pleasonton coded as "Plea" and "sonton"
    • Washington coded as "Washing-" and "ton" (preserving the line break)
    • intended coded as "lutended"
    • positively coded as "posi.." and "tively" (another line break)
    That means that the search engine is not going to find those references. When you also count missing punctuation ("Stuart's" coded as "Stuarts"), I found more than 100 errors in the first 10 pages.

    Speaking of Pleasonton, I found 320 references to him in my own scan of Part 3. Cornell lists 311. You might think 97% sounds pretty good - unless you're researching the guy and don't realize that you've missed 9 potentially-critical pieces of correspondence that are from, to, or about him.

    In another test, Cornell finds only 92 of the 122 references to Emmitsburg in that volume (or 75%). For "Emmitsburg Road", it finds just 6 of 10 (60%). Of the last group, Cornell coded the first three errors as "iEmmits-"+"bare", "Einmitsburg", and "Emmits-"+"burg", respectively. The fourth error is due to the fact that Page 555 is completely missing from their scan!

    I bought a CD of the whole set of ORs and the search results there are even worse.

    Those of us who have a copy might double-check against the paper index but (1) honestly, who does that? (2) you have to check three indexes, one for each volume, and (3) it is woefully incomplete too! Of the 122 references to Emmitsburg in Part 3, the index lists exactly ONE. In fact, the incompleteness of the paper index is often cited as justification for using the digital version.

    Does anyone know of a digitized copy of the Gettysburg ORs which is more reliable?
     
    Last edited: Oct 10, 2017

  2. (Membership has it privileges! To remove this ad: Register NOW!)
  3. 19thGeorgia

    19thGeorgia Sergeant

    Joined:
    Apr 4, 2017
    Messages:
    769
    Location:
    Cleburne Co
    "it is shocking to discover what you're missing!"

    Oh, yeah...

    Oh, yoali...
     
  4. Jimklag

    Jimklag Captain Silver Patron Trivia Game Winner

    Joined:
    Mar 3, 2017
    Messages:
    6,567
    Location:
    Chicagoland, Land of Lincoln
    About six months ago I purchased a cd-Rom of the Official Recordss which also includes A Compendium of The War Of Rebellion, Regimental Losses in the American Civil War, User's Guide to the Official Records, Military Operations of the Civil War: A Guide to the Official Records. I have found it a very usefull tool. The title of the CD is The Civil War CD-Rom and it is published by The Guild Press of Indiana.
     
    NH Civil War Gal and 19thGeorgia like this.
  5. 19thGeorgia

    19thGeorgia Sergeant

    Joined:
    Apr 4, 2017
    Messages:
    769
    Location:
    Cleburne Co
    I have that one too. It's better than what is available online. I believe they went through it and corrected a lot of the errors in the scans.
     
    NH Civil War Gal and Jimklag like this.
  6. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    52
    Yes, there are many versions for sale and all of them seem good on the surface. But have you audited them in any way?

    How many hits do you get for Pleasonton, Emmitsburg, and Emmitsburg Road in s1, v27, p3?
     
  7. connecticut yankee

    connecticut yankee Private

    Joined:
    Jun 2, 2017
    Messages:
    113
    Thanks for the heads-up. I'm sure many of us on this forum use Cornell's O.R. site and just assume the search is complete and accurate.
     
  8. Jimklag

    Jimklag Captain Silver Patron Trivia Game Winner

    Joined:
    Mar 3, 2017
    Messages:
    6,567
    Location:
    Chicagoland, Land of Lincoln
    No. I am not a professional historical researcher. I bought the cd-rom to help me learn more about what I read. I have yet to find a discrepancy when looking up a footnote or an author's citation. That's good enough for me.
     
    NH Civil War Gal likes this.
  9. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    52
    Would you mind checking for Pleasonton, Emmitsburg, and/or Emmitsburg Road in s1, v27, p3 and let us know how many hits you get?
     
  10. Jimklag

    Jimklag Captain Silver Patron Trivia Game Winner

    Joined:
    Mar 3, 2017
    Messages:
    6,567
    Location:
    Chicagoland, Land of Lincoln
    I'll take a look when I get back to my computer. I'm on a tablet right now.
     
  11. Eric Calistri

    Eric Calistri 2nd Lieutenant

    Joined:
    May 31, 2012
    Messages:
    2,763
    Location:
    Austin Texas

    Hi Bob. I have owned the Guild Press CD for a long time, since about 2000. I don't know about newer versions that may be available, the user interface is strictly old school, but I am used to it. I also use the Cornell on-line version. The Guild press CD text is way more accurate, and thus more complete in a search, but it does not have the page images that Cornell has. I'll use one the other or both depending on what end result I need.

    On the Guild Press search for Pleasonton in that volume I am getting 33 hits, Emmitsburg 12 and "Emmitsburg Road" 4 hits. I think when multiple references are on the same page, the page comes up as a single hit.

    A couple screen shots:

    Screenshot 2017-10-10 10.00.44.png Screenshot 2017-10-10 10.01.13.png
     
    mofederal and AndyHall like this.
  12. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    52
    In my own scan, I have recorded each piece of correspondence according to the page number that it started on. So it's possible, e.g., that a letter starts on Page 20, continues onto Page 21, and the reference is actually on Page 21 where there happens to be another letter with another reference (causing an undercount compared to yours). On the other hand, a single letter might span two pages with one reference on each page (causing an overcount compared to yours). But they should mostly cancel each other out, I would think.

    In other words, there will be small differences compared to the way Guild Press counts them. But with that in mind, I have:

    Pleasonton - 320 references in 280 letters on 172 unique pages.
    Emmitsburg - 122 references in 78 letters on 55 unique pages.
    Emmitsburg Road - 10 references in 8 letters on 7 unique pages.

    Any way you count it, if you're getting Pleasonton=33, Emmitsburg=12, and Emmitsburg Road=4, then that's a huge difference!

    My search will find Pleasonton and PLEASONTON. Could it be that your search is case-sensitive? The signature line on most letters is in all caps but other references are not.
     
  13. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    52
    The count for "Emmitsburg road" is pretty straightforward because none of those letters wrap onto a second page.

    pg 490: 2 references (in two different letters)
    pg 531: 1 reference
    pg 532: 1 reference
    pg 533: 2 references (in the same letter)
    pg 555: 1 reference
    pg 558: 1 reference
    pg 1087: 2 references (in the same letter)
     
  14. Eric Calistri

    Eric Calistri 2nd Lieutenant

    Joined:
    May 31, 2012
    Messages:
    2,763
    Location:
    Austin Texas
    The search can be either case-sensitive or not. For the above, it was not case sensitive. The results highlight each hit.

    If you click on the screenshots I posted, you can see them in better resolution that will enable you to look at the results.

    What Guild Press does is divide the correspondance into "topics" of about 20-25 pages (this is the #1, #2, #3, #33 #34 etc that appear in the search results) in a way that is not present in the printed or Cornell version. Each "hit" shows that "Pleasonton" for example shows up in that "topic", at least once. However, it could be multiple times in that "topic". There is button to click that moves through each hit. Topic #1 has about 10 hits, topic #2 has over 30.

    So when I said 33 hits above that was inaccurate as to the number of times the search found "Pleasonton" in the Text of this volume, I should have said he appears in that many "Topics."
     
    Last edited: Oct 10, 2017
  15. Eric Calistri

    Eric Calistri 2nd Lieutenant

    Joined:
    May 31, 2012
    Messages:
    2,763
    Location:
    Austin Texas
    "Emmitsburg Road" is a more manageable result size. I can summarize that search result from the Guild Press CD as follows:

    page 490 (2)
    page 531, 532,533 (2)
    page 554, 558
    page 1087 (2)

    These seems identical to your result in #12, except for a page number discrepency 554 vs 555.
     
  16. DaveBrt

    DaveBrt First Sergeant

    Joined:
    Mar 6, 2010
    Messages:
    1,265
    Location:
    Charlotte, NC
    Digital newspapers are even worse. I have seen pages from the Library of Congress, Chronicling America where I have searched for the word "rail". This should catch all the "rail" and "railroad" occurrences. I have seen pages with one hit, but the word is visible on the page over a dozen times!

    I agree that the false-negatives are the real problems, especially when you have no reason to check a particular page and therefore catch the misses.
     
    mofederal likes this.
  17. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    52
    Well, that sounds good! I found a copy and bought it ($29!). I'll report back after I've had some time to work with it. Thanks.
     
    Eric Calistri likes this.
  18. Eric Calistri

    Eric Calistri 2nd Lieutenant

    Joined:
    May 31, 2012
    Messages:
    2,763
    Location:
    Austin Texas

    Well, that's a lot less than I paid for mine all those years ago. Hope it works for you!
     

(Membership has it privileges! To remove this ad: Register NOW!)

Share This Page


(Membership has it privileges! To remove this ad: Register NOW!)