1. Welcome to the CivilWarTalk, a forum for questions and discussions about the American Civil War! Become a member today for full access to all of our resources, it's fast, simple, and absolutely free!
Dismiss Notice
Join and Become a Patron at CivilWarTalk!
Support this site with a monthly or yearly subscription! Active Patrons get to browse the site Ad free!
START BY JOINING NOW!

Researching the digitized ORs - a cautionary tale

Discussion in 'Battle of Gettysburg' started by Bob Velke, Oct 10, 2017.

  1. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    54
    I'm really disappointed in what I'm seeing available in the digitized versions of the Gettysburg ORs (s1, v27, p1-3) so I think that I'm going to have to digitize them myself. I started with the third volume (Correspondence) and thought that I'd post some of the results here by way of a warning.

    The trouble, of course, is not false-positives. The insidious errors are the false-negatives: when the search engine fails to find what is there. If you search and find something, it is easy to assume that you're getting everything - especially if the search results shows you a scan of the actual page. But that is not what you're really searching, of course. You're searching an OCRed version of the text that often has had little or no proofreading. And it is shocking to discover what you're missing!

    Cornell University's online version at http://ebooks.library.cornell.edu/m/moawar/waro.html is often cited as a good source so I started with that. It has a powerful search engine that I know is used by a lot of people. But how many people have clicked on "View entire text" to see what it is actually searching? It's scary.

    When you look at that text, the first clue of the problem is in the title: The ..."OFFJCJAL IRECOIRDS"...

    Yikes. They couldn't even proofread the title??

    For the first piece of correspondence alone (Page 3), I found several errors in the text, including:
    • Pleasonton coded as "Plea" and "sonton"
    • Washington coded as "Washing-" and "ton" (preserving the line break)
    • intended coded as "lutended"
    • positively coded as "posi.." and "tively" (another line break)
    That means that the search engine is not going to find those references. When you also count missing punctuation ("Stuart's" coded as "Stuarts"), I found more than 100 errors in the first 10 pages.

    Speaking of Pleasonton, I found 320 references to him in my own scan of Part 3. Cornell lists 311. You might think 97% sounds pretty good - unless you're researching the guy and don't realize that you've missed 9 potentially-critical pieces of correspondence that are from, to, or about him.

    In another test, Cornell finds only 92 of the 122 references to Emmitsburg in that volume (or 75%). For "Emmitsburg Road", it finds just 6 of 10 (60%). Of the last group, Cornell coded the first three errors as "iEmmits-"+"bare", "Einmitsburg", and "Emmits-"+"burg", respectively. The fourth error is due to the fact that Page 555 is completely missing from their scan!

    I bought a CD of the whole set of ORs and the search results there are even worse.

    Those of us who have a copy might double-check against the paper index but (1) honestly, who does that? (2) you have to check three indexes, one for each volume, and (3) it is woefully incomplete too! Of the 122 references to Emmitsburg in Part 3, the index lists exactly ONE. In fact, the incompleteness of the paper index is often cited as justification for using the digital version.

    Does anyone know of a digitized copy of the Gettysburg ORs which is more reliable?
     
    Last edited: Oct 10, 2017

  2. (Membership has it privileges! To remove this ad: Register NOW!)
  3. 19thGeorgia

    19thGeorgia Sergeant

    Joined:
    Apr 4, 2017
    Messages:
    896
    Location:
    Cleburne Co
    "it is shocking to discover what you're missing!"

    Oh, yeah...

    Oh, yoali...
     
    JohnW. likes this.
  4. Jimklag

    Jimklag Captain Forum Host Silver Patron Trivia Game Winner

    Joined:
    Mar 3, 2017
    Messages:
    7,392
    Location:
    Chicagoland, Land of Lincoln
    About six months ago I purchased a cd-Rom of the Official Recordss which also includes A Compendium of The War Of Rebellion, Regimental Losses in the American Civil War, User's Guide to the Official Records, Military Operations of the Civil War: A Guide to the Official Records. I have found it a very usefull tool. The title of the CD is The Civil War CD-Rom and it is published by The Guild Press of Indiana.
     
  5. 19thGeorgia

    19thGeorgia Sergeant

    Joined:
    Apr 4, 2017
    Messages:
    896
    Location:
    Cleburne Co
    I have that one too. It's better than what is available online. I believe they went through it and corrected a lot of the errors in the scans.
     
    JohnW., NH Civil War Gal and Jimklag like this.
  6. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    54
    Yes, there are many versions for sale and all of them seem good on the surface. But have you audited them in any way?

    How many hits do you get for Pleasonton, Emmitsburg, and Emmitsburg Road in s1, v27, p3?
     
    JohnW. likes this.
  7. connecticut yankee

    connecticut yankee Private

    Joined:
    Jun 2, 2017
    Messages:
    233
    Thanks for the heads-up. I'm sure many of us on this forum use Cornell's O.R. site and just assume the search is complete and accurate.
     
    JohnW. likes this.
  8. Jimklag

    Jimklag Captain Forum Host Silver Patron Trivia Game Winner

    Joined:
    Mar 3, 2017
    Messages:
    7,392
    Location:
    Chicagoland, Land of Lincoln
    No. I am not a professional historical researcher. I bought the cd-rom to help me learn more about what I read. I have yet to find a discrepancy when looking up a footnote or an author's citation. That's good enough for me.
     
    JohnW. and NH Civil War Gal like this.
  9. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    54
    Would you mind checking for Pleasonton, Emmitsburg, and/or Emmitsburg Road in s1, v27, p3 and let us know how many hits you get?
     
    JohnW. likes this.
  10. Jimklag

    Jimklag Captain Forum Host Silver Patron Trivia Game Winner

    Joined:
    Mar 3, 2017
    Messages:
    7,392
    Location:
    Chicagoland, Land of Lincoln
    I'll take a look when I get back to my computer. I'm on a tablet right now.
     
    JohnW. likes this.
  11. Eric Calistri

    Eric Calistri 2nd Lieutenant

    Joined:
    May 31, 2012
    Messages:
    2,791
    Location:
    Austin Texas

    Hi Bob. I have owned the Guild Press CD for a long time, since about 2000. I don't know about newer versions that may be available, the user interface is strictly old school, but I am used to it. I also use the Cornell on-line version. The Guild press CD text is way more accurate, and thus more complete in a search, but it does not have the page images that Cornell has. I'll use one the other or both depending on what end result I need.

    On the Guild Press search for Pleasonton in that volume I am getting 33 hits, Emmitsburg 12 and "Emmitsburg Road" 4 hits. I think when multiple references are on the same page, the page comes up as a single hit.

    A couple screen shots:

    Screenshot 2017-10-10 10.00.44.png Screenshot 2017-10-10 10.01.13.png
     
    JohnW., mofederal and AndyHall like this.
  12. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    54
    In my own scan, I have recorded each piece of correspondence according to the page number that it started on. So it's possible, e.g., that a letter starts on Page 20, continues onto Page 21, and the reference is actually on Page 21 where there happens to be another letter with another reference (causing an undercount compared to yours). On the other hand, a single letter might span two pages with one reference on each page (causing an overcount compared to yours). But they should mostly cancel each other out, I would think.

    In other words, there will be small differences compared to the way Guild Press counts them. But with that in mind, I have:

    Pleasonton - 320 references in 280 letters on 172 unique pages.
    Emmitsburg - 122 references in 78 letters on 55 unique pages.
    Emmitsburg Road - 10 references in 8 letters on 7 unique pages.

    Any way you count it, if you're getting Pleasonton=33, Emmitsburg=12, and Emmitsburg Road=4, then that's a huge difference!

    My search will find Pleasonton and PLEASONTON. Could it be that your search is case-sensitive? The signature line on most letters is in all caps but other references are not.
     
    JohnW. likes this.
  13. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    54
    The count for "Emmitsburg road" is pretty straightforward because none of those letters wrap onto a second page.

    pg 490: 2 references (in two different letters)
    pg 531: 1 reference
    pg 532: 1 reference
    pg 533: 2 references (in the same letter)
    pg 555: 1 reference
    pg 558: 1 reference
    pg 1087: 2 references (in the same letter)
     
    JohnW. likes this.
  14. Eric Calistri

    Eric Calistri 2nd Lieutenant

    Joined:
    May 31, 2012
    Messages:
    2,791
    Location:
    Austin Texas
    The search can be either case-sensitive or not. For the above, it was not case sensitive. The results highlight each hit.

    If you click on the screenshots I posted, you can see them in better resolution that will enable you to look at the results.

    What Guild Press does is divide the correspondance into "topics" of about 20-25 pages (this is the #1, #2, #3, #33 #34 etc that appear in the search results) in a way that is not present in the printed or Cornell version. Each "hit" shows that "Pleasonton" for example shows up in that "topic", at least once. However, it could be multiple times in that "topic". There is button to click that moves through each hit. Topic #1 has about 10 hits, topic #2 has over 30.

    So when I said 33 hits above that was inaccurate as to the number of times the search found "Pleasonton" in the Text of this volume, I should have said he appears in that many "Topics."
     
    Last edited: Oct 10, 2017
    JohnW. likes this.
  15. Eric Calistri

    Eric Calistri 2nd Lieutenant

    Joined:
    May 31, 2012
    Messages:
    2,791
    Location:
    Austin Texas
    "Emmitsburg Road" is a more manageable result size. I can summarize that search result from the Guild Press CD as follows:

    page 490 (2)
    page 531, 532,533 (2)
    page 554, 558
    page 1087 (2)

    These seems identical to your result in #12, except for a page number discrepency 554 vs 555.
     
    JohnW. likes this.
  16. DaveBrt

    DaveBrt First Sergeant

    Joined:
    Mar 6, 2010
    Messages:
    1,381
    Location:
    Charlotte, NC
    Digital newspapers are even worse. I have seen pages from the Library of Congress, Chronicling America where I have searched for the word "rail". This should catch all the "rail" and "railroad" occurrences. I have seen pages with one hit, but the word is visible on the page over a dozen times!

    I agree that the false-negatives are the real problems, especially when you have no reason to check a particular page and therefore catch the misses.
     
    JohnW. and mofederal like this.
  17. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    54
    Well, that sounds good! I found a copy and bought it ($29!). I'll report back after I've had some time to work with it. Thanks.
     
    JohnW. and Eric Calistri like this.
  18. Eric Calistri

    Eric Calistri 2nd Lieutenant

    Joined:
    May 31, 2012
    Messages:
    2,791
    Location:
    Austin Texas

    Well, that's a lot less than I paid for mine all those years ago. Hope it works for you!
     
    JohnW. likes this.
  19. Bob Velke

    Bob Velke Private

    Joined:
    Jan 25, 2014
    Messages:
    54
    Well, here's the follow-up report that I promised.

    As I mentioned, I found a copy of "The Civil War CD-ROM" by Guild Press online for $29 and bought it. It was v1.5 dated 1996 (ISBN 1-878208-76-4). I tried to install it on my Windows 10 system and it just wouldn't install ("Operation Failed" error). I guess that shouldn't be too surprising since the software is 21 years old. I was able to look at the contents of the CD with Windows Explorer but everything seemed to be compressed or encrypted in a way that my system didn't understand. I couldn't find any remnants of Guild Press online either. They went out of business a long time ago, I guess. I put the CD aside thinking that I might dig out an old computer some day.

    But a few days ago I was cleaning up and found a copy of the CD which I had bought many years ago. (That happens more often than I'd like to admit). The CD case said that it was v1.6 but the date and ISBN number was the same. On a lark, I put it in the CD drive and, as I expected, it didn't do anything. But then I looked at the contents of the CD and there was a SETUP.EXE. I ran it directly but, sadly, it didn't work either. But I noticed that there was also a SETUP95.EXE so I tried that instead ... and it worked!

    It installed successfully but it required the CD to be always in the drive in order to run. Well, that's a pain. Apparently, it was looking for the database in the original CD location. So I uninstalled, copied the entire contents of the CD to a folder on my hard drive, removed the CD from the drive, and ran SETUP95.EXE directly from the hard drive. It installed again and now it runs/searches MUCH faster and doesn't require the CD to be in the drive.

    There are still some side effects. It doesn't recognize my mouse wheel - so scrolling is a PITA. And it won't let me access the help file. Its help format is too old. I might dig out that old computer just for the purpose of reading the help file.

    The CD does have some powerful search features. And it is clear to me that the full text of the ORs has been retyped (!), not just OCRed. Tables and figures are intact. As Eric said, it is hard to determine exactly how many "hits" there are from a search (I'll let you know if I find that in the help file) but it would seem that it is reliable except to the extent that there might be typos in the transcription.

    The back of the newer CD case has a label over the Guild Press contact info. It says "Published by Oliver Computing, LLC, www.civilwaramerica.com". It is good to know that all of the hard work by Guild Press hasn't been completely lost. Oliver Computing's web site still lists the CD as being for sale ($69.95) but "Temporarily out of stock". I called them and left a message asking about future availability.

    Oliver Computing also sells another product, "The Complete Civil War DVD-ROM" which it says includes everything from the earlier CD plus the "Medical and Surgical History..." - which is nice - and a few other things. But it's "A $350 value for only $169.95". It's not clear whether that DVD uses the original Guild Press database and/or is compatible with Windows 10. It says that it requires "Windows 98 or above" which means that it is pretty old. Maybe someone here has this DVD and can comment on its database and compatibility?

    All of this is to agree with other writers that it is a good product IF you can find it and IF you can manage to get it to run on your computer. If you already have it installed on an older computer, my experience may give you another factor to consider before upgrading.

    Thanks to Eric for his help.
     

(Membership has it privileges! To remove this ad: Register NOW!)

Share This Page


(Membership has it privileges! To remove this ad: Register NOW!)