Extracted Text and OCR Text

Now that you’re processing native files with Breeze eDiscovery Suite, we have some new glossary terms being used in our instructions.  I want to start talking about these terms and how they apply to your processing.  Let’s begin with “extracted text” and talk about how extracted text is similar and different from OCR text.

In a previous blog post we talked about OCR.  OCR is a processing option which allows the end user to load TXT into their database in order to perform text searches on their documents.  When documents are processed for OCR, the software program being used “looks” at the pages using its computer eyes and searches for characters (letters, numbers, symbols) and saves those characters to a TXT file.  Then, the TXT file is loaded into the database and linked to the image that was OCR’d. 

When processing native files, we can forgo some of the OCR processing and extract the text that already exists in the native files.  Huh?!  Ok, let’s break this down.  Some files are text based, like Word, Word Perfect, Emails, and Excel.  This means that these files can be edited to contain more or less text, but because they do contain text, they are considered text based.  During electronic document discovery software processing (EDD software processing), the software performs the metadata extraction, load file creation, and text extraction. 

Let’s think of native files with extractable text as a jar of jellybeans.  Still with me?  Good!  And each jellybean is a word, letter, number, or symbol.  When the eDiscovery software processes each jar of jellybeans (native file with extractable text), each jellybean is duplicated and saved to a new jar, aka a TXT file.  Now, each TXT file contains the same amount of jellybeans as the original native file.  The bonus to this process is that each jellybean is a complete replica to the original jellybean in the jar original jar we processed.  The same goes for extracted text.  Since we are taking the words, letters, numbers, and symbols directly from the native file and copying the information to a TXT, the accuracy of the TXT file created is 100%.

Is there still a need to OCR since I’m processing native files?

Yes.  Some files are not text based and do not contain extractable text.  Scans and digital photographs are examples of files which do not contain extractable text.  Consider the when processing a PST file, your software may extract email files that have document scans as attachments to the emails.  These files are image files which do not have the ability to have text added or removed from the files, thus, they have no extractable text.  You end up with extracted native files, native load files, and extracted text files, but no OCR.  What do you do about the image files?  Well, you perform OCR on the not text based files.

It is important that your software is able to generate OCR for just the image files.  Let’s go back to our PST example that had a scanned document which was attached to an email.  Since the attached document does not have extracted text associated with it, the document is not searchable once loaded into the database.  Performing OCR on the image files will allow you to load the OCR TXT files into the database for searching.  Your software solution should allow you to OCR the image files, but keep the extracted text intact.

What other terms are you interested in discussing?  If you have any questions about this process, you can ask them via comment below.

No comments (Add your own)

Add a New Comment

Enter the code you see below:
code
 

Comment Guidelines: No HTML is allowed. Off-topic or inappropriate comments will be edited or deleted. Thanks.


E Discovery Software

Share/Save/Bookmark

Browse the Posts by Topic


Follow us on twiter