OCR: Making an Image Searchable

What does OCR mean?  Not a bad question.  OCR stands for Optical Character Recognition.  Of course, this doesn't completely shed light on what OCR is and what OCR does.  OCR is a process performed by a program on your system.  Once activated, the OCR process is like virtual eyes taking a peak at the documents you selected on your computer or hard drive.  (Don't worry, these eyes only look at images, not users!) When the eyes recognize a character (letter, number, symbol) on your document, the OCR program saves the character to an output file.  Then, the OCR process resumes its search for the next character.  Once the OCR program finds the next character, it records the find in an output location.  This process is then run again and again until the OCR program finishes searching across the selected documents. 

After the OCR program is finished searching the documents, you can then perform searches on the results.  This will save the document reviewer the time it would take to identify “key people” or “issues” located within a document without reading the document.  There are many databases where you can load OCR text and search for the same key people or issues across the entire document population.

Not all OCR software programs are equal.  The more accurate OCR software programs will yield better results by recognizing more characters and saving the characters to the output file.  Always consider OCR accuracy in addition to speed when evaluating a product offering OCR capabilities.

 

OCR Output

After the OCR process is complete, the information can be saved in different formats.  Some programs will save the OCR text into Word files.  In Litigation Support, we typically encounter OCR in either .TXT files or embedded in PDF files.

 

OCR – PDFs with Text

PDFs with text are PDF files with embedded text.  These files can be created either (1) by converting a text based native file (such as a Word doc) directly to PDF or (2) by performing OCR on image files and combining the images and OCR into a PDF file.  Either method will create a PDF with text.  The method of converting a text based file to PDF with text provides the most accurate search ability.  The original file is converted to a PDF image and the text from that file is embedded into the PDF.  Since the original text is embedded into the file, this method should find all instances of the search phrase. 

Creating a PDF with text using OCR will create a less accurate file.  The OCR engine must seek and record all of the characters on the image file.  Then, the OCR text is combined with the image file into a searchable PDF file.  The reviewer may then search the document for a specific word or term.

 

OCR – Txt File Output

In litigation support, we see image files accompanied by .txt files.  Many litigation support databases allow us to load the .txt files into our case along with image files.  The document records with corresponding .txt files are then searchable.  In programs such as Summation, we can search for certain terms across the entire database.

OCR text is a great tool in your review arsenal.  You are able to use OCR text to search across a large population of documents in a database or search individual documents for key terms.  Keep in mind that the quality of the OCR software you use has a direct result on the OCR results.

No comments (Add your own)

Add a New Comment

Enter the code you see below:
code
 

Comment Guidelines: No HTML is allowed. Off-topic or inappropriate comments will be edited or deleted. Thanks.


E Discovery Software

Share/Save/Bookmark

Browse the Posts by Topic


Follow us on twiter