One of the top engines that were created for these purposes is Tesseract and those who intend to try and use it have at their disposal the Tesseract-OCR package.
Multiple setting installation
Before getting to use this tool, it is a good idea to pay attention to the setup procedure as it may provide some useful extras that may be required when handling documents in many foreign languages.
More precisely, the 'Language data' section enables you to choose the desired languages and also add the math and equation detection module if you plan to extract this type of data as well.
No GUI and quick execution via Command Prompt
As soon as Tesseract-OCR is installed onto your system, you will be able to deploy it via command-line and start using it immediately. There are only a few parameters to apply when working on the target files and they are explained well enough.
The most important values are those for the 'pagesegmode' parameter and they pertain mainly to the page segmentation and image handling.
Fast operation and widely supported output
One of the main strong points of Tesseract-OCR is its ability to recognize and process a variety of graphical image file types. Another great thing about this utility is its processing speed which should satisfy the needs of any user.
When it comes to saving the extracted content, the program generates text (TXT) files with the names you set before starting the task.
Simple tool for all users
All things considered, this command-line application should be not to difficult to understand for less experienced users as it uses a quite simplified syntax. It is quick in processing and accurate enough to be considered among the best in its category.
Reviewed by Olivian Puha on February 24th, 2014
In a hurry? Add it to your Download Basket!
- Moved ResultIterator/PageIterator to ccmain.
- Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
- Added paragraph detection in layout analysis/post OCR.
- Fixed inconsistent xheight during training and over-chopping.
Application descriptionTesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little ...