How to build the tesseract OCR engine on Windows/Cygwin

Optical Character Recognition (OCR) comes in handy if you need to edit text which is not available in an editable format, e.g. scans of dead-tree documents. I needed an OCR solution which works both on my Unix boxes at home and on my Windows box at work. Tesseract ships a Windows version although it is a command-line tool. It seemed a better idea to build tesseract on Cygwin, and some success stories of previous versions on the web encouraged me to go ahead and try.

Since version 3, tesseract depends on the image library leptonica so I had to build that first. I used version 1.69. A first test run of ./configure told me that it was not able to find a couple of graphics libraries like libpng and libtiff. I noticed that I didn't have the development versions of these libraries installed. Remember that in order to build a software that uses a particular library, the DLL alone won't do. You need the headers and the import libraries too. These are packaged separately on Cygwin. I installed the required packages using setup.exe with the exception of a GIF library. Cygwin does not seem to ship one, probably due to licensing issues or because its no longer relevant these days.

With all libraries in place, I configured and built leptonica like this:

./configure --without-giflib

make && make install && make clean


Now I tried to build tesseract proper. I downloaded the 3.01 sources. ./configure complained that the leptonica lib is present but lacks a particular function (pixCreate, to be specific). Turned out that tesseract (or is that Cygwin?) does not include /usr/local/lib in the library search path, causing the test code to fail. After that was fixed, ./configure complained about a missing Makefile.in. This called for running the autotools using the autogen.sh script contained in the source. So the proper commands to build tesseract on Cygwin are:

./autogen.sh
./configure LDFLAGS=-L/usr/local/lib

make && make install && make clean


Finally, you'll need at least one language pack to run tesseract successfully. They can be downloaded from the same location as the tesseract sources. Trial and error told me that it is sufficient to unpack the xxx.traineddata.gz files (where xxx are the three-letter codes of the languages of your choice) into /usr/local/share/tessdata.

The following command extracts the text from some German document provided as a TIFF scan and puts it into the file output.txt (tesseract automagically adds the suffix .txt):

tesseract input.tif output -l deu

Kommentare

bernard polarski meint:

Many thanks. I just installed tesseract 3.03 easily on my cygwin.
Just had to remove the -std=c++11 from configure
Mittwoch 26 Februar 16:36

Mein Kommentar

Dieser Artikel ist geschlossen. Keine Kommentare mehr möglich.