Contacts

TesseRact training. TesseRact-OCR in Visual Studio - recognize the page of the text. Digitization of images one by one

Tesseract. - a free platform for optical recognition of text, the sources of which Google presented the community in 2006. If you write a software to recognize text, you probably have to access the services of this powerful library. And if she did not cope with your text, then you have one way left - to teach it. This process is quite complicated and replete with no obvious and sometimes directly with magical actions. The original description is. I needed almost a whole day to comprehend all his depths, so I want to save, I hope its more understandable option. So to help yourself and others go through this path the next time faster.

0. What we need

  • TesseRact actually.
Assembling this library is under Windows (you can download the installer from the official repository) and under Linux. For most Linux distributions, install TESSERACT can simply be installed through sudo Apt-Get Install Tesseract-OCR, I have to add source for my fashionable electary os:
gedit /etc/apt/sources.list.
dEB http://notesalexp.net/debian/precise/ Precise Main
wget -o - http://notesalexp.net/debian/alexp_Key.Asc
aPT-KEY Add alexp_key.asc
aPT-GET UPDATE
aPT-Get Install Tesseract-OCR
  • Image with text for training
It is desirable that it was a real text, which then will have to recognize. It is important that each font symbol met in the scanned fragment is at least 5 times, and preferably 20 times. TIFF format, without compression, preferably not multiplocked. Between all the characters should be clearly distinguishable gaps. Put our image into a separate directory and call in the form<код языка>.<имя шрифта>.exp<номер>.tif The image may not be one and differ they should only number in the file name. Format file names are very important. On files with incorrect names of the utility that we will use will swear to segmentation errors, etc. For definiteness, we assume that we study the SCC language and the EEE font. Thus, call the file with the scanning sample CCC.Eee.Exp0.tif

1. Create and edit the Box file
For. To mark the characters on the image and set their matches of UTF-8 text symbols are Box files. These are the usual text files in which each character corresponds to a string with a symbol and rectangle coordinates in pixels. Initially, you generate a utility from the TESSERACT package:
tesseRact CCC.Eee.Eee.Exp0 Batch.Nochop MakeBox
Received CCC.Eee.Exp0.Box file in the current directory. Look at it. Symbols at the beginning of the line fully match the characters in the file? If so, then you don't need to train anything, you can sleep well. In our case, most likely the characters will not coincide any of the essentially in number. Those. TesseRact with the default dictionary did not recognize not only the characters but also considered some of them for two or more. Perhaps part of the characters we will "stick", i.e. will fall into a common box and will be recognized as one. It all needs to fix before you go further. Work tedious and painstaking, but fortunately there are a number third-party utilities. For example, I used PytesTrainer-1.03. We open the image to them, the box-file with the same name he himself pulls out.
Half a day passed ... You with a sense of deep satisfaction with the PyteSeractTrainer (you did not forget to save the result, right?) And you have a correct Box file. Now you can go to the next step.

2. Training Tesseract
tesseRact CCC.Eee.Eee.EEE.EEE.EEE.EE.EEE.EVP0 NOBATCH BOX.TRAIN
We get a lot of errors, but we are looking at something like "Found 105 Good Blobs". If the figure is significantly greater than the number of "studied" characters, that is, the chance that the training as a whole has succeeded. Otherwise, we return to the beginning. As a result of this step, you have a CCC.Eee.EEE.Exp0.tr file

3. Remove the symbol set
unicharset_Extractor CCC.Eee.Exp0.Box.
We receive a set of characters as a Unicharset file in the current directory, where each character and its characteristics are located in a separate line. Here our task will check and correct the characteristics of the characters (the second column in the file). For small letters of the alphabet, we put a sign 3 for large 5, for the punctuation of 10 for numbers 8, everything else (type + \u003d -) marked 0. Chinese and Japanese hieroglyphs We note 1. Usually all signs are right, so this stage is a lot of time You will not take.

4. Describe the font style
Create a CCC.Font_Properties file with a single string: Eee 0 0 0 0 0. Here you first write the name of the font, then the number 1 or 0 is labeled the presence of style characters (respectively, ITALIC BOLD Fixed Serif Fraktur). In our case there are no styles, so we leave everything in zeros.

5. Clusters of figures, prototypes and other magic
For further studies, we need to perform three more operations. You can try to understand their meaning from the official description, I was not before :). Just perform:
shapeclustering -f CCC.Font_Properties -u Unicharset CCc.Eee.Exp0.tr
... file shapetable
and then:
mFTRAINING -F CCC.Font_Properties -u Unicharset -o CCC.UNICHARSET CCC.Eee.Exp0.tr
... get files cCC.Unicharset, Inttemp, PFFMTABLE
Finally:
cntraining CCC.Eee.Exp0.tr.
... We get the NORMPROTO file.

6. Dictionaries
The theoretically filling of the dictionaries of frequently used words (and words at all) helps TesseRact-u to deal with your doodles. Understanding dictionaries, but if you suddenly want, make filesfrequent_Words_List and Words_List in which enter (each with new String) Accordingly, frequently used and simply words of the language.
To convert these lists to the correct format, perform:
wordList2Dawg Frequent_Words_List CCC.FREQ-DAWG CCC.UNICHARSET

wordList2Dawg Words_List CCC.WORD-DAWG CCC.UNICHARSET

7. Last mysterious file
His name is UnicharamBigs. In theory, he must pay attention to the TesseRact on similar symbols. it text file In each row with the separators, the tabs are described by pairs of strings that can be confused when recognizing. Fully file format is described in the documentation, it was not needed and I left it empty.

8. Last team
All files must be renamed so that their names begin with the name of the language. Those. Only files will remain in the directory:

cCC.BOX.
cCC.INTTEMP.
cCC.PFFMTABLE
cCC.tif.
cCC.Font_Properties.
cCC.NORMPROTO.
cCC.Shapetable
cCc.tr.
cCC.Unicharset.

And finally, we perform:
cOMBINE_TESSDATA CCC.
(!) The point is obligatory. As a result, we get the CCC.TrainedData file, which will allow us further to recognize our mysterious new language.

9. We check whether it was worth it :)
Now let's try to recognize our sample using the already trained TESSERACT-A:
sudo CP CCC.TrainedData / USR / Share / Tesseract-OCR / TESSDATA /
tesseRact CCC.tif Output -L CCC
Now we look at Output.txt and rejoice (or getting up-minded, depending on the result).

We needed to improve the workflow in our company, first of all, to increase the speed of processing paper documents. To do this, we decided to develop software solution On the basis of one of the OCR (Optical Character Recognition) libraries.

OCR, or optical text recognition, is a mechanical or electronic conversion of printed text images into a machine. OCR is a method of digitizing printed text so that it can be electronically saved, edited, is displayed and applied in such machine processes as cognitive calculations, machine translation and intelligent data analysis.

In addition, OCR is used as a method for entering information from paper documents (including financial entries, business Cards, invoices and much more).

Before implementing the application itself, we conducted a thorough analysis of the three most popular OCR libraries in order to determine the most appropriate option to solve our tasks.

We analyzed the three most popular OCR libraries:

- Google Text Recognition API

Google Text Recognition API

Google Text Recognition API is the process of detecting text in images and video streams and recognition text contained in it. After detection, the recognizer determines the actual text in each block and breaks it to words and lines. He discovers the text of various languages \u200b\u200b(French, German, English, etc.) in real time.

It is worth noting that, in general, this OCR has coped with the task. We got the opportunity to recognize the text both in Real-Time and with already ready-made images of text documents. In the course of the analysis of this library, we revealed both the advantages and disadvantages of its use.

Benefits:

- the ability to recognize text in real time

- the ability to recognize text from images;

- small size library;

- High recognition speed.

Disadvantages:

- Large size files with trained data (~ 30MB).

Tesseract.

TesseRact is an open source OCR library for different operating systems. It is free software, issued under the Apache license, version 2.0, supports various languages.

The development of TesseRACT was funded company Google Since 2006, the time when it was considered one of the most accurate and efficient OCR libraries with open source.

Be that as it may, at that time, the results of the introduction of TesseRact we remained not very satisfied, because The library is incredibly volumetric and does not allow to recognize the text in real time.

Benefits:

- has an open source code;

- Accordingly, it is enough to teach OCR to recognize the desired fonts and improve the quality of recognizable information. After fast settings Libraries and learning The quality of recognition results rapidly increased.

Disadvantages:

- insufficient recognition accuracy, which is eliminated by training and learning the recognition algorithm;

- To recognize text in real time, additional processing of the resulting image is required;

- Small recognition accuracy when using standard files with font data, words and symbols.

Anyline

Anyline provides multi-platform SDK, which allows developers to easily integrate OCR functions in applications. This OCR library attracted us by numerous capabilities setting recognition parameters and provided by models to solve specific applied tasks. It is worth noting that the library is paid and intended for commercial use.

Benefits:

- pretty simple setup recognition of the desired fonts;

- text recognition in real time;

- Easy and convenient setting of recognition parameters;

- the library can recognize barcodes and QR codes;

- Provides finished modules To solve various tasks.

Disadvantages:

- low recognition speed;

- to obtain satisfactory results required initial setup Fonts for recognition.

In the course of the analysis performed to solve our tasks, we were staying on Google Text Recognition API, which combines high speed in itself, light setting and high recognition results.

The solution developed by us allows you to scan paper documents automatically digitize them and save to a single database. The quality of recognizable information is about 97%, which is a very good result.

Due to the introduction of the developed system internal (including document processing, their creation and exchange between departments, etc.) was accelerated by 15%.

It took me to get the values \u200b\u200bof the clogged numbers. Numbers robbed From the screen.

I thought, and not try to try OCR? I tried TesseRact.

Below I will tell you how I tried to adapt TesseRact, why did I train it, and what happened from it. The project on Hithabe lies a CMD script that automates how much time the workout process is possible, and the data on which I conducted training. In a word, there is everything you need to train TesseRact to train something useful.

Preparation

Cloning the repository or download zIP archive (~ 6MB). Install Tesserator 3.01 with from.Syta. If it is no longer there, then from the ZIP archive subdirectory / Distros.

Go to the samples folder, run montage_all.cmd.
This script will create a final image. samples / total.png., you can not start the script, because I already placed it in the root folder of the project.

Why train?

Perhaps and without training the result will be good? Check.
./exp1 - As IS\u003e TesseRact ../total.png Total

Position the corrected result in the file model_total.txtTo compare the results of recognition. An asterisk marks the wrong values.

model_total.txt Recognition
default
27
33
39
625.05
9
163
1,740.10
15
36
45
72
324
468
93
453
1,200.10
80.10
152.25
158.25
176.07
97.50
170.62
54
102
162
78
136.50
443.62
633.74
24
1,579.73
1,576.73
332.23
957.69
954.69
963.68
1,441.02
1,635.34
50
76
168
21
48
30
42
108
126
144
114
462
378
522
60
240
246
459.69
456.69
198
61
255
27
33
39
525 05*
9
153*
1,740 10*
15
35*
45
72
324
455*
93
453
1,200 10*
50 10*
152 25*
155 25*
175 07*
97 50*
170 52*
54
102
152*
75*
135 50*
443 52*
533 74*
24
1,579 73*
1,575 73*
332 23*
957 59*
954 59*
953 55*
1,441 02*
1,535 34*
50
75*
155*
21
45*
30
42
105*
125*
144
114
452*
375*
522
50*
240
245*
459 59*
455 59*
195*
51*
255

default recognition errors

It can be seen that there are many errors. If you look closely, you can see that the decimal point is not recognized, the numbers 6 and 8 are recognized as 5. Will the workout help get rid of errors?

Workout

Training TesseRact allows you to ride it on recognizing images of texts in the form in which you will feed it similar images in the recognition process.
You pass TesseRact training images, correct the recognition errors and transmit these directions to the Tesseract-y. And he corrects the coefficients in his algorithms to continue to prevent the errors you found.

To perform workout you want to run. / Exp2 - Trained\u003e Train.cmd

What is the workout process? And in the fact that TesseRACT processes the training image and forms so-called. Boxes of characters - highlights individual characters from the text creating a list of restricting rectangles. At the same time, he makes guessing about what kind of symbol is limited by a rectangle.

The results of this work writes to the Total.Box file, which looks like this:
2 46 946 52 956 0
7 54 946 60 956 0
3 46 930 52 940 0
3 54 930 60 940 0
3 46 914 52 924 0
9 53 914 60 924 0
6 31 898 38 908 0
2 40 898 46 908 0
5 48 898 54 908 0
0 59 898 66 908 0

Here in the singel column symbol, and in 2 - 5 columns coordinates of the lower left corner of the rectangle, its height and width.

Of course, it is difficult to edit it manually and uncomfortable, so graphic utilities facilitating this work were created by enthusiasts. I used to be written in Java.

After startup. / EXP2 - Trained\u003e Java -Jar JTESSBOXEDITOR-0.6JTESSBOXEditor.jar You need to open the file. / Exp2 - trained / total.png, the file will be automatically open. / Exp2 - trained / total.box and rectangle defined in it Will be superimposed on the training image.

In the left part, the contents of the Total.Box file are given on the right there is a training image. Above the image is the active line of the Total.Box file

Blue depicted boxes, and red - boxing corresponding to the active row.

I corrected all the wrong 5-ki on the right 6-ki and 8th, added rows with definitions of all decimal points in the file and saved Total.Box

After editing is complete, it is necessary that the script continues to work, you need to close the JTESSBOXEDITOR. Further, all actions are performed by a script automatically without user participation. The script writes the results of training under the code TTN

To use the learning outcomes when recognizing, you need to start TESSERACT C key -L TTN
./exp2 - Trained /\u003e Tesseract ../total.png Total-Trained -l TTN

It can be seen that all the numbers began to be recognized correctly, but the decimal point is still not recognized.

model_total.txt Recognition
default
Recognition
after workout
27
33
39
625.05
9
163
1,740.10
15
36
45
72
324
468
93
453
1,200.10
80.10
152.25
158.25
176.07
97.50
170.62
54
102
162
78
136.50
443.62
633.74
24
1,579.73
1,576.73
332.23
957.69
954.69
963.68
1,441.02
1,635.34
50
76
168
21
48
30
42
108
126
144
114
462
378
522
60
240
246
459.69
456.69
198
61
255
27
33
39
525 05*
9
153*
1,740 10*
15
35*
45
72
324
455*
93
453
1,200 10*
50 10*
152 25*
155 25*
175 07*
97 50*
170 52*
54
102
152*
75*
135 50*
443 52*
533 74*
24
1,579 73*
1,575 73*
332 23*
957 59*
954 59*
953 55*
1,441 02*
1,535 34*
50
75*
155*
21
45*
30
42
105*
125*
144
114
452*
375*
522
50*
240
245*
459 59*
455 59*
195*
51*
255
27
33
39
625 05*
9
163
1,740 10*
15
36
45
72
324
468
93
453
1,200 10*
80 10*
152 25*
158 25*
176 07*
97 50*
170 62*
54
102
162
78
136 50*
443 62*
633 74*
24
1,579 73*
1,576 73*
332 23*
957 69*
954 69*
963 68*
1,441 02*
1,635 34*
50
76
168
21
48
30
42
108
126
144
114
462
378
522
60
240
246
459 69*
456 69*
198
61
255

learning Recognition Errors

Enlarge image

You can increase in different ways, I tried two ways: scale and resize

total-scaled.png (fragment) total-resized.png (fragment)
Convert Total.png Total-Scaled.png -Scale "208x1920" convert Total.png Total-resized.png -Resize "208x1920"

Since the images of the characters increased with the images themselves, the training data under the code TTN is outdated. Therefore, then I recognized without a key -L TTN.

It can be seen that in the image Total-scaled.png Tesseract confuses 7-ku with a 2nd, and it does not confuse Total-resized.png. On both images, the decimal point is correct. Image recognition Total-resized.png is almost perfect. There are only three errors - the gap between the numbers in numbers 21, 114 and 61.

But this error is not critical, because It is easy to fix it by simply removal from the lines of spaces.

total-Scaled.png Recognition Errors

total-Resized.png Recognition Errors

model_total.txt Recognition
default
Recognition
after workout
total-scaled.png. total-resized.png.
27
33
39
625.05
9
163
1,740.10
15
36
45
72
324
468
93
453
1,200.10
80.10
152.25
158.25
176.07
97.50
170.62
54
102
162
78
136.50
443.62
633.74
24
1,579.73
1,576.73
332.23
957.69
954.69
963.68
1,441.02
1,635.34
50
76
168
21
48
30
42
108
126
144
114
462
378
522
60
240
246
459.69
456.69
198
61
255
27
33
39
525 05*
9
153*
1,740 10*
15
35*
45
72
324
455*
93
453
1,200 10*
50 10*
152 25*
155 25*
175 07*
97 50*
170 52*
54
102
152*
75*
135 50*
443 52*
533 74*
24
1,579 73*
1,575 73*
332 23*
957 59*
954 59*
953 55*
1,441 02*
1,535 34*
50
75*
155*
21
45*
30
42
105*
125*
144
114
452*
375*
522
50*
240
245*
459 59*
455 59*
195*
51*
255
27
33
39
625 05*
9
163
1,740 10*
15
36
45
72
324
468
93
453
1,200 10*
80 10*
152 25*
158 25*
176 07*
97 50*
170 62*
54
102
162
78
136 50*
443 62*
633 74*
24
1,579 73*
1,576 73*
332 23*
957 69*
954 69*
963 68*
1,441 02*
1,635 34*
50
76
168
21
48
30
42
108
126
144
114
462
378
522
60
240
246
459 69*
456 69*
198
61
255
22*
33
39
625.05
9
163
1,240.10*
15
36
45
22*
324
468
93
453
1,200.10
80.10
152.25
158.25
126.02*
92.50*
120.62*
54
102
162
28*
136.50
443.62
633.24*
24
1,529.23*
1,526.23*
332.23
952.69*
954.69
963.68
1,441.02
1,635.34
50
26*
168
2 1*
48
30
42
108
126
144
1 14*
462
328*
522
60
240
246
459.69
456.69
198
6 1*
255
27
33
39
625.05
9
163
1,740.10
15
36
45
72
324
468
93
453
1,200.10
80.10
152.25
158.25
176.07
97.50
170.62
54
102
162
78
136.50
443.62
633.74
24
1,579.73
1,576.73
332.23
957.69
954.69
963.68
1,441.02
1,635.34
50
76
168
2 1*
48
30
42
108
126
144
1 14*
462
378
522
60
240
246
459.69
456.69
198
6 1*
255

Digitization of images one by one

OK, what if you need to digitize images one after another in real linen mode?

I try one by one.
./exp5 - one by one\u003e for / r% i in (* .png) do tesseract "% I" "% i"
Two and three-digit numbers are not determined at all!

625.05
1740.10

Digitization of small packages

And if you want to digitize images by packets by several images (6 or 10 in the package)? Ten time.
./exp6 - Ten in Line\u003e TesseRact Teninline.png Teninline

Recognized, and even without a gap among 61.

conclusions

In general, I expected the worst results, because Small raster fonts are a boundary case due to their small size, distinct granularity and constancy - different images One symbol completely coincide. And the practice has shown that the arranged artificially inspired figures are better recognized.

Pre-processing image has a greater effect than learning. Increase with smoothing: convert -resize ...

Recognition of separate "short" two and three-digit numbers of unsatisfactory - numbers need to be collected in packages.

But in general, Tesseract almost coped with the task, despite the fact that it is sharpened for other tasks - recognizing inscriptions in the photo and video, scans of documents.

Tesseract. - a free platform for optical recognition of text, the sources of which Google presented the community in 2006. If you write a software to recognize text, you probably have to access the services of this powerful library. And if she did not cope with your text, then you have one way left - to teach it. This process is quite complicated and replete with no obvious and sometimes directly with magical actions. The original description is. I needed almost a whole day to comprehend all his depths, so I want to save, I hope its more understandable option. So to help yourself and others go through this path the next time faster.

0. What we need

  • TesseRact actually.
Assembling this library is under Windows (you can download the installer from the official repository) and under Linux. For most Linux distributions, install TESSERACT can simply be installed through sudo Apt-Get Install Tesseract-OCR, I have to add source for my fashionable electary os:
gedit /etc/apt/sources.list.
dEB http://notesalexp.net/debian/precise/ Precise Main
wget -o - http://notesalexp.net/debian/alexp_Key.Asc
aPT-KEY Add alexp_key.asc
aPT-GET UPDATE
aPT-Get Install Tesseract-OCR
  • Image with text for training
It is desirable that it was a real text, which then will have to recognize. It is important that each font symbol met in the scanned fragment is at least 5 times, and preferably 20 times. TIFF format, without compression, preferably not multiplocked. Between all the characters should be clearly distinguishable gaps. Put our image into a separate directory and call in the form<код языка>.<имя шрифта>.exp<номер>.tif The image may not be one and differ they should only number in the file name. Format file names are very important. On files with incorrect names of the utility that we will use will swear to segmentation errors, etc. For definiteness, we assume that we study the SCC language and the EEE font. Thus, call the file with the scanning sample CCC.Eee.Exp0.tif

1. Create and edit the Box file
For. To mark the characters on the image and set their matches of UTF-8 text symbols are Box files. These are the usual text files in which each character corresponds to a string with a symbol and rectangle coordinates in pixels. Initially, you generate a utility from the TESSERACT package:
tesseRact CCC.Eee.Eee.Exp0 Batch.Nochop MakeBox
Received CCC.Eee.Exp0.Box file in the current directory. Look at it. Symbols at the beginning of the line fully match the characters in the file? If so, then you don't need to train anything, you can sleep well. In our case, most likely the characters will not coincide any of the essentially in number. Those. TesseRact with the default dictionary did not recognize not only the characters but also considered some of them for two or more. Perhaps part of the characters we will "stick", i.e. will fall into a common box and will be recognized as one. It all needs to fix before you go further. Work tedious and painstaking, but fortunately there are a number of third-party utilities for this. For example, I used PytesTrainer-1.03. We open the image to them, the box-file with the same name he himself pulls out.
Half a day passed ... You with a sense of deep satisfaction with the PyteSeractTrainer (you did not forget to save the result, right?) And you have a correct Box file. Now you can go to the next step.

2. Training Tesseract
tesseRact CCC.Eee.Eee.EEE.EEE.EEE.EE.EEE.EVP0 NOBATCH BOX.TRAIN
We get a lot of errors, but we are looking at something like "Found 105 Good Blobs". If the figure is significantly greater than the number of "studied" characters, that is, the chance that the training as a whole has succeeded. Otherwise, we return to the beginning. As a result of this step, you have a CCC.Eee.EEE.Exp0.tr file

3. Remove the symbol set
unicharset_Extractor CCC.Eee.Exp0.Box.
We receive a set of characters as a Unicharset file in the current directory, where each character and its characteristics are located in a separate line. Here our task will check and correct the characteristics of the characters (the second column in the file). For small letters of the alphabet, we put a sign 3 for large 5, for the punctuation of 10 for numbers 8, everything else (type + \u003d -) marked 0. Chinese and Japanese hieroglyphs We note 1. Usually all signs are right, so this stage is a lot of time You will not take.

4. Describe the font style
Create a CCC.Font_Properties file with a single string: Eee 0 0 0 0 0. Here you first write the name of the font, then the number 1 or 0 is labeled the presence of style characters (respectively, ITALIC BOLD Fixed Serif Fraktur). In our case there are no styles, so we leave everything in zeros.

5. Clusters of figures, prototypes and other magic
For further studies, we need to perform three more operations. You can try to understand their meaning from the official description, I was not before :). Just perform:
shapeclustering -f CCC.Font_Properties -u Unicharset CCc.Eee.Exp0.tr
... file shapetable
and then:
mFTRAINING -F CCC.Font_Properties -u Unicharset -o CCC.UNICHARSET CCC.Eee.Exp0.tr
... get files cCC.Unicharset, Inttemp, PFFMTABLE
Finally:
cntraining CCC.Eee.Exp0.tr.
... We get the NORMPROTO file.

6. Dictionaries
The theoretically filling of the dictionaries of frequently used words (and words at all) helps TesseRact-u to deal with your doodles. Understanding dictionaries, but if you suddenly want, make filesfrequent_Words_List and Words_List in which you enter (each from a new line), respectively, often used and simply words of the language.
To convert these lists to the correct format, perform:
wordList2Dawg Frequent_Words_List CCC.FREQ-DAWG CCC.UNICHARSET

wordList2Dawg Words_List CCC.WORD-DAWG CCC.UNICHARSET

7. Last mysterious file
His name is UnicharamBigs. In theory, he must pay attention to the TesseRact on similar symbols. This text file in each row with tab separators is described by pairs of strings that can be confused when recognizing. Fully file format is described in the documentation, it was not needed and I left it empty.

8. Last team
All files must be renamed so that their names begin with the name of the language. Those. Only files will remain in the directory:

cCC.BOX.
cCC.INTTEMP.
cCC.PFFMTABLE
cCC.tif.
cCC.Font_Properties.
cCC.NORMPROTO.
cCC.Shapetable
cCc.tr.
cCC.Unicharset.

And finally, we perform:
cOMBINE_TESSDATA CCC.
(!) The point is obligatory. As a result, we get the CCC.TrainedData file, which will allow us further to recognize our mysterious new language.

9. We check whether it was worth it :)
Now let's try to recognize our sample using the already trained TESSERACT-A:
sudo CP CCC.TrainedData / USR / Share / Tesseract-OCR / TESSDATA /
tesseRact CCC.tif Output -L CCC
Now we look at Output.txt and rejoice (or getting up-minded, depending on the result).

TESSERACT-OCR is a free library for text recognition. In order to connect it to you need to download the following components:
Leptonica - http://code.google.com/p/leptonica/downloads/detail?name\u003dleptonica-1.68-win32-lib-include-dirs.zip.
The latest version of TeesRact-OCR (on this moment 3.02) - https://code.google.com/p/tesseract-ocr/downloads/detail?name\u003dtesshant-3.02.02-win32-lib-include-dirs.zip&can\u003d2&q\u003d
Labor learning data - https://tessract-ocr.googlecode.com/files/Tesseract-ocr-3.02.rus.tar.gz
Everything can be collected yourself by downloading source codesBut we will not do this.

By creating a new project, connect paths to LIB and H files. And write a simple code.
#Include. #Include. INT MAIN () (Tesseract :: TessBaseAPI * Myocr \u003d New Tessract :: TessBaseAPI (); PrintF ("Tesseract-Ocr Version:% S \\ N", MyOco-\u003e Version ()); PrintF ("Leptonica Version:% s \\ n ", GetleptonicaSION ()); if (myocr-\u003e init (," rus ")) (fprintf (STDERR," COULD NOT INITIALIZE TESSERACT. \\ n "); EXIT (1);) PIX * PIX \u003d Pixread ("test.tif"); myocr-\u003e setImage (PIX); char outtext; lstrcpy (outtext, myocr-\u003e getutf8text ()); printf ("OCR OUTPUT: \\ N \\ n"); PrintF (outText); myocr -\u003e Clear (); myocr-\u003e end (); pixdestroy (& pix); Return 0;)

Connect LIB files:
libteseratorAct302.Lib.
liblept168.lib

Complete - the program is successfully created. As an example, take the following picture:

We run the program so that the information is displayed in the file (since UTF-8 in the Capo console will be):
Test\u003e A.Txt.

File content below:
Tesseract-OCR Version: 3.02
Leptonica Version: Leptonica-1.68 (Mar 14 2011, 10:47:28)
OCR OUTPUT:

, Substituting this expression in (63), we see that
Titled single-band, signal is industrial
and the depth of the modulation is equal to a.
7 envelope ho) primary signal directly
It is impossible to observe the oscilloscope, so
How this signal is narrow-band, and we
'Case "clarity" is absent, but
With single-band modulation, narrow
'Salmon signal with the same envelope, and then she
"And manifests itself explicitly and sometimes (as in op
"Sunny case) makes confusion in the minds of poorly
and researchers ..
6.4. "Costa Formula"
G U.
With the advent of OM in textbooks, coffee `
Estatys and monographs debaught the question of
how winning gives the transition from amplitude
Modulations to single-lane. It was much expressed
Disordered opinions. At the beginning of the 60s of Ame-
Rican scientist J. Kostas wrote that, see
Trebuned journal literature on Ohm, he
discovered in each article its estimate of energy
"Clear winnings about AM-two to
Multitious dozen. As a result, he installed
- Winning, indicated in each article,
It makes approximately (z-k-s!) dB, where M-number co-'g
\u003e Authors of this article.
E, '11 If this joke and inaccurate, it is still correct
'Music reflects that maturity that existed
; In those years. In addition to the fact that different authors
D climbed a comparison in various conditions and once-m
, Nomu determined the energy winnings, they are so- 1
', `The same allowed a lot of different mistakes. four "
, `Here are examples of some reasoning. ",
1. With usual AM, believing the power 'carrier



Did you like the article? Share it