Hi!
I'm working with Google Tesseract (OCR-engine) to convert scanned PDF's to XML/Text files, with the goal of collecting and sorting the data into a larger data set for more in-depth analysis.
The OCR-process works fine and now I want to sort and collect the data from output files in XML-output into a larger database.
A typical input has the following layout:
http://image.bayimg.com/f724281a9943f3716df9c6e4d42b81ffa279a0d1.jpg
For each word / digit I get the word / digit + coordinates (bbox) in terms of x0,y0, x1, y1 (a rectangle). In XML it looks like this:
<p class='ocr_par' dir='ltr' id='par_4' title="bbox 128 1462 575 1541">
<span class='ocr_line' id='line_8' title="bbox 128 1462 575 1541"><span class='ocrx_word' id='word_20' title="bbox 128 1462 575 1541">Reparationer</span>
</span>
</p>
the bbox coordinates for 'par_7' and 'line_10' can be ignored since they refer to the OCR engine's interpretation of lines etc., which can be different from the "actual" coordinates depending on how the page is segmented during the OCR-process.
After importing the data and manipulating it a bit I have the following result (only showing data for one line):
<tbody>
</tbody>
All of these should be on the same row except the last row (i.e have the same y0), but depending on how the engine maps the words, the coordinates can differ a bit. The difference between the rows is typically in the magnitude of 50-90 pixels so adding a bit of flexibility of 20-30 pixels allows me to collect all the values on the same row.
The order of the digits in the table above is not correct since the "8" should be the first digit followed by 417 and then the 029 given the x-coordinates.
The logic to build the rows and columns should be:
1. Sort by y0 coordinates
2. Group into rows given a certain "flexibility"
3. Repeat 2 but with columns (typically the distance between words/digits in the same column is <40 pixels and the difference between columns is >100 pixels)
What I would appreciate some feedback on
I have some experience in VBA but given that my dataset is quite large I would be grateful for any tips / hints for more efficient solutions.
My plan for VBA-module
1) Compute the size/length of the array
2) Sort the data on y0 values
3) Define the lowest y0 value as line 1
4) Look for values within a range of 20 pixels and collect that data into arrays.
5) Define line 2 by taking the lowest value that is larger than line 1 y0 + 20 pixels
6) Repeat step 4 and so on (and do a similar step for the columns)
Given your experience in VBA / XML, is there some more efficient way of doing this versus my plan above?
I'm working with Google Tesseract (OCR-engine) to convert scanned PDF's to XML/Text files, with the goal of collecting and sorting the data into a larger data set for more in-depth analysis.
The OCR-process works fine and now I want to sort and collect the data from output files in XML-output into a larger database.
A typical input has the following layout:
http://image.bayimg.com/f724281a9943f3716df9c6e4d42b81ffa279a0d1.jpg
For each word / digit I get the word / digit + coordinates (bbox) in terms of x0,y0, x1, y1 (a rectangle). In XML it looks like this:
<p class='ocr_par' dir='ltr' id='par_4' title="bbox 128 1462 575 1541">
<span class='ocr_line' id='line_8' title="bbox 128 1462 575 1541"><span class='ocrx_word' id='word_20' title="bbox 128 1462 575 1541">Reparationer</span>
</span>
</p>
the bbox coordinates for 'par_7' and 'line_10' can be ignored since they refer to the OCR engine's interpretation of lines etc., which can be different from the "actual" coordinates depending on how the page is segmented during the OCR-process.
After importing the data and manipulating it a bit I have the following result (only showing data for one line):
Word / Digit | x0 | y0 | x1 | y1 |
Årsavgifter | 123 | 813 | 505 | 909 |
och | 523 | 827 | 655 | 891 |
hyror | 689 | 827 | 655 | 891 |
029 | 2695 | 828 | 2818 | 891 |
8 | 2578 | 830 | 2655 | 890 |
417 | 2852 | 828 | 2984 | 891 |
xxx | 125 | 923 | 348 | 1014 |
<tbody>
</tbody>
All of these should be on the same row except the last row (i.e have the same y0), but depending on how the engine maps the words, the coordinates can differ a bit. The difference between the rows is typically in the magnitude of 50-90 pixels so adding a bit of flexibility of 20-30 pixels allows me to collect all the values on the same row.
The order of the digits in the table above is not correct since the "8" should be the first digit followed by 417 and then the 029 given the x-coordinates.
The logic to build the rows and columns should be:
1. Sort by y0 coordinates
2. Group into rows given a certain "flexibility"
3. Repeat 2 but with columns (typically the distance between words/digits in the same column is <40 pixels and the difference between columns is >100 pixels)
What I would appreciate some feedback on
I have some experience in VBA but given that my dataset is quite large I would be grateful for any tips / hints for more efficient solutions.
My plan for VBA-module
1) Compute the size/length of the array
2) Sort the data on y0 values
3) Define the lowest y0 value as line 1
4) Look for values within a range of 20 pixels and collect that data into arrays.
5) Define line 2 by taking the lowest value that is larger than line 1 y0 + 20 pixels
6) Repeat step 4 and so on (and do a similar step for the columns)
Given your experience in VBA / XML, is there some more efficient way of doing this versus my plan above?
Last edited: