Grouping and sorting data in Excel based out OCR/XML output

M1985 · Jun 28, 2014

Hi!

I'm working with Google Tesseract (OCR-engine) to convert scanned PDF's to XML/Text files, with the goal of collecting and sorting the data into a larger data set for more in-depth analysis.

The OCR-process works fine and now I want to sort and collect the data from output files in XML-output into a larger database.

A typical input has the following layout:
http://image.bayimg.com/f724281a9943f3716df9c6e4d42b81ffa279a0d1.jpg

For each word / digit I get the word / digit + coordinates (bbox) in terms of x0,y0, x1, y1 (a rectangle). In XML it looks like this:

<p class='ocr_par' dir='ltr' id='par_4' title="bbox 128 1462 575 1541">
<span class='ocr_line' id='line_8' title="bbox 128 1462 575 1541"><span class='ocrx_word' id='word_20' title="bbox 128 1462 575 1541">Reparationer</span>
</span>
</p>

the bbox coordinates for 'par_7' and 'line_10' can be ignored since they refer to the OCR engine's interpretation of lines etc., which can be different from the "actual" coordinates depending on how the page is segmented during the OCR-process.

After importing the data and manipulating it a bit I have the following result (only showing data for one line):

Word / Digit	x0	y0	x1	y1
Årsavgifter	123	813	505	909
och	523	827	655	891
hyror	689	827	655	891
029	2695	828	2818	891
8	2578	830	2655	890
417	2852	828	2984	891
xxx	125	923	348	1014

All of these should be on the same row except the last row (i.e have the same y0), but depending on how the engine maps the words, the coordinates can differ a bit. The difference between the rows is typically in the magnitude of 50-90 pixels so adding a bit of flexibility of 20-30 pixels allows me to collect all the values on the same row.
The order of the digits in the table above is not correct since the "8" should be the first digit followed by 417 and then the 029 given the x-coordinates.

The logic to build the rows and columns should be:
1. Sort by y0 coordinates
2. Group into rows given a certain "flexibility"
3. Repeat 2 but with columns (typically the distance between words/digits in the same column is <40 pixels and the difference between columns is >100 pixels)

What I would appreciate some feedback on
I have some experience in VBA but given that my dataset is quite large I would be grateful for any tips / hints for more efficient solutions.

My plan for VBA-module
1) Compute the size/length of the array
2) Sort the data on y0 values
3) Define the lowest y0 value as line 1
4) Look for values within a range of 20 pixels and collect that data into arrays.
5) Define line 2 by taking the lowest value that is larger than line 1 y0 + 20 pixels
6) Repeat step 4 and so on (and do a similar step for the columns)

Given your experience in VBA / XML, is there some more efficient way of doing this versus my plan above?

Grouping and sorting data in Excel based out OCR/XML output

M1985

New Member

Excel Facts

Forum statistics

Share this page

Grouping and sorting data in Excel based out OCR/XML output

M1985

New Member

Excel Facts

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock