Grouping and sorting data in Excel based out OCR/XML output

M1985

New Member
Joined
Jan 15, 2014
Messages
5
Hi!

I'm working with Google Tesseract (OCR-engine) to convert scanned PDF's to XML/Text files, with the goal of collecting and sorting the data into a larger data set for more in-depth analysis.

The OCR-process works fine and now I want to sort and collect the data from output files in XML-output into a larger database.

A typical input has the following layout:
http://image.bayimg.com/f724281a9943f3716df9c6e4d42b81ffa279a0d1.jpg

For each word / digit I get the word / digit + coordinates (bbox) in terms of x0,y0, x1, y1 (a rectangle). In XML it looks like this:


<p class='ocr_par' dir='ltr' id='par_4' title="bbox 128 1462 575 1541">
<span class='ocr_line' id='line_8' title="bbox 128 1462 575 1541"><span class='ocrx_word' id='word_20' title="bbox 128 1462 575 1541">Reparationer</span>
</span>
</p>


the bbox coordinates for 'par_7' and 'line_10' can be ignored since they refer to the OCR engine's interpretation of lines etc., which can be different from the "actual" coordinates depending on how the page is segmented during the OCR-process.

After importing the data and manipulating it a bit I have the following result (only showing data for one line):


Word / Digitx0y0x1y1
Årsavgifter123813505909
och523827655891
hyror689827655891
02926958282818891
825788302655890
41728528282984891
xxx1259233481014

<tbody>
</tbody>

All of these should be on the same row except the last row (i.e have the same y0), but depending on how the engine maps the words, the coordinates can differ a bit. The difference between the rows is typically in the magnitude of 50-90 pixels so adding a bit of flexibility of 20-30 pixels allows me to collect all the values on the same row.
The order of the digits in the table above is not correct since the "8" should be the first digit followed by 417 and then the 029 given the x-coordinates.

The logic to build the rows and columns should be:
1. Sort by y0 coordinates
2. Group into rows given a certain "flexibility"
3. Repeat 2 but with columns (typically the distance between words/digits in the same column is <40 pixels and the difference between columns is >100 pixels)

What I would appreciate some feedback on
I have some experience in VBA but given that my dataset is quite large I would be grateful for any tips / hints for more efficient solutions.

My plan for VBA-module
1) Compute the size/length of the array
2) Sort the data on y0 values
3) Define the lowest y0 value as line 1
4) Look for values within a range of 20 pixels and collect that data into arrays.
5) Define line 2 by taking the lowest value that is larger than line 1 y0 + 20 pixels
6) Repeat step 4 and so on (and do a similar step for the columns)

Given your experience in VBA / XML, is there some more efficient way of doing this versus my plan above?
 
Last edited:

Excel Facts

Shade all formula cells
To shade all formula cells: Home, Find & Select, Formulas to select all formulas. Then apply a light fill color.

Forum statistics

Threads
1,215,093
Messages
6,123,067
Members
449,090
Latest member
fragment

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top