VBA to loop and extract parts from multiple word files to one sheet Excel

Ovisele

New Member
Joined
Apr 28, 2015
Messages
30
Hi guys!

I know that there are at least 5-6 topics on how to extract certain parts of text from Word to Excel, but I am just kindly asking you to go over my request, as there are some specificities. I will try to be brief:

  • I have over 5000 PDF and Word files that I need to extract data from. The word files are in majority so I have decided to extract the first page of each PDF (done that), convert that first page to .doc (done that) so now I have the same base.
  • What I need to do:
    • Loop through each word file (1-page document)
    • Extract only parts of data (see point 4)
    • Insert those date in an excel sheet for further analysis
  • The entire page looks like the one in the image – the black boxes are not relevant
  • The information that I need to extract is highlighted in yellow
  • The problem is that in the last section, the Secondary diagnostics can vary from 2-3 to over 20 positions, depending on the patient. I was considering to use wildcards but I am not sure that this is a good approach. In fact, this variability from patient to patient is the main problem, otherwise I could (possibly) have extracted the data with an automated tool.
Sample word file here: Sample Word.docx

Many thanks for taking the time to read this! I really appreciate it!
 

Attachments

  • Document 1.png
    Document 1.png
    46.9 KB · Views: 28
  • 2020-03-13_14-08-09.png
    2020-03-13_14-08-09.png
    35.5 KB · Views: 25

Excel Facts

Square and cube roots
The =SQRT(25) is a square root. For a cube root, use =125^(1/3). For a fourth root, use =625^(1/4).
This is close to mission impossible since you have quite unstructured Word docs. Presuming they are all like your example, what looks like could be reliably find-able isn't;
it's full of tabs, column breaks, section breaks, etc., so trying to reliably extract content from this mess is not likely to be fruitful. You may get some way, but I expect there will be some level of review required of the output.

1584210898333.png

For example, you may find "Pana la..." with MS Word Regex
1584213128758.png
(presuming it's always followed by a tab character!) whereas "Sectia..." is not actually followed by "Cardiology" at all - it is for humans, not for a machine, where it's followed by a return and "De la".

The diagnostics could be found (in this example) with something like Secondary diagnostics:?^n^13(*)^13(*)^13(*)^13(*)^13(*)^13(*)^13(*)^13(*)^13 but who knows whether a different doc has the column break, the tiny paragraph mark, etc.?

I suggest going back to the source files and seeing whether a more reliable/structured input could be sourced.
 
Upvote 0
Thank you very much for the effort of responding. Indeed, there is a total mess in the word files :( Much appreciated!
 
Upvote 0

Forum statistics

Threads
1,214,985
Messages
6,122,605
Members
449,089
Latest member
Motoracer88

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top