Loop through PDF files in a folder and save data in Excel

L

Legacy 316613

Guest
Hi All,

I am a very amateur and green VBA coder but learning new tricks every day.

Here is a big project I have:

I have PDF documents saved in a specific folder and would like to have the VBA code to do the following:

1) Either access the particular folder automatically or allow me to paste the path of the folder in a particular cell, lets say particular folder is C:\pdf or path of folder is in cell A2
2) I would like the program to loop through each one of the PDF files and paste all the pages into a new workbook which we can call C:\pdf\ExcelData.xlsx or xlsm (whichever is relevant)
3) Once I have the data in Excel and I assume it will be a very big file (I have 100+ pdf's) I want the code to look through the entire file of "bad data" and pick out a specific 8 digit numeric field such as "12345678" and paste it into a worksheet called "Codes" as well as capture the original file name of the PDF - for example file name is PDF1.pdf and the numerical field is "12345678" I would like the code to capture the following in the "Codes" worksheet - Column A - pdf1.pdf and Column B - "12345678"
4) To add further complexities, there may be several numeric fields in the same original PDF file and hence I would like to capture the following in the "Codes" worksheet:
Column A: pdf1.pdf Column B - 12345678
Column A: pdf1.pdf Column B - 23456789

etc...

I know this is asking for a lot but if I did this manually I would be looking at a minimum of 200-300 manhours and here is me hoping that you VBA experts can help me.

I would be using this code regularly and it will save me a lot of time and could lead to a promotion :)

Plz help!!
 

Excel Facts

Shade all formula cells
To shade all formula cells: Home, Find & Select, Formulas to select all formulas. Then apply a light fill color.
Looping through a folder is relatively straightforward. The second item is considerably more complex and would require using Adobe Acrobat Pro, not just the free reader, to automate the extraction. It also isn't obvious what you expect to obtain in terms of formatting, where the data are to go, or what constitutes "bad data".
 
Upvote 0
Looping through a folder is relatively straightforward. The second item is considerably more complex and would require using Adobe Acrobat Pro, not just the free reader, to automate the extraction. It also isn't obvious what you expect to obtain in terms of formatting, where the data are to go, or what constitutes "bad data".

Hi Macropod.

Thank you for prompt reply.

1) I do have Adobe Acrobat Pro and can use it for the extraction.

I stand corrected for my wrong terminology - it is not "bad data". What I meant was "badly formatted data".

Once the extraction from PDF to xls has been done, I expect to have a large file with badly formatted data.

I would need a code which "scrapes" through this bad data to pick up the numeric fields and the original PDF file name.

Let me know if you need any further information.
 
Upvote 0
Try the following code. It allows you to select the folder to process. As coded, the text from all the PDFs will be output to individual worksheets (named after the PDFs) in the active workbook. Comments in the code show where you could substitute a different workbook. You'll see when each worksheet is added as the code executes.
Code:
Sub ExtractPDFs()
'Note: Requires reference to Acrobat object library, via Tools|References
Application.ScreenUpdating = False
Dim strFolder As String, strFile As String, xlBook As Workbook, xlSheet As Worksheet
Dim AcroApp As CAcroApp, AcroAVDoc As CAcroAVDoc, AcroPDDoc As CAcroPDDoc
Dim AcroTextSelect As CAcroPDTextSelect, PageNumber As Object, PageContent As Object
Dim i As Long, j As Long, k As Long, StrContent As String
strFolder = GetFolder
If strFolder = "" Then Exit Sub
Set xlBook = ActiveWorkbook 'Workbooks.Open(Filename:="", AddToMRU:=False)
Set AcroAVDoc = CreateObject("AcroExch.AVDoc")
Set AcroApp = CreateObject("AcroExch.App")
strFile = Dir(strFolder & "\*.pdf", vbNormal)
While strFile <> ""
  If AcroAVDoc.Open(strFolder & "\" & strFile, vbNull) = True Then
    Set xlSheet = xlBook.Sheets.Add
    xlSheet.Name = Split(strFile, ".pdf")(0)
    Application.ScreenUpdating = True
    Application.ScreenUpdating = False
    While AcroAVDoc Is Nothing
      Set AcroAVDoc = AcroApp.GetActiveDoc
    Wend
    StrContent = ""
    Set AcroPDDoc = AcroAVDoc.GetPDDoc
    For i = 0 To AcroPDDoc.GetNumPages - 1
      Set PageNumber = AcroPDDoc.AcquirePage(i)
      Set PageContent = CreateObject("AcroExch.HiliteList")
      If PageContent.Add(0, 9000) = True Then
        Set AcroTextSelect = PageNumber.CreatePageHilite(PageContent)
        ' The next line is needed to avoid errors with protected PDFs that can't be read
        On Error Resume Next
        For j = 0 To AcroTextSelect.GetNumText - 1
          StrContent = StrContent & AcroTextSelect.GetText(j)
        Next j
        With xlSheet
          j = .UsedRange.Rows.Count + 1
          For k = 0 To UBound(Split(StrContent, vbCr))
            .Range("A" & j + k).Value = Split(StrContent, vbCr)(k)
          Next
          .Range("A1").Value = strFile
          .UsedRange.WrapText = False
        End With
      End If
    Next i
    AcroAVDoc.Close True
  End If
  strFile = Dir()
Wend
Set xlSheet = Nothing: Set xlBook = Nothing
Set PageContent = Nothing: Set PageNumber = Nothing
Set AcroTextSelect = Nothing: Set AcroAVDoc = Nothing: Set AcroApp = Nothing
Application.ScreenUpdating = True
End Sub

Function GetFolder() As String
Dim oFolder As Object
GetFolder = ""
Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0)
If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
Set oFolder = Nothing
End Function
At this stage, I haven't tried implementing your steps 3 & 4 - there should be plenty of code on this forum to show you how to do that.
 
Upvote 0
Dear Macropod,

I have the same problem althugh i just need the code to loop through all pdfs in a folder and paste all their contents (text) into one Sheet where all pdfs are in one column e.g. A one under the other. I then will extract the relevant data via formulas but the copying of the many pdfs is the problem. I have tried to run your code but the line
Code:
Set AcroAVDoc = CreateObject("AcroExch.AVDoc")

[CODE]

Throws an error and the debugger says Run time error 429...

Can You please help? This make my wokr so much easier.
[/CODE]
 
Upvote 0
Hi Macropod,

Of course I did and I enabled everything that had Acrobat or ActiveX in the description... Can You tell me which Library Exactly is needed?

THX in advance.

P
 
Upvote 0
Hi Macropod,

That would explain a lot... don't have it but maybe my IT has it.

Anyway thanks a lot. I will write on my progress.
 
Upvote 0
HI Macropod,

can You write which version of Acrobat Professional You are using before I ask my IT guys?

Thanks in advance.
 
Upvote 0

Forum statistics

Threads
1,215,330
Messages
6,124,308
Members
449,151
Latest member
JOOJ

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top