read line by line PDF file using TxT from VBA

drom

Well-known Member
Joined
Mar 20, 2005
Messages
528
Office Version
  1. 2021
  2. 2019
  3. 2016
  4. 2013
  5. 2011
  6. 2010
  7. 2007
Hi and thanks in advance!!
If Ï use th following link
How To Count The Page Numbers Of Pdf Files In Excel?
I get the desired page number but is there a way to read in every PDF page for a KeyWord (eg:country) and get the different countries in a PDF file

I do not know the Acrobat are using my working mates (and here is were I get my problems)
And when I use to open the PDF file in word to get the countries...
  • Takes to long because the PDF files are huge
    • Works fine but takes too much time

When I use:

VBA Code:
Const Form_FileName As String = "C:\WordWideEmployees.pdf"   'Could contain over 350 pages

Sub AAA()
  On Error Resume Next
Dim fso As New FileSystemObject
  Dim wStream As TextStream:        Set wStream = fso.OpenTextFile(Form_FileName, ForReading, False)
  Dim wLine As String
Dim wKey As String:                 wKey = "Country"   'Contry is located in every page, so once found I can get the countries like: Spain, France...
  Dim aCountry_List() As Variant
Dim xArrayIndex As Integer
Dim xRow As Integer

  Do While Not wStream.AtEndOfStream
    wLine = "":                               wLine = wStream.ReadLine
    If wLine <> "" Then
      'Debug.Print wLine
      If InStr(wLine, wKey) > 0 Then
        xArrayIndex = xArrayIndex + 1:        ReDim Preserve aCountry_List(1 To xArrayIndex)
                                              aCountry_List(xArrayIndex) = mid ( wLine,7,50)     'Not real but for this eg I do not care
        'Debug.Print wLine
      End If
    End If
  Loop
 
  For xRow = UBound(aCountry_List) To LBound(aCountry_List)
   if xRow =0 then exit sub
    Debug.Print xRow, aCountry_List(xRow)
  Next xRow
  
End Sub

for wLine I get very rare strings like:
  • /Length1 123180
  • /Type /Stream
  • >>
  • stream
  • xœì½ xEþ7þêžÉ\Éœ™#“¹2™É1™Ü áä’+"DA@ð>@]Eï\o
  • —†€‚ŠëÉzëzìŠ.¢«â‰¬É¼Ÿê™ uwÿÿû>Ïû<ïvQߪ®ª®®þÞßêž@Œˆl ?????????????????????????????????????????????????????????????????????????p????????????????-????????????????????????????????????????????????????????????????????-??????????????????????O??????????????????????????????????????????????????????????????????????????????????????????????????°???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????Z??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????‰????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
  • šA¯èãl›C€7ŽR˜Ka)©‹{òàÝ3¨?ð¼³W~Wlgí®Rø
and I cant get never get the wanted countries
My PDF's are not protected, photos...

Any Help ?
 

Excel Facts

Best way to learn Power Query?
Read M is for (Data) Monkey book by Ken Puls and Miguel Escobar. It is the complete guide to Power Query.
I've been working on a project that reads the metadata of PDF files (like page count, in much the same way as what you've used) and it automates the text extraction process from PDFs, but the problem is that it uses Word to do this. The difficulty is that assuming that a PDF file has an accessible text layer (i.e., meaning that it hasn't simply been scanned, and is effectively a collection of pictures!), that text layer isn't readily accessibly through just reading the file. As I understand it, the PDF file format compresses the text and other contents in the document, and that would explain the strings of data you're getting.

So you need to extract the text from the PDF file somehow, and that's usually either through Adobe or some third party provider, an online API service, Word (the free and easy approach), or you could use command line tools like XPDFTools or PDFTK. I've seen PDFTK mentioned on this forum (link) but haven't used it personally. I do use XPDF Tools - they're a free collection of 5 or 6 small executable files that each do one thing - PDF2TEXT.EXE, for example, will extract the text from a PDF file. PDF2IMAGE.EXE does what you would expect it to... and so on... You can use VBA to automate the process with a command line tool. That said, if your files are 350 pages or so, I don't know that the process will necessarily be especially quick....

I should also add that Power Query can read PDF Files, but that tends to be more for PDFs of tables rather than text files of the size you're contemplating. I could be entirely wrong on that point, but something worth investigating?
 
Upvote 0

Forum statistics

Threads
1,215,267
Messages
6,123,964
Members
449,137
Latest member
yeti1016

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top