PDF challenge

xenou

MrExcel MVP
Joined
Mar 2, 2007
Messages
16,836
Office Version
  1. 2019
Platform
  1. Windows
Hi,
I've got some co-workers using Acrobat in a long-handed way. They are going through these steps which will be repeated about 14,000 times:

1. Search one long pdf document for a name
2. Print the page
3. Repeat (14,000 times)

The search takes about 10 seconds (alone) on average - so it adds up to a bit of time, with all clicking and waiting for search results.

I'm not at all familiar with coding to Adobe's object model. Does anyone know if it's possible to automate these steps, given a list of names (they come in about 100 at at time in Excel, to be matched and printed). Naturally, the use of names will not be a perfect match so we'd probably need to print all matching names and sort out the duplicates (common names will likely have two or more matches).

Any ideas welcome. Cheers,
xenou
 

Excel Facts

Is there a shortcut key for strikethrough?
Ctrl+S is used for Save. Ctrl+5 is used for Strikethrough. Why Ctrl+5? When you use hashmarks to count |||| is 4, strike through to mean 5.
Hi,
I've got some co-workers using Acrobat in a long-handed way. They are going through these steps which will be repeated about 14,000 times:

1. Search one long pdf document for a name
2. Print the page
3. Repeat (14,000 times)

The search takes about 10 seconds (alone) on average - so it adds up to a bit of time, with all clicking and waiting for search results.

I'm not at all familiar with coding to Adobe's object model. Does anyone know if it's possible to automate these steps, given a list of names (they come in about 100 at at time in Excel, to be matched and printed). Naturally, the use of names will not be a perfect match so we'd probably need to print all matching names and sort out the duplicates (common names will likely have two or more matches).

Any ideas welcome. Cheers,
xenou

I'm not sure about reading PDF data (it can be done but I think you have to buy a program to do it), but if it's taking 10 seconds to search and you've got 14,000 searches to do, the searches alone would take over 2 solid weeks to complete, or 48 working days.

Talk about job security! :ROFLMAO:
 
Upvote 0
I scrapped something together here. One item that was mildly problematic is that the pdf search seems sometimes to not go to the page where the text was found. And if the search didn't "select" the found text then I don't have the page number(!). I haven't been able to nail down the cause yet, but it may be something wierd like running the code in the vbe window - also sometimes search results seem to be in a "treeview" with the document still at the top of the tree, rather than the search result present immediately - I couldn't figure out why the two ways of presenting the result (tried under search options in the Acrobat program but no joy). (BTW, I did find a post where the search was done "page by page" - that may be the best way to do it). I decided to close and re-open the file each time as a precaution since I don't really know what Acrobat does with searches and so on.

This is a stripped down version of the code. I found two search methods, one a method of the avDoc and one a method of the JSO interface. They seem to do the same thing - I decided on the latter.

Code:
[COLOR="Navy"]Public[/COLOR] [COLOR="Navy"]Sub[/COLOR] PrintCNA()


[COLOR="Navy"]Dim[/COLOR] gApp [COLOR="Navy"]As[/COLOR] Acrobat.CAcroApp [COLOR="SeaGreen"]'//Acrobat app[/COLOR]
[COLOR="Navy"]Dim[/COLOR] avDoc [COLOR="Navy"]As[/COLOR] Acrobat.CAcroAVDoc [COLOR="SeaGreen"]'//Visible pdf document with a UI Window[/COLOR]
[COLOR="Navy"]Dim[/COLOR] pdDoc [COLOR="Navy"]As[/COLOR] Acrobat.CAcroPDDoc [COLOR="SeaGreen"]'//Underlying pdf document[/COLOR]
[COLOR="Navy"]Dim[/COLOR] avView [COLOR="Navy"]As[/COLOR] Acrobat.CAcroAVPageView [COLOR="SeaGreen"]'//For access to page numbers[/COLOR]
[COLOR="Navy"]Dim[/COLOR] jso [COLOR="Navy"]As[/COLOR] [COLOR="Navy"]Object[/COLOR] [COLOR="SeaGreen"]'//Javascript interface[/COLOR]

    
    [COLOR="Navy"]Set[/COLOR] gApp = CreateObject("AcroExch.App")
    [COLOR="Navy"]Set[/COLOR] avDoc = CreateObject("AcroExch.AVDoc")
    gApp.Show
    
    [COLOR="Navy"]For[/COLOR] i = 1 [COLOR="Navy"]To[/COLOR] UBound(b)
        
        [COLOR="Navy"]Call[/COLOR] avDoc.Open(sMasterFullFilePath, sMasterFileName)
        [COLOR="Navy"]Set[/COLOR] avView = avDoc.GetAVPageView()
        [COLOR="Navy"]Set[/COLOR] pdDoc = avDoc.GetPDDoc()
        [COLOR="Navy"]Set[/COLOR] jso = pdDoc.GetJSObject
        
        [COLOR="SeaGreen"]'//Search for ID Numbers[/COLOR]
        pgNum = 0
        [COLOR="SeaGreen"]'Call avDoc.FindText(b(i), False, False, True)[/COLOR]
        [COLOR="Navy"]Call[/COLOR] jso.Search.query(b(i), "ActiveDoc")
        [COLOR="Navy"]Do[/COLOR] [COLOR="Navy"]While[/COLOR] pgNum = 0
            pgNum = avView.GetPageNum [COLOR="SeaGreen"]'//bug alert: infinite loop if number never found ...[/COLOR]
        [COLOR="Navy"]Loop[/COLOR]
        [COLOR="Navy"]Call[/COLOR] avDoc.PrintPages(pgNum, pgNum, 2, False, False)
                    
        avDoc.Close (True) [COLOR="SeaGreen"]'//close w/o saving changes[/COLOR]
        pdDoc.Close
    
    [COLOR="Navy"]Next[/COLOR] i

    gApp.Exit

[COLOR="Navy"]End[/COLOR] [COLOR="Navy"]Sub[/COLOR]
 
Upvote 0
do you have the original C# code?
what is b in uBound(b)? is it a parameter string[] b?
any hint about select all or page by page without searching?

I looked at the adobe acrobat site and and the object browser for the Acrobat but I just don't know how to leap form get object to text.
 
Upvote 0
for those that want to extract all words from all pages of a PDF document w/o searching, here is an slightly modified version of the orginal post in adobe forum by Eldrarak82:
private static string PdDocGetText(AcroPDDoc pdDoc)
{
AcroPDPage page;
int pages = pdDoc.GetNumPages();
string pageText = "";

for (int i = 0; i < pages; i++)
{
page = (AcroPDPage)pdDoc.AcquirePage(i);
object jso, jsNumWords, jsWord;
List<string> words = new List<string>();

try
{
jso = pdDoc.GetJSObject();

if (jso != null)
{
object[] args = new object[] { i };
jsNumWords = jso.GetType().InvokeMember("getPageNumWords", System.Reflection.BindingFlags.InvokeMethod, null, jso, args, null);
int numWords = Int32.Parse(jsNumWords.ToString());

for (int j = 0; j <= numWords; j++)
{
object[] argsj = new object[] { i, j, false };
jsWord = jso.GetType().InvokeMember("getPageNthWord", System.Reflection.BindingFlags.InvokeMethod, null, jso, argsj, null);
words.Add((string)jsWord);
}
}

foreach (string word in words)
{
pageText += word;
}
}
catch
{
}
}

return pageText;
}
// end of the modified code from Eldrarak82

// the following I added using info form prev post in this thread
public static string getTextFromPDF(string filespec)
{
Acrobat.AcroAppClass gAppClass = new Acrobat.AcroAppClass();
Acrobat.AcroApp gApp = (Acrobat.AcroApp)gAppClass.GetInterface("AcroExch.App"); //Acrobat app
Acrobat.AcroAVDoc avDoc = (Acrobat.AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc"); //Visible pdf document with a UI Window
avDoc.Open(System.IO.Path.GetFullPath(filespec), System.IO.Path.GetFileName(filespec));

AcroPDDoc doc = (AcroPDDoc)avDoc.GetPDDoc();
string txt = PdDocGetText(doc);
doc.Close();
return txt;
}
 
Upvote 0

Forum statistics

Threads
1,215,196
Messages
6,123,575
Members
449,108
Latest member
rache47

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top