PDF challenge

xenou · Jun 24, 2010

Hi,
I've got some co-workers using Acrobat in a long-handed way. They are going through these steps which will be repeated about 14,000 times:

1. Search one long pdf document for a name
2. Print the page
3. Repeat (14,000 times)

The search takes about 10 seconds (alone) on average - so it adds up to a bit of time, with all clicking and waiting for search results.

I'm not at all familiar with coding to Adobe's object model. Does anyone know if it's possible to automate these steps, given a list of names (they come in about 100 at at time in Excel, to be matched and printed). Naturally, the use of names will not be a perfect match so we'd probably need to print all matching names and sort out the duplicates (common names will likely have two or more matches).

Any ideas welcome. Cheers,
xenou

Rekd · Jun 25, 2010

xenou said:
Hi,
I've got some co-workers using Acrobat in a long-handed way. They are going through these steps which will be repeated about 14,000 times:

1. Search one long pdf document for a name
2. Print the page
3. Repeat (14,000 times)

The search takes about 10 seconds (alone) on average - so it adds up to a bit of time, with all clicking and waiting for search results.

I'm not at all familiar with coding to Adobe's object model. Does anyone know if it's possible to automate these steps, given a list of names (they come in about 100 at at time in Excel, to be matched and printed). Naturally, the use of names will not be a perfect match so we'd probably need to print all matching names and sort out the duplicates (common names will likely have two or more matches).

Any ideas welcome. Cheers,
xenou

I'm not sure about reading PDF data (it can be done but I think you have to buy a program to do it), but if it's taking 10 seconds to search and you've got 14,000 searches to do, the searches alone would take over 2 solid weeks to complete, or 48 working days.

Talk about job security!

xenou · Jun 26, 2010

I scrapped something together here. One item that was mildly problematic is that the pdf search seems sometimes to not go to the page where the text was found. And if the search didn't "select" the found text then I don't have the page number(!). I haven't been able to nail down the cause yet, but it may be something wierd like running the code in the vbe window - also sometimes search results seem to be in a "treeview" with the document still at the top of the tree, rather than the search result present immediately - I couldn't figure out why the two ways of presenting the result (tried under search options in the Acrobat program but no joy). (BTW, I did find a post where the search was done "page by page" - that may be the best way to do it). I decided to close and re-open the file each time as a precaution since I don't really know what Acrobat does with searches and so on.

This is a stripped down version of the code. I found two search methods, one a method of the avDoc and one a method of the JSO interface. They seem to do the same thing - I decided on the latter.

Code:

[COLOR="Navy"]Public[/COLOR] [COLOR="Navy"]Sub[/COLOR] PrintCNA()


[COLOR="Navy"]Dim[/COLOR] gApp [COLOR="Navy"]As[/COLOR] Acrobat.CAcroApp [COLOR="SeaGreen"]'//Acrobat app[/COLOR]
[COLOR="Navy"]Dim[/COLOR] avDoc [COLOR="Navy"]As[/COLOR] Acrobat.CAcroAVDoc [COLOR="SeaGreen"]'//Visible pdf document with a UI Window[/COLOR]
[COLOR="Navy"]Dim[/COLOR] pdDoc [COLOR="Navy"]As[/COLOR] Acrobat.CAcroPDDoc [COLOR="SeaGreen"]'//Underlying pdf document[/COLOR]
[COLOR="Navy"]Dim[/COLOR] avView [COLOR="Navy"]As[/COLOR] Acrobat.CAcroAVPageView [COLOR="SeaGreen"]'//For access to page numbers[/COLOR]
[COLOR="Navy"]Dim[/COLOR] jso [COLOR="Navy"]As[/COLOR] [COLOR="Navy"]Object[/COLOR] [COLOR="SeaGreen"]'//Javascript interface[/COLOR]

    
    [COLOR="Navy"]Set[/COLOR] gApp = CreateObject("AcroExch.App")
    [COLOR="Navy"]Set[/COLOR] avDoc = CreateObject("AcroExch.AVDoc")
    gApp.Show
    
    [COLOR="Navy"]For[/COLOR] i = 1 [COLOR="Navy"]To[/COLOR] UBound(b)
        
        [COLOR="Navy"]Call[/COLOR] avDoc.Open(sMasterFullFilePath, sMasterFileName)
        [COLOR="Navy"]Set[/COLOR] avView = avDoc.GetAVPageView()
        [COLOR="Navy"]Set[/COLOR] pdDoc = avDoc.GetPDDoc()
        [COLOR="Navy"]Set[/COLOR] jso = pdDoc.GetJSObject
        
        [COLOR="SeaGreen"]'//Search for ID Numbers[/COLOR]
        pgNum = 0
        [COLOR="SeaGreen"]'Call avDoc.FindText(b(i), False, False, True)[/COLOR]
        [COLOR="Navy"]Call[/COLOR] jso.Search.query(b(i), "ActiveDoc")
        [COLOR="Navy"]Do[/COLOR] [COLOR="Navy"]While[/COLOR] pgNum = 0
            pgNum = avView.GetPageNum [COLOR="SeaGreen"]'//bug alert: infinite loop if number never found ...[/COLOR]
        [COLOR="Navy"]Loop[/COLOR]
        [COLOR="Navy"]Call[/COLOR] avDoc.PrintPages(pgNum, pgNum, 2, False, False)
                    
        avDoc.Close (True) [COLOR="SeaGreen"]'//close w/o saving changes[/COLOR]
        pdDoc.Close
    
    [COLOR="Navy"]Next[/COLOR] i

    gApp.Exit

[COLOR="Navy"]End[/COLOR] [COLOR="Navy"]Sub[/COLOR]

gg89 · May 29, 2012

do you have the original C# code?
what is b in uBound(b)? is it a parameter string[] b?
any hint about select all or page by page without searching?

I looked at the adobe acrobat site and and the object browser for the Acrobat but I just don't know how to leap form get object to text.

gg89 · May 30, 2012

for those that want to extract all words from all pages of a PDF document w/o searching, here is an slightly modified version of the orginal post in adobe forum by Eldrarak82:

private static string PdDocGetText(AcroPDDoc pdDoc)
{
AcroPDPage page;
int pages = pdDoc.GetNumPages();
string pageText = "";

for (int i = 0; i < pages; i++)
{
page = (AcroPDPage)pdDoc.AcquirePage(i);
object jso, jsNumWords, jsWord;
List<string> words = new List<string>();

try
{
jso = pdDoc.GetJSObject();

if (jso != null)
{
object[] args = new object[] { i };
jsNumWords = jso.GetType().InvokeMember("getPageNumWords", System.Reflection.BindingFlags.InvokeMethod, null, jso, args, null);
int numWords = Int32.Parse(jsNumWords.ToString());

for (int j = 0; j <= numWords; j++)
{
object[] argsj = new object[] { i, j, false };
jsWord = jso.GetType().InvokeMember("getPageNthWord", System.Reflection.BindingFlags.InvokeMethod, null, jso, argsj, null);
words.Add((string)jsWord);
}
}

foreach (string word in words)
{
pageText += word;
}
}
catch
{
}
}

return pageText;
}
// end of the modified code from Eldrarak82

// the following I added using info form prev post in this thread
public static string getTextFromPDF(string filespec)
{
Acrobat.AcroAppClass gAppClass = new Acrobat.AcroAppClass();
Acrobat.AcroApp gApp = (Acrobat.AcroApp)gAppClass.GetInterface("AcroExch.App"); //Acrobat app
Acrobat.AcroAVDoc avDoc = (Acrobat.AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc"); //Visible pdf document with a UI Window
avDoc.Open(System.IO.Path.GetFullPath(filespec), System.IO.Path.GetFileName(filespec));

AcroPDDoc doc = (AcroPDDoc)avDoc.GetPDDoc();
string txt = PdDocGetText(doc);
doc.Close();
return txt;
}

PDF challenge

xenou

MrExcel MVP

Excel Facts

Rekd

Banned

xenou

MrExcel MVP

gg89

New Member

gg89

New Member

Similar threads

Forum statistics

Share this page

PDF challenge

xenou

MrExcel MVP

Excel Facts

Rekd

Banned

xenou

MrExcel MVP

gg89

New Member

gg89

New Member

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock