Best method for scraping data from Web using VB macro?

dustybootz

New Member
Joined
Feb 19, 2013
Messages
3
This is something of a conceptual question rather than on the specifics of code (ie am I going about this the right way in general or is there a better technique I could use?). I think that my problem represents a broad issue affecting many of the inexperienced people who post on this forum so an overview and sharing of best practice would also help many people.

My aim is to scrape statistical data from a website (here is an exemplar page: www.racingpost.com/horses/result_home.sd?race_id=572318&r_date=2013-02-26&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS

I have a very basic (if you pardon the pun) knowledge of VB which I use through excel but know nothing about other programming languages or conventions (SQL, HTML, XML etc.), however I am quite good at writing code to manipulate strings- that is, once I can scrape the data, even if it is in a very noisy form then I am expert at processing it. I am trying to build an automated process that will scrape up to 1000 pages in one hit. In one form or another, I have been working on this for years and the last few weeks have been very frustrating in that I have come up with several new methods which have taken days of work but have each had one fatal flaw that has stopped my progress.

Here are the methods I have tried (all using a VB macro run from Excel):
1) Control Firefox (as a shell application) - this was the poorest, I found that I could not interact with Firefox properly using a VB excel macro- i tried mainly using keystrokes etc.
2) Get inner text, outer text, inner html or outer html from internet explorer (IE)- this method was by far the most reliable but the data was, at times, very hard to parse and did not always contain everything I needed (good for some applications but bad for others)
3) automated Copy and Paste from IE- this was tantalisingly close to being perfect but is given to throwing up inexplicable errors whereby the type of information copied to the clipboard differs depending on whether it is done manually (ie CTRL+A, CTRL+C) or through the automated process (with the former I could get the HTML structure- ie tables, with the latter only text). The enigma here was that I could get the automated copy/paste to give me the right info IF I FIRST CLICKED INSIDE THE IE WINDOW USING MOUSE POINTER- however I was unable to automate this using a VB MACRO (I tried sendkeys and various other methods)
4) By automating an excel webquery- I recorded a macro of a query, this worked flawlessly giving me the structure of tables I needed. Snag was it was very very slow- even for a single page it might take 14 to 16 seconds (some of the other methods I used were near instantaneous). Also this method appears to encounter severe lagging/crashing problems when many refreshes are done (that may be because I don't know how to update the queries with different criteria, or properly extinguish them)
5) Loading the page as an XML document- I am investigating this method now- I know next-to-nothing about XML but have a hunch the sort of pages I am scraping (see example above) are suitable for this. I have managed to load the pages as an XML object but at present seem to be running into difficulties trying to parse the structure (ie various nodes) to extract text- keep running into object errors.

For the record I have posted highly specific questions with code relating to these individual methods without response so I am trying a broader question. What is the experience of others here? Which of these methods should I focus on? (bear in mind I am trying to keep everything to Excel VB Macros). I am getting to the point where I might look to get someone to code something for me and pay them (as this is taking hundreds of hours) - have people had good experiences employing others to write ad hoc code in this manner?

Best wishes,
David
 
Last edited:

Excel Facts

Excel motto
Not everything I do at work revolves around Excel. Only the fun parts.
The best method is the one that works, is reliable and relatively fast. The method you choose depends on the specific web site and some methods may not work.

Here are the methods I have tried (all using a VB macro run from Excel):
1) Control Firefox (as a shell application) - this was the poorest, I found that I could not interact with Firefox properly using a VB excel macro- i tried mainly using keystrokes etc.
I wouldn't bother. SendKeys is unreliable. Anyway, Excel uses IE as its underlying browser, not Firefox.

2) Get inner text, outer text, inner html or outer html from internet explorer (IE)- this method was by far the most reliable but the data was, at times, very hard to parse and did not always contain everything I needed (good for some applications but bad for others)
Good, but only if you're using the MS HTML Object Library to find and parse HTML elements. Some people parse the whole outerHTML using VBA string functions like Instr, Mid, etc., which I think is more prone to web site design changes.

3) automated Copy and Paste from IE- this was tantalisingly close to being perfect but is given to throwing up inexplicable errors whereby the type of information copied to the clipboard differs depending on whether it is done manually (ie CTRL+A, CTRL+C) or through the automated process (with the former I could get the HTML structure- ie tables, with the latter only text). The enigma here was that I could get the automated copy/paste to give me the right info IF I FIRST CLICKED INSIDE THE IE WINDOW USING MOUSE POINTER- however I was unable to automate this using a VB MACRO (I tried sendkeys and various other methods)
Did you try IE.ExecWB to select all, and copy to the clipboard? Forget SendKeys and AppActivate; there is better code on the web for sending keystrokes and mouse clicks to a specific window, which VBA SendKeys can't do - see SendKeys.

4) By automating an excel webquery- I recorded a macro of a query, this worked flawlessly giving me the structure of tables I needed. Snag was it was very very slow- even for a single page it might take 14 to 16 seconds (some of the other methods I used were near instantaneous). Also this method appears to encounter severe lagging/crashing problems when many refreshes are done (that may be because I don't know how to update the queries with different criteria, or properly extinguish them)
This is the easiest method because no knowledge of HTML is needed, but also very slow, as you've seen, because I think it renders the web page behind the scenes using IE. It also only works if the web page has HTML tables, though it is very good at parsing the table, dealing with odd HTML formatting like < BR > tags and importing into Excel cells. For multiple web query retrievals, have a single web query on a separate sheet and change its Connection property for each URL. I've also seen it freezing or crashing when many web queries are done; the solution is to clear the IE cache every 40-50 queries - there is code on the web to do this. Other methods (XMLhttp, IE automation, CreateDocumentFromURL) can be near instantaneous because they might be retrieving the data from the local cache instead of requesting it from the web site again.

5) Loading the page as an XML document- I am investigating this method now- I know next-to-nothing about XML but have a hunch the sort of pages I am scraping (see example above) are suitable for this. I have managed to load the pages as an XML object but at present seem to be running into difficulties trying to parse the structure (ie various nodes) to extract text- keep running into object errors.
Do you mean XMLhttp request? Or XMLhttp DOMdocument? Very few web pages are XML documents with properly structured nodes.

Summary:

The fastest method is generally an XMLhttp request combined with the HTML Object Library. For a mega-fast technique which retrieves multiple pages simultaneously instead of sequentially, use these methods with multiple VBScript processes.

CreateDocumentFromURL (HTML Object Library method) is also fast but does some rendering of the web page. Advantage - you don't need XMLhttp.

IE automation is the next best, but relatively slow because it renders the web page. However some web pages require rendering before you can retrieve the data so this might be the only method that works. And some sites generate pages based on input boxes, dropdown menus etc., which require IE automation. Combine IE automation with the HTML Object Library.

Excel web query - the slowest, but also the easiest to use if the web page has HTML tables, write code for and reliable.
 
Upvote 0
Hi John,

I know the post is kind of old but it really interested me and I'd like to ask you some deeper insights.
I am actually runninng some scripts to retrieve infos from several websites, if up to today I had no more than a 1000 pages to scan at a time, I today find my self in the position of having to look at more than 50.000 pages to scan as I'd like to achieve a better and deeper information level.

Over where you specified the fastest way is XMLHTTP... ("The fastest method is generally an XMLhttp request combined with the HTML Object Library. For a mega-fast technique which retrieves multiple pages simultaneously instead of sequentially, use these methods with multiple VBScript processes.") I'd like to ask you what you mean by running multiple vbscripts processes and how exactelly do you retrieve multiple pages simoultaneosly instead of subsequentially, because I think that is exactelly my problem at the time, waiting everytime the page to charge...

Thanks in advance,
marco














The best method is the one that works, is reliable and relatively fast. The method you choose depends on the specific web site and some methods may not work.


I wouldn't bother. SendKeys is unreliable. Anyway, Excel uses IE as its underlying browser, not Firefox.


Good, but only if you're using the MS HTML Object Library to find and parse HTML elements. Some people parse the whole outerHTML using VBA string functions like Instr, Mid, etc., which I think is more prone to web site design changes.


Did you try IE.ExecWB to select all, and copy to the clipboard? Forget SendKeys and AppActivate; there is better code on the web for sending keystrokes and mouse clicks to a specific window, which VBA SendKeys can't do - see SendKeys.


This is the easiest method because no knowledge of HTML is needed, but also very slow, as you've seen, because I think it renders the web page behind the scenes using IE. It also only works if the web page has HTML tables, though it is very good at parsing the table, dealing with odd HTML formatting like < BR > tags and importing into Excel cells. For multiple web query retrievals, have a single web query on a separate sheet and change its Connection property for each URL. I've also seen it freezing or crashing when many web queries are done; the solution is to clear the IE cache every 40-50 queries - there is code on the web to do this. Other methods (XMLhttp, IE automation, CreateDocumentFromURL) can be near instantaneous because they might be retrieving the data from the local cache instead of requesting it from the web site again.

Do you mean XMLhttp request? Or XMLhttp DOMdocument? Very few web pages are XML documents with properly structured nodes.

Summary:

The fastest method is generally an XMLhttp request combined with the HTML Object Library. For a mega-fast technique which retrieves multiple pages simultaneously instead of sequentially, use these methods with multiple VBScript processes.

CreateDocumentFromURL (HTML Object Library method) is also fast but does some rendering of the web page. Advantage - you don't need XMLhttp.

IE automation is the next best, but relatively slow because it renders the web page. However some web pages require rendering before you can retrieve the data so this might be the only method that works. And some sites generate pages based on input boxes, dropdown menus etc., which require IE automation. Combine IE automation with the HTML Object Library.

Excel web query - the slowest, but also the easiest to use if the web page has HTML tables, write code for and reliable.
 
Upvote 0
Over where you specified the fastest way is XMLHTTP... ("The fastest method is generally an XMLhttp request combined with the HTML Object Library. For a mega-fast technique which retrieves multiple pages simultaneously instead of sequentially, use these methods with multiple VBScript processes.") I'd like to ask you what you mean by running multiple vbscripts processes and how exactelly do you retrieve multiple pages simoultaneosly instead of subsequentially, because I think that is exactelly my problem at the time, waiting everytime the page to charge...
Write a VBA program which creates and runs multiple VBScript (.vbs) processes. Each VBScript process retrieves data for a specific URL of the same web site using XMLhttp and writes the data directly to the Excel worksheet. See the following:

Multithreaded VBA – An Approach To Processing Using VBScript | Excel & VBA – Databison
Multi-threaded VBA - Excel Hero Blog
http://www.mrexcel.com/forum/excel-questions/593519-vbscript-extracting-html-tables-2.html
Multiprocess VBScript written in VBA with example workbook

The last post includes all the code and example workbook for a specific use of the technique. The code includes a VBA to VBScript converter, allowing you to write and test the VBScript code in Excel VBA.
 
Upvote 0
Hi Marco
Lately I've come across some web-scraping-tool websites (eg. kimono : Turn websites into structured APIs from your browser in seconds - no affiliation, and there may be others that do a better job for your purposes). You could potentially use their tools/services to create a neat html mega-table that scrapes in all your common format data, that you can then link to using the xmlhttp method. Just a thought - it might be an easier to manage.


Hi John,

I know the post is kind of old but it really interested me and I'd like to ask you some deeper insights.
I am actually runninng some scripts to retrieve infos from several websites, if up to today I had no more than a 1000 pages to scan at a time, I today find my self in the position of having to look at more than 50.000 pages to scan as I'd like to achieve a better and deeper information level.

Over where you specified the fastest way is XMLHTTP... ("The fastest method is generally an XMLhttp request combined with the HTML Object Library. For a mega-fast technique which retrieves multiple pages simultaneously instead of sequentially, use these methods with multiple VBScript processes.") I'd like to ask you what you mean by running multiple vbscripts processes and how exactelly do you retrieve multiple pages simoultaneosly instead of subsequentially, because I think that is exactelly my problem at the time, waiting everytime the page to charge...

Thanks in advance,
marco
 
Upvote 0

Forum statistics

Threads
1,214,429
Messages
6,119,435
Members
448,898
Latest member
dukenia71

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top