Best method for scraping data from Web using VB macro?

dustybootz

New Member
Joined
Feb 19, 2013
Messages
3
This is something of a conceptual question rather than on the specifics of code (ie am I going about this the right way in general or is there a better technique I could use?). I think that my problem represents a broad issue affecting many of the inexperienced people who post on this forum so an overview and sharing of best practice would also help many people.

My aim is to scrape statistical data from a website (here is an exemplar page: www.racingpost.com/horses/result_home.sd?race_id=572318&r_date=2013-02-26&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS

I have a very basic (if you pardon the pun) knowledge of VB which I use through excel but know nothing about other programming languages or conventions (SQL, HTML, XML etc.), however I am quite good at writing code to manipulate strings- that is, once I can scrape the data, even if it is in a very noisy form then I am expert at processing it. I am trying to build an automated process that will scrape up to 1000 pages in one hit. In one form or another, I have been working on this for years and the last few weeks have been very frustrating in that I have come up with several new methods which have taken days of work but have each had one fatal flaw that has stopped my progress.

Here are the methods I have tried (all using a VB macro run from Excel):
1) Control Firefox (as a shell application) - this was the poorest, I found that I could not interact with Firefox properly using a VB excel macro- i tried mainly using keystrokes etc.
2) Get inner text, outer text, inner html or outer html from internet explorer (IE)- this method was by far the most reliable but the data was, at times, very hard to parse and did not always contain everything I needed (good for some applications but bad for others)
3) automated Copy and Paste from IE- this was tantalisingly close to being perfect but is given to throwing up inexplicable errors whereby the type of information copied to the clipboard differs depending on whether it is done manually (ie CTRL+A, CTRL+C) or through the automated process (with the former I could get the HTML structure- ie tables, with the latter only text). The enigma here was that I could get the automated copy/paste to give me the right info IF I FIRST CLICKED INSIDE THE IE WINDOW USING MOUSE POINTER- however I was unable to automate this using a VB MACRO (I tried sendkeys and various other methods)
4) By automating an excel webquery- I recorded a macro of a query, this worked flawlessly giving me the structure of tables I needed. Snag was it was very very slow- even for a single page it might take 14 to 16 seconds (some of the other methods I used were near instantaneous). Also this method appears to encounter severe lagging/crashing problems when many refreshes are done (that may be because I don't know how to update the queries with different criteria, or properly extinguish them)
5) Loading the page as an XML document- I am investigating this method now- I know next-to-nothing about XML but have a hunch the sort of pages I am scraping (see example above) are suitable for this. I have managed to load the pages as an XML object but at present seem to be running into difficulties trying to parse the structure (ie various nodes) to extract text- keep running into object errors.

For the record I have posted highly specific questions with code relating to these individual methods without response so I am trying a broader question. What is the experience of others here? Which of these methods should I focus on? (bear in mind I am trying to keep everything to Excel VB Macros). I am getting to the point where I might look to get someone to code something for me and pay them (as this is taking hundreds of hours) - have people had good experiences employing others to write ad hoc code in this manner?

Best wishes,
David
 
Last edited:

Some videos you may like

Excel Facts

Copy PDF to Excel
Select data in PDF. Paste to Microsoft Word. Copy from Word and paste to Excel.

John_w

MrExcel MVP
Joined
Oct 15, 2007
Messages
6,207
The best method is the one that works, is reliable and relatively fast. The method you choose depends on the specific web site and some methods may not work.

Here are the methods I have tried (all using a VB macro run from Excel):
1) Control Firefox (as a shell application) - this was the poorest, I found that I could not interact with Firefox properly using a VB excel macro- i tried mainly using keystrokes etc.
I wouldn't bother. SendKeys is unreliable. Anyway, Excel uses IE as its underlying browser, not Firefox.

2) Get inner text, outer text, inner html or outer html from internet explorer (IE)- this method was by far the most reliable but the data was, at times, very hard to parse and did not always contain everything I needed (good for some applications but bad for others)
Good, but only if you're using the MS HTML Object Library to find and parse HTML elements. Some people parse the whole outerHTML using VBA string functions like Instr, Mid, etc., which I think is more prone to web site design changes.

3) automated Copy and Paste from IE- this was tantalisingly close to being perfect but is given to throwing up inexplicable errors whereby the type of information copied to the clipboard differs depending on whether it is done manually (ie CTRL+A, CTRL+C) or through the automated process (with the former I could get the HTML structure- ie tables, with the latter only text). The enigma here was that I could get the automated copy/paste to give me the right info IF I FIRST CLICKED INSIDE THE IE WINDOW USING MOUSE POINTER- however I was unable to automate this using a VB MACRO (I tried sendkeys and various other methods)
Did you try IE.ExecWB to select all, and copy to the clipboard? Forget SendKeys and AppActivate; there is better code on the web for sending keystrokes and mouse clicks to a specific window, which VBA SendKeys can't do - see SendKeys.

4) By automating an excel webquery- I recorded a macro of a query, this worked flawlessly giving me the structure of tables I needed. Snag was it was very very slow- even for a single page it might take 14 to 16 seconds (some of the other methods I used were near instantaneous). Also this method appears to encounter severe lagging/crashing problems when many refreshes are done (that may be because I don't know how to update the queries with different criteria, or properly extinguish them)
This is the easiest method because no knowledge of HTML is needed, but also very slow, as you've seen, because I think it renders the web page behind the scenes using IE. It also only works if the web page has HTML tables, though it is very good at parsing the table, dealing with odd HTML formatting like < BR > tags and importing into Excel cells. For multiple web query retrievals, have a single web query on a separate sheet and change its Connection property for each URL. I've also seen it freezing or crashing when many web queries are done; the solution is to clear the IE cache every 40-50 queries - there is code on the web to do this. Other methods (XMLhttp, IE automation, CreateDocumentFromURL) can be near instantaneous because they might be retrieving the data from the local cache instead of requesting it from the web site again.

5) Loading the page as an XML document- I am investigating this method now- I know next-to-nothing about XML but have a hunch the sort of pages I am scraping (see example above) are suitable for this. I have managed to load the pages as an XML object but at present seem to be running into difficulties trying to parse the structure (ie various nodes) to extract text- keep running into object errors.
Do you mean XMLhttp request? Or XMLhttp DOMdocument? Very few web pages are XML documents with properly structured nodes.

Summary:

The fastest method is generally an XMLhttp request combined with the HTML Object Library. For a mega-fast technique which retrieves multiple pages simultaneously instead of sequentially, use these methods with multiple VBScript processes.

CreateDocumentFromURL (HTML Object Library method) is also fast but does some rendering of the web page. Advantage - you don't need XMLhttp.

IE automation is the next best, but relatively slow because it renders the web page. However some web pages require rendering before you can retrieve the data so this might be the only method that works. And some sites generate pages based on input boxes, dropdown menus etc., which require IE automation. Combine IE automation with the HTML Object Library.

Excel web query - the slowest, but also the easiest to use if the web page has HTML tables, write code for and reliable.
 

marcowaf1

New Member
Joined
Feb 21, 2014
Messages
1
Hi John,

I know the post is kind of old but it really interested me and I'd like to ask you some deeper insights.
I am actually runninng some scripts to retrieve infos from several websites, if up to today I had no more than a 1000 pages to scan at a time, I today find my self in the position of having to look at more than 50.000 pages to scan as I'd like to achieve a better and deeper information level.

Over where you specified the fastest way is XMLHTTP... ("The fastest method is generally an XMLhttp request combined with the HTML Object Library. For a mega-fast technique which retrieves multiple pages simultaneously instead of sequentially, use these methods with multiple VBScript processes.") I'd like to ask you what you mean by running multiple vbscripts processes and how exactelly do you retrieve multiple pages simoultaneosly instead of subsequentially, because I think that is exactelly my problem at the time, waiting everytime the page to charge...

Thanks in advance,
marco














The best method is the one that works, is reliable and relatively fast. The method you choose depends on the specific web site and some methods may not work.


I wouldn't bother. SendKeys is unreliable. Anyway, Excel uses IE as its underlying browser, not Firefox.


Good, but only if you're using the MS HTML Object Library to find and parse HTML elements. Some people parse the whole outerHTML using VBA string functions like Instr, Mid, etc., which I think is more prone to web site design changes.


Did you try IE.ExecWB to select all, and copy to the clipboard? Forget SendKeys and AppActivate; there is better code on the web for sending keystrokes and mouse clicks to a specific window, which VBA SendKeys can't do - see SendKeys.


This is the easiest method because no knowledge of HTML is needed, but also very slow, as you've seen, because I think it renders the web page behind the scenes using IE. It also only works if the web page has HTML tables, though it is very good at parsing the table, dealing with odd HTML formatting like < BR > tags and importing into Excel cells. For multiple web query retrievals, have a single web query on a separate sheet and change its Connection property for each URL. I've also seen it freezing or crashing when many web queries are done; the solution is to clear the IE cache every 40-50 queries - there is code on the web to do this. Other methods (XMLhttp, IE automation, CreateDocumentFromURL) can be near instantaneous because they might be retrieving the data from the local cache instead of requesting it from the web site again.

Do you mean XMLhttp request? Or XMLhttp DOMdocument? Very few web pages are XML documents with properly structured nodes.

Summary:

The fastest method is generally an XMLhttp request combined with the HTML Object Library. For a mega-fast technique which retrieves multiple pages simultaneously instead of sequentially, use these methods with multiple VBScript processes.

CreateDocumentFromURL (HTML Object Library method) is also fast but does some rendering of the web page. Advantage - you don't need XMLhttp.

IE automation is the next best, but relatively slow because it renders the web page. However some web pages require rendering before you can retrieve the data so this might be the only method that works. And some sites generate pages based on input boxes, dropdown menus etc., which require IE automation. Combine IE automation with the HTML Object Library.

Excel web query - the slowest, but also the easiest to use if the web page has HTML tables, write code for and reliable.
 

John_w

MrExcel MVP
Joined
Oct 15, 2007
Messages
6,207
Over where you specified the fastest way is XMLHTTP... ("The fastest method is generally an XMLhttp request combined with the HTML Object Library. For a mega-fast technique which retrieves multiple pages simultaneously instead of sequentially, use these methods with multiple VBScript processes.") I'd like to ask you what you mean by running multiple vbscripts processes and how exactelly do you retrieve multiple pages simoultaneosly instead of subsequentially, because I think that is exactelly my problem at the time, waiting everytime the page to charge...
Write a VBA program which creates and runs multiple VBScript (.vbs) processes. Each VBScript process retrieves data for a specific URL of the same web site using XMLhttp and writes the data directly to the Excel worksheet. See the following:

Multithreaded VBA – An Approach To Processing Using VBScript | Excel & VBA – Databison
Multi-threaded VBA - Excel Hero Blog
http://www.mrexcel.com/forum/excel-questions/593519-vbscript-extracting-html-tables-2.html
Multiprocess VBScript written in VBA with example workbook

The last post includes all the code and example workbook for a specific use of the technique. The code includes a VBA to VBScript converter, allowing you to write and test the VBScript code in Excel VBA.
 

Cadoova

New Member
Joined
Apr 30, 2014
Messages
1
Hi Marco
Lately I've come across some web-scraping-tool websites (eg. kimono : Turn websites into structured APIs from your browser in seconds - no affiliation, and there may be others that do a better job for your purposes). You could potentially use their tools/services to create a neat html mega-table that scrapes in all your common format data, that you can then link to using the xmlhttp method. Just a thought - it might be an easier to manage.


Hi John,

I know the post is kind of old but it really interested me and I'd like to ask you some deeper insights.
I am actually runninng some scripts to retrieve infos from several websites, if up to today I had no more than a 1000 pages to scan at a time, I today find my self in the position of having to look at more than 50.000 pages to scan as I'd like to achieve a better and deeper information level.

Over where you specified the fastest way is XMLHTTP... ("The fastest method is generally an XMLhttp request combined with the HTML Object Library. For a mega-fast technique which retrieves multiple pages simultaneously instead of sequentially, use these methods with multiple VBScript processes.") I'd like to ask you what you mean by running multiple vbscripts processes and how exactelly do you retrieve multiple pages simoultaneosly instead of subsequentially, because I think that is exactelly my problem at the time, waiting everytime the page to charge...

Thanks in advance,
marco
 

Watch MrExcel Video

Forum statistics

Threads
1,095,146
Messages
5,442,639
Members
405,190
Latest member
KFrank

This Week's Hot Topics

  • Copy entire row if CountA <>0 to another sheet
    [B]I want to copy entire row if CountA <>0 for column J7:AM7 (headers on J6:AM6) and so on till the last used cell is column D and paste the...
  • Select last used Row in Table
    I have created a Table in a Worksheet which is locked to prevent user errors and protect formula. Some of the cells require freetext entries which...
  • excel workbook: do not allow certain file name
    Hello all, Don't think this has ever been asked before, but how do I restrict file save [Before_Save Event] if the name of the file being saved...
  • fixing problem autofilter
    hello i need help about my code when i search by code in textbox it doesn't show anything this is my data [ATTACH type="full"...
  • “Weight”
    Hi, i’ve got a long sheet filled with weights such as kg,g,L & ml. i can build a formula to convert kg into g and liter into ml. How ever, my...
  • How to capitalize everything before a certain character?
    In column A, I have some text: Hello good day.mp3 Hello good day.flac etc. I'd like to capitalize everything before the period. I don't need the...
Top