Web Scraping with IE

Comake

New Member
Joined
Sep 17, 2014
Messages
17
Hi,

I'm making an Excel sheet that uses a lot of web scraping to pull the data from "td" tags within "tr". Sometimes there are more than 80 "tr" tags with 4 "td" tags in each. While it pulls it in it places the "td" tags in the same row, then goes to the next row for the next "tr" tag.

The code works fine, probably isn't close to being proper coding because I'm fairly new. My issue becomes is when IE runs into the "Automation error". Which, from what I can tell, is related to the fact that IE isn't quite ready before trying to run.

Here's the code (edited slightly to hide some personal information)

Code:
Sub Button1_Click()
Dim IE As New InternetExplorer
Dim shellWins As ShellWindows
Dim tr_data, td_data As IHTMLElement
Dim website_number, row_number, column_number, td_number, page_count As Integer
Set IE = CreateObject("internetexplorer.application")
website_number = 1
row_number = 1
column_number = 1
td_number = 1


For website_number = 1 To 500


IE.Visible = False
IE.navigate Sheets("Sheet2").Range("A" & website_number).Value


 Do
 On Error GoTo ErrorHandler
     If Not IE.Busy And IE.readyState = 4 Then
        Application.Wait (Now + TimeValue("00:00:01"))
        If Not IE.Busy And IE.readyState = 4 Then
           Exit Do
        End If
     End If
     DoEvents
   Loop




For Each tr_data In IE.document.getElementById("table").getElementsByTagName("tr")
For Each td_data In tr_data.getElementsByTagName("td")
Sheets("Sheet1").Cells(row_number, column_number).Value = Trim(td_data.innerText)
column_number = column_number + 1
td_number = td_number + 1
Next td_data
column_number = 1
td_number = 1
row_number = row_number + 1
Next tr_data


website_number = website_number + 1


Next website_number


ErrorHandler:
Set IE = Nothing
Set IE = CreateObject("internetexplorer.application")
IE.Visible = False
IE.navigate Sheets("Sheet2").Range("A" & website_number).Value
Resume Next


End Sub

On Sheet2 it contains lists of websites that I want to pull the information from.

I've found help related to this with IE, but none of it seems to work exactly. For the Automation error is usually is around the Do...Loop that's running to wait for IE. I put in the ErrorHandling to see if it could clear it up by recreating the InternetExplorer object, but it ends up hanging after about 100 websites.

If there's anything, I'm more than appreciative for the help.

Comake
 

Excel Facts

Whats the difference between CONCAT and CONCATENATE?
The newer CONCAT function can reference a range of cells. =CONCATENATE(A1,A2,A3,A4,A5) becomes =CONCAT(A1:A5)

Kyle123

Well-known Member
Joined
Jan 24, 2012
Messages
2,761
I'd skip the Internet Explorer step altogether, it's not necessary and slows the process down. I've re-written your code for a single url for you to try, but if you uncomment the green bits then it should work for your full list:
Code:
Sub test()


    Const url = "http://www.defender.com/category.jsp?path=-1|7504|2290217&id=2290219"
    
    Dim HTMLRow As Object, HTMLCell As Object
    
    Dim HTMLDocument As Object: Set HTMLDocument = CreateObject("htmlFile")
    Dim Request As Object: Set Request = CreateObject("winhttp.winhttprequest.5.1")
    
    Dim RowCounter As Long, ColumnCounter As Long
    
    RowCounter = 1: ColumnCounter = 1
    
    
    'For your data
    'Dim urls As Variant
    'Dim currentUrl As Variant
    'With Sheets("sheet2").Range("a1").CurrentRegion
    '   urls = .Resize(.Rows.Count, 1).Value
    'End With
    
     'For Each currentUrl In urls
    
        With Request
            .Open "GET", url, False 'Comment this line out for multiple urls
            '.Open "GET", currentUrl, False
            .send
            HTMLDocument.body.innerHtml = .responseText
        End With
        
        For Each HTMLRow In HTMLDocument.GetElementById("tableSorter").Rows
            For Each HTMLCell In HTMLRow.Cells
                If Len(Replace(HTMLCell.innertext, " ", "")) > 0 Then
                    Sheets("Sheet1").Cells(RowCounter, ColumnCounter).Value = HTMLCell.innertext
                    ColumnCounter = ColumnCounter + 1
                End If
            Next HTMLCell
        
            RowCounter = RowCounter + 1
            ColumnCounter = 1
        Next HTMLRow
    
     'Next currentUrl


End Sub
 
Last edited:
Upvote 0

Comake

New Member
Joined
Sep 17, 2014
Messages
17
Perfect! It works really well, except after it hits about 200 items it comes up with this: "Run-time error '91': Object variable or With block variable not set" in relation to the "For Each HTMLRow In HTMLDocument.GetElementById("tableSorter").Rows" line. Do you know what could cause this? Could it just be it running faster than the Request can handle? Maybe even that particular website could be messing it up?

I created a test ErrorHandler just to resume the next line, and it continues to run and get about 18000 items. So I'm leaning towards the particular URL is messing it up, but I'd like your input on it.

Thank you so much for your help! I had no idea how to do a non-IE way.
 
Upvote 0

Kyle123

Well-known Member
Joined
Jan 24, 2012
Messages
2,761
I suspect that the table doesn't exist in that particular url, it may be a dead link or no results returned etc - probably worth checking the offending url manually
 
Upvote 0

Forum statistics

Threads
1,190,767
Messages
5,982,822
Members
439,799
Latest member
matts12

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Top