Email Address Scrape from A Website

ross88guy

New Member
Joined
Dec 8, 2010
Messages
17
Hi All,

I am trying to learn how to navigate web pages using internet explorer and VBA in Excel.

What I would like to do is to go to the following web page (allyouneedtotri.com/directory/listing/clubs-and-training-venues/159/triathlon-clubs), navigate to each 'more info' page and download all the email addresses from the 'more details' pages into a sheet (there is only one in each of the 'more details' tabs)

I would then like the macro to click the number 2 on the page allyouneedtotri.com/directory/listing/clubs-and-training-venues/159/triathlon-clubs, to display the next 10 items and then the macro needs to go to each of their 'more info' pages and download the email address from each page.

I have made a few attempts but I have made a complete mess of the code and nothing seems to work.

If someone could write the first part of the code so that I can have a look and learn from it then that would be awesome

Regards

Ross
 

Excel Facts

How to total the visible cells?
From the first blank cell below a filtered data set, press Alt+=. Instead of SUM, you will get SUBTOTAL(9,)
Ross

What code do you have so far?
 
Upvote 0
Thanks for the quick reply. I have tried altering something that I found on the web to go to the initial URL and then find the button that has the name 'more-detail-button.png' but even this simple code doesnt seem to want to run!
Code:
Sub emails1()
 
    Dim IE As InternetExplorer
    Dim RegEx As Variant, RegMatch As Variant
    Dim MyStr As String
     
    Set IE = New IE
    Set RegEx = New RegExp
     
    
    IE.Navigate "[URL]http://www.allyouneedtotri.com/directory/listing/clubs-and-training-venues/159/triathlon-clubs/[/URL]"
    Do Until IE.ReadyState = READYSTATE_COMPLETE
    Loop
     
     'String to parse google search for a VBAX reference
    With RegEx
        .Pattern = "wp-content/themes/ayntt/images/more-detail-button.png"
        .MultiLine = True
    End With
     
     'return text from google page
    MyStr = IE.Document.body.innertext
    Set RegMatch = RegEx.Execute(MyStr)
     
     'If a match to our RegExp searchstring is found then launch this page
    If RegMatch.Count > 0 Then
        IE.Navigate RegMatch(0)
        Do Until IE.ReadyState = READYSTATE_COMPLETE
        Loop
        MsgBox "Loaded VBAX link"
         'show internet explorer
        IE.Visible = True
    Else
        MsgBox "No VBAX link found"
    End If
     
    Set RegEx = Nothing
    Set IE = Nothing
End Sub
 
Upvote 0
You need to start with a new instance of Internet Explorer.

I think this is the code for it when early binding*, which I assume you are.
Rich (BB code):
Set IE = New InternetExplorer.Application
I normally use this.
Rich (BB code):
Set IE = CreateObject("InternetExplorer.Application")
I honestly don't understand the rest of the code, especially the use of RegExp.

This is how I would get started:
Rich (BB code):
Option Explicit
Sub TriClubEmails()
Dim IE As Object
Dim doc As Object
Dim nxtpage As Object
Dim email As Object
Dim strURL As String
    strURL = "http://www.allyouneedtotri.com/directory/listing/clubs-and-training-venues/159/triathlon-clubs/"
    Set IE = CreateObject("InternetExplorer.Application")
    With IE
        .navigate strURL
        .Visible    ' optional
        Do While .Busy: DoEvents: Loop
        Do While .ReadyState <> 4: DoEvents: Loop
        Set doc = .document
 
    End With
 
    Set IE = Nothing
End Sub
This will get you a reference to the page's document, and from there you can find the links for each club and also the link to goto the next page.

Once you've got the link for a club you can follow the link and then get the email address.

I've only got as far as opening the page and getting the document reference.

*A loose definitiion of 'early binding' is you add a reference to the application you are automating, in this case InternetExplorer.

With 'late binding' you don't have the reference so the code is slightly different.

There are advantages/disadvantages to both, I'm sure you'll find a fuller and more accurate description elsewhere.
 
Upvote 0
Thanks for this! I have tried to find the various urls to navigate to from that main page but I cannot think how to find them? In the page source each has a button graphic associated with them so could I use this to somehow get the link that the macro needs to click?

Much appreciated

Ross



Code:
Option Explicit
Sub TriClubEmails()
Dim IE As Object
Dim doc As Object
Dim nxtpage As Object
Dim email As Object
Dim strURL As String
    strURL = "http://www.allyouneedtotri.com/directory/listing/clubs-and-training-venues/159/triathlon-clubs/"
    Set IE = CreateObject("InternetExplorer.Application")
    With IE
        .navigate strURL
        .Visible = True
        Do While .Busy: DoEvents: Loop
        Do While .ReadyState <> 4: DoEvents: Loop
        Set doc = .Document
 
    End With
    Dim elementcol As Object
    Set elementcol = strURL.Document.getElements    '?????????

For Each btnInput In elementcol
    If btnInput.Value = "Submit" Then
        btnInput.Click
        Exit For
    End If
Next btnInput
 
    'Set IE = Nothing
End Sub
 
Upvote 0
strURL is just the URL for the page, use doc which is a reference to the page's document.

You can get all the elements of a certain type using something like this:
Rich (BB code):
' get the collection of all the DIV elements on the page
Set divs = doc.GetelementsByTagName("DIV")

This is how I used that method to get a list of all the clubs and the link.

I've actually taken the link from the name rather than the button, but they are the same.
Rich (BB code):
Option Explicit
Sub TriClubEmails()
Dim IE As Object
Dim doc As Object
Dim nxtpage As Object
Dim club As Object
Dim detail As Object
Dim email As Object
Dim strURL As String
Dim div As Object
Dim rng As Range
    strURL = "http://www.allyouneedtotri.com/directory/listing/clubs-and-training-venues/159/triathlon-clubs/"
    Set rng = Range("A1")
    Set IE = CreateObject("InternetExplorer.Application")
    With IE
        .navigate strURL
        .Visible = True   ' optional
        Do While .Busy: DoEvents: Loop
        Do While .ReadyState <> 4: DoEvents: Loop
        Set doc = .document

        For Each div In doc.GetelementsByTagName("DIV")
            Select Case div.classname
                Case "dlistcolA"
                
                    Set club = div.GetelementsByTagName("A")(0)
                    rng.Value = club.OuterText
                    rng.Offset(, 1).Value = club.href
                    Set rng = rng.Offset(1)
                Case "dirpage"
                    Set nxtpage = div
            End Select
        Next div
        
    End With
    
    Set IE = Nothing
End Sub
That's really just a start, you would need to decide how you want to deal with all the links - I think there's quite a few.

One though I had was to go through and get all the links in a array then have code that follows each link and gets the email.

PS I also have a bit in that code for getting the link to the next page but I've not completed it.
 
Upvote 0
Thanks for the hints so far!

I have expanded on your clues and have now managed to create something that collates all the pages where the emails that I want to pull out lie. However, I have no idea how to use an array to store these variables and then access each one to find and download the email. Would you be able to give me some clues please! :)

Regards

Ross



Code:
Option Explicit
Sub TriClubEmails()
Dim IE As Object
Dim doc As Object
Dim nxtpage As Object
Dim club As Object
Dim detail As Object
Dim email As Object
Dim strURL As String
Dim div As Object
Dim rng As Range
Dim i As Integer

    i = 1
    For i = 1 To 17



        If i = 1 Then
        strURL = "http://www.allyouneedtotri.com/directory/listing/clubs-and-training-venues/159/triathlon-clubs/"
        Else
        strURL = "http://www.allyouneedtotri.com/directory/listing/clubs-and-training-venues/159/triathlon-clubs/?sortA=&sortC=&typclk=&prt=" & i
        End If
        
        If Range("a1") = 0 Then
        Set rng = Range("a1")
        Else
        Set rng = Range("A1").End(xlDown).Offset(1, 0)
        End If
        
        If IE Is Nothing Then
        Set IE = CreateObject("InternetExplorer.Application")
        Else
        End If
        With IE
            .navigate strURL
            .Visible = True   ' optional
            Do While .Busy: DoEvents: Loop
            Do While .ReadyState <> 4: DoEvents: Loop
            Set doc = .document
    
        
            For Each div In doc.GetelementsByTagName("DIV")
                Select Case div.classname
                    Case "dlistcolA"
                    
                        Set club = div.GetelementsByTagName("A")(0)
                        rng.Value = club.OuterText
                        rng.Offset(, 1).Value = club.href
                        Set rng = rng.Offset(1)
                        'MsgBox (rng.Value)
                    Case "dirpage"
                        Set nxtpage = div
                End Select
                
            Next div
            
        End With
        
    Next i
    
End Sub
 
Upvote 0
ross

So this code gets all the links for the clubs in column B of a worksheet?

If it is doing that then you don't need an array, that was just one idea I had.

You can loop through the links in column B, navigate to the detail page of each club and grab the email.

I only had a quick look at the detail page(s) and it should be quite straightforward.

Bit late for me to look at it now but I think that's how I would approach it.


PS You don't need to set i to 1 before the loop.
 
Upvote 0
Thank for all of your help so far! <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:eek:ffice:eek:ffice" /><o:p></o:p>
<o:p></o:p>
I have almost managed to get everything working but am running into a small problem. <o:p></o:p>
<o:p></o:p>
Currently, the macro looks at each cell in column B for the URL, navigates to this URL, finds the email address and dumps it into column C. For some reason the macro doesn’t seem to be inserting an email for column C for each corresponding url in column B. <o:p></o:p>
<o:p></o:p>
I have checked and there is an email existing on each Url so I am stuck! <o:p></o:p>
<o:p></o:p>
The problem is with the second sub (findemails)<o:p></o:p>

Code:
Option Explicit
 

Sub findurls()
Dim IE As Object
Dim doc As Object
Dim nxtpage As Object
Dim club As Object
Dim detail As Object
Dim email As Object
Dim strURL As String
Dim div As Object
Dim rng As Range
Dim i As Integer
Dim email1 As Object
Dim rows1 As Integer
    For i = 1 To 17
 
        If i = 1 Then
        strURL = "[URL]http://www.allyouneedtotri.com/directory/listing/clubs-and-training-venues/159/triathlon-clubs/[/URL]"
        Else
        strURL = "[URL]http://www.allyouneedtotri.com/directory/listing/clubs-and-training-venues/159/triathlon-clubs/?sortA=&sortC=&typclk=&prt[/URL]=" & i
        End If
        
        If Range("a1") = 0 Then
        Set rng = Range("a1")
        Else
        Set rng = Range("A1").End(xlDown).Offset(1, 0)
        End If
        
        If IE Is Nothing Then
        Set IE = CreateObject("InternetExplorer.Application")
        Else
        End If
        With IE
            .navigate strURL
            .Visible = False   ' optional
            Do While .Busy: DoEvents: Loop
            Do While .readyState <> 4: DoEvents: Loop
            Set doc = .document
    
        
            For Each div In doc.GetelementsByTagName("DIV")
                Select Case div.classname
                    Case "dlistcolA"
                    
                        Set club = div.GetelementsByTagName("A")(0)
                        rng.Value = club.OuterText
                        rng.Offset(, 1).Value = club.href
                        Set rng = rng.Offset(1)
                        'MsgBox (rng.Value)
                    Case "dirpage"
                        Set nxtpage = div
                End Select
                
            Next div
            
        End With
    Next i
End Sub
 
 
 
 

Sub findemails()
Dim IE  As Object
Dim doc As Object
Dim strURL As String
Dim div As Object
Dim rng As Range
Dim i As Integer
Dim email1 As Object
Dim rows1 As Integer
 
    rows1 = Range(Range("a1"), Range("a1").End(xlDown)).rows.Count
    i = 0
    For i = 0 To rows1
    Set rng = Range("b1").Offset(i, 0)
    
    If IE Is Nothing Then
        Set IE = CreateObject("InternetExplorer.Application")
        Else
        End If
    
    With IE
        .navigate rng
        .Visible = False
        
        IESTATUS IE
        Application.Wait Now() + TimeValue("00:00:02")
        
        Set doc = .document
        
        For Each div In doc.GetelementsByTagName("DIV")
        
            Select Case div.classname
            Case "detcot"
                Set email1 = div.GetelementsByTagName("A")(0)
                If email1 Is Nothing = False Then
                rng.Offset(0, 1) = ExtractEmailAddress(email1.href)
                Else
                End If
            End Select
        Next div
    End With
    Next i
    
End Sub
 

Function ExtractEmailAddress(s As String) As String
    Dim AtSignLocation As Long
    Dim i As Long
    Dim TempStr As String
    Const CharList As String = "[A-Za-z0-9._-]"
    
    'Get location of the @
    AtSignLocation = InStr(s, "@")
    If AtSignLocation = 0 Then
        ExtractEmailAddress = "" 'not found
    Else
        TempStr = ""
        'Get 1st half of email address
        For i = AtSignLocation - 1 To 1 Step -1
            If Mid(s, i, 1) Like CharList Then
                TempStr = Mid(s, i, 1) & TempStr
            Else
                Exit For
            End If
        Next i
        If TempStr = "" Then Exit Function
        'get 2nd half
        TempStr = TempStr & "@"
        For i = AtSignLocation + 1 To Len(s)
            If Mid(s, i, 1) Like CharList Then
                TempStr = TempStr & Mid(s, i, 1)
            Else
                Exit For
            End If
        Next i
    End If
    'Remove trailing period if it exists
    If Right(TempStr, 1) = "." Then TempStr = _
       Left(TempStr, Len(TempStr) - 1)
    ExtractEmailAddress = TempStr
End Function
Private Sub IESTATUS(ByVal IE As Object)
     
    Dim lngFrames As Long
ErrH:
    Err.Clear: On Error GoTo 0: On Error GoTo -1: On Error GoTo ErrH
    Do While IE.Busy = True: Loop
        Do Until IE.readyState = 4: Loop
            'For lngFrames = 0 To IE.document.frames.Length - 1
                'Do Until IE.document.frames(lngFrames).document.readyState = "complete": Loop
                'Next lngFrames
                 
            End Sub
 
Upvote 0
Does every club have an email address on the details page?
 
Upvote 0

Forum statistics

Threads
1,215,219
Messages
6,123,684
Members
449,116
Latest member
HypnoFant

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top