Web Scraper

Sharid

Well-known Member
Joined
Apr 22, 2007
Messages
1,064
Office Version
  1. 2016
Platform
  1. Windows
I have a web scraper that works, it extracts URLs from Google, I need to update it so it can scrape emails as well as the urls.

I have 2 sheets Data and Keywords, a KEYWORD is e.g. "Taxi Base" is placed in sheet2 "Keywords" cell C3 and that is what is searched in Google, currently I have set it to search only first 5 pages with a 7 second delay, not 100% sure if the delay is 1 to 7 seconds or every 7 seconds, could someone please advise.

I NEED the emails to go into row 2 column B and Urls in Row 2 Column C. If NO EMAILS are found it leaves that cell in column B blank put places the urls in C

VBA Code:
Private Sub CommandButton1_Click()

[COLOR=rgb(26, 188, 156)][B]'URL SCRAPER[/B][/COLOR] 
Dim ie As Object
    Dim htmlDoc As Object
    Dim nextPageElement As Object
    Dim div As Object
    Dim link As Object
    Dim url As String
    Dim pageNumber As Long
    Dim i As Long
    
 [COLOR=rgb(26, 188, 156)]   ' Takes seach from C3 and places it into google[/COLOR]
    url = "https://www.google.co.uk/search?q=" & Replace(Worksheets("Keywords").Range("C3").Value, " ", "+")

    Set ie = CreateObject("InternetExplorer.Application")
    
    With ie
        .Visible = True
        .navigate url
        Do While .Busy Or .readyState <> 4
            DoEvents
        Loop
    End With

    Application.Wait Now + TimeSerial(0, 0, 5)
     Set htmlDoc = ie.document

 [COLOR=rgb(26, 188, 156)] ' Searches URLS and places them in ROW 2 Column C[/COLOR]
    pageNumber = 1
    i = 2[COLOR=rgb(26, 188, 156)] 'This is Row 2[/COLOR]
    Do
        For Each div In htmlDoc.getElementsByTagName("div")
            If div.getAttribute("class") = "r" Then
                Set link = div.getElementsByTagName("a")(0)
                Cells(i, 3).Value = link.getAttribute("href") [COLOR=rgb(26, 188, 156)]'This bit seaches the href (URL), I need it to do urls and emails[/COLOR]
                i = i + 1
            End If
        Next div
[COLOR=rgb(26, 188, 156)]       'Currently only set to search the first 5 pages on GOOGLE[/COLOR]
        If pageNumber >=[B] 5[/B] Then Exit Do[COLOR=rgb(26, 188, 156)] [/COLOR]
        Set nextPageElement = htmlDoc.getElementById("pnnext")
        If nextPageElement Is Nothing Then Exit Do
        
        [COLOR=rgb(26, 188, 156)]' Clicks web next page every 7 seconds [/COLOR]
        nextPageElement.Click 'next web page
        Do While ie.Busy Or ie.readyState <> 4
            DoEvents
        Loop
        Application.Wait Now + TimeSerial(0, 0, 7)[COLOR=rgb(26, 188, 156)] 'NOT sure if this is between 1 to 7 seconds or every 7 seconds[/COLOR]
        Set htmlDoc = ie.document
        pageNumber = pageNumber + 1
    Loop

ie.Quit
    Set ie = Nothing
    Set htmlDoc = Nothing
    Set nextPageElement = Nothing
    Set div = Nothing
    Set link = Nothing

 MsgBox "All Done"
End Sub

Thanks for having a look
 

Attachments

  • Screenshot.jpg
    Screenshot.jpg
    141.8 KB · Views: 8

Excel Facts

When they said...
When they said you are going to "Excel at life", they meant you "will be doing Excel your whole life".
please ignore all the following on the code above COLOR=rgb(26, 188, 156)
 
Upvote 0
Hi
I have made a few changes to my code as google would stop the search after 10 pages as it was dedecting a bot. I am having the following problems with the new search engine.

1) I can not get it to paginate as before with Google
2) Extracting the URL

Rich (BB code):
Private Sub CommandButton1_Click()
'URL SCRAPER
Dim IE As Object
    Dim HTMLdoc As Object
    Dim nextPageElement As Object
    Dim div As Object
    Dim link As Object
    Dim url As String
    Dim pageNumber As Long
    Dim i As Long
   
    ' Takes seach from C3 and places it into search engine
    url = "https://www.startpage.com/search?q=" & Replace(Worksheets("Keywords").Range("C3").Value, " ", "+") ' NEW Search engine removed Google
    url = "https://www.google.co.uk/search?q=" & Replace(Worksheets("Keywords").Range("C3").Value, " ", "+") ' For Google
    Set IE = CreateObject("InternetExplorer.Application")
   
    With IE
        .Visible = True
        .navigate url
        Do While .Busy Or .readyState <> 4
            DoEvents
        Loop
    End With

    Application.Wait Now + TimeSerial(0, 0, 5)
   
    Set HTMLdoc = IE.document
IE.document.parentWindow.Scroll 0&, 2000&
    ' Searches URLS and places them in ROW 2 Column C
    pageNumber = 1
    i = 2
    Do
        For Each div In HTMLdoc.getElementsByTagName("div")
           ' If div.getAttribute("class") = "r" Then ' this was for Google
            If div.getAttribute("class") = "w-gl__result-url" Then
                Set link = div.getElementsByTagName("a")(0)
                Cells(i, 3).Value = link.getAttribute("href")
                i = i + 1
            End If
        Next div
        If pageNumber >= 3 Then Exit Do 'the first 3 pages, i would like to get the amount of pages from a cell e.g. sheet1 "Data" D1
        'Set nextPageElement = htmlDoc.getElementById("pnnext") ' this was for Google
         Set nextPageElement = HTMLdoc.getElementByClassName("desktop-arrow desktop-arrow__next")
       
        If nextPageElement Is Nothing Then Exit Do
     
        ' Set it to scroll the page so it looks like it is a person and not a bot, added a 10 second time deal, can increase this, would like to get this from a sheet D2           Clicks web next page
        IE.document.parentWindow.Scroll 0&, 99999
        Application.Wait Now + TimeSerial(0, 0, 10)
       
        nextPageElement.Click [B]'Click next web page[/B]
        Do While IE.Busy Or IE.readyState <> 4
            DoEvents
        Loop
       'Added another page delay of 10 seconds
        Application.Wait Now + TimeSerial(0, 0, 10)
        Set HTMLdoc = IE.document
        pageNumber = pageNumber + 1
    Loop
  
IE.Quit
    Set IE = Nothing
    Set HTMLdoc = Nothing
    Set nextPageElement = Nothing
    Set div = Nothing
    Set link = Nothing

MsgBox "All Done"
End Sub

Thanks for having a look
 
Upvote 0
Hi

I have managed to sort out how to get the data from a cell e.g. how many pages to scrape and delay time. But I have had to go back to google search as I can not resolve the other issues

Gets page number to scrape from Sheet "Data" D1
VBA Code:
  If pageNumber >= Replace(Worksheets("Data").Range("D1").Value, " ", "+") Then Exit Do

For the two delays
Gets delay time in seconds from Sheet "Data" D2
VBA Code:
Application.Wait Now + TimeSerial(0, 0, Replace(Worksheets("Data").Range("D2").Value, " ", "+"))

Gets delay time in seconds from Sheet "Data" D3
VBA Code:
Application.Wait Now + TimeSerial(0, 0, Replace(Worksheets("Data").Range("D3").Value, " ", "+"))

If someone could help, that would be super.

Thanks
 
Upvote 0

Forum statistics

Threads
1,213,538
Messages
6,114,218
Members
448,554
Latest member
Gleisner2

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top