Excel vba scraper - guidance part II

Anka

New Member
Joined
Oct 20, 2012
Messages
45
Office Version
  1. 2016
  2. 2010
Platform
  1. Windows
Hi all,

I'm start from here , but seems that was the easiest part...
In the second part, I try to take the data from all rows that corresponding with the Bet365 bookmaker (for example), from this page.

I'm able, do that but my vba is very slow.
Is there a way to do the task without IE as I have hundreds of similar links and using ie is slower?
I need to use another method to do this job.
This is an example of link that I need to scrape.
How difficult is to scrape ajax sites because this site have this kind of pages (just for the odds)?

And the final result have to be something like this.
image.jpg


Thank you in advance.
 
Last edited:
Thanks again for your hard workJohn_w, and thanks for your suggestion, ScottR.

If I may, but I do not know how difficult it is the achievement.
I saw that odds are added dynamically (I think this is the right term and I want to apologize because I'm not familiar with these terms) in site.
From what I understand, the site has a fixed part where we find the basic data as Country, League, Date & Time, Home, Away, Score, Halfs and the dynamic part were the odds & bookmaker are added to every match.

From this link (because I think that don't have any problem to be accessed) we can scrape the fixed data of any match (Country, League, Date & Time, Home, Away, Score, Halfs)
and the odds from the ajax pages related on each matches.

For example, for the link above

1X2 Odds from this page
Over/Under Odds from this page
Asian Handicap Odds from this page
Draw No Bet Odds from this page
Double Chance from this page
Both Team to Score from this page

Maybe it's a stupid thing what I say, I do not know ...

1X2 Odds ina ajax page

Code:
<a href=\"\/[COLOR=#ff0000]bookmaker[/COLOR]\/11\/http:\/\/www.[COLOR=#ff0000]bet365[/COLOR].com\/?affiliate=365_021894\" target=\"_blank\" data-stream-bid=\"11\"><i class=\"icon icon__stream\"><\/i><\/a><\/td>\n<td class=\"table-main__odds table-main__odds--first\" [COLOR=#ff0000]data-odd=\"2.60[/COLOR]\" data-created=\"15,05,2016,17,22\"><i class=\"icon icon__increasing table-main__icon\"><\/i> <span class=\"table-main__odds--hasarchive\" title=\"Click to see odds movements\" *******=\"arch_odds(this, '27496xv464x0x4kkj0', 16);\"><\/span><\/td>\n<td class=\"table-main__odds\" [COLOR=#ff0000]data-odd=\"3.50[/COLOR]\" data-created=\"09,05,2016,10,58\"><span><\/span><\/td>\n<td class=\"table-main__odds\" [COLOR=#ff0000]data-odd=\"2.60[/COLOR]\" data-created=\"15,05,2016,17,22\">
 
Last edited:
Upvote 0

Excel Facts

Remove leading & trailing spaces
Save as CSV to remove all leading and trailing spaces. It is faster than using TRIM().
This requests the match result page then the associated Over/Under odds page using WinHttpRequest, so you need a reference to Microsoft WinHTTP Services version 5.1 and Microsoft HTML Object Library.

Code:
Public Sub Extract_Data2()
    
    Dim httpReq As WinHttp.WinHttpRequest
    Dim HTMLdoc As HTMLDocument
    Dim para As HTMLParaElement
    Dim tRows As IHTMLElementCollection
    Dim tRow As HTMLTableRow
    Dim URLs As Range, URL As Range
    Dim destSheet As Worksheet
    Dim parts As Variant
    Dim matchURL As String, matchOddsURL As String
    Dim r As Long
    Dim matchData(1 To 7) As Variant
    Dim HTML As String
    Dim startTime As Single
    
    startTime = Timer
    
    Set destSheet = ThisWorkbook.Worksheets("Ou_Odds")
    With destSheet
        .UsedRange.ClearContents
        .Range("A1:K1").Value = Array("Country", "League", "Date & Time", "Home", "Away", "Score", "Halfs", "Bookmaker", "Total", "Over", "Under")
    End With
    r = 2
    
    With ThisWorkbook.Worksheets("MatchDetails")
        Set URLs = .Range("D2", .Cells(Rows.Count, "D").End(xlUp))
    End With
    
    Set httpReq = New WinHttp.WinHttpRequest
    
    For Each URL In URLs
        
        'Request the match result page by removing "#ou" from the URL
        
        matchURL = Replace(URL.Value, "#ou", "")
        Debug.Print matchURL
        With httpReq
            .Open "GET", matchURL, False
            .setRequestHeader "Host", "www.betexplorer.com"
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
            .setRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
            .setRequestHeader "Accept-Language", "en-US,en;q=0.5"
            .setRequestHeader "Upgrade-Insecure-Requests", "1"
            .setRequestHeader "Referer", "http://www.betexplorer.com/soccer/italy/serie-a-2015-2016/results/"
            .send
            Debug.Print .Status, .statusText
            
            'Put response in a HTMLDocument for parsing
            
            Set HTMLdoc = New HTMLDocument
            HTMLdoc.body.innerHTML = .responseText
        End With
        DoEvents
        
        matchData(1) = HTMLdoc.getElementsByClassName("list-breadcrumb__item")(2).innerText
        matchData(2) = HTMLdoc.getElementsByClassName("list-breadcrumb__item")(3).innerText
        
        'Date and time of match is in data-dt attribute of "match-date" P element:
        '< P id=match-date class=list-details__item__date data-dt="15,5,2016,18,00">< /P>
        
        Set para = HTMLdoc.getElementById("match-date")
        parts = Split(para.getAttribute("data-dt"), ",")
        matchData(3) = DateSerial(parts(2), parts(1), parts(0)) + TimeSerial(parts(3), parts(4), 0)
        
        matchData(4) = HTMLdoc.getElementsByTagName("h2")(0).innerText
        matchData(5) = HTMLdoc.getElementsByTagName("h2")(2).innerText
        matchData(6) = HTMLdoc.getElementById("js-score").innerText
        matchData(7) = HTMLdoc.getElementById("js-partial").innerText
        
        'Construct match odds URL from match result URL
        
        parts = Split(URL.Value, "/")
        matchOddsURL = "http://www.betexplorer.com/gres/ajax/matchodds.php?p=1&b=ou&e=" & parts(7)
        Debug.Print matchOddsURL
        
        'Request the match odds data.  The response is a JSON string containing HTML which itself contains the odds data
        
        With httpReq
            .Open "GET", matchOddsURL, False
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
            .setRequestHeader "Accept", "application/json, text/javascript, */*; q=0.01"
            .setRequestHeader "Accept-Language", "en-US,en;q=0.5"
            .setRequestHeader "X-Requested-With", "XMLHttpRequest"
            .setRequestHeader "Referer", URL.Value
            .send
            Debug.Print .Status, .statusText
            
            'Extract HTML from JSON response and put in HTMLDocument
            
            HTML = Mid(.responseText, Len("{'odds':'") + 1)     'remove {"odds":" at the start
            HTML = Left(HTML, Len(HTML) - 2)                    'remove "} at the end
            HTML = Replace(HTML, "\n", "")                      'remove newlines
            HTML = Replace(HTML, "\", "")                       'remove escape characters (assumes that every "\" is the escape character preceding the escaped character)
            
            Set HTMLdoc = New HTMLDocument
            HTMLdoc.body.innerHTML = HTML
        End With
        DoEvents
    
        Set tRows = HTMLdoc.getElementsByTagName("TR")
        
        For Each tRow In tRows
        
            If tRow.getElementsByTagName("TABLE").Length = 0 Then   'no inner tables?
            
                If tRow.innerText Like "*bet365*" Then
                    
                    destSheet.Cells(r, "A").Resize(1, 7) = matchData
                    destSheet.Cells(r, "H").Value = tRow.Cells(0).innerText
                    destSheet.Cells(r, "I").Value = tRow.Cells(3).innerText
                    destSheet.Cells(r, "J").Value = tRow.Cells(4).getAttribute("data-odd")
                    destSheet.Cells(r, "K").Value = tRow.Cells(5).getAttribute("data-odd")
                    r = r + 1
                
                End If
            
            End If
            
        Next
        
    Next
    
    Debug.Print "Elapsed time = " & Timer - startTime & " seconds"
    
End Sub
 
Upvote 0
This is the result: 106 seconds / 100 links. I love that !!
I think that is not necessary to tell you: You're a genius (because you know that)!
I just want to thank you for everything you do in helping newbies like me.
Thank you !

image.jpg


PS: I just want to point out that the timer from the code does not work in my system, but this is not a problem, I used another one.
 
Upvote 0
John,

This is really impressive. Can you please spend your few minutes to answer some questions about this code so that we could learn new things. Many thanks in advance for your valuable time.

Code:
With httpReq
            .Open "GET", matchURL, False
            .setRequestHeader "Host", "www.betexplorer.com"
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
            .setRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
            .setRequestHeader "Accept-Language", "en-US,en;q=0.5"
            .setRequestHeader "Upgrade-Insecure-Requests", "1"
            .setRequestHeader "Referer", "http://www.betexplorer.com/soccer/italy/serie-a-2015-2016/results/"
            .send

How did you know which requestheaders are needed for this site ? If there are any tutorials available then please share the links. Can you please explain a little about what all these lines are doing ?

Code:
"http://www.betexplorer.com/gres/ajax/matchodds.php?p=1&b=ou&e=" & parts(7)

Where you find this link and what this link is doing ?


Code:
With httpReq
            .Open "GET", matchOddsURL, False
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
            .setRequestHeader "Accept", "application/json, text/javascript, */*; q=0.01"
            .setRequestHeader "Accept-Language", "en-US,en;q=0.5"
            .setRequestHeader "X-Requested-With", "XMLHttpRequest"
            .setRequestHeader "Referer", URL.Value
            .send

This request is different from previous http request. Again, can you please explain the purpose of all these lines ?

and last one

Why you separated whole code into two http requests. Was that really necessary or just a user preference ?


I would be really grateful for your valuable time and advice.
 
Upvote 0
Excellent news - I'm pleased it works for you too!

PS: I just want to point out that the timer from the code does not work in my system, but this is not a problem, I used another one.
The Timer code should work on any system. To display it in hh:mm:ss:

Code:
    Debug.Print "Elapsed time = " & Format(TimeSerial(0, 0, Timer - startTime), "hh:mm:ss")
Timer gives the number of seconds since midnight, so the calculation needs tweaking if the elapsed time spans midnight.
 
Upvote 0
Thanks - hopefully this will help.
Code:
With httpReq
            .Open "GET", matchURL, False
            .setRequestHeader "Host", "www.betexplorer.com"
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
            .setRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
            .setRequestHeader "Accept-Language", "en-US,en;q=0.5"
            .setRequestHeader "Upgrade-Insecure-Requests", "1"
            .setRequestHeader "Referer", "http://www.betexplorer.com/soccer/italy/serie-a-2015-2016/results/"
            .send

How did you know which requestheaders are needed for this site ? If there are any tutorials available then please share the links. Can you please explain a little about what all these lines are doing ?
I used the Developer Tool in Firefox (IE and Chrome, etc. have the same thing), accessed by right-clicking on any part of the web page then 'Inspect Element'. Then look at the Console tab and it shows the requests sent by the browser to the website.

With a match result page loaded (or refreshed) in the browser, for example Empoli - Torino, 15.05.2016 - H2H stats, results, odds, you will see
HTML:
GET  http://www.betexplorer.com/soccer/italy/serie-a-2015-2016/empoli-torino/bFpRibLT/
as the first request in the Console. Click that and the tool switches to the Network tab which shows the request and response headers. I used most of the request headers shown by the tool, so the above code is emulating what a browser does when it sends a "GET" request to the web server to request that webpage. Most of the headers probably aren't needed (you could experiment and omit them to find out), though I decided to keep them so that the web server thinks a browser is requesting the page.

The various header fields are described at https://en.wikipedia.org/wiki/List_of_HTTP_header_fields.

I haven't really searched for tutorials on the Developer Tool because I tend to learn by using it. I also use Fiddler which is good for examining http requests and responses outside a browser.

Code:
"http://www.betexplorer.com/gres/ajax/matchodds.php?p=1&b=ou&e=" & parts(7)

Where you find this link and what this link is doing ?
When you click the "Over/Under" tab link in the webpage, a Javascript function is called which uses XMLHttpRequest to send a GET request and you will see this in the Console:

HTML:
GET XHR http://www.betexplorer.com/gres/ajax/matchodds.php?p=1&e=bFpRibLT&b=ou
XHR denotes XMLHttpRequest. In the above URL, "bFpRibLT" is the same string as parts(7) split from the original match URL. Therefore that link is requesting the match odds data for the same match as the original URL (the match result page being displayed).

Code:
With httpReq
            .Open "GET", matchOddsURL, False
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
            .setRequestHeader "Accept", "application/json, text/javascript, */*; q=0.01"
            .setRequestHeader "Accept-Language", "en-US,en;q=0.5"
            .setRequestHeader "X-Requested-With", "XMLHttpRequest"
            .setRequestHeader "Referer", URL.Value
            .send

This request is different from previous http request. Again, can you please explain the purpose of all these lines ?
That is sending a GET to request the match odds data. Again, the headers were found by looking at the Network tab in the Developer Tool. Here the "Accept" header is telling the web server that a JSON response is expected.

and last one

Why you separated whole code into two http requests. Was that really necessary or just a user preference ?
Two http requests are needed because that is what the webpage sends. The first request responds with HTML containing the match result data (league, teams, score, date and time, etc) and the second request responds with the match odds data in a JSON HTML string.
 
Upvote 0
John,

Thank you very much for the detailed explanation and your valuable time. This is really helpful and learned new things today. I'll definitely use this knowledge to help other people on this forum.


Most of the headers probably aren't needed (you could experiment and omit them to find out), though I decided to keep them so that the web server thinks a browser is requesting the page.

I was playing with request headers and found that in first http request no request header is necessary and in second one this is
Code:
.setRequestHeader "Referer", URL.Value
doing the job.

Thanks once again.

I was wondering if you're famous John_Walkenbach from spreadsheetpage ?
 
Upvote 0
Not me - he's a real Excel guru!

No problem John. As Will Roger once said, Everybody is Ignorant only on different subjects and everybody is Expert on something. Always respect that expertise and yours is Web automation.
 
Upvote 0
Excel vba scraper

I ran the code, but there's a problem. Input is omitted if a bookmaker does not exist.
If there are no bookmaker, the following information shall be entered.
< "Country", "League", "Date & Time", "Home", "Away", "Score", "Halfs", >

ex)
bookmaker yes : < "Country", "League", "Date & Time", "Home", "Away", "Score", "Halfs", "Bookmaker", "Total", "Over", "Under" >
bookmaker non : < "Country", "League", "Date & Time", "Home", "Away", "Score", "Halfs", >



Public Sub Extract_Data2()

Dim httpReq As WinHttp.WinHttpRequest
Dim HTMLdoc As HTMLDocument
Dim para As HTMLParaElement
Dim tRows As IHTMLElementCollection
Dim tRow As HTMLTableRow
Dim URLs As Range, URL As Range
Dim destSheet As Worksheet
Dim parts As Variant
Dim matchURL As String, matchOddsURL As String
Dim r As Long
Dim matchData(1 To 7) As Variant
Dim HTML As String
Dim startTime As Single

startTime = Timer

Set destSheet = ThisWorkbook.Worksheets("Ou_Odds")
With destSheet
.UsedRange.ClearContents
.Range("A1:K1").Value = Array("Country", "League", "Date & Time", "Home", "Away", "Score", "Halfs", "Bookmaker", "Total", "Over", "Under")
End With
r = 2

With ThisWorkbook.Worksheets("MatchDetails")
Set URLs = .Range("D2", .Cells(Rows.Count, "D").End(xlUp))
End With

Set httpReq = New WinHttp.WinHttpRequest


For Each URL In URLs




'Request the match result page by removing "#ou" from the URL

matchURL = Replace(URL.Value, "#ou", "")
Debug.Print matchURL
With httpReq
.Open "GET", matchURL, False
.setRequestHeader "Host", "www.betexplorer.com"
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
.setRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
.setRequestHeader "Accept-Language", "en-US,en;q=0.5"
.setRequestHeader "Upgrade-Insecure-Requests", "1"
.setRequestHeader "Referer", "http://www.betexplorer.com/soccer/italy/serie-a-2015-2016/results/"
.send
Debug.Print .Status, .statusText

'Put response in a HTMLDocument for parsing

Set HTMLdoc = New HTMLDocument
HTMLdoc.body.innerHTML = .responseText
End With
DoEvents

matchData(1) = HTMLdoc.getElementsByClassName("list-breadcrumb__item")(2).innerText
matchData(2) = HTMLdoc.getElementsByClassName("list-breadcrumb__item")(3).innerText

'Date and time of match is in data-dt attribute of "match-date" P element:
'< P id=match-date class=list-details__item__date data-dt="15,5,2016,18,00">< /P>

Set para = HTMLdoc.getElementById("match-date")
parts = Split(para.getAttribute("data-dt"), ",")
matchData(3) = DateSerial(parts(2), parts(1), parts(0)) + TimeSerial(parts(3), parts(4), 0)

matchData(4) = HTMLdoc.getElementsByTagName("h2")(0).innerText
matchData(5) = HTMLdoc.getElementsByTagName("h2")(2).innerText
matchData(6) = HTMLdoc.getElementById("js-score").innerText
matchData(7) = HTMLdoc.getElementById("js-partial").innerText

'Construct match odds URL from match result URL

parts = Split(URL.Value, "/")
matchOddsURL = "http://www.betexplorer.com/gres/ajax/matchodds.php?p=1&b=ou&e=" & parts(7)
Debug.Print matchOddsURL

'Request the match odds data. The response is a JSON string containing HTML which itself contains the odds data

With httpReq
.Open "GET", matchOddsURL, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
.setRequestHeader "Accept", "application/json, text/javascript, */*; q=0.01"
.setRequestHeader "Accept-Language", "en-US,en;q=0.5"
.setRequestHeader "X-Requested-With", "XMLHttpRequest"
.setRequestHeader "Referer", URL.Value
.send
Debug.Print .Status, .statusText

'Extract HTML from JSON response and put in HTMLDocument

HTML = Mid(.responseText, Len("{'odds':'") + 1) 'remove {"odds":" at the start
HTML = Left(HTML, Len(HTML) - 2) 'remove "} at the end
HTML = Replace(HTML, "\n", "") 'remove newlines
HTML = Replace(HTML, "", "") 'remove escape characters (assumes that every "" is the escape character preceding the escaped character)

Set HTMLdoc = New HTMLDocument
HTMLdoc.body.innerHTML = HTML
End With
DoEvents

Set tRows = HTMLdoc.getElementsByTagName("TR")

For Each tRow In tRows

If tRow.getElementsByTagName("TABLE").Length = 0 Then 'no inner tables?

If tRow.innerText Like "*bet365*" Then

destSheet.Cells(r, "A").Resize(1, 7) = matchData
destSheet.Cells(r, "H").Value = tRow.Cells(0).innerText
destSheet.Cells(r, "I").Value = tRow.Cells(3).innerText
destSheet.Cells(r, "J").Value = tRow.Cells(4).getAttribute("data-odd")
destSheet.Cells(r, "K").Value = tRow.Cells(5).getAttribute("data-odd")
r = r + 1

End If

End If

Next

Next

Debug.Print "Elapsed time = " & Timer - startTime & " seconds"

End Sub
 
Upvote 0

Forum statistics

Threads
1,213,564
Messages
6,114,334
Members
448,567
Latest member
Kuldeep90

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top