Web Scraping using vba

Abdmujib

Board Regular
Joined
May 15, 2022
Messages
88
Office Version
  1. 365
Platform
  1. Windows
Pls, I want to write a vba code for web scaraping. I want to access betexplorer.com soccer section, for the matches that would be played today (if possible with option to be extend it to to other days i.e tomorrow and any other days).

The program looks at all the games being played that day. It checks each game's record to see if certain pattern happened before in previous games between those teams. For example, if Manchester United is playing against Arsenal, the program will check if both teams scored, times they scored more than 1 goal, times they scored more than 2 goals, times they scored more than 3 goals and times they scored more than 4 goals and times when both team did not score. If the program finds these things happening in at least 70% of the previous games between Manchester United and Arsenal. Then the result will be matches that has the pattern occurrence at least 70%. with the pattern that occurred listed against it and the link to the match.

@Anthony47 pls can you help. thanks for the usual help

pls anybody is free to help.
Thanks

These are the codes suggested by chatgtp. pls kindly help to modify

VBA Code:
Sub CheckMatches()
    Dim xmlHttp As Object
    Dim htmlDoc As Object
    Dim table As Object
    Dim rows As Object
    Dim row As Object
    Dim link As Object
    Dim h2hLink As String
    Dim h2hDoc As Object
    Dim h2hTable As Object
    Dim h2hRows As Object
    Dim h2hRow As Object
    Dim h2hCells As Object
    Dim i As Long
    Dim score As String
    Dim goals As Long
    Dim btts As Boolean
    Dim matches As Collection
    
    ' Create a new XMLHTTP object
    Set xmlHttp = CreateObject("MSXML2.XMLHTTP")
    
    ' Send a GET request to the website
    xmlHttp.Open "GET", "https://www.betexplorer.com/soccer/", False
    xmlHttp.send
    
    ' Parse the HTML response using the HTMLDocument object
    Set htmlDoc = CreateObject("HTMLFile")
    htmlDoc.body.innerHTML = xmlHttp.responseText
    
    ' Extract the links to the matches
    Set table = htmlDoc.getElementById("sortable-1")
    Set rows = table.getElementsByTagName("tr")
    
    ' Loop through the rows of the table
    For Each row In rows
        ' Get the link to the match
        Set link = row.getElementsByTagName("a")(0)
        h2hLink = "https://www.betexplorer.com" & link.getAttribute("href") & "&h2h=1"
        
        ' Send a GET request to the H2H page
        xmlHttp.Open "GET", h2hLink, False
        xmlHttp.send
        
        ' Parse the HTML response using the HTMLDocument object
        Set h2hDoc = CreateObject("HTMLFile")
        h2hDoc.body.innerHTML = xmlHttp.responseText
        
        ' Extract the rows of the H2H table
        Set h2hTable = h2hDoc.getElementById("h2h-matches")
        Set h2hRows = h2hTable.getElementsByTagName("tr")
        
        ' Loop through the rows of the H2H table
        For Each h2hRow In h2hRows
            ' Skip the header row
            If h2hRow.className = "hl" Then
                Continue For
            End If
            
            ' Extract the cells of the H2H row
            Set h2hCells = h2hRow.getElementsByTagName("td")
            
            ' Check if the H2H row matches the pattern
            score = h2hCells(2).innerText
            goals = CLng(h2hCells(3).innerText) + CLng(h2hCells(4).innerText)
            btts = (h2hCells(5).innerText = "BTS")
            
            ' Update the match list if the H2H row matches the pattern
            If (score = "1:1" Or score = "2:2") And goals >= 3 And btts Then
                matches.Add link.getAttribute("href")
                Exit For
            End If
        Next h2hRow
    Next row
    
    ' Print the links to the matches that match the pattern
    For i = 1 To matches.Count
        Debug.Print matches.Item(i)
    Next i
End Sub


The second code

VBA Code:
Sub CheckMatchesForPatterns()
    ' Set up variables
    Dim url As String
    Dim response As String
    Dim soup As Object
    Dim today_matches As Object
    Dim match As Object
    Dim match_link As String
    Dim match_url As String
    Dim h2h_stats As Variant
    Dim patterns As Variant
    
    ' Set the URL to scrape
    url = "https://www.betexplorer.com/soccer/"
    
    ' Create a new HTTP request and send it to the URL
    Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
    xmlhttp.Open "GET", url, False
    xmlhttp.send
    
    ' Get the response text from the request
    response = xmlhttp.responseText
    
    ' Parse the HTML response text with BeautifulSoup
    Set soup = CreateObject("HTMLFile")
    soup.body.innerHTML = response
    
    ' Get all the matches in today's list
    Set today_matches = soup.getElementById("sortable-1").getElementsByClassName("dark")
    
    ' Loop through today's matches and check h2h stats and patterns
    For Each match In today_matches
        ' Get the match link and URL
        match_link = match.getElementsByTagName("a")(0).getAttribute("href")
        match_url = "https://www.betexplorer.com" & match_link
        
        ' Call the function to calculate the h2h stats
        h2h_stats = GetH2HStats(match_url)
        
        ' Check if there are any h2h stats for this match
        If Not IsEmpty(h2h_stats) Then
            ' Call the function to check for patterns in the h2h stats
            patterns = CheckH2HPatterns(h2h_stats)
            
            ' Check if there are any patterns for this match
            If Not IsEmpty(patterns) Then
                ' Print the match link and patterns to the immediate window
                Debug.Print "Match: " & match_link
                Debug.Print "Patterns: " & Join(patterns, ", ")
            End If
        End If
    Next match
End Sub

Function GetH2HStats(match_url As String) As Variant
    ' Set up variables
    Dim match_response As String
    Dim match_soup As Object
    Dim h2h_table As Object
    Dim h2h_rows As Object
    Dim total_matches As Long
    Dim both_scored As Long
    Dim total_over2 As Long
    Dim total_under2 As Long
    Dim no_score As Long
    Dim row As Object
    Dim columns As Object
    Dim goals_home As Long
    Dim goals_away As Long
    Dim both_scored_pct As Double
    Dim total_over2_pct As Double
    Dim total_under2_pct As Double
    Dim no_score_pct As Double
    
    ' Create a new HTTP request and send it to the match URL
    Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
    xmlhttp.Open "GET", match_url, False
    xmlhttp.send
    
    ' Get the response text from the request
    match_response = xmlhttp.responseText
    
    ' Parse the HTML response text with BeautifulSoup
    Set match_soup = CreateObject("HTMLFile")
    match_soup.body.innerHTML = match_response
    
    ' Get the h2h table from the match page
    Set h2h_table = match_soup.getElementsByClassName("h2h")(0).getElementsByTagName("tbody")(0)
    
    ' Get all the rows in the h2h table
    Set h2h_rows = h2h_table.getElementsByTagName("tr")
 
If you need a help in importing the h2h tables, well I can do that; if you need to analyze the results before doing anything else then that 's beyond my availability.

As the former task is concerned, replace the whole block from Set tObj = H2H.FindElementsById("js-mutual-table") up to End Sub with this new block:
Code:
    Set tObj = H2H.FindElementsById("js-mutual-table")
    If tObj.Count > 0 Then
    Debug.Print I & " out of " & mColl.Count
        Set liColl = H2H.FindElementsByCss("li[class='list-breadcrumb__item']")
        Debug.Print ">H2H for: " & liColl(liColl.Count).Text
        Debug.Print ">Listed: " & tObj(1).FindElementsByTag("tbody").Count
'        Stop
        'Import h2h table:
        Selection.Cells(100000, 1).End(xlUp).Offset(2, 0).Value = ">H2H for: " & liColl(liColl.Count).Text
        tObj(1).AsTable.ToExcel Selection.Cells(100000, 1).End(xlUp).Offset(1, 0)
        '
    Else
        Debug.Print ">No h2h for: " & Replace(mLink, "https://www.betexplorer.com", "", , , vbTextCompare)
    End If
    DoEvents
Next I
MsgBox ("H2H imported")
End Sub
Please remember that the cell selected when Sub GetH2H get started determine the date taken into consideration:
-if the selection is row1 and selected cell contains a date than that date will be used for quering Betexplorer
-if not a date or not row1 then "today" will be used

The h2h info will be imported in the selected column, from the first free row
Hello,

thanks so much for the help.
When i run the code early this morning I got an error message. Not sure if it is due to the driver. But i updated the selenium chrome drivers few days ago and it worked for the previous code. The photo of the error is attached.

I understand that the analysis would take much of your time. I so much appreciate for the help so far. But the analysis is an integral part of the project. Can I provide an alternative website where some of the analysis has been done. The code just need to import them into the excel. It's a different website entirely. Sofascore.com then for each mtches listed for that day has a section named "head to head streak" which has the the analysed data (2nd attached picture named screenshot... ). So I want thesection out for all the matches that has that part.

Thank you so much sir.

I hope for your kind consideration sir
 

Attachments

  • IMG-20230317-WA0000.jpg
    IMG-20230317-WA0000.jpg
    150.4 KB · Views: 16
  • Screenshot_20230317-152806_Chrome.jpg
    Screenshot_20230317-152806_Chrome.jpg
    219.1 KB · Views: 12
Upvote 0

Excel Facts

Remove leading & trailing spaces
Save as CSV to remove all leading and trailing spaces. It is faster than using TRIM().
Again, these automations require time and patience, and have an unknown expiration date, because sooner or later the website will change its html souce and the macro need to be adapted to the new situation; so either you learn how to approach the site or you need a consultant (and that is not my job) to mantain the code for you.
Moving to a new site would not change the situation for the future, but would make useless what has been done up to now.

Going to the error situation, that should not depends on the driver revision.

I kept the currente Sub GetH2H running for hours importing h2h information and I was able to identify two possible area of error, both due to timing and connection errors.

The main modifications involves the following area, that was amended for a better controlled access to the elements:
[code For I = 1 To mColl.Count
mLink = mColl(I).FindElementsByTag("a")(1).Attribute("href")
H2H.Get mLink
H2H.Wait 300
If I = 1 Then H2H.FindElementById("onetrust-accept-btn-handler").Click
'
'Open h2h panel:
H2H.FindElementById("mutual_div").FindElementsByTag("A")(1).Click
H2H.Wait 800 [/code]
I also added more Debug.Prints for better tracing the events, so it is better I publish the complete code for the new version of Sub GetH2H:
Code:
 Sub GetH2H()
Dim wPage As Object         'Selenium.WebDriver
Dim H2H As Object
Dim myUrl As String
Dim myTim As Single, I As Long
Dim TDate As Date, mColl As Object, mLink, liColl As Object
Dim Cippa As WebElement, J As Long
Dim Errr(1 To 2) As Long, IPos As Range
Dim reeLoad As Long, reGet As Long
'
Sheets("Foglio5").Select                            '<<< The working sheet
myTim = Timer
'Crea Driver:
Set wPage = CreateObject("Selenium.WebDriver")      '1st container
Set H2H = CreateObject("Selenium.WebDriver")        '2nd container

If IsDate(Selection) And Selection.Row = 1 Then     'Get the date
    TDate = Selection.Value
Else
    TDate = Int(Now)
End If
Set IPos = Selection.Cells(1, 1)

myUrl = "https://www.betexplorer.com/next/soccer/?year=" & Year(TDate) & "&month=" & _
   Month(TDate) & "&day=" & Day(TDate) '
wPage.Start "chrome", myUrl
wPage.Get "/"
wPage.Wait 500
Debug.Print "URL=" & myUrl
Debug.Print ">Page loaded", Format(Timer - myTim, "0.0")
H2H.Start "chrome"
Set mColl = wPage.FindElementsByClass("table-main__tt")
For I = 1 To mColl.Count
    Debug.Print ">>> I=" & I
    mLink = mColl(I).FindElementsByTag("a")(1).Attribute("href")
    Err.Clear
    On Error Resume Next
    For J = 1 To 10
        H2H.Get mLink
        H2H.Wait 200
        If Err.Number = 0 Then Exit For
        H2H.Wait 1000
        Debug.Print "Error# " & Err.Number & " on Get mLink", J
        reeLoad = reeLoad + 1
        Err.Clear
        DoEvents
    Next J
    H2H.Wait 100
    If I = 1 Then H2H.FindElementById("onetrust-accept-btn-handler").Click
    '
    'Open h2h panel:
    On Error Resume Next
    For J = 1 To 30
        Set Cippa = H2H.FindElementById("mutual_div").FindElementsByTag("A")(1)
        Debug.Print "Cippa=" & Cippa.IsEnabled, Cippa.IsDisplayed, Timer
        Cippa.Click
        If Err.Number = 0 Then
            'No error
            Debug.Print "Clicked Ok", "J=" & J, Timer
            Exit For
        Else
            'Errors
            If J = 1 Then Errr(1) = Errr(1) + 1
            Errr(2) = Errr(2) + 1
            H2H.Wait 300
            Debug.Print Err.Number, Left(Err.Description, 100) & "...."
            If Err.Number <> 11 Then
                'refresh page if <> 11
                H2H.Get mLink
                H2H.Wait 1000
                reGet = reGet + 1
            End If
            Err.Clear
'            Stop
            Beep
            DoEvents
        End If
    Next J
    On Error GoTo 0
'    H2H.FindElementById("mutual_div").FindElementsByTag("A")(1).Click
    H2H.Wait 200
    Set tObj = H2H.FindElementsById("js-mutual-table")
    If tObj.Count > 0 Then
    Debug.Print I & " out of " & mColl.Count
        Set liColl = H2H.FindElementsByCss("li[class='list-breadcrumb__item']")
        Debug.Print ">H2H for: " & liColl(liColl.Count).Text
        Debug.Print ">Listed: " & tObj(1).FindElementsByTag("tbody").Count
'        Stop
        'Import h2h table:
        IPos.Cells(100000, 1).End(xlUp).Offset(2, 0).Value = ">H2H for: " & liColl(liColl.Count).Text
        tObj(1).AsTable.ToExcel IPos.Cells(100000, 1).End(xlUp).Offset(1, 0)
        '
    Else
        Debug.Print ">No h2h for: " & Replace(mLink, "https://www.betexplorer.com", "", , , vbTextCompare)
    End If
    Debug.Print
    DoEvents
Next I
Debug.Print "Completed", "FailCases=" & Errr(1), "FailCycles=" & Errr(2), "Re.Get=" & reGet, "Re.Load=" & reeLoad, "Secs: " & Format(Timer - myTim, "0.0")
MsgBox ("H2H imported")
wPage.Quit
H2H.Quit
End Sub
Using this code I was able to import without problems some 50thousands lines of H2H information

Try...
 
Upvote 0
Again, these automations require time and patience, and have an unknown expiration date, because sooner or later the website will change its html souce and the macro need to be adapted to the new situation; so either you learn how to approach the site or you need a consultant (and that is not my job) to mantain the code for you.
Moving to a new site would not change the situation for the future, but would make useless what has been done up to now.

Going to the error situation, that should not depends on the driver revision.

I kept the currente Sub GetH2H running for hours importing h2h information and I was able to identify two possible area of error, both due to timing and connection errors.

The main modifications involves the following area, that was amended for a better controlled access to the elements:
[code For I = 1 To mColl.Count
mLink = mColl(I).FindElementsByTag("a")(1).Attribute("href")
H2H.Get mLink
H2H.Wait 300
If I = 1 Then H2H.FindElementById("onetrust-accept-btn-handler").Click
'
'Open h2h panel:
H2H.FindElementById("mutual_div").FindElementsByTag("A")(1).Click
H2H.Wait 800 [/code]
I also added more Debug.Prints for better tracing the events, so it is better I publish the complete code for the new version of Sub GetH2H:
Code:
 Sub GetH2H()
Dim wPage As Object         'Selenium.WebDriver
Dim H2H As Object
Dim myUrl As String
Dim myTim As Single, I As Long
Dim TDate As Date, mColl As Object, mLink, liColl As Object
Dim Cippa As WebElement, J As Long
Dim Errr(1 To 2) As Long, IPos As Range
Dim reeLoad As Long, reGet As Long
'
Sheets("Foglio5").Select                            '<<< The working sheet
myTim = Timer
'Crea Driver:
Set wPage = CreateObject("Selenium.WebDriver")      '1st container
Set H2H = CreateObject("Selenium.WebDriver")        '2nd container

If IsDate(Selection) And Selection.Row = 1 Then     'Get the date
    TDate = Selection.Value
Else
    TDate = Int(Now)
End If
Set IPos = Selection.Cells(1, 1)

myUrl = "https://www.betexplorer.com/next/soccer/?year=" & Year(TDate) & "&month=" & _
   Month(TDate) & "&day=" & Day(TDate) '
wPage.Start "chrome", myUrl
wPage.Get "/"
wPage.Wait 500
Debug.Print "URL=" & myUrl
Debug.Print ">Page loaded", Format(Timer - myTim, "0.0")
H2H.Start "chrome"
Set mColl = wPage.FindElementsByClass("table-main__tt")
For I = 1 To mColl.Count
    Debug.Print ">>> I=" & I
    mLink = mColl(I).FindElementsByTag("a")(1).Attribute("href")
    Err.Clear
    On Error Resume Next
    For J = 1 To 10
        H2H.Get mLink
        H2H.Wait 200
        If Err.Number = 0 Then Exit For
        H2H.Wait 1000
        Debug.Print "Error# " & Err.Number & " on Get mLink", J
        reeLoad = reeLoad + 1
        Err.Clear
        DoEvents
    Next J
    H2H.Wait 100
    If I = 1 Then H2H.FindElementById("onetrust-accept-btn-handler").Click
    '
    'Open h2h panel:
    On Error Resume Next
    For J = 1 To 30
        Set Cippa = H2H.FindElementById("mutual_div").FindElementsByTag("A")(1)
        Debug.Print "Cippa=" & Cippa.IsEnabled, Cippa.IsDisplayed, Timer
        Cippa.Click
        If Err.Number = 0 Then
            'No error
            Debug.Print "Clicked Ok", "J=" & J, Timer
            Exit For
        Else
            'Errors
            If J = 1 Then Errr(1) = Errr(1) + 1
            Errr(2) = Errr(2) + 1
            H2H.Wait 300
            Debug.Print Err.Number, Left(Err.Description, 100) & "...."
            If Err.Number <> 11 Then
                'refresh page if <> 11
                H2H.Get mLink
                H2H.Wait 1000
                reGet = reGet + 1
            End If
            Err.Clear
'            Stop
            Beep
            DoEvents
        End If
    Next J
    On Error GoTo 0
'    H2H.FindElementById("mutual_div").FindElementsByTag("A")(1).Click
    H2H.Wait 200
    Set tObj = H2H.FindElementsById("js-mutual-table")
    If tObj.Count > 0 Then
    Debug.Print I & " out of " & mColl.Count
        Set liColl = H2H.FindElementsByCss("li[class='list-breadcrumb__item']")
        Debug.Print ">H2H for: " & liColl(liColl.Count).Text
        Debug.Print ">Listed: " & tObj(1).FindElementsByTag("tbody").Count
'        Stop
        'Import h2h table:
        IPos.Cells(100000, 1).End(xlUp).Offset(2, 0).Value = ">H2H for: " & liColl(liColl.Count).Text
        tObj(1).AsTable.ToExcel IPos.Cells(100000, 1).End(xlUp).Offset(1, 0)
        '
    Else
        Debug.Print ">No h2h for: " & Replace(mLink, "https://www.betexplorer.com", "", , , vbTextCompare)
    End If
    Debug.Print
    DoEvents
Next I
Debug.Print "Completed", "FailCases=" & Errr(1), "FailCycles=" & Errr(2), "Re.Get=" & reGet, "Re.Load=" & reeLoad, "Secs: " & Format(Timer - myTim, "0.0")
MsgBox ("H2H imported")
wPage.Quit
H2H.Quit
End Sub
Using this code I was able to import without problems some 50thousands lines of H2H information

Try...
Thank you so much sir.

I will run it and give you my feedback sir.
 
Upvote 0
Thank you so much sir.

I will run it and give you my feedback sir.
Thank you so much. it works but the imported score were in time format. i.e if the score was 2:1 on the website, excel will read it as a time (02:01) . then it was difficult to use left and right funtions to split the score into another cell.

Thanks
 
Upvote 0
Let's format the destination area "as Text" before saving the results.

For this, replace this block:
Rich (BB code):
        'Import h2h table:
        IPos.Cells(100000, 1).End(xlUp).Offset(2, 0).Value = ">H2H for: " & liColl(liColl.Count).Text
        tObj(1).AsTable.ToExcel IPos.Cells(100000, 1).End(xlUp).Offset(1, 0)
        '
    Else
with:
VBA Code:
        'Import h2h table:
        NextR = IPos.Cells(100000, 1).End(xlUp).Offset(2, 0).Address
        Range(NextR).Value = ">H2H for: " & liColl(liColl.Count).Text
        Range(NextR).Resize(tObj(1).FindElementsByTag("tbody").Count * 4, 7).NumberFormat = "@"
        tObj(1).AsTable.ToExcel Range(NextR).Offset(1, 0)
    Else
 
Upvote 0
Let's format the destination area "as Text" before saving the results.

For this, replace this block:
Rich (BB code):
        'Import h2h table:
        IPos.Cells(100000, 1).End(xlUp).Offset(2, 0).Value = ">H2H for: " & liColl(liColl.Count).Text
        tObj(1).AsTable.ToExcel IPos.Cells(100000, 1).End(xlUp).Offset(1, 0)
        '
    Else
with:
VBA Code:
        'Import h2h table:
        NextR = IPos.Cells(100000, 1).End(xlUp).Offset(2, 0).Address
        Range(NextR).Value = ">H2H for: " & liColl(liColl.Count).Text
        Range(NextR).Resize(tObj(1).FindElementsByTag("tbody").Count * 4, 7).NumberFormat = "@"
        tObj(1).AsTable.ToExcel Range(NextR).Offset(1, 0)
    Else
Thank you so much sir
 
Upvote 0

Forum statistics

Threads
1,215,001
Messages
6,122,648
Members
449,092
Latest member
peppernaut

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top