Webscrape into Excel with VBA

DeeEmmEss

New Member
Joined
Nov 25, 2015
Messages
43
I am looking to Webscrape several pages from a password protected site - to which I have access. I can negotiate my way with VBA to the first page I wish to scrape from, but am stuck from there. My VBA knowledge is basic, but my HTML skills are non existent, so I require help with what exactly to do from here. The first page I wish to scrape is below. I wish to have the headings, 'blueheadings' and data 'readablerowlink'. If I can be guided how to scrape this page, and then paste into my active workbook I can negotiate the other, similar, pages on the site. Help greatly appreciated.

Dim a As String, url As String
Set ie = CreateObject("InternetExplorer.Application")
url = "Horse Racing Database Solutions"
With ie
.Visible = True
.Navigate url
Do Until .ReadyState = 4
DoEvents
Loop
.Document.all.Item("Login").Value = "****"
.Document.all.Item("Password").Value = "*********"
.Document.forms(4).submit
End With
url = "Daily Jockeys Report"
ie.Navigate url
While ie.Busy Or ie.ReadyState <> 4: DoEvents: Wend

Below is an extract from the page concerned
<a target=_blank href='helpsection.php#jocktrainreport' title='Jockey Report Help Section'>Understand the Jockey Report.</a>
</td></tr></table></td></tr></table><br><hr><table><tr><th></th>
<th class=blueheadings valign=top colspan=6 title='Stats relate to the last 14 days'>Last 14 Days</th>
<th class=blueheadings valign=top colspan=6 title='Stats relate to the last 2 years'>Last 2 Years</th>
<th class=blueheadings valign=top colspan=2 title='Stats refer to time lapsed since last win'>Since Win</th>
<th class=blueheadings valign=top title='Booked rides for today by track'>TODAY</th>
<th class=blueheadings valign=top title='Runs, wins & places for each of the last seven days'>7 Day Form Figures</th></tr>
<tr><th class=readablerowlink valign=top title='Name of jockey, click to open'><a href='?sortselection=1'>Jockey</a></th>
<th class=readablerowlink valign=top title='Number of rides in last 14 days'><a href='?sortselection=2'>R</a></th>
<th class=readablerowlink valign=top title='Number of winning rides in last 14 days'><a href='?sortselection=3'>W</a></th>
<th class=readablerowlink valign=top title='Number of placed rides in last 14 days'><a href='?sortselection=4'>P</a></th>
<th class=readablerowlink valign=top title='14 day winning strike rate percentage'><a href='?sortselection=5'>W%</a></th>
<th class=readablerowlink valign=top title='14 day place strike rate percentage'><a href='?sortselection=6'>P%</a></th>
<th class=readablerowlink valign=top title='14 day profit/loss when backing at SP'><a href='?sortselection=7'>P/L</a></th>
<th class=readablerowlink valign=top title='Number of rides in last 2 years'><a href='?sortselection=8'>R</a></th>
<th class=readablerowlink valign=top title='Number of winning rides in last 2 years'><a href='?sortselection=9'>W</a></th>
<th class=readablerowlink valign=top title='Number of placed rides in last 2 years'><a href='?sortselection=10'>P</a></th>
<th class=readablerowlink valign=top title='2 year winning strike rate percentage'><a href='?sortselection=11'>W%</a></th>
<th class=readablerowlink valign=top title='2 year place strike rate percentage'><a href='?sortselection=12'>P%</a></th>
<th class=readablerowlink valign=top title='2 year profit/loss when backing at SP'><a href='?sortselection=13'>P/L</a></th>
<th class=readablerowlink valign=top title='The number of days since last rode a winner'><a href='?sortselection=14'>Days</a></th>
<th class=readablerowlink valign=top title='The number of runs since riding a winner'><a href='?sortselection=15'>Runs</a></th>
<th class=readablerowlink valign=top></th>
<th class=readablerowlink valign=top>(4th)(3rd)(2nd)(1st)(31st)(30th)(29th)</th></tr><tr><td valign=top class=readablerowlink><a href='jockeys.php?id=314' target='_blank' title='Allan, David, click to open form page'>Allan, David</a> <a title='FC Odds include 6/1 shot or less' class='texttype926'> $</a></td><td class=readablerow valign=top title='Allan, David has had 65 rides in last 14 days'>65</td>
<td class=readablerow valign=top title='Allan, David has had 10 winning rides in last 14 days'>10</td>
<td class=readablerow valign=top title='Allan, David has had 27 placed rides in last 14 days'>27</td>
<td class=readablerow valign=top title='Allan, David winning strike rate in last 14 days is 15%'>15%</td>

etc. down to row 3000
 

Excel Facts

Excel Joke
Why can't spreadsheets drive cars? They crash too often!
It's a matter of finding the correct table and then looping through its rows and columns. Try something like this - assumes the required table isn't nested in another table (can't tell from your HTML), otherwise an extra check on the InStr line is needed:

VBA Code:
    Dim HTMLdoc As HTMLDocument
    Dim tables As IHTMLElementCollection
    Dim table As HTMLTable
    Dim tRow As HTMLTableRow, tCell As HTMLTableCell
    Dim i As Long, col As Long
    Dim destCell As Range
    
    With ActiveSheet
        Set destCell = .Range("A1")
    End With
    
    Set HTMLdoc = ie.document
    Set tables = HTMLdoc.getElementsByTagName("TABLE")
    Set table = Nothing
    i = 0
    While i < tables.Length And table Is Nothing
        If InStr(tables(i).innerText, "Last 14 Days") Then Set table = tables(i)
        i = i + 1
    Wend
    
    For Each tRow In table.Rows
        col = 0
        For Each tCell In tRow.Cells
            destCell.Offset(tRow.RowIndex, col).Value = tCell.innerText
            col = col + tCell.colSpan
        Next
    Next
 
Upvote 0
Thanks John W for your prompt reply. Almost what I was looking for, and it was my mistake in not outlining what I needed correctly. Those first few lines are headings - the lines starting th class=blueheadings value, and the next lines th class=readable link, but what I actually require are the next 2500 odd lines with individual names and stats for each of those above lines i.e those lines that begin with td class=readablerow valign= etc.
 
Upvote 0
If it's found the correct table and the table isn't nested then that code should extract the headings and data rows.
 
Upvote 0
Thanks John W, my apologies, it has found and extracted the correct data. However, it is all in one column and one line on my Excel sheet which is why I thought it was incorrect. Oddly, when I look at it, I can only see the first few items of data, but if I copy and paste to another sheet, I get all the information it has scraped. I would be grateful therefore if you you indicate how I send it to multiple columns and lines. Thanks again.
 
Upvote 0
It sounds like the required table is nested and the code is finding the outer table, which contains 1 row and 1 cell, hence why all the data is being put into 1 cell. You can verify this by stepping through the For Each tRow In table.Rows loop with the F8 key.

Change the code to:
VBA Code:
    Set tables = HTMLdoc.getElementsByTagName("TABLE")
    Set table = Nothing
    i = 0
    While i < tables.Length And table Is Nothing
        If tables(i).getElementsByTagName("TABLE").Length = 0 Then
            If InStr(tables(i).innerText, "Last 14 Days") Then Set table = tables(i)
        End If
        i = i + 1
    Wend
 
Upvote 0
Solution
Thanks John, cracked it. I had to set col=0 after 1st next, but it is now doing as I wish. Appreciate the help.
 
Upvote 0

Forum statistics

Threads
1,214,587
Messages
6,120,405
Members
448,958
Latest member
Hat4Life

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top