Scraping HTML Tables with VBA

DeonM

New Member
Joined
Sep 18, 2014
Messages
26
I am scraping HTML tables with a GET request. The req.responseText is then passed to the module below to process the HTML tables.

My code works 100% but I'm trying to optimise it. It is a large daily crawl. I have 2 questions:

1. In the section where I'm processing the "className = "views-field views-field-field-date-published", I don't want the innerText of the TableCell, I want the innerText of the ...content="2020-10-22T00:00:00+02:00". At the moment I'm processing the innerHTML with a custom function, which works fine. But is there a way to access the content="2020-10-22T00:00:00+02:00" innerText itself, similar to what I'm doing for the href in the TableCell above?

2. I need to do different things with the content of each TableCell, therefore I've resorted to the multiple If statements based on TableCell.className. But it does require a lot of extra looping. Is there a more efficient way to get right to the applicable section?

Thanks in advance for any help.


VBA Code:
Option Explicit

Sub ProcessHTMLTables(HTMLString As String)

    Dim HTMLDoc As New MSHTML.HTMLDocument
    Dim HTMLTable As MSHTML.IHTMLElement
    Dim TableRow As MSHTML.IHTMLElement
    Dim TableCell As MSHTML.IHTMLElement
    Dim Category As String, URL As String, Description As String, PDate As String, ClDate As String, BDate As String
   
    HTMLDoc.body.innerHTML = HTMLString

    Set HTMLTable = HTMLDoc.getElementsByTagName("table")(0)
        'Debug.Print HTMLTable.innerHTML
        If HTMLTable Is Nothing Then
            Exit Sub
        End If

    For Each TableRow In HTMLTable.getElementsByTagName("tr")
        'Debug.Print TableRow.innerHTML

        For Each TableCell In TableRow.getElementsByTagName("td")
               
            If TableCell.className = "views-field views-field-field-category" Then
                Category = TableCell.innerText

            ElseIf TableCell.className = "views-field views-field-title" Then
                URL = "https://****.com" & Mid(TableCell.getElementsByTagName("a")(0).href, 7)
                Description = TableCell.innerText
             
            ElseIf TableCell.className = "views-field views-field-field-date-published" Then
                Debug.Print TableCell.innerHTML 'Returns: <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-10-22T00:00:00+02:00">22/10/2020</span>
                Debug.Print TableCell.innerText 'Returns: 22/10/2020
                PDate = ExtractDateValue(TableCell.innerHTML)
                
            ElseIf TableCell.className = "views-field views-field-field-date-closing" Then
                Debug.Print TableCell.innerHTML 'Returns: <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-10-23T00:00:00+02:00">22/10/2020</span>
                Debug.Print TableCell.innerText 'Returns: 23/10/2020
                ClDate = ExtractDateValue(TableCell.innerHTML)
                
            ElseIf TableCell.className = "views-field views-field-field-date-briefing" Then
                Debug.Print TableCell.innerHTML 'Returns: <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-10-24T00:00:00+02:00">22/10/2020</span>
                Debug.Print TableCell.innerText 'Returns: 24/10/2020
                BDate = ExtractDateValue(TableCell.innerHTML)
                                
            End If

        Next TableCell
        
    Next TableRow

End Sub
 

Excel Facts

Return population for a City
If you have a list of cities in A2:A100, use Data, Geography. Then =A2.Population and copy down.

John_w

MrExcel MVP
Joined
Oct 15, 2007
Messages
6,752
1. In the section where I'm processing the "className = "views-field views-field-field-date-published", I don't want the innerText of the TableCell, I want the innerText of the ...content="2020-10-22T00:00:00+02:00". At the moment I'm processing the innerHTML with a custom function, which works fine. But is there a way to access the content="2020-10-22T00:00:00+02:00" innerText itself, similar to what I'm doing for the href in the TableCell above?

Try TableCell.getAttribute("content")

2. I need to do different things with the content of each TableCell, therefore I've resorted to the multiple If statements based on TableCell.className. But it does require a lot of extra looping. Is there a more efficient way to get right to the applicable section?

It depends on what you want to do with the extracted data and your code doesn't show that.

There is no need to loop through the table cells (the inner loop) and look at the class name if you know that a particular column index always relates to the same 'field'. For example, if the first column is always the Category then you could have Category = TableRow.Cells(0).innerText

Other ways of getting all the table cell elements for a specific class name and looping through them are:
VBA Code:
    Dim Categories As IHTMLDOMChildrenCollection
    Dim Category As IHTMLDOMNode
    Set Categories = HTMLdoc.querySelectorAll("td.views-field views-field-field-category")
    'or
    'Set Categories = HTMLdoc.querySelectorAll("views-field-field-category")
    'or
    'Set Categories = HTMLdoc.querySelectorAll("td[class='views-field views-field-field-category']")
    For Each Category In Categories
        Debug.Print Category.innerText
    Next
   
    Dim Categories2 As IHTMLElementCollection
    Dim Category2 As HTMLTableCell
    Set Categories2 = HTMLdoc.getElementsByClassName("td.views-field views-field-field-category")
    'or
    'Set Categories2 = HTMLdoc.getElementsByClassName("views-field-field-category")
    For Each Category2 In Categories2
        Debug.Print Category2.innerText
    Next
 
Last edited:

DeonM

New Member
Joined
Sep 18, 2014
Messages
26
Hi John

Thanks for the reply.

1.
VBA Code:
TableCell.getAttribute("content")
If I use that I get a "Run time error 94, Invalid use of Null." If I look in the locals window I get a Null return for the variable. Any ideas?

2.
VBA Code:
Category = TableRow.Cells(0).innerText
That approach worked perfectly for me. As you say the columns are fixed 'fields', so I can read directly from each.
(I had to rewrite the code a bit to deal with the header row which has no usable data for me, but it's much preferable to the many if loops.).

Thanks for the help.
 

John_w

MrExcel MVP
Joined
Oct 15, 2007
Messages
6,752
I didn't notice that the content attribute is in a child span element, therefore this should work:
VBA Code:
Debug.Print TableCell.Children(0).getAttribute("content")
 
Solution

DeonM

New Member
Joined
Sep 18, 2014
Messages
26
That works perfectly. I used your way of referencing TableRow.Cells so ended up with:

VBA Code:
If TableRow.Cells(3).innerHTML <> "" Then DatePublished = TableRow.Cells(3).Children(0).getAttribute("content") Else DatePublished = ""

I also had to deal with some instances where the table cells could be empty, which would cause an error.

Thanks again.
 

Watch MrExcel Video

Forum statistics

Threads
1,129,804
Messages
5,638,450
Members
417,025
Latest member
MusterDuster

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Top