Scraping HTML Tables with VBA

DeonM

New Member
Joined
Sep 18, 2014
Messages
26
I am scraping HTML tables with a GET request. The req.responseText is then passed to the module below to process the HTML tables.

My code works 100% but I'm trying to optimise it. It is a large daily crawl. I have 2 questions:

1. In the section where I'm processing the "className = "views-field views-field-field-date-published", I don't want the innerText of the TableCell, I want the innerText of the ...content="2020-10-22T00:00:00+02:00". At the moment I'm processing the innerHTML with a custom function, which works fine. But is there a way to access the content="2020-10-22T00:00:00+02:00" innerText itself, similar to what I'm doing for the href in the TableCell above?

2. I need to do different things with the content of each TableCell, therefore I've resorted to the multiple If statements based on TableCell.className. But it does require a lot of extra looping. Is there a more efficient way to get right to the applicable section?

Thanks in advance for any help.


VBA Code:
Option Explicit

Sub ProcessHTMLTables(HTMLString As String)

    Dim HTMLDoc As New MSHTML.HTMLDocument
    Dim HTMLTable As MSHTML.IHTMLElement
    Dim TableRow As MSHTML.IHTMLElement
    Dim TableCell As MSHTML.IHTMLElement
    Dim Category As String, URL As String, Description As String, PDate As String, ClDate As String, BDate As String
   
    HTMLDoc.body.innerHTML = HTMLString

    Set HTMLTable = HTMLDoc.getElementsByTagName("table")(0)
        'Debug.Print HTMLTable.innerHTML
        If HTMLTable Is Nothing Then
            Exit Sub
        End If

    For Each TableRow In HTMLTable.getElementsByTagName("tr")
        'Debug.Print TableRow.innerHTML

        For Each TableCell In TableRow.getElementsByTagName("td")
               
            If TableCell.className = "views-field views-field-field-category" Then
                Category = TableCell.innerText

            ElseIf TableCell.className = "views-field views-field-title" Then
                URL = "https://****.com" & Mid(TableCell.getElementsByTagName("a")(0).href, 7)
                Description = TableCell.innerText
             
            ElseIf TableCell.className = "views-field views-field-field-date-published" Then
                Debug.Print TableCell.innerHTML 'Returns: <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-10-22T00:00:00+02:00">22/10/2020</span>
                Debug.Print TableCell.innerText 'Returns: 22/10/2020
                PDate = ExtractDateValue(TableCell.innerHTML)
                
            ElseIf TableCell.className = "views-field views-field-field-date-closing" Then
                Debug.Print TableCell.innerHTML 'Returns: <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-10-23T00:00:00+02:00">22/10/2020</span>
                Debug.Print TableCell.innerText 'Returns: 23/10/2020
                ClDate = ExtractDateValue(TableCell.innerHTML)
                
            ElseIf TableCell.className = "views-field views-field-field-date-briefing" Then
                Debug.Print TableCell.innerHTML 'Returns: <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-10-24T00:00:00+02:00">22/10/2020</span>
                Debug.Print TableCell.innerText 'Returns: 24/10/2020
                BDate = ExtractDateValue(TableCell.innerHTML)
                                
            End If

        Next TableCell
        
    Next TableRow

End Sub
 

Some videos you may like

Excel Facts

Shade all formula cells
To shade all formula cells: Home, Find & Select, Formulas to select all formulas. Then apply a light fill color.

John_w

MrExcel MVP
Joined
Oct 15, 2007
Messages
6,546
1. In the section where I'm processing the "className = "views-field views-field-field-date-published", I don't want the innerText of the TableCell, I want the innerText of the ...content="2020-10-22T00:00:00+02:00". At the moment I'm processing the innerHTML with a custom function, which works fine. But is there a way to access the content="2020-10-22T00:00:00+02:00" innerText itself, similar to what I'm doing for the href in the TableCell above?

Try TableCell.getAttribute("content")

2. I need to do different things with the content of each TableCell, therefore I've resorted to the multiple If statements based on TableCell.className. But it does require a lot of extra looping. Is there a more efficient way to get right to the applicable section?

It depends on what you want to do with the extracted data and your code doesn't show that.

There is no need to loop through the table cells (the inner loop) and look at the class name if you know that a particular column index always relates to the same 'field'. For example, if the first column is always the Category then you could have Category = TableRow.Cells(0).innerText

Other ways of getting all the table cell elements for a specific class name and looping through them are:
VBA Code:
    Dim Categories As IHTMLDOMChildrenCollection
    Dim Category As IHTMLDOMNode
    Set Categories = HTMLdoc.querySelectorAll("td.views-field views-field-field-category")
    'or
    'Set Categories = HTMLdoc.querySelectorAll("views-field-field-category")
    'or
    'Set Categories = HTMLdoc.querySelectorAll("td[class='views-field views-field-field-category']")
    For Each Category In Categories
        Debug.Print Category.innerText
    Next
   
    Dim Categories2 As IHTMLElementCollection
    Dim Category2 As HTMLTableCell
    Set Categories2 = HTMLdoc.getElementsByClassName("td.views-field views-field-field-category")
    'or
    'Set Categories2 = HTMLdoc.getElementsByClassName("views-field-field-category")
    For Each Category2 In Categories2
        Debug.Print Category2.innerText
    Next
 
Last edited:

DeonM

New Member
Joined
Sep 18, 2014
Messages
26
Hi John

Thanks for the reply.

1.
VBA Code:
TableCell.getAttribute("content")
If I use that I get a "Run time error 94, Invalid use of Null." If I look in the locals window I get a Null return for the variable. Any ideas?

2.
VBA Code:
Category = TableRow.Cells(0).innerText
That approach worked perfectly for me. As you say the columns are fixed 'fields', so I can read directly from each.
(I had to rewrite the code a bit to deal with the header row which has no usable data for me, but it's much preferable to the many if loops.).

Thanks for the help.
 

John_w

MrExcel MVP
Joined
Oct 15, 2007
Messages
6,546
I didn't notice that the content attribute is in a child span element, therefore this should work:
VBA Code:
Debug.Print TableCell.Children(0).getAttribute("content")
 
Solution

DeonM

New Member
Joined
Sep 18, 2014
Messages
26
That works perfectly. I used your way of referencing TableRow.Cells so ended up with:

VBA Code:
If TableRow.Cells(3).innerHTML <> "" Then DatePublished = TableRow.Cells(3).Children(0).getAttribute("content") Else DatePublished = ""

I also had to deal with some instances where the table cells could be empty, which would cause an error.

Thanks again.
 

Watch MrExcel Video

Forum statistics

Threads
1,119,110
Messages
5,576,162
Members
412,702
Latest member
maggielrux
Top