Scraping a web, and looping the every element on the page.

mctabish

New Member
Joined
Nov 2, 2009
Messages
33
Hello, I want to write a tool that will help me scraping a webpage.
I would like to set through EVERY element on a page and at least debug print or write it to a WS. the processing to debug or putting in to a WS is not important, and I can do that.
I am using MSXML2.XMLHTTP
I know how to get DIV and then step through each DIV and similar, but I do not know how to get all of the children of all types from the main.
Basically, How do you get all of the div, then in each div, how do you process all of the tables, all of the spans, all of the class etc of each div. and then drill down through each of those recursively.

I know how to get the innertext, innerhtml, outertext, outerhtml, ID and all of that, so I do not need those details (not that is would hurt)
This should not really matter what page for what I am looking for. But here is one of the pages. Acorn Nut Fasteners - Luke Rivets for Handles - (.186HD x 4-40US x .9L) - Stainless Steel | KnifeKits.com

I am very surprised I have not found samples of how to do this.

I don't think this would be hard, but I just not quite sure how.


Thanks
Mc
 

Excel Facts

What did Pito Salas invent?
Pito Salas, working for Lotus, popularized what would become to be pivot tables. It was released as Lotus Improv in 1989.
The ScanNodes function here loops recursively through all DOM nodes and outputs the tag name and outerHTML to the Immediate Window. You need references to MS XML v6.0 and MS HTML Object Library.

VBA Code:
Public Sub Scan_HTML_DOM()

    Dim HTMLdoc As HTMLDocument
    Dim n As Long    
    Dim XmlHttp As MSXML2.XMLHTTP60

    Set XmlHttp = New MSXML2.XMLHTTP60        
    With XmlHttp
        .Open "GET", "https://www.knifekits.com/vcom/acorn-nut-fasteners-luke-rivets-for-handles-186hd-40us-stainless-steel-p-16068.html", False
        .send
        Set HTMLdoc = New HTMLDocument
        HTMLdoc.body.innerHTML = .responseText
    End With
    
    n = ScanNodes(HTMLdoc.FirstChild)

End Sub


Private Function ScanNodes(node As IHTMLDOMNode) As Long
    
    Dim n As Long
    Dim HTMLelem As Object
    Dim child As IHTMLDOMNode
    
    n = 0    
    If node.NodeType = tagDOMNodeType.NODE_ELEMENT Then
        Set HTMLelem = node
        Debug.Print HTMLelem.tagName, HTMLelem.outerHTML
        n = n + 1
    End If
    
    'Traverse child nodes
        
    If node.HasChildNodes Then
        For Each child In node.ChildNodes
            n = n + ScanNodes(child)
        Next
    End If
    
    ScanNodes = n
    
End Function
 
Upvote 0
I am sorry for the late reply! I have been sick with covid. Feeling better, but not perfect yet.

This looks very promising!
I am going to test it with a few other sites I am having issues getting items from.
Again, Thanks!
 
Upvote 0

Forum statistics

Threads
1,215,076
Messages
6,122,987
Members
449,093
Latest member
Mr Hughes

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top