Trying to extract data from HTML header

simonphillips

New Member
Joined
Mar 25, 2008
Messages
28
Hello Everyone,

I have managed to open an HTML file from IE and start importing some data from the HTML source to an excel table.

The problem is that I need to get access to some information in a function that is in the HTML header and I do not know the syntax to use.

I am defining the HTML document using ".Document.body.innerHTML". But I need to find the syntax that is equivalent to ".Document.head.innerHTML".

I would welcome any pointers to tutotials on this syntax and also the syntax for opening the HTML (see below) so I can learn more.

With HTMLdoc
.Visible = True
.Navigate PageString
Do Until .ReadyState = 4: DoEvents: Loop
End With

TIA

Simon
 

Excel Facts

Repeat Last Command
Pressing F4 adds dollar signs when editing a formula. When not editing, F4 repeats last command.
Simon

Is that your current code?

If you want to get the HTML for the header you could try this.
Code:
Msgbox HTMLdoc.all(1)
This assumes HTMLdoc is a reference to the document for the web page.

It also assumes the header is the first item, which I'm not 100% sure is going to be true in all cases.

I'm trying to think of another way to get the header that doesn't rely on this.

If I can't think of a direct method then we could always use brute force and loop through all the elements in the document looking for it.:)

By the way what information are you trying to get from the header?

Is the function script eg javascript ?
 
Upvote 0
Hi Norie,

Thanks for taking the time to respond.

Unfortunately I did not explain that I have defined HTMLdoc as an Object.

As a result the command to display the HTML code in the message box fails.

The full code is:

Dim HTMLdoc as object
Dim HTMLString as String

Set HTMLdoc = CreateObject("InternetExplorer.Application")

With HTMLdoc
.Visible = True
.Navigate PageString
Do Until .ReadyState = 4: DoEvents: Loop
End With

String = HTMLdoc.Document.body.innerHTML
 
Upvote 0
Eh, it should be an object - just not that one.
Code:
Dim IE As Object
Dim doc As Object
    
     Set IE = CreateObject("InternetExplorer.Application")
 
     With IE
            .Visible = True
            .Navigate PageString
            Do Until .ReadyState = 4: DoEvents: Loop
            doc = .document
     End With
 
    strHeadHTML = doc.all(1)
 
    MsgBox strHeadHTML
 
    IE.quit
By the way using body means you would probably miss the header.

PS Is that definitely all the code?
 
Upvote 0
Use getElementsByTagName("HEAD") or loop through the doc.all() array looking for the "HEAD" node. The "HEAD" node is not necessarily doc.all(1). Note also this array starts at index 0, i.e. doc.all(0).
 
Upvote 0
John

I knew there was a way to do it without assuming the header wasn't the first element.

And there are appears to be at least 2 including the one you mentioned and this.
Code:
    Set objHead = doc.all.tags("HEAD")

PS I did suggest looping.:)

PPS Oh, and here's a strange thing - which I suppose isn't really strange actually.

If you have HTML with two headers the above methods only seem to return the first. Obviously a web page will/should have only one anyway so I suppose it makes sense.

But when I look up the two methods it states they both returns collections, and they do when you use something like TABLE.:)
 
Upvote 0
Thanks Norie and John,

The code you provided is now correctly returning the header.

Unfortunately the part of my code that looks through the body of the HTML page (see below) returns a run time error 91. I think this has something to do with the do loop until readystate = 4. I have tried putting in a basic counter for i = 1 to 100000 type of thing to give it some more time. This just delays the problem in the fact that the program looks through multiple pages and instead of failing on the first page it fails on the 5th.

Any suggestions?

Sub PopValues(ByVal HyperString As String, ByVal RowCounter As Integer, ByVal SheetName As String)

Dim DetailPageHTMLStringDocument, DetailPageTEXTStringDocument As Object
Dim DetailPageHTMLString, DetailPageTEXTString As String
Dim StartLink1, StartLink2, StartLink3, StartLink4, StartLink5, StartLink6 As Integer
Dim EndLink1, EndLink2, EndLink3, EndLink4, EndLink5, EndLink6 As Integer
Dim TmpRng As String
Dim Cutstring1, Cutstring2, Cutstring3, Cutstring4, Cutstring5, Cutstring6 As String
Dim Counter As Long
Dim IEforContractDetail As Object


Set IEforContractDetail = CreateObject("InternetExplorer.Application")


With IEforContractDetail
.Visible = True
.Navigate HyperString ' should work for any URL
Do Until .ReadyState = 4: DoEvents: Loop
DetailPageHTMLStringDocument = .document
End With

' Need time to allow IEforContractList.Document.body.innerHTML string to populate and avoid run time errro 91
' Counter = 0
' Do While Counter < 4500000
' Counter = Counter + 1
' Loop
DetailPageHTMLString = DetailPageHTMLStringDocument.document.body.innerHTML
DetailPageTEXTString = DetailPageHTMLStringDocument.body.innerTEXT
 
Upvote 0
Set DetailPageHTMLStringDocument = .document

should fix the run-time error 91.

Also, you will find it a lot easier to understand/parse/manipulate the HTML if you use the HTML DOM library, with its named objects, rather than 'Object', because then you get VB intellisense completion for the properties and methods, etc. For this you need to set a reference to the Microsoft HTML DOM Library in the VB editor.

The following code takes your PopValues routine and shows how to parse the HTML HEAD tag. The code outputs the HEAD data to the debug window (Immediate Window).

Code:
Option Explicit

'Requires reference to Microsoft HTML Object Library.  Set this in Tools - References in VB editor.

Sub Test()
    PopValues "http://www.mrexcel.com", 1, "Sheet1"
End Sub

Sub PopValues(ByVal HyperString As String, ByVal RowCounter As Integer, ByVal SheetName As String)

    Dim DetailPageHTMLStringDocument As HTMLDocument, DetailPageTEXTStringDocument As Object
    Dim DetailPageHTMLString, DetailPageTEXTString As String
    Dim StartLink1, StartLink2, StartLink3, StartLink4, StartLink5, StartLink6 As Integer
    Dim EndLink1, EndLink2, EndLink3, EndLink4, EndLink5, EndLink6 As Integer
    Dim TmpRng As String
    Dim Cutstring1, Cutstring2, Cutstring3, Cutstring4, Cutstring5, Cutstring6 As String
    Dim Counter As Long
    Dim IEforContractDetail As Object
    
    Set IEforContractDetail = CreateObject("InternetExplorer.Application")
    
    With IEforContractDetail
        .Visible = True
        .Navigate HyperString ' should work for any URL
        Do Until .readystate = 4: DoEvents: Loop
        Do Until .document.readystate = "complete": DoEvents: Loop
        Set DetailPageHTMLStringDocument = .document
    End With
    
    Get_and_Print_Head_Element DetailPageHTMLStringDocument

End Sub


Private Sub Get_and_Print_Head_Element(doc As HTMLDocument)

    'Get HEAD element using getElementsByTagName and print all its nodes in the Immediate Window

    Dim headElements As IHTMLElementCollection
    Dim headElement As HTMLHeadElement
    Dim i As Integer
    
    'Get collection of head elements.  There is normally only one <HEAD> tag, so the length of this collection is
    'normally 1
    
    Set headElements = doc.getElementsByTagName("HEAD")
    
    'Two ways of looping through the collection of head elements
    
    For Each headElement In headElements
        Debug.Print headElement.innerHTML
        PrintAllHeadNodes4 headElement
    Next
    
    For i = 0 To headElements.Length - 1
        Set headElement = headElements(i)
        Debug.Print headElement.innerHTML
        PrintAllHeadNodes4 headElement
    Next
    
End Sub


Private Sub PrintAllHeadNodes4(ByVal head As HTMLHeadElement)
    
    'One way of looping through the head element nodes
    
    Dim mNode As IHTMLDOMNode
    
    For Each mNode In head.childNodes
        PrintHeadNode mNode
    Next
    
End Sub


Private Sub PrintHeadNode(ByVal mNode As IHTMLDOMNode)

    'Print details of a HEAD node according to the node's name
    
    Dim title As HTMLTitleElement
    Dim meta As HTMLMetaElement
    Dim style As HTMLStyleElement
    Dim link As HTMLLinkElement
    
    Debug.Print mNode.nodeName
    
    Select Case mNode.nodeName
    
        Case "TITLE"
            Set title = mNode
            Debug.Print title.Text
    
        Case "META"
            Set meta = mNode
            If meta.httpEquiv <> "" Then Debug.Print "Http-equiv=" & meta.httpEquiv
            If meta.Name <> "" Then Debug.Print "Name=" & meta.Name
            Debug.Print "Content=" & meta.content
            
        Case "STYLE"
            Set style = mNode
            Debug.Print "Type=" & style.Type
            Debug.Print style.innerHTML
            
        Case "LINK"
            Set link = mNode
            Debug.Print "href=" & link.href
            Debug.Print "type=" & link.Type
            Debug.Print "rel=" & link.rel
            
    End Select
    
End Sub
 
Upvote 0
Also, your Dim statements don't declare the variable you expect. For example,

Dim DetailPageHTMLStringDocument, DetailPageTEXTStringDocument As Object

declares DetailPageHTMLStringDocument as a Variant and DetailPageTEXTStringDocument as an object.

Similarly,

Dim DetailPageHTMLString, DetailPageTEXTString As String

declares DetailPageHTMLString as a Variant and DetailPageTEXTString as a string.

Not that it really matters because I think Excel automatically maps a variant to the required type, but for your code these statements should really be:

Dim DetailPageHTMLStringDocument As Object, DetailPageTEXTStringDocument As Object
Dim DetailPageHTMLString As String, DetailPageTEXTString As String

The same applies to your other Dims.
 
Upvote 0
John,

Thank you for such a detailed reply. I thought I knew a little about Excel/VBA. Clearly I know "very" little.

Sincere Thanks

Simon
 
Upvote 0

Forum statistics

Threads
1,215,575
Messages
6,125,629
Members
449,241
Latest member
NoniJ

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top