excel vba to extract html info from websites

Sayre

Board Regular
Joined
Apr 12, 2005
Messages
180
Hi I'm trying to figure out how to get html data from websites into excel sheet using vba. I've been looking around for hours specifically trying to figure out how to get pagename and pageid values (to start, I want to get a lot more than that after like website images etc...).

There seems to be a lot about getting tag info like <div> etc..... but I can't seem to find anything on extracting pagename and pageid. Both occur in the header of html code and I can clearly see what it is if I copy the code using right-click>inspect element on any given webpage. But I can't seem to do this.

This particular thread from this forum looks potentially promising but I'm having a problem with it and hoping someone can help.
Link:
http://www.mrexcel.com/forum/excel-questions/391035-trying-extract-data-html-header.html


Here's the code from john_w:
Code:
Sub Test()
    PopValues "http://www.mrexcel.com", 1, "Sheet1"
End Sub

Sub PopValues(ByVal HyperString As String, ByVal RowCounter As Integer, ByVal SheetName As String)


 Dim DetailPageHTMLStringDocument As HTMLDocument, DetailPageTEXTStringDocument As Object
 Dim DetailPageHTMLString As String, DetailPageTEXTString As String


    Dim StartLink1, StartLink2, StartLink3, StartLink4, StartLink5, StartLink6 As Integer
    Dim EndLink1, EndLink2, EndLink3, EndLink4, EndLink5, EndLink6 As Integer
    Dim TmpRng As String
    Dim Cutstring1, Cutstring2, Cutstring3, Cutstring4, Cutstring5, Cutstring6 As String
    Dim Counter As Long
    Dim IEforContractDetail As Object

    Set IEforContractDetail = CreateObject("InternetExplorer.Application")

    Set DetailPageHTMLStringDocument = .Document
    
    With IEforContractDetail
        .Visible = True
        .Navigate HyperString ' should work for any URL
        Do Until .ReadyState = 4: DoEvents: Loop
        Do Until .Document.ReadyState = "complete": DoEvents: Loop
        Set DetailPageHTMLStringDocument = .Document
    End With
    
    Get_and_Print_Head_Element DetailPageHTMLStringDocument

End Sub

I'm getting this error: "Invalid or unqualified reference"
on this line: Set DetailPageHTMLStringDocument = .Document

".document" is the portion highlighted in the error.
This happens whether I dim the variable DetailPageHTMLStringDocument as object or as HTMLDocument.

I have the reference libraries for ms html object lib as well as ms internet controls turned on as well as several others.
Not sure how t fix this, would appreciate any help or direction.

Every time I think I"m starting to know vba pretty well, I am humbled by some little thing like this. This website has been so great for me over the years. I have contributed back a little in the way of answers but not nearly enough compared to what I have received. I need to make more effort in this area, and will do so.

Again any assistance much appreciated.
 

Excel Facts

What is the last column in Excel?
Excel columns run from A to Z, AA to AZ, AAA to XFD. The last column is XFD.
Code:
    [COLOR=#FF0000]Set DetailPageHTMLStringDocument = .Document[/COLOR]
    
    With IEforContractDetail
        .Visible = True
        .Navigate HyperString ' should work for any URL
        Do Until .ReadyState = 4: DoEvents: Loop
        Do Until .Document.ReadyState = "complete": DoEvents: Loop
        [COLOR=#008000]Set DetailPageHTMLStringDocument = .Document[/COLOR]
    End With

I think you can delete that line (red) as it is a duplicated in the With code-block where .Document refers to the IEforContractDetail.Document
 
Upvote 0
Yea! Ok that appears to have worked, guess I should have tried that. It did not get the results I was hoping for. I'm intermediate with vba and know almost nothing of html.
I guess I need to figure out how to parse the html code of the url.

For instance, the url MrExcel.com | Excel Resources | Excel Seminars | Excel Products, lets say I inspect the element and decide I want to extract the number 14 from this line of code:<a href="http://www.mrexcel.com" abp="14">
Then I want to extract the number 21 from this line:<td width="100%" align="center" abp="21">
Then I want to find other links this page is linked to and do the same for them.

Any idea how I could write this? Or just point me at another resource that I may have missed in my hours of searching.... anything at this point may be helpful.
 
Upvote 0
Er, my last reply did not post right, let my try again ajd I'll try to simplify.

I am looking to create a macro that will take a given url, lets say MrExcel.com | Excel Resources | Excel Seminars | Excel Products in this case, search through the html code (the stuff you can view by right clicking on the site and choose inspect element).
Find certain tags, in this case lets say
HTML:
<TR>
, and then search for a key word next to an = sign, in this case "href", and then debug.print what comes after the key word.

HTML:
<tr>
<td width="100%" align="center" height="16"><a href="http://www.mrexcel.com/">Home</a></td>
</tr>

The result for the debug.print here would be "http://www.mrexcel.com/"

I apologize if I'm using the wrong verbiage here and hope I've simplified this enough that someone could actually help!
 
Last edited:
Upvote 0

Forum statistics

Threads
1,215,219
Messages
6,123,685
Members
449,117
Latest member
Aaagu

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top