Extract data from websites

Lositom

New Member
Joined
Apr 23, 2017
Messages
6
I have no idea about vba but have found this forum really helpful and have managed to put together a table where websites entered into column A are then extracted so that B returns page title, C returns h1 tag but along with this I would like D to return og:image URL from the meta properties and E to return favicon URL. Is this possible? I just can't work out how to adjust the existing code to select these final two urls from within the site's code.

Many thanks
 
Forget about the previous post. This board doesn't handle the "< " sign very well and it crippled the code.
New version:


Code:
Option Explicit
Sub Get_That_Data()

Dim strHTML As String, strImageUrl As String
Dim strTextTemp As String, strUrl As String
Dim LastRow As Long, i As Long
Const strTitle = "< title>"
Const strTitleEnd = "< /title>"
Const strMetaTag = "< meta property=""og:image"" content="""
Const strEnd = """ />"
Const strH1Tag = "< h1"
Const strH1End = ">"
Const strH1EndTag = "< /h1>"
    
    'As you can see I don't use CreateObject("InternetExplorer.Application")
    'Instead I use CreateObject("msxml2.xmlhttp")
    With CreateObject("msxml2.xmlhttp")
        LastRow = ActiveSheet.Cells(Rows.Count, "A").End(xlUp).Row
        
        'Loop through Column 1 from Activesheet
        For i = 2 To LastRow
        
            'I've just this strUrl for testing:
            '"http://www.metmuseum.org/art/collection/search/11122"
            strUrl = Cells(i, 1)
            .Open "GET", strUrl, False
            .send
            strHTML = .responseText
            
            'To find the elements I don't use DOM methods
            'but just search through the HTML.
            
            'Get the titel
            Cells(i, 2) = Mid(strHTML, InStr(strHTML, strTitle) + Len(strTitle), _
                            InStr(strHTML, strTitleEnd) - InStr(strHTML, strTitle) - Len(strTitleEnd) + 1)
            
            'Get the image url
            strTextTemp = Mid(strHTML, InStr(1, strHTML, strMetaTag) + Len(strMetaTag))
            Cells(i, 3) = Mid(strTextTemp, 1, InStr(strTextTemp, strEnd) - 1)
            
            'Get the h1 header
            strTextTemp = Mid(strHTML, InStr(InStr(1, strHTML, strH1Tag), strHTML, strH1End))
            Cells(i, 4) = Mid(strTextTemp, 2, InStr(strTextTemp, strH1EndTag) - 2)
            
            'Autofit
            Range(Cells(i, 1), Cells(i, 4)).Columns.AutoFit
        Next i
    End With
End Sub

Make sure you remove the space between "< t, "< m", "< h1", "< /h1" in:
Const strTitle = "< title>"
Const strTitleEnd = "< /title>"
Const strMetaTag = "< meta property=""og:image"" content="""
Const strH1Tag = "< h1"
Const strH1EndTag = "< /h1>"

Take good note of the comments.

Questions:
Please run this code. Does this example work for you?
What's the Url you fetch the data from?
 
Upvote 0

Excel Facts

Format cells as currency
Select range and press Ctrl+Shift+4 to format cells as currency. (Shift 4 is the $ sign).

Forum statistics

Threads
1,215,432
Messages
6,124,858
Members
449,194
Latest member
HellScout

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top