Extract data from websites

Lositom

New Member
Joined
Apr 23, 2017
Messages
6
I have no idea about vba but have found this forum really helpful and have managed to put together a table where websites entered into column A are then extracted so that B returns page title, C returns h1 tag but along with this I would like D to return og:image URL from the meta properties and E to return favicon URL. Is this possible? I just can't work out how to adjust the existing code to select these final two urls from within the site's code.

Many thanks
 

Excel Facts

Save Often
If you start asking yourself if now is a good time to save your Excel workbook, the answer is Yes
Welcome to the board.

If you could post an example of the data it will make it clearer what you are trying to achieve and we will possibly be able to help.
 
Upvote 0
This is the code I'm using which I got from here (or equivalent) after a google search. It means little to me but works. I just need to capture those other bits of information is url format.

Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim sURL As String
Dim lastrow As Long
lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
For i = 2 To lastrow
Set wb = CreateObject("internetExplorer.Application")
sURL = Cells(i, 1)
wb.navigate sURL
wb.Visible = True
While wb.Busy
DoEvents
Wend
'HTML document
Set doc = wb.document
Cells(i, 2) = doc.title
On Error GoTo err_clear
Cells(i, 3) = doc.GetElementsByTagName("h1")(0).innerText
err_clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Range(Cells(i, 1), Cells(i, 3)).Columns.AutoFit

On Error GoTo err_clear2
Cells(i, 4) = doc.GetElements("meta[property=og:image]")(0).innerText
err_clear2:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Range(Cells(i, 1), Cells(i, 4)).Columns.AutoFit
Next i
End Sub
 
Upvote 0
This is still baffling me! I would really appreciate some help. It looks like the codes within pages are something like ****** property ="og:image" content=http://www.website.com/image.jpg /> and it's this link I need to be extracted and placed into its own separate column after the page title and h1 tag.

Many thanks
 
Upvote 0
I'm sorry I'm still struggling with this. I presume this isn't quite right

doc.GetElements("meta[property=og:image]")(0).innerText

But I've tried various things to try to make it work.
 
Upvote 0
I'm sorry I'm still struggling with this. I presume this isn't quite right

doc.GetElements("meta[property=og:image]")(0).innerText

I don't see that part in the code i referenced above. Perhaps you can post what you have already.

This also works.

Code:
Option Explicit
Sub Get_Graph_Og()

Dim ArrConst
Dim objHttp As Object
Dim Title As String, strEnd As String, strResult As String
Dim i As Byte

'For example you are looking for the image-link in the tag og:image
'In this example looks something like:
'< meta property="og:image" content="http://images.metmuseum.org/CRDImages/ap/web-large/DP140858.jpg" />
'First, create an array with your search parameters

ArrConst = Array("****** property=""og:url"" content=""", _
    "< meta property=""og:image"" content=""", _
    "< meta property=""og:title"" content=""", _
    "< meta property=""og:site_name"" content=""", _
    "< meta property=""og:description"" content=""")

'The end tag
strEnd = """ />"

'Create object
Set objHttp = CreateObject("MSXML2.ServerXMLHTTP")

'Get the webpage (adjust url to your needs)
objHttp.Open "GET", "http://www.metmuseum.org/art/collection/search/11122", False
objHttp.send ""

'Get the source of webpage
Title = objHttp.responseText

'Cycle through items in the array you created
For i = LBound(ArrConst) To UBound(ArrConst)
    
    'Check if search item exists
    If InStr(1, Title, ArrConst(i)) Then
    'If yes, then this
        strResult = Mid(Title, InStr(1, Title, ArrConst(i)) + Len(ArrConst(i)))
        strResult = Mid(strResult, 1, InStr(strResult, strEnd) - 1)
    Else
        'If no, then this
        strResult = ""
    End If
    
    'Print result to immediate window
    Debug.Print strResult
    strResult = ""
Next i
End Sub

'This should be your results
'http://www.metmuseum.org/art/collection/search/11122
'http://images.metmuseum.org/CRDImages/ap/web-large/DP140858.jpg
'Winslow Homer | The Gulf Stream | The Met
'The Metropolitan Museum of Art, i.e. The Met Museum
'Back in Prouts Neck, Maine, after one of his winter visits to the Bahamas, Homer painted this dramatic scene of imminent disaster. A man faces his demise on a dismasted, rudderless fishing boat, sustained by only a few stalks of sugarcane and threatened by sharks and a distant waterspout

Make sure you remove the space between "< m" in: "< meta property=
 
Last edited:
Upvote 0
Thank you very much for that - still unable to make it work.

This is the code I've started with. It's the final part of the process which I want to extract the og:image into the 4th column in excel.

Code:
Sub get_title_header()
 Dim wb As Object
 Dim doc As Object
 Dim sURL As String
 Dim lastrow As Long
 lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
 For i = 2 To lastrow
 Set wb = CreateObject("internetExplorer.Application")
 sURL = Cells(i, 1)
 wb.navigate sURL
 wb.Visible = True
 While wb.Busy
    DoEvents
 Wend
 'HTML document
 Set doc = wb.document
 Cells(i, 2) = doc.title
 On Error GoTo err_clear
 Cells(i, 3) = doc.GetElementsByTagName("h1")(0).innerText
 err_clear:
 If Err <> 0 Then
 Err.Clear
 Resume Next
 End If
 wb.Quit
 Range(Cells(i, 1), Cells(i, 3)).Columns.AutoFit

 On Error GoTo err_clear2
 Cells(i, 4) = doc.GetElements("meta[property=og:image]")(0).innerText
 err_clear2:
 If Err <> 0 Then
 Err.Clear
 Resume Next
 End If
 wb.Quit
 Range(Cells(i, 1), Cells(i, 4)).Columns.AutoFit
 Next i
 End Sub
 
Upvote 0
Actually, this is the code - not sure what I did wrong in that one

Code:
Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim sURL As String
Dim lastrow As Long
lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
For i = 2 To lastrow
Set wb = CreateObject("internetExplorer.Application")
sURL = Cells(i, 1)
wb.navigate sURL
wb.Visible = True
While wb.Busy
    DoEvents
Wend
'HTML document
Set doc = wb.document
Cells(i, 2) = doc.title
On Error GoTo err_clear
Cells(i, 3) = doc.getElementsByTagName("h1")(0).innerText
err_clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Range(Cells(i, 1), Cells(i, 3)).Columns.AutoFit

On Error GoTo err_clear2
Cells(i, 4) = doc.GetElements("meta[property=og:image]")(0).innerText
err_clear2:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Range(Cells(i, 1), Cells(i, 4)).Columns.AutoFit
Next i
End Sub
 
Upvote 0
Another approach:

Code:
Option Explicit
Sub Get_That_Data()

Dim strHTML As String, strTitle As String, strImageUrl As String
Dim strH1 As String, strTextTemp As String, sURL As String
Dim LastRow As Long, i As Long
Const strMetaTag = "< meta property=""og:image"" content="""
Const strEnd = """ />"
Const strH1Tag = "< h1"
Const strH1End = ">"
Const strH1EndTag = "< /h1>"
    
    'As you can see I don't use CreateObject("InternetExplorer.Application")
    'Instead I use CreateObject("msxml2.xmlhttp")

    With CreateObject("msxml2.xmlhttp")
        LastRow = ActiveSheet.Cells(Rows.Count, "A").End(xlUp).Row
        
        'Loop through Column 1 from Activesheet
        For i = 2 To LastRow
        
            'I've just this sURL for testing:
            '"http://www.metmuseum.org/art/collection/search/11122"
            sURL = Cells(i, 1)
            .Open "GET", sURL, False
            .send
            strHTML = .responseText
            
            'To find the elements I don't use DOM methods
            'but just search through the HTML.
            
            'Get the titel
            Cells(i, 2) = Mid(strHTML, InStr(strHTML, "<title>") + Len("<title>"), _
                            InStr(strHTML, "</title>") - InStr(strHTML, "<title>") - Len("</title>") + 1)
            
            'Get the image url
            strTextTemp = Mid(strHTML, InStr(1, strHTML, strMetaTag) + Len(strMetaTag))
            Cells(i, 3) = Mid(strTextTemp, 1, InStr(strTextTemp, strEnd) - 1)
            
            'Get the h1 header
            strTextTemp = Mid(strHTML, InStr(InStr(1, strHTML, "h1"), strHTML, ">"))
            Cells(i, 4) = Mid(strTextTemp, 2, InStr(strTextTemp, "</h1>") - 2)
            
            'Autofit
            Range(Cells(i, 1), Cells(i, 4)).Columns.AutoFit
        Next i
    End With
End Sub

Make sure you remove the space between "< m", "< h1", "< /h1" in:
Const strMetaTag = "< meta property=""og:image"" content="""
Const strH1Tag = "< h1"
Const strH1EndTag = "< /h1>"

Take good note of the comments.

Questions:
Please run this code. Does this example work for you?
What's the Url you fetch the data from?
 
Last edited:
Upvote 0

Forum statistics

Threads
1,214,909
Messages
6,122,189
Members
449,072
Latest member
DW Draft

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top