VBA Parsing for values from a list of websites

kmpoaquests

New Member
Joined
Sep 26, 2011
Messages
33
I'd like to refer to this post for a better understanding of what I'm trying to do

http://www.mrexcel.com/forum/excel-...sing-webpage-specific-value-between-tags.html


I need help trying to extract specific values between tags from a list of websites.
The particular values I need to extract are :

[h=1]< dd> </dd>[/h]< h1> </h1 >

<dd></dd>

Yahoo contributor website example is here MD Lynn's Contributor Profile - Yahoo! Contributor Network - contributor.yahoo.com

Ideally Column A would contain the hyperlinks pointing to each website I need values scraped from
Column B,C,D,E would then contain the values scraped from yahoo contributor website such as page views , fans, intersets, experiences, education, etc

If anyone can help me I'm greatly in your debt

thank you
 

Some videos you may like

Excel Facts

Will the fill handle fill 1, 2, 3?
Yes! Type 1 in a cell. Hold down Ctrl while you drag the fill handle.

ukmikeb

Well-known Member
Joined
Jul 10, 2009
Messages
2,757
Hi

This piece of code shows a different approach and I am sure it can be adopted for your needs :-
Code:
Sub YahooContrib()
Dim IE As InternetExplorer
Set IE = New InternetExplorer
my_URL = "http://contributor.yahoo.com/user/11173/md_lynn.html"

   IE.Visible = True    'Some things don't work unless it's visible



    IE.navigate my_URL
    
    'Loop unitl ie page is fully loaded
    Do Until IE.ReadyState = READYSTATE_COMPLETE
        DoEvents
    Loop
    
    Do Until IE.Document.ReadyState = "complete"
        DoEvents
    Loop

    ' Set document object
    Set HTMLdocMain = IE.Document
    hd1 = HTMLdocMain.getElementsByTagName("h1")(0).innerText '   Author
    lg3 = HTMLdocMain.getElementsByClassName("sec_col")(0).innerText ' General Background
    lg4 = HTMLdocMain.getElementsByClassName("stats")(0).innerText '   Page views etc
    lg5 = HTMLdocMain.getElementsByClassName("prim_col_sect")(1).innerText ' Education/Experience
    lg6 = HTMLdocMain.getElementsByClassName("prim_col_sect")(2).innerText ' Interests
    lg7 = HTMLdocMain.getElementsByClassName("prim_col_sect")(3).innerText ' Motto
    lg8 = HTMLdocMain.getElementsByClassName("prim_col_sect")(4).innerText 'Affiliations 
 End Sub

hth
 

kmpoaquests

New Member
Joined
Sep 26, 2011
Messages
33
Hi

This piece of code shows a different approach and I am sure it can be adopted for your needs :-
Code:
Sub YahooContrib()
Dim IE As InternetExplorer
Set IE = New InternetExplorer
my_URL = "http://contributor.yahoo.com/user/11173/md_lynn.html"

   IE.Visible = True    'Some things don't work unless it's visible



    IE.navigate my_URL
    
    'Loop unitl ie page is fully loaded
    Do Until IE.ReadyState = READYSTATE_COMPLETE
        DoEvents
    Loop
    
    Do Until IE.Document.ReadyState = "complete"
        DoEvents
    Loop

    ' Set document object
    Set HTMLdocMain = IE.Document
    hd1 = HTMLdocMain.getElementsByTagName("h1")(0).innerText '   Author
    lg3 = HTMLdocMain.getElementsByClassName("sec_col")(0).innerText ' General Background
    lg4 = HTMLdocMain.getElementsByClassName("stats")(0).innerText '   Page views etc
    lg5 = HTMLdocMain.getElementsByClassName("prim_col_sect")(1).innerText ' Education/Experience
    lg6 = HTMLdocMain.getElementsByClassName("prim_col_sect")(2).innerText ' Interests
    lg7 = HTMLdocMain.getElementsByClassName("prim_col_sect")(3).innerText ' Motto
    lg8 = HTMLdocMain.getElementsByClassName("prim_col_sect")(4).innerText 'Affiliations 
 End Sub

hth

I appreciate the response but I'm a total noob when it comes to manipulating vba/macro code.
If possible could you provide an example or modified code ?

Thank You
 

ukmikeb

Well-known Member
Joined
Jul 10, 2009
Messages
2,757
Hi

Here is the modified code :-
Code:
Option Base 1
Sub YahooContrib()
'
' Created by ukmikeb (MrExcel forum) on 14/05/2013
'
Dim IE As InternetExplorer
Dim Header As String, my_URL As String
Dim Rw As Long
Dim PtrCntnt, PtrCntrib, PtrFans
Dim lg4, lg5, lg6, lg7, lg8

Header = Array("URL", "Author", "General Background", "Page views", "Fans", "Contrib Since", "Education", "Affiliations", "Motto", "Interests")
Columns("A").ColumnWidth = 30
Columns("B").ColumnWidth = 30
Columns("C").ColumnWidth = 60
Columns("D").ColumnWidth = 12
Columns("E").ColumnWidth = 10
Columns("F").ColumnWidth = 10
Columns("G").ColumnWidth = 15
Columns("H").ColumnWidth = 30
Columns("I").ColumnWidth = 30
Columns("J").ColumnWidth = 30
Columns("K").ColumnWidth = 50

Range("A1:K1") = Header
With Range("A1:K1")
  .Font.Bold = True
  .HorizontalAlignment = xlCenter
End With
  
Set IE = New InternetExplorer
Rw = 2

Do While Not IsEmpty(Range("A" & Rw))
my_URL = Range("A" & Rw).Value
'   IE.Visible = True    'Some things don't work unless it's visible

'  Navigate the web to Race Result oage for race-id

    IE.navigate my_URL
    
    'Loop unitl ie page is fully loaded
    Do Until IE.ReadyState = READYSTATE_COMPLETE
        DoEvents
    Loop
    
    Do Until IE.Document.ReadyState = "complete"
        DoEvents
    Loop

    ' Set document object
    Set HTMLdocMain = IE.Document
     
     Range("B" & Rw) = HTMLdocMain.getElementsByTagName("h1")(0).innerText ' Author
     
     Range("C" & Rw) = HTMLdocMain.getElementsByClassName("sec_col")(0).innerText ' General Background
     
     lg4 = HTMLdocMain.getElementsByClassName("stats")(0).innerText ' Page views etc
     PtrCntnt = InStr(lg4, "Content")
     PtrFans = InStr(lg4, "Fans")
     PtrCntrib = InStr(lg4, "Contributor")
     Range("D" & Rw) = Mid(lg4, 11, PtrCntnt - 11)
     Range("E" & Rw) = Mid(lg4, PtrCntnt + 7, PtrFans - (PtrCntnt + 7))
     Range("F" & Rw) = Mid(lg4, PtrFans + 4, PtrCntrib - (PtrFans + 4))
    
     Range("G" & Rw) = Right(lg4, Len(lg4) - (PtrCntrib + 16))
     
     lg5 = HTMLdocMain.getElementsByClassName("prim_col_sect")(1).innerText ' Education/Experience
     Range("H" & Rw) = Right(lg5, Len(lg5) - 20)
    
     lg6 = HTMLdocMain.getElementsByClassName("prim_col_sect")(4).innerText ' Affiliations
     Range("I" & Rw) = Right(lg6, Len(lg6) - 12)
     
     lg7 = HTMLdocMain.getElementsByClassName("prim_col_sect")(3).innerText ' Motto
     Range("J" & Rw) = Right(lg7, Len(lg7) - 5)
     
     lg8 = HTMLdocMain.getElementsByClassName("prim_col_sect")(2).innerText 'Interests
     Range("K" & Rw) = Right(lg8, Len(lg8) - 12)

     Rw = Rw + 1
 Loop
End Sub

The module creates the header line in Row 1.
You just need all your URLs in column A.

Good luck.

hth
 

kmpoaquests

New Member
Joined
Sep 26, 2011
Messages
33

ADVERTISEMENT

Hi

Here is the modified code :-
Code:
Option Base 1
Sub YahooContrib()
'
' Created by ukmikeb (MrExcel forum) on 14/05/2013
'
Dim IE As InternetExplorer
Dim Header As String, my_URL As String
Dim Rw As Long
Dim PtrCntnt, PtrCntrib, PtrFans
Dim lg4, lg5, lg6, lg7, lg8

Header = Array("URL", "Author", "General Background", "Page views", "Fans", "Contrib Since", "Education", "Affiliations", "Motto", "Interests")
Columns("A").ColumnWidth = 30
Columns("B").ColumnWidth = 30
Columns("C").ColumnWidth = 60
Columns("D").ColumnWidth = 12
Columns("E").ColumnWidth = 10
Columns("F").ColumnWidth = 10
Columns("G").ColumnWidth = 15
Columns("H").ColumnWidth = 30
Columns("I").ColumnWidth = 30
Columns("J").ColumnWidth = 30
Columns("K").ColumnWidth = 50

Range("A1:K1") = Header
With Range("A1:K1")
  .Font.Bold = True
  .HorizontalAlignment = xlCenter
End With
  
Set IE = New InternetExplorer
Rw = 2

Do While Not IsEmpty(Range("A" & Rw))
my_URL = Range("A" & Rw).Value
'   IE.Visible = True    'Some things don't work unless it's visible

'  Navigate the web to Race Result oage for race-id

    IE.navigate my_URL
    
    'Loop unitl ie page is fully loaded
    Do Until IE.ReadyState = READYSTATE_COMPLETE
        DoEvents
    Loop
    
    Do Until IE.Document.ReadyState = "complete"
        DoEvents
    Loop

    ' Set document object
    Set HTMLdocMain = IE.Document
     
     Range("B" & Rw) = HTMLdocMain.getElementsByTagName("h1")(0).innerText ' Author
     
     Range("C" & Rw) = HTMLdocMain.getElementsByClassName("sec_col")(0).innerText ' General Background
     
     lg4 = HTMLdocMain.getElementsByClassName("stats")(0).innerText ' Page views etc
     PtrCntnt = InStr(lg4, "Content")
     PtrFans = InStr(lg4, "Fans")
     PtrCntrib = InStr(lg4, "Contributor")
     Range("D" & Rw) = Mid(lg4, 11, PtrCntnt - 11)
     Range("E" & Rw) = Mid(lg4, PtrCntnt + 7, PtrFans - (PtrCntnt + 7))
     Range("F" & Rw) = Mid(lg4, PtrFans + 4, PtrCntrib - (PtrFans + 4))
    
     Range("G" & Rw) = Right(lg4, Len(lg4) - (PtrCntrib + 16))
     
     lg5 = HTMLdocMain.getElementsByClassName("prim_col_sect")(1).innerText ' Education/Experience
     Range("H" & Rw) = Right(lg5, Len(lg5) - 20)
    
     lg6 = HTMLdocMain.getElementsByClassName("prim_col_sect")(4).innerText ' Affiliations
     Range("I" & Rw) = Right(lg6, Len(lg6) - 12)
     
     lg7 = HTMLdocMain.getElementsByClassName("prim_col_sect")(3).innerText ' Motto
     Range("J" & Rw) = Right(lg7, Len(lg7) - 5)
     
     lg8 = HTMLdocMain.getElementsByClassName("prim_col_sect")(2).innerText 'Interests
     Range("K" & Rw) = Right(lg8, Len(lg8) - 12)

     Rw = Rw + 1
 Loop
End Sub

The module creates the header line in Row 1.
You just need all your URLs in column A.

Good luck.

hth

copied and pasted your code into the workbook
placed a url in a1 and tried to run the macro

got an error

compile error
"user-defined type not defined"

any ideas ?
(I have microsoft dao 3.6 object library" enabled fyi

I tried some googling to fix the problem , no dice
 

ukmikeb

Well-known Member
Joined
Jul 10, 2009
Messages
2,757
Hi

Please don't quote the whole post when replying.

Go into the VBE - Tools - References and ensure you have the following selected :-

Microsoft HTML Object Library
Microsoft XML, v6.0
HTML Dialogs 1.0 Type Library and
Microsoft VbScript Regular Expressions 5.5

hth
 

kmpoaquests

New Member
Joined
Sep 26, 2011
Messages
33

ADVERTISEMENT

I added all the libraries you requested within the references, but

I do not have the HTML Dialogs 1.0 Type Library

any idea how I can add it as well ?

I REALLY SERIOUSLY APPRECIATE THE HELP !
 

kmpoaquests

New Member
Joined
Sep 26, 2011
Messages
33
I've also enabled microsoft internet controls

it seemed to have done something because I'm getting a new error

Run-Time Error 13 - Type Mismatch
 
Last edited:

ukmikeb

Well-known Member
Joined
Jul 10, 2009
Messages
2,757
What line threw up the error?

It worked ok for me with the M D Lynn URL.
 

ukmikeb

Well-known Member
Joined
Jul 10, 2009
Messages
2,757
Hi

Further to your PM.

Probably my bad with this line (confirm it's line 83 in the code) :-
Rich (BB code):
     Range("K" & Rw) = Right(lg8, Len(lg8) - 12)
which should be :-
Rich (BB code):
     Range("K" & Rw) = Right(lg8, Len(lg8) - 9)

and replace the Header line with :-
Rich (BB code):
Header = Array("URL", "Author", "General Background", "Page views", "Content", "Fans", "Contrib Since", "Education", "Affiliations", "Motto", "Interests")

If it is not the line I have corrected above, could you let me know with the text of the line.

hth
 

Watch MrExcel Video

Forum statistics

Threads
1,122,961
Messages
5,599,061
Members
414,281
Latest member
Engjamal2021

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Top