Scraping web site...pulling in html source to parse for data

OaklandJim

Well-known Member
Joined
Nov 29, 2018
Messages
833
Office Version
  1. 365
Platform
  1. Windows
Am trying to scrape Morningstar web pages for funds data. In the past I paid up to get help with this as it is beyond my abilities. I'll pay up again if I have to.

Anyway, before Morningstar redesigned their web pages the following code used to pull in the entire html source for a web url.

VBA Code:
Function GetSource(sURL As String) As String
    Dim oXHTTP As Object
    
    Set oXHTTP = CreateObject("MSXML2.XMLHTTP")
    
    On Error GoTo haveError
    oXHTTP.Open "GET", sURL, False
    oXHTTP.send
    GetSource = oXHTTP.responsetext
        
    Set oXHTTP = Nothing
        
Exit Function

haveError:

    Debug.Print "Error in Function GetSource for URL " & sURL
    Err.Clear
    GetSource = ""
    Set oXHTTP = Nothing
End Function

Now it pulls in text that looks a lot different than the source does when I look at it and save it using a browser. Specifically the code pulls in SOME html but it is a much smaller file (55k chars) than the source (~1.8 million chars ) that I see/save using the browser.

In the latter I can see my data and its tags and know how to parse to get it. BUT I cannot get the "full" source code to parse. The "small" source I get does not have my data in it.

Also, I presume that VBA string vars and string functions can handle that many chars?

Here is a typical URL for me. This is for an ETF

Here is one for a mutual fund
 

Excel Facts

Copy formula down without changing references
If you have =SUM(F2:F49) in F50; type Alt+' in F51 to copy =SUM(F2:F49) to F51, leaving the formula in edit mode. Change SUM to COUNT.
Hi Jim,

what data are you looking for? If I load the page (VNQ), I see that they load some stuff after loading the initial page, calling on an internal API. So it could be that your scraping can be more efficient by tapping into those. If I check the source of the page (View page source), I also see a more compact version of the page, where also a lot of stuff is not showing. So it seems that they made their side scrape-unfriendly on purpose.
Having said that, you could try this trick to get to their API (python): Google Translate
And to finish it off: I felt intrigued and spent some time building a small tool that can scrape the code through the JSONs, see this dropbox folder and find the file morningstar_scrape.xlsm (the others are for other questions I answered on this forum): MrExcel

Koen
 
Upvote 0
Koen I really appreciate the thoughtful response! Thanx a lot. I just want to scrape basic info from Morningstar like fund expense, yield, stars rating, medal rating and asset allocation for my portfolio spreadsheet. I skimmed the resource you suggested. Looks great. But, unfortunately I'm getting old and not so up-to-date enuf on modern programming stuff. VBA, php, limited javascript but never worked with an API or Python or JOSN. Anyway, I cannot expect much handholding for such a journey so I conclude that 1. what I want can be done but 2. I am not really up to learning what seems like a lot. I'll suck it up and do by transcribing data for my 50 funds. I'll need a half day at least. Prefer to do quarterly or more often but may have to stick with one update a year. Heheh, if I could program it in two weeks I'd be surprised. Last thought, is there a good clearinghouse for programmers-for-hire who might do this for me for a reasonable cost? Have a great day all.
 
Upvote 0
Koen I did not want to feel ungrateful so I took a look at the workbook that you put together. Wow. That took a lot of work. I can't tell how good it is but it sure seems sophisticated. What to do with it though? I looked at it and I see some semblance of VBA that I know but a lot is not understandable. Might you add a super simple basic front end on it so I can mess with it? A button even that makes relevant call to do the scraping and output of the array? Otherwise I just won't know what to do with your hard work. Thanks again, really.
 
Upvote 0
Koen I did not want to feel ungrateful so I took a look at the workbook that you put together. Wow. That took a lot of work. I can't tell how good it is but it sure seems sophisticated. What to do with it though? I looked at it and I see some semblance of VBA that I know but a lot is not understandable. Might you add a super simple basic front end on it so I can mess with it? A button even that makes relevant call to do the scraping and output of the array? Otherwise I just won't know what to do with your hard work. Thanks again, really.
Hi Jim,
thanks for your kind words. I wrote a big chunck of the code for some other project, so that was basically copy-paste. I could build something with a button if you provide me with a mockup. So if you can make an excel sheet with your user interface, I can fuse that with my code. You can't attach excel files, so use e.g. dropbox to get your file accross.
Thanks,
Koen
 
Upvote 0

Forum statistics

Threads
1,214,641
Messages
6,120,695
Members
448,979
Latest member
DET4492

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top