Scrape Webpages Using Power Query
October 25, 2017 - by Bill Jelen
Power Query is pretty powerful. But I am about to do Power Query Squared... writing a query for one page and then having Excel perform the same query for a whole list of web pages.
- Today's trick is adapted from the M is for Data Monkey book
- Build a query to get data from one web page
- Edit the query to change it into a function with (VariableName)=> before Let
- Change the hard-coded URL to VariableName
- Rename the query to fxWeather
- Close & Load. The data will disappear.
- Use Excel Trickery to create a table of all URL's
- Create a query from that table.
- Add a new column of Weather
- Expand the column. Uncheck Prefix
Learn Excel from MrExcel Podcast, Episode 2056: Power Query Squared
Hey, welcome back to the MrExcel netcast, I'm Bill Jelen. I came across this trick while I was preparing to do a seminar at a conference in Dallas called Excelapalooza, greatest Excel conference name ever. You should check it out every September in Dallas.
And credit to Ken Puls and Miguel Escobar because I had to do an hour on power query, so of course, I pulled out there awesome book, the world's greatest book on Power Query. I was flipping through the book and I saw they had one technique and I said, “Wait a second. I'm going to see if I can adapt this technique.” And this to me is awesome in power query. And here's what we're going to do, we're going to take one power query and then we're going to run that power query, that query, dozens of times, alright?
And so, the example that I came up with was where I wanted to pull data from a web page, alright. And I went out to - I just looked for some web pages I could- that I could use as an example. I ended up at Weather Underground and here's the URL, and you can see that I was in Dallas-Fort Worth, so we're pulling data for Dallas and it looks like January 2nd, 2015. So right there in the URL is the parameters, right? And that is a URL that is just ripe for making this thing work.
We'll take a quick look at the web page although it's not so important. You see there's a lot of different data out here on the web page and I decided I was just going to try and get precipitation and high and low temperature. And here's that query. And let me tell you right here that this podcast is not about how to Create this Query, New Query, From Other Sources, From Web, specify the URL and then a bunch of steps that I'm not going to detail here to get my final answer of Max Temp, Min Temp, and precipitation. The point is you're going to take your own query and make it work for a bunch of things.
So, I click Close & Load and this query is working, it's returning my one row. Everything is awesome. And I'm going to come back in, I'm going to Edit this query and I'm going to go to View, Advanced Editor. I'm going to take this query and I'm going to make it be a function, alright? So right here before the word LET, I press Enter. And in parentheses, I'm going to give it a variable (MyURL) and then => little arrow there, alright? Cool. And then down here, where they have the URL in quotes, I want to get rid of the entire URL including the quotes and then type my variable name MyURL, alright. So what we're saying is, we're going to pass it to URL and it's going to do that same query but with whatever URL we happen to pass it.
Now, a couple of disconcerting things here, when I click Done, oh men! All my applied steps are gone and they want me to enter a parameter. Just ignore all that. We're going to rename this; we're going to call it fxWeather. FX, of course, being the abbreviation for function and you have to really remember this name and remember which letters are capitalized, that's going to be very important in a couple of minutes. Home, Close & Load, and BAM! Everything is gone. Oh, no! But that's okay. Alright, so we know it's there. It's a connection only. Now, I'm going to come over here and this is just straight old Excel, alright? So here's the URL, I broke it out into the first part of the URL, the ending part of the URL. I know that I need to take the date; I need to format it in this weird format of year, month, and day so I used the TEXT function to do that. Put in the starting date here. I can even change the airport, so now I'm back and forth. Let's do MCO for Orlando and let's do some recent data. So I’ll start at 10/1/2016, alright. So now we have this great little table set up here. And by the way, it does have to be a table. You have to use format as table or Ctrl+T. So, you know, this is just grabbing that date and then +1+1+1. I format it, I build the URL.
Alright now, we're going to build a query from this table. Alright, and there's my information. I'm going to Add a New Column, Add a Custom Column, the columns were going to be called Weather, and the formula is going to be =fxWeather. Make sure it's the exact same case, same upper and lower case letters, and we'll insert that field called URL like that, closing parenthesis. No syntax errors, click OK. They want to know about Privacy here, this is all Public data, click Save, alright. So, there's our date. It's funny they've actually changed my format into something that doesn't look like what I started out with. And then here's Weather with the expand symbol. So I'm going to click the expand symbol, uncheck Use original column name as prefix. I want the Max, the Min, the Precipitation, click OK. Alright, and now all I need is the date and that information out there. So I'll right-click and remove this column, right-click and remove this column. Over here, I don't need that time so I'll say this is just a Date, alright. And see what it's doing every date that I'm passing it; it's returning the high, the low, and the precipitation for Orlando. Every row here is going out to a different web page. Just imagine, if it wasn't 15 rows but 5,000 rows, you'd set it up to run overnight. I used to write macros for this. In fact, one of the web pages out at MrExcel is how to build a macro to scrape web pages from a thousand different web pages at a site, not necessary anymore with power query.
Now, when I Close & Load, it's funny the preview here is showing me all the results. When I Close & Load, they're actually going to go do each query. And so right now we have preview rows loaded and it will take a good long time for this information to get updated. So, go to lunch, do something especially if you're doing more than 15 rows. And it's funny the preview is correct but they're still going out and could chunk, could chunk, could chunking through each individual row.
And there it's loaded. Is this amazing or what? Hey, I do a lot of Excel seminars, the Power Excel seminar in Orlando, Florida. Look at these beautiful temperatures we have down here on November 4, 2016. My afternoon will be all about Power Query, Power BI, Power Pivot, Power Map. So, I'm going to invite you down to Orlando to check out this seminar. There'll be a link there in the top-right hand corner.
Alright, so recap. Today's trick is from this book, M is for (DATA) MONKEY. We built a query to get one web page and then edit that query to change it into a function. So right before the Let statement variable name => and then change the hard-coded URL to whatever that variable name is. Rename the query to fxWeather, Close & Load, the data disappears. Then, we use some sort of Excel trickery to create a table of all the URLs we want to crawl, create a query from that table. This has to be a Ctrl+T table, add a new column of Weather = fxWeather and again it has to match the case there, [URL], Expand that column, uncheck Prefix. BAM! It is amazing.
Thanks to Ken and Miguel for writing this book. Thanks to you for stopping by. Hope to see you in Orlando on November 4th 2016. See you next time for another netcast from MrExcel.
Download the sample file here: Podcast2056.xlsm
Title Photo: Tama66 / Pixabay