Using Web Queries and a Loop to Download 4000 Database Entries from 4000 Web Pages


September 26, 2004 - by

One day, I received a broadcast e-mail from Jan at the PMA. She was passing along a great idea from Gary Gagliardi of Clearbridge Publishing. Gary mentioned that some search engines assign a page rank to a page based on how many other sites link to the page. He was suggesting that if all 4000 members of the PMA would link to all 4000 other members of the PMA, it would boost all of our rankings. Jan thought this was a great idea and said that all PMA member web addresses are listed on the current PMA website in the members area.

Personally, I think the "number of links" theory is a bit of a myth, but I was willing to give it a try in order to help out.

So, I visited the PMA Members area, where I quickly learned that there wasn't a single list of members, but in fact 27 lists of members.

I visited the PMA Members area.
I visited the PMA Members area.

As I clicked through to the "A" page, I saw that it was even worse. Each link on this page did not lead to the member's website. Each link here lead to an individual page at PMA-online with the member's website.

Links in the web page.
Links in the web page.


This would mean that I would have to visit thousands of web pages in order to compile the list of members. This clearly would be an insane proposition.

Luckily, I am the co-author of VBA & Macros for Microsoft Excel. I wondered if I could customize the code from the book to solve the problem of extracting member URL's from thousands of linked pages.

Chapter 14 of the book is about using Excel for reading from and writing to the web. On page 335, I found code that could create a web query on the fly.

The first step was to see if I could customize the code in the book to be able to produce 27 web queries - one for each of the letters of the alphabet and the number 1. This would give me several lists of all the links on the 26 alphabetical page listings.

Each page has a URL similar to http://www.pma-online.org/scripts/showmemlist.cfm?letter=A. I took code from page 335 and customized it a bit to do 27 web queries.

Sub CreateNewQuery()
    ' Page 335
    Dim WSD As Worksheet
    Dim WSW As Worksheet
    Dim QT As QueryTable
    
    For m = 1 To 27
        Select Case m
            Case 27
                MyStr = "1"
            Case Else
                MyStr = Chr(64 + m)
        End Select
        MyName = "Query" & m
        ConnectString = "URL;http://www.pma-online.org/scripts/showmemlist.cfm?letter=" & MyStr
        ThisWorkbook.Worksheets.Add
        ActiveSheet.Name = m
        
        ' On the Workspace worksheet, clear all existing query tables
        For Each QT In ActiveSheet.QueryTables
            QT.Delete
        Next QT
        
        ' Define a new Web Query
        Set QT = ActiveSheet.QueryTables.Add(Connection:=ConnectString, Destination:=Range("A1"))
        With QT
            .Name = MyName
            .FieldNames = True
            .RowNumbers = False
            .FillAdjacentFormulas = False
            .PreserveFormatting = True
            .RefreshOnFileOpen = False
            .BackgroundQuery = False
            .RefreshStyle = xlInsertDeleteCells
            .SavePassword = False
            .SaveData = True
            .AdjustColumnWidth = True
            .RefreshPeriod = 0
            .WebSelectionType = xlSpecifiedTables
            .WebFormatting = xlWebFormattingAll
            .WebTables = "7"
            .WebPreFormattedTextToColumns = True
            .WebConsecutiveDelimitersAsOne = True
            .WebSingleBlockTextImport = False
            .WebDisableDateRecognition = False
            .WebDisableRedirections = False
        End With
        
        ' Refresh the Query
        QT.Refresh BackgroundQuery:=True
        
    Next m
    
End Sub

There were four items that were customized in the above code.

  • First, I had to build the correct URL. This was achieved by appending the proper letter to the end of the URL string.
  • Second, I modified the code to run each query on a new worksheet in the workbook.
  • Third, the code in the book was grabbing the 20th table from the web page. By recording a macro pulling in the table from PMA, I learned that I needed the 7th table on the web page.
  • Fourth, after running the macro, I was disappointed to see that I was getting the names of the publishers, but not the hyperlinks. The code in the book specified .WebFormatting:=xlFormattingNone. Using VBA help, I figured that if I changed to .WebFormatting:=xlFormattingAll, I would get the actual hyperlinks.

After running this first macro, I had 27 worksheets, each with a series of hyperlinks that looked like this:

Extracted links with hyperlinks in Excel.
Extracted links with hyperlinks in Excel.

The next step was to extract the hyperlinked address from every hyperlink on the 27 worksheets. It is not in the book, but there is a hyperlink object in Excel. The object has an .Address property that would return the webpage within PMA-Online with the URL for that publisher.

Sub GetEmAll()
    NextRow = 1
    Dim WSD As Worksheet
    Dim WS As Worksheet
    Set WSD = Worksheets("Sheet1")
    For Each WS In ActiveWorkbook.Worksheets
        If Not WS.Name = "Sheet1" Then
            For Each cll In WS.UsedRange.Cells
                For Each hl In cll.Hyperlinks
                    WSD.Cells(NextRow, 1).Value = hl.Address
                    NextRow = NextRow + 1
                Next hl
            Next cll
        End If
    Next WS
End Sub

After running this macro, I finally learned that there were 4119 individual webpages at the PMA site. I am glad that I did not try to visit each individual site one at a time!

My next goal was to have a webquery built to visit each of the 4119 individual web pages. I recorded a macro returning one of the individual publisher pages to learn that I wanted table # 5 from each page. I could see that the publisher name was returned as the fifth row of the table. In most cases, the website was returned as the 13th row. However, I learned that in some cases, if the street address was 3 lines instead of 2, the website URL was actually on row 14. If they had 3 telephones instead of 2, the website was pushed down another row. The macro would have to be flexible enough to search from perhaps row 13 to 18 in order to find the cell that started WWW:.

There was another dilemma. The code in the book allows the webquery to refresh in the background. In most cases, I would actually watch the query finish after the macro finished. My initial thought was to allow 40 rows for each publisher, and to build all 4100 queries on each page. This would have required 80,000 rows of spreadsheet and a lot of memory. In Excel 2002, I experimented with changing the BackgroundRefresh to False. VBA did a good job of pulling the information into the worksheet before the macro would go on. This allowed be to build the query, refresh the query, save the values to a database, then delete the query. Using this method, there was never more than one query at a time on the worksheet.

Sub AllQuery()
Dim WS As Worksheet
Dim WD As Worksheet
Set WD = Worksheets("database")
Set WS = Worksheets("Sheet1")

    Dim QT As QueryTable
    WS.Activate
    OutCol = 8
    OutRow = 1
    FinalRow = WS.Cells(65536, 1).End(xlUp).Row
    For i = 2 To FinalRow
        ConnectString = "URL;" & WD.Cells(i, 12).Value
        Application.StatusBar = i
        
        ' Save after every 500 queries
        If i Mod 500 = 0 Then
            ThisWorkbook.Save
        End If
        
        MyName = "Query" & i
        
        ' Define a new Web Query
        Set QT = ActiveSheet.QueryTables.Add(Connection:=ConnectString, Destination:=WS.Cells(OutRow, OutCol))
        With QT
            .Name = MyName
            .FieldNames = True
            .RowNumbers = False
            .FillAdjacentFormulas = False
            .PreserveFormatting = True
            .RefreshOnFileOpen = False
            .BackgroundQuery = False
            .RefreshStyle = xlOverwriteCells
            .SavePassword = False
            .SaveData = True
            .AdjustColumnWidth = True
            .RefreshPeriod = 0
            .WebSelectionType = xlSpecifiedTables
            .WebFormatting = xlWebFormattingNone
            .WebTables = "5"
            .WebPreFormattedTextToColumns = True
            .WebConsecutiveDelimitersAsOne = True
            .WebSingleBlockTextImport = False
            .WebDisableDateRecognition = False
            .WebDisableRedirections = False
        End With
    
        ' Refresh the Query
        QT.Refresh BackgroundQuery:=False
        ' Change from a live query to values
        WS.Cells(OutRow, OutCol).Resize(40, 2).Value = WS.Cells(OutRow, OutCol).Resize(40, 2).Value
        For Each QT In WS.QueryTables
            QT.Delete
        Next QT
    
        ' Copy to Database
        WD.Cells(i, 1).Value = WS.Cells(5, 8).Value
        For j = 13 To 20
            CheckIt = WS.Cells(j, 8).Value
            If Left(CheckIt, 3) = "WWW" Then
                WD.Cells(i, 8).Value = CheckIt
            End If
        Next j
    
    Next i
    
End Sub

This query took more than an hour to run. After all, it was doing the work of visiting over 4000 web pages. It did run without a hitch and did not crash the computer or Excel.

I then had a nice database in Excel with Publisher name in column A and the website in column B. After sorting by website in Column B, I found that over 1000 publishers did not list a web site. Their entry in column B was a blank URL. I sorted and deleted these rows.

Also, the websites listed in column B had "WWW: " before each URL. I used a Edit > Replace to change each occurence of WWW: (with a space after it) to nothing. I had a nice list of 2339 publishers on a spreadsheet.

Publishers list on the spreadsheet.
Publishers list on the spreadsheet.

The last step was to write out a text file that could be copied and pasted into any members' website. The following macro (adapted from the code on page 345) handled this task nicely.

Sub WriteHTML()
    On Error Resume Next
    Kill "C:\PMALinks.txt"
    On Error GoTo 0
    
    Open "C:\PMALinks.txt" For Output As #1
    Print #1, "Visit the websites of our fellow PMA members:<br><UL>"
    
    For i = 2 To 2340
        MyStr = "<LI><a href=""" & Cells(i, 2).Value & """>" & Cells(i, 1).Value & "</a>"
        Print #1, MyStr
    Next i
    
    Print #1, "</UL>"
    Close #1
End Sub

The result was a text file with the name and URL of 2000+ publishers.

All of the above code was adapted from the book. When I started, I was sort of just doing a one-off program that I didn't envision running regularly. However, I can now imaging going back to the PMA website every month or so to get the updated lists of URL's.

It would be possible to put all of the above steps into a single macro.

Sub DoEverything()
    Dim WSW As Worksheet
    Dim WST As Worksheet
    Set WSW = Worksheets("Workspace")
    Set WST = Worksheets("Template")
    
    On Error Resume Next
    Kill "C:\AutoLinks.txt"
    On Error GoTo 0
    
    Open "C:\PMALinks.txt" For Output As #1
    Print #1, "Visit the websites of our fellow PMA members:<br><UL>"
    
    For m = 1 To 27
        Select Case m
            Case 27
                MyStr = "1"
            Case Else
                MyStr = Chr(64 + m)
        End Select
        MyName = "Query" & m
        ConnectString = "URL;http://www.pma-online.org/scripts/showmemlist.cfm?letter=" & MyStr
                
                
        ' On the Workspace worksheet, clear all existing query tables
        For Each QT In WSW.QueryTables
            QT.Delete
        Next QT
        
        ' Define a new Web Query
        Set QT = WSW.QueryTables.Add(Connection:=ConnectString, Destination:=WSW.Range("A1"))
        With QT
            .Name = MyName
            .FieldNames = True
            .RowNumbers = False
            .FillAdjacentFormulas = False
            .PreserveFormatting = True
            .RefreshOnFileOpen = False
            .BackgroundQuery = False
            .RefreshStyle = xlInsertDeleteCells
            .SavePassword = False
            .SaveData = True
            .AdjustColumnWidth = True
            .RefreshPeriod = 0
            .WebSelectionType = xlSpecifiedTables
            .WebFormatting = xlWebFormattingAll
            .WebTables = "7"
            .WebPreFormattedTextToColumns = True
            .WebConsecutiveDelimitersAsOne = True
            .WebSingleBlockTextImport = False
            .WebDisableDateRecognition = False
            .WebDisableRedirections = False
        End With
        
        ' Refresh the Query
        QT.Refresh BackgroundQuery:=False
        
        ' Next, loop through all of the hyperlinks in the resulting page
        For Each cll In WSW.UsedRange.Cells
            For Each hl In cll.Hyperlinks
                MyURL = hl.Address
                ' Build a web query on WST
                ConnectString = "URL;" & MyURL
                
                MyName = "Query" & NextRow
                
                ' Define a new Web Query
                Set QT = WST.QueryTables.Add(Connection:=ConnectString, Destination:=WST.Cells(1, 1))
                With QT
                    .Name = MyName
                    .FieldNames = True
                    .RowNumbers = False
                    .FillAdjacentFormulas = False
                    .PreserveFormatting = True
                    .RefreshOnFileOpen = False
                    .BackgroundQuery = False
                    .RefreshStyle = xlOverwriteCells
                    .SavePassword = False
                    .SaveData = True
                    .AdjustColumnWidth = True
                    .RefreshPeriod = 0
                    .WebSelectionType = xlSpecifiedTables
                    .WebFormatting = xlWebFormattingNone
                    .WebTables = "5"
                    .WebPreFormattedTextToColumns = True
                    .WebConsecutiveDelimitersAsOne = True
                    .WebSingleBlockTextImport = False
                    .WebDisableDateRecognition = False
                    .WebDisableRedirections = False
                End With
    
                ' Refresh the Query
                QT.Refresh BackgroundQuery:=False
                ' Change from a live query to values
                WST.Cells(1, 1).Resize(40, 2).Value = WST.Cells(1, 1).Resize(40, 2).Value
                For Each QT In WS.QueryTables
                    QT.Delete
                Next QT
        
                ' Find URL
                ThisPub = WS.Cells(5, 8).Value
                ThisURL = "WWW: http://"
                For j = 13 To 20
                    CheckIt = WS.Cells(j, 8).Value
                    If Left(CheckIt, 3) = "WWW" Then
                        ThisURL = CheckIt
                    End If
                Next j
                
                If Not ThisURL = "WWW: http://" Then
                    ' write a record to the .txt file
                    MyStr = "<LI><a href=""" & ThisURL & """>" & ThisPub & "</a>"
                    Print #1, MyStr
                End If
            Next hl
        Next cll
    Next m
                
    Print #1, "</UL>"
    Close #1

End Sub

Excel and VBA provided a quick alternative to individually visiting thousands of web pages. In theory, the PMA should have been able to query their database and provide this information far more quickly than using this method. However, sometimes you are dealing with someone who is uncooperative or possibly doesn't know how to get data out of a database that someone else wrote for them. In this case, a bit of VBA macro code solved our problem.