Scrape website with pagination

beijing0305

New Member
Joined
Feb 14, 2021
Messages
9
Office Version
  1. 2016
Platform
  1. Windows
I have tried a couple of approach to scrape a website with simple data layout, but was not successful with the following 3 issues:
1) tweaking the TagName "tabble", "tr", "tb", not resulting desirable results
2) when (1) was able to pick up the title row, unicode result display the Chinese Character as ???
3) Pagination - probably will need to add another loop to pick up page 3, etc.
Thanks in advance for your help.
Bei


VBA Code:
Sub CAACCertVendorPartListXML()

    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
    
    XMLPage.Open "GET", "http://fsop.caac.gov.cn/g145/CARS/WebSiteQueryServlet?method=loadAircraftConditionsResultPage&enterpriseName=&licenceCode=&partsNumber=&partsName=&ataChaptersection=", False
    XMLPage.send
    
    HTMLDoc.body.innerHTML = XMLPage.responseText
    
    ProcessHTMLPage HTMLDoc
    
End Sub

VBA Code:
Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    Dim HTMLTable As MSHTML.IHTMLElement
    Dim HTMLTables As MSHTML.IHTMLElementCollection
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
        
    Set HTMLTables = HTMLPage.getElementsByTagName("tbody")
    
    For Each HTMLTable In HTMLTables
        Debug.Print HTMLTable.tagName
                
        For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
            'Debug.Print HTMLRow.Children
                                                
            For Each HTMLCell In HTMLRow.getElementsByTagName("td")
                Debug.Print vbTab & HTMLCell.innerText
            Next HTMLCell
                      
        Next HTMLRow
            
    Next HTMLTable
    
End Sub
 
Need some help with this line...following don't seem to work. Thanks.

Set pt = driver.FindElementByXPath("//*[@id=""example_next""]/a[Next]") ' next button

Set pt = driver.FindElementByXPath("//*[@id=""example_next""]/[Next]") ' next button
 
Upvote 0

Excel Facts

Copy PDF to Excel
Select data in PDF. Paste to Microsoft Word. Copy from Word and paste to Excel.
Here is how you can get the next button. Note that if you click the fifth button, two more buttons became available.

Also be careful with the loop because there are 48K pages…

VBA Code:
Public driver As New ChromeDriver
Sub WChr()
Dim pt As WebElement
driver.get "http://fsop.caac.gov.cn/g145/CARS/WebSiteQueryServlet?method=loadAircraft" & _
"ConditionsResultPage&enterpriseName=&licenceCode=&partsNumber=&partsName=&ataChaptersection="
Application.Wait Now + TimeValue("0:00:10")
Set pt = driver.FindElementById("example_next")
MsgBox pt.Text
pt.Click
End Sub
 
Upvote 0
Solution
Thanks again for the guidance. Works as expected now. Still can't see to get rid of the title row. How would I go about starting with "tbody" starting 2nd page and on?


Public driver As New ChromeDriver
Sub WChr()
Dim pt As WebElement
Dim pcount As Integer 'page counter
'Get first page
driver.get "http://fsop.caac.gov.cn/g145/CARS/WebSiteQueryServlet?method=loadAircraft" & _
"ConditionsResultPage&enterpriseName=&licenceCode=&partsNumber=&partsName=&ataChaptersection="
Application.Wait Now + TimeValue("0:00:10")
Set pt = driver.FindElementById("example")
pt.AsTable.ToExcel ActiveSheet.[a10]
For pcount = 2 To 10
Set pt = driver.FindElementById("example_next") ' next button
'MsgBox pt.Text
pt.Click
Application.Wait Now + TimeValue("0:00:10")
Set pt = driver.FindElementById("example")
pt.AsTable.ToExcel ActiveSheet.Cells(Range("a" & Rows.Count).End(xlUp).Row + 1, 1)
Next pcount
End Sub
 
Upvote 0
I have always used this to navigate to next page on internet explorer, never had an issue with it. The next page class will need changing, and you can increase the page amount.

VBA Code:
If pageNumber >= 5 Then Exit Do 'the first 5 pages
        Set nextPageElement = htmlDoc.getElementsByClassName("gspr next")(0
        If nextPageElement Is Nothing Then Exit Do
       
        nextPageElement.Click 'next web page
        Do While ie.Busy Or ie.readyState <> 4
            DoEvents
        Loop
        Set htmlDoc = ie.document
        pageNumber = pageNumber + 1
 
Upvote 0
You can delete the headers afterwards:

VBA Code:
Sub Multiple()
Dim cell As Range, i%, continue As Boolean
i = 1
Do
    Set cell = [a:a].Find([a10].Value, , xlValues, xlWhole, xlByRows, 2)
    continue = False
    If cell.Address <> [a10].Address Then
        continue = True
        cell.EntireRow.Delete
    End If
    i = i + 1
Loop While i < 20 And continue
End Sub
 
Upvote 0
You can delete the headers afterwards:

VBA Code:
Sub Multiple()
Dim cell As Range, i%, continue As Boolean
i = 1
Do
    Set cell = [a:a].Find([a10].Value, , xlValues, xlWhole, xlByRows, 2)
    continue = False
    If cell.Address <> [a10].Address Then
        continue = True
        cell.EntireRow.Delete
    End If
    i = i + 1
Loop While i < 20 And continue
End Sub
Thanks much for this. I can do post processing! :)
 
Upvote 0

Forum statistics

Threads
1,213,568
Messages
6,114,348
Members
448,570
Latest member
rik81h

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top