Error duplicating data when scraping from Web to Excel

randaubienghoc

New Member
Joined
Apr 25, 2020
Messages
15
Office Version
  1. 2016
Platform
  1. Windows
Hi guys,

I am looking for your kind help on my coding to pull data from Web to Excel.

Web to get data: https://eport.saigonnewport.com.vn/Pages/Common/Containers_new

  • Steps to get data:
Put "Cát Lái" to field "Khu vực giao nhận container" (choose seaport)

Put container number to field "Container"

Deselect "Chỉ vòng luân chuyển cuối" to show all rows in data table

Click Search to show data table - result for searching Container information

*****ISSUE: Data scraped from web to Excel for each row in Excel (respective to each container number found) seems to be the same with the previous result WHILE information for this container number can be blank. For example: Event time 2 "10/4/2020 3:07:00 PM" is repeated for container "TEMU3311320" while this container does not have Event time 2.

Hope you could give me any advices to resolve this duplicating issue. Attached Excel file for your references. Thanks. Link download Excel file example

VBA Code:
Sub PullDataFromWeb()
  Dim IE As Object, W As Excel.Worksheet
  Dim doc As HTMLDocument
  Dim lastRow As Integer, b As Boolean, tmp As String
  Dim lis, li
  Set W = ThisWorkbook.Sheets("Sheet1")
  Set IE = VBA.CreateObject("InternetExplorer.Application")
  IE.Visible = True   'hien cua so IE
  IE.navigate "https://eport.saigonnewport.com.vn/Pages/Common/Containers_new"
  Do While IE.Busy Or IE.readyState <> 4      'doi IE chay xong
    Application.Wait DateAdd("s", 1, Now)
  Loop
  Set doc = IE.document

  lastRow = W.Range("B" & W.UsedRange.Rows.Count + 2).End(xlUp).Row        'dong cuoi cung trong cot B container
  If lastRow < 2 Then GoTo Ends
  On Error Resume Next
  For intRow = 2 To lastRow     'tu dong toi dong
    b = False
    b = W.Range("I" & intRow).Value Like "[Yy]"
    If W.Range("B" & intRow).Value <> "" And Not b Then
      doc.getElementById("txtItemNo_I").Value = W.Range("B" & intRow).Value 'so cont
      doc.getElementById("cbSite_VI").Value = W.Range("A" & intRow).Value
      doc.getElementById("chkInYard_I").Checked = False
      doc.getElementById("ContentPlaceHolder2_btnSearch").Click 'click Search
      '----------------------------------------------
      Do While IE.Busy Or IE.readyState <> 4
        Application.Wait DateAdd("s", 1, Now)
      Loop
      '----------------------------------------------
      strFindContainer = doc.getElementById("ContentPlaceHolder2_lblNotice").innerText
      W.Range("H" & intRow) = strFindContainer
      If strFindContainer Like "T*m th*y * container*" Then
        strEventtime1 = doc.getElementById("grdContainer_DXDataRow0").Cells(0).innerText
        strEventtype1 = doc.getElementById("grdContainer_DXDataRow0").Cells(1).innerText
        strLocation1 = doc.getElementById("grdContainer_DXDataRow0").Cells(2).innerText
        strEventtime2 = doc.getElementById("grdContainer_DXDataRow1").Cells(0).innerText
        strEventtype2 = doc.getElementById("grdContainer_DXDataRow1").Cells(1).innerText
        W.Range("C" & intRow) _
          .Resize(, 5).Value = Array(strEventtime1, strEventtype1, strLocation1, _
                         strEventtime2, strEventtype2)
      End If
    End If
  Next
Ends:
  IE.Quit
  Set IE = Nothing    'Cleaning up
  Set objElement = Nothing
  Set objCollection = Nothing
  Application.StatusBar = ""
  Application.DisplayAlerts = True
End Sub
 

Excel Facts

Quick Sum
Select a range of cells. The total appears in bottom right of Excel screen. Right-click total to add Max, Min, Count, Average.

Dan_W

Active Member
Joined
Jul 11, 2018
Messages
435
Office Version
  1. 365
Platform
  1. Windows
I haven't downloaded the code, so I could be wrong, but it seems to me that from just look at the code that:

Missing container name
You're not getting the container name because your code doesn't appear to try and get it. You have variables for EventTime1, EventType1 and Location, and those appear to be sourced from the element on the website, but spread out across different cells EventType1 = cell(0), EventTime1 = cell(1), and so forth. You don't have a variable for the container name - for your reference, the container name information is located in the cell(4):
VBA Code:
strContainerName = doc.getElementById("grdContainer_DXDataRow0").Cells(4).innerText

You then need to rewrite this line to make sure the info is then written to the worksheet:

VBA Code:
W.Range("C" & intRow) _
          .Resize(, 5).Value = Array(strEventtime1, strEventtype1, strLocation1, _
                         strEventtime2, strEventtype2)

I don't know what other data is on the sheet, but I can see that Column H is populated by the notification text on how many containers were found. I would suggest commenting out the following line:
W.Range("H" & intRow) = strFindContainer
and instead put that information in strFiendContainer with the rest of the array:

VBA Code:
W.Range("C" & intRow) _
          .Resize(, 7).Value = Array(strContainerName, strEventtime1, strEventtype1, strLocation1, _
                         strEventtime2, strEventtype2, strFindContainer)

Duplicate Entries
Looking at your For ... Next loop, you don't seem to refresh the variables - meaning that you don't delete their values, so if they are not overwritten by the data obtained during the scrape, they will just continue to record the last thing that they scrape.

Does that sound like it might be right?
 

randaubienghoc

New Member
Joined
Apr 25, 2020
Messages
15
Office Version
  1. 2016
Platform
  1. Windows
I haven't downloaded the code, so I could be wrong, but it seems to me that from just look at the code that:

Missing container name
You're not getting the container name because your code doesn't appear to try and get it. You have variables for EventTime1, EventType1 and Location, and those appear to be sourced from the element on the website, but spread out across different cells EventType1 = cell(0), EventTime1 = cell(1), and so forth. You don't have a variable for the container name - for your reference, the container name information is located in the cell(4):
VBA Code:
strContainerName = doc.getElementById("grdContainer_DXDataRow0").Cells(4).innerText

You then need to rewrite this line to make sure the info is then written to the worksheet:

VBA Code:
W.Range("C" & intRow) _
          .Resize(, 5).Value = Array(strEventtime1, strEventtype1, strLocation1, _
                         strEventtime2, strEventtype2)

I don't know what other data is on the sheet, but I can see that Column H is populated by the notification text on how many containers were found. I would suggest commenting out the following line:
W.Range("H" & intRow) = strFindContainer
and instead put that information in strFiendContainer with the rest of the array:

VBA Code:
W.Range("C" & intRow) _
          .Resize(, 7).Value = Array(strContainerName, strEventtime1, strEventtype1, strLocation1, _
                         strEventtime2, strEventtype2, strFindContainer)

Duplicate Entries
Looking at your For ... Next loop, you don't seem to refresh the variables - meaning that you don't delete their values, so if they are not overwritten by the data obtained during the scrape, they will just continue to record the last thing that they scrape.

Does that sound like it might be right?
I updated the Excel file as below:
- For container name: I put container names in Excel and use these names to search.
- For duplicate Entries: Could you please show me how to refresh variables to avoid duplicate entries?
Thanks.
 

Dan_W

Active Member
Joined
Jul 11, 2018
Messages
435
Office Version
  1. 365
Platform
  1. Windows
Hi. .Give me a few minutes. I'll have a look now.
 

Dan_W

Active Member
Joined
Jul 11, 2018
Messages
435
Office Version
  1. 365
Platform
  1. Windows

ADVERTISEMENT

Ok. so I think this will work. Try adding the following five lines into your code:

strEventtime1 = ""
strEventtype1 = ""
strLocation1 = ""
strEventtime2 = ""

strEventtype2 = ""

here:

VBA Code:
 W.Range("C" & intRow) _
          .Resize(, 5).Value = Array(strEventtime1, strEventtype1, strLocation1, _
                         strEventtime2, strEventtype2)
      End If
    End If
    strEventtime1 = ""
    strEventtype1 = ""
    strLocation1 = ""
    strEventtime2 = ""
    strEventtype2 = ""
        
  Next
Ends:
 

randaubienghoc

New Member
Joined
Apr 25, 2020
Messages
15
Office Version
  1. 2016
Platform
  1. Windows
Ok. so I think this will work. Try adding the following five lines into your code:

strEventtime1 = ""
strEventtype1 = ""
strLocation1 = ""
strEventtime2 = ""

strEventtype2 = ""

here:

VBA Code:
 W.Range("C" & intRow) _
          .Resize(, 5).Value = Array(strEventtime1, strEventtype1, strLocation1, _
                         strEventtime2, strEventtype2)
      End If
    End If
    strEventtime1 = ""
    strEventtype1 = ""
    strLocation1 = ""
    strEventtime2 = ""
    strEventtype2 = ""
       
  Next
Ends:
Oh my god. It works eventually. Thanks so much. I have never thought it turns out to be such an easy method ^_^
 

Dan_W

Active Member
Joined
Jul 11, 2018
Messages
435
Office Version
  1. 365
Platform
  1. Windows
You're very welcome. You script works very nicely. I hope it works out well.
 

Watch MrExcel Video

Forum statistics

Threads
1,130,016
Messages
5,639,559
Members
417,099
Latest member
duhafnusa4

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Top