VBS? Maybe? Website Text Rip

joe.afusco

Board Regular
Joined
Jun 25, 2010
Messages
80
Now, I know this is typically a forum for VBA, but I also know there are many extremely knowledgeable people here, which I can't seem to find on any VBS forum. So, I apologize if this is too far out context.

I am attempting to rip some text from a webpage, eventually bringing the data into excel for manipulation. Here is a working VBS script that completes that task:

Code:
'sub to wait for page loading
Sub WaitForLoad(obj)


Do While ie.Busy: Loop
Do While obj.readyState <> 4:  Loop
wscript.sleep(100)
End Sub


'create internet explorer object
Set ie = WScript.CreateObject("InternetExplorer.Application")


'setup ie properties
ie.ToolBar = 1
ie.StatusBar = 1
ie.Width = 999
ie.Height = 999
ie.Left = 0
ie.Top = 0
ie.Visible = 0


ie.Navigate("http://en.wikipedia.org/wiki/Rdio")


WaitForLoad(ie)

websiteText = ie.Document.Body.innerText

Set fso = CreateObject("Scripting.FileSystemObject")
'OpenTextFile Parameters:
'-Filename
'-The 2 is for writing... 1 is reading and 8 is appending
'-The "True" is to create if not already there.
Set fl = fso.OpenTextFile("C:\temp\textFile.txt", 2, True) 
fl.Write(websiteText)
fl.Close : Set fl = Nothing
Set fso = Nothing 
set WshShell = nothing
set http = nothing

The issue is, when I attempt it on a larger webpage, it fails, for what I assume to be too much text to handle. The website I am trying to rip the text off of is sometimes very large, so I am looking for a solution that isn't limited. I'm sure someone out that has had to accomplish this before.

The reason I am using VBS, is because I know VBS can access elements of a webpage, input fields, buttons, etc. Not sure if VBA can do that. I'm open to a VBA solution if it can accomplish logging into a webpage, stripping the data, and jamming it in Excel.

Any thoughts? Thanks in advance for any help.
 

Excel Facts

How can you turn a range sideways?
Copy the range. Select a blank cell. Right-click, Paste Special, then choose Transpose.
Hi,
Try this version of WaitForLoad code:
Rich (BB code):
Sub WaitForLoad(obj)
  Do While obj.Busy: wscript.sleep (100): Loop
  Do While obj.readyState <> 4: wscript.sleep (100): Loop
  Do While obj.Document Is Nothing: wscript.sleep (100): Loop
End Sub
 
Upvote 0
Still get the same error on larger sites:

Line: 49
Char: 1
Error: Invalid procedure call or argument
Code: 800A0005

Source: Microsoft VBScript runtime error

Line 49 is the write command.
 
Upvote 0
VBA alternative without IE:
Rich (BB code):
FunctionUrl2File(Url$, PathName$, Optional Login$,Optional Password$, OptionalParseTxt As Boolean)As Boolean
'ZVI:2012-07-13 http://www.mrexcel.com/forum/showthread.php?646573
  Dim FN%, txt$
  OnError GoTo exit_
  If Len(Dir(PathName))Then Kill PathName
  WithCreateObject("MSXML2.XMLHTTP")
    .Open "GET", Url,False, Login, Password
    .send
    If .Status <>200 Then ExitFunction
    FN = FreeFile
    Open PathNameFor BinaryAccess WriteAs #FN
    If ParseTxtThen
      txt = .responseText
      WithCreateObject("HTMLFile")
        .Body.innerHTML = txt
        txt = .Body.innerText
        Put #FN, , txt
      EndWith
    Else
      Put #FN, , .responseBody
    EndIf
exit_:
    If FNThen Close #FN
    Url2File = .Status = 200
  EndWith
EndFunction
<o:p> </o:p>
SubTest_Url2File()
  Const Url$ ="http://en.wikipedia.org/wiki/Rdio"
  Const FILENAME$ ="c:\Temp\textFile.txt"
  Dim ret
  ret = Url2File(Url, FILENAME, ParseTxt:=True)
  If Url2File(Url,FILENAME, ParseTxt:=True) Then
    MsgBox "Downloaded file:" &vbLf & FILENAME, vbInformation, "Success"
  Else
    MsgBox "Can't download from"& vbLf & Url, vbCritical, "Error"
  EndIf
[FONT=&quot]End Sub
[/FONT]
 
Last edited:
Upvote 0
Still get the same error on larger sites:

Line: 49
Char: 1
Error: Invalid procedure call or argument
Code: 800A0005

Source: Microsoft VBScript runtime error

Line 49 is the write command.
Seems something is wrong in your code.
Try deleting manually C:\temp\textFile.txt and then run this script without any modifications:
Rich (BB code):
' VBSript file: test.vbs
Call Main
 
Sub Main()
  Dim ie, fso, fl, websiteText
  'create internet explorer object
  Set ie = WScript.CreateObject("InternetExplorer.Application")
 
  'setup ie properties
  ie.Toolbar = 1
  ie.StatusBar = 1
  ie.Width = 999
  ie.Height = 999
  ie.Left = 0
  ie.Top = 0
  ie.Visible = 0
 
  ie.Navigate ("http://en.wikipedia.org/wiki/Rdio")
 
  Call WaitForLoad(ie)
 
  websiteText = ie.Document.Body.innerText
 
  Set fso = CreateObject("Scripting.FileSystemObject")
  'OpenTextFile Parameters:
  '-Filename
  '-The 2 is for writing... 1 is reading and 8 is appending
  '-The "True" is to create if not already there.
  Set fl = fso.OpenTextFile("C:\temp\textFile.txt", 2, True)
  fl.Write (websiteText)
  fl.Close: Set fl = Nothing
  Set fl = Nothing
  Set fso = Nothing
  Set ie = Nothing
  Set WshShell = Nothing
 
End Sub
 
'sub to wait for page loading
Sub WaitForLoad(obj)
  Do While obj.Busy: WScript.sleep (100): Loop
  Do While obj.readyState <> 4: WScript.sleep (100): Loop
  Do While obj.Document Is Nothing: WScript.sleep (100): Loop
End Sub
 
Last edited:
Upvote 0
VBA alternative without IE:
Here is reposting because the code of post#5 was badly formatted:
Rich (BB code):
Function Url2File(Url$, PathName$, Optional Login$, Optional Password$, Optional ParseTxt As Boolean) As Boolean
' ZVI:2012-07-13 http://www.mrexcel.com/forum/showthread.php?646573
  Dim FN%, txt$, x
  On Error GoTo exit_
  If Len(Dir(PathName)) Then Kill PathName
  With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", Url, False, Login, Password
    .send
    If .Status <> 200 Then Exit Function
    FN = FreeFile
    Open PathName For Binary Access Write As #FN
    If ParseTxt Then
      txt = .responseText
      With CreateObject("HTMLFile")
        .Body.innerHTML = txt
        txt = .Body.innerText
        Put #FN, , txt
      End With
    Else
      Put #FN, , .responseBody
    End If
exit_:
    If FN Then Close #FN
    Url2File = .Status = 200
  End With
End Function
 
Sub Test_Url2File()
  Const Url$ = "http://en.wikipedia.org/wiki/Rdio"
  Const FILENAME$ = "c:\Temp\textFile.txt"
  Dim ret
  ret = Url2File(Url, FILENAME, ParseTxt:=True)
  If Url2File(Url, FILENAME, ParseTxt:=True) Then
    MsgBox "Downloaded file:" & vbLf & FILENAME, vbInformation, "Success"
  Else
    MsgBox "Can't download from" & vbLf & Url, vbCritical, "Error"
  End If
End Sub
 
Upvote 0
Seems something is wrong in your code.
Try deleting manually C:\temp\textFile.txt and then run this script without any modifications:
ie.quit is added now
Rich (BB code):
' VBSript file: test.vbs
Call Main
 
Sub Main()
 
  Dim ie, fso, fl, websiteText
 
  'create internet explorer object
  Set ie = WScript.CreateObject("InternetExplorer.Application")
 
  'setup ie properties
  ie.Toolbar = 1
  ie.StatusBar = 1
  ie.Width = 999
  ie.Height = 999
  ie.Left = 0
  ie.Top = 0
  ie.Visible = 0
 
  ie.Navigate ("http://en.wikipedia.org/wiki/Rdio")
 
  Call WaitForLoad(ie)
 
  websiteText = ie.Document.Body.innerText
 
  Set fso = CreateObject("Scripting.FileSystemObject")
  'OpenTextFile Parameters:
  '-Filename
  '-The 2 is for writing... 1 is reading and 8 is appending
  '-The "True" is to create if not already there.
  Set fl = fso.OpenTextFile("C:\temp\textFile.txt", 2, True)
  fl.Write (websiteText)
  fl.Close
  Set fl = Nothing
  Set fso = Nothing
  ie.Quit
  Set ie = Nothing
 
End Sub
 
'sub to wait for page loading
Sub WaitForLoad(obj)
  Do While obj.Busy: WScript.sleep (100): Loop
  Do While obj.readyState <> 4: WScript.sleep (100): Loop
  Do While obj.Document Is Nothing: WScript.sleep (100): Loop
End Sub
 
Upvote 0
Here is reposting because the code of post#5 was badly formatted:
Rich (BB code):
Function Url2File(Url$, PathName$, Optional Login$, Optional Password$, Optional ParseTxt As Boolean) As Boolean
' ZVI:2012-07-13 http://www.mrexcel.com/forum/showthread.php?646573
  Dim FN%, txt$, x
  On Error GoTo exit_
  If Len(Dir(PathName)) Then Kill PathName
  With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", Url, False, Login, Password
    .send
    If .Status <> 200 Then Exit Function
    FN = FreeFile
    Open PathName For Binary Access Write As #FN
    If ParseTxt Then
      txt = .responseText
      With CreateObject("HTMLFile")
        .Body.innerHTML = txt
        txt = .Body.innerText
        Put #FN, , txt
      End With
    Else
      Put #FN, , .responseBody
    End If
exit_:
    If FN Then Close #FN
    Url2File = .Status = 200
  End With
End Function
 
Sub Test_Url2File()
  Const Url$ = "http://en.wikipedia.org/wiki/Rdio"
  Const FILENAME$ = "c:\Temp\textFile.txt"
  Dim ret
  ret = Url2File(Url, FILENAME, ParseTxt:=True)
  If Url2File(Url, FILENAME, ParseTxt:=True) Then
    MsgBox "Downloaded file:" & vbLf & FILENAME, vbInformation, "Success"
  Else
    MsgBox "Can't download from" & vbLf & Url, vbCritical, "Error"
  End If
End Sub

You're calling Url2File twice, re-writing the text file :)
 
Upvote 0
You're calling Url2File twice, re-writing the text file :)
Yea, thank you! :)
These debugging lines should be removed:
Dim ret
ret = Url2File(Url, FILENAME, ParseTxt:=True)
 
Upvote 0

Forum statistics

Threads
1,223,098
Messages
6,170,106
Members
452,302
Latest member
TaMere

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top