Parsing strings from HTML code and writing to cells via VBA code

dobri69

New Member
Joined
Oct 15, 2012
Messages
2
Hi All,
I'm new to this forum but found it very useful whenever I've had a problem with Excel or VBA. But get to the point. I'am currently working on a macro that will enable me to parse few strings from an online HTML source (ie. webpage) and write these values to an excel using VBA. So far I've come up with the code presented below, however there are few things:
1. I don't know how to tell excel to write each found string to one cell and then move to the next cell to write the next string. (it is a loop within a loop within a loop)
2. The data I'm interested in occurs several times within HTML code always between two another strings (I've managed to build a very naive loop to find all (or at least most) of the data occurences I'm interested in and show me each occurence in a message box). I'd like to put these values in excel's cells rather than see them in msgbox.
3. Maybe You know some better, more robust, way to find every occurence of a string within a string and paste each time it is found to a different cell?

The code is as follows:
Code:
Sub GatherData()

'Gather data from HTML code
'References to enable: Microsoft Internet Controls, OLE Automation, Microsoft Excel 'apppropriate ver.' Object Library


    Dim IE As Object
    Dim x As String
    Dim z As String
    Dim lTemp As Long
    Dim xtemp As String
         
    Set IE = CreateObject("InternetExplorer.Application") 'Load IE
    IE.Visible = True 'Like to see how everything happens, as well might be turned to False
        
    For Mystr = 2 To 3 'To loop through subpages
        
        IE.navigate "http://www.webpage.com/search?page=" & Mystr
    
        ' Let IE load everything on the page
        Do Until IE.readyState = READYSTATE_COMPLETE
        DoEvents
        Loop
        
        z = "tr id=" 'the string after which the interesting part of data is within HTML code
        x = IE.Document.Body.InnerHTML 'Use HTML code as a string
        k = InStr(x, z) 'The position of the first occurence of "tr id=" string within the HTML code
        w = Len(x) 'Length of HTML code (string)
        
            'There are many different occurences of the data i'm looking for,
            'Below is a very simplistic approach to extract most(due to the fixed step - I don't know how to apply variable(?))
            'of the occurences. The idea is to find the position of the first ocurence of the data ('k' above)
            'and then setting a constant distance to the next occurence (in terms of number of signs).
            'This number is subtracted for each iteration
            '(to shorten the string, i.e. exclude the first occurence to find the next, 'xtemp' below).
            'The GetBetween function is meant to extract text between "tr id=" and "class" string from the xtemp string.
            'As for now the data I'm interested in pops-up in a message box, but I'd like to write each occurence to excel cell,
            'then move down one cell and write the next occurence
            '- all the way down until all occurences on all subpages are in one column.
                        
            For y = k To w Step 3000
                
            lTemp = w - y
            xtemp = Right(x, lTemp)
            occur = GetBetween(xtemp, "tr id=", "class")
            MsgBox occur


            Next y
            
    Next Mystr
    
End Sub
 

Excel Facts

Square and cube roots
The =SQRT(25) is a square root. For a cube root, use =125^(1/3). For a fourth root, use =625^(1/4).
Check out Split.
If you were to:
Code:
zz=Split(x,"tr id=")
it will (if said 'tr id=' exists) result in a zero-based array of strings, between (but not including) the occurrences of 'tr id='.
If you then:
Code:
For n = 1 to ubound(zz)
tt = Split(zz(n),"class")(0)
'tt will now contain the string between the nth occurrence of 'tr id=' and the first occurrence of 'class' after it.
msgbox tt
activesheet.cells(rows.count,"A").end(xlup).offset(1).value = tt' will put it into the cell after the last entry in column A
next n
 
Last edited:
Upvote 0
OK, there are quite a few points here to consider.

Here's a bit of sample html which will make things easier to explain

http://pastebin.com/raw.php?i=FLw9cMFb


The kneejerk response to your problems is to suggest that you use regex to (string pattern matching) to extract the data you are looking for, so something like:

Code:
Option Explicit
Sub ParseHTML()
Dim coll As Collection
Dim o


    With CreateObject("msxml2.xmlhttp")
        .Open "GET", "http://pastebin.com/raw.php?i=FLw9cMFb", False
        .send
        Set coll = GetBetween(.responseText)
    End With
    
    For Each o In coll
        MsgBox o
    Next o
End Sub


Function GetBetween(stri As String) As Collection
    Dim d As Object, match
    Set GetBetween = New Collection
    With CreateObject("vbscript.regexp")
        .Pattern = "tr id=(.*?)class"
        .Global = True
        .ignorecase = True
        Set d = .Execute(stri)
        For Each match In d
            GetBetween.Add match.submatches(0)
        Next match
    End With
    
End Function

This will return all the html between "id=" and "class" in a collection (so it only needs running once on the page). However using regex successfully on html is like trying to eat soup with a fork, it's too much of a blunt instrument to deal with the oddities and poor coding habits of HTML. You're generally also looking for something specific so even after you've got the match, you'd still need to format the string to present it how you would like.

My advice would therefore always be to use a library to build the code into an object (unless you are doing something absurdly simple. So if all we were interested in from the above html are the ids of the rows, consider:
Code:
Sub ParseHTML2()
Dim htm As Object: Set htm = CreateObject("htmlfile")
Dim tr As Object
    With CreateObject("msxml2.xmlhttp")
        .Open "GET", "http://pastebin.com/raw.php?i=FLw9cMFb", False
        .send
        htm.body.innerhtml = .responseText
    End With


    For Each tr In htm.getElementsByTagName("tr")
        if len(tr.ID)>0 then MsgBox tr.ID
    Next tr
End Sub

This way we can access all the properties and loop through tables easily, for example:
Code:
Sub ParseHTML2()


Dim htm As Object: Set htm = CreateObject("htmlfile")
Dim tr As Object
Dim td As Object
Dim x As Long
Dim y As Long


    With CreateObject("msxml2.xmlhttp")
        .Open "GET", "http://pastebin.com/raw.php?i=FLw9cMFb", False
        .send
        htm.body.innerhtml = .responseText
    End With


    With htm.getElementsByTagName("table")(0)
        For Each tr In .Rows
            For Each td In tr.Cells
                With Sheets(3).Range("a1")
                    .Offset(x, y).Value = td.innerText
                End With
                y = y + 1
            Next td
            y = 0
            x = x + 1
        Next tr
    End With
    
End Sub
 
Last edited:
Upvote 0
Thank You Both for such a quick answer. The first solution works like a charm, in case of the second it looks quite complicated :) However, this code was just the first step in what I'm trying to achieve, therefore I'll take a closer look (if only I will be able to understand everything) on the second solution as it might be quite useful for the second step.
 
Upvote 0
Hey,
i was searching a long time for such a solution. So I hope it's not a problem that I reactivate those posts.


I know how HTML works but I have difficulties to understand how I have to change the Code to make it work for me.


1. I have several tables on one website and I only need some values which one can find in the table. I started to fill an arrray. Then I took those values out which I need. I think there will excists a beautiful way where I don't have so much waste. Can I adress some values of the tabel directly?


Code:
 With htm.getElementsByTagName("table")(1)   
            For Each tr In .Rows
                For Each td In tr.Cells
                    'test(x, y) = td(row,col).innerText 'something like this
                    table1(y + 1, x + 1) = td.innerText
                    y = y + 1
                Next td
                y = 0
                x = x + 1
            Next tr
        End With
        'then I woul use some values and transform them if neede 
	' only as an example, made via debug.pring
                Debug.Print table1(2, 3) 
                Debug.Print table1(2, 4) 
                Debug.Print table1(2, 4) - table1(2, 3) Spread


2. I do not only have tables, I also have some div container where I need to extract data.


Code:
<article class="STAMMDATEN"><h1 class="LISTE_STERN">Stammdaten zu Wacker Chemie</h1><div><dl><dt>WKN</dt><dd>WCH888</dd><dt>ISIN</dt><dd itemprop="productID">DE000WCH8881</dd><dt>Symbol</dt><dd>WCH</dd></dl><dl><dt>Land</dt><dd><span class="FLAGGE ICON icon-de"></span>Deutschland</dd><dt>Branche</dt><dd>Spezialchemie</dd><dt>Sektor</dt><dd title="Chemie / Pharma / Gesundheit">Chemie / Pharma ...</dd></dl><dl><dt>Typ</dt><dd>Inhaberaktie</dd><dt>Nennwert</dt><dd>ohne Nennwert</dd><dt>Unternehmen</dt><dd>WACKER CHEMIE AG</dd></dl></div></article>
Code:
I found somethin like: 


[code]'With htm


'.getElementById("USER").setAttribute "value", "myuser"
'.getElementById("PASSWORD").setAttribute "value", "mypass"
'.getElementById("Button").Click


'End With


So instead of Tag I would use ID but as you can see in the HTML Doc there are different layer in the div container. The easiest would be if the e.g. function would return only the necessary values.


Help is appreciated and I'll do as much as
 
Upvote 0

Forum statistics

Threads
1,215,695
Messages
6,126,262
Members
449,307
Latest member
Andile

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top