Structure of parsing 3k html files

mos

New Member
Joined
May 8, 2011
Messages
15
Good morning,

I have about 3-4k html files, each looking like this
iRA1K.png


Code (could not use Code tag, got f ed up by the html code itself)

idix4s.png


I look for something like this:

iV1is.png



NAO, my question:

What methods do you suggest?

I had in mind
  • using import function and the re-order the tables (disadv: gives 1sheet per html file)
  • parse the html files as text, look for String "Company Name" and get following String bracket content somehoe

what else would be good?

will be a bigger task, at least for me, so I wanna check whats best.

thanks, cheers

mos
 

Excel Facts

Back into an answer in Excel
Use Data, What-If Analysis, Goal Seek to find the correct input cell value to reach a desired result
Use regular expressions. For instance, get Company Name:

Code:
Dim re As New RegExp
Dim mc As MatchCollection
re.Pattern = [HTML]"<tr><th>Company Name</th><td colspan=""2"">(\w*)<br />"[/HTML]
Set mc = re.Execute("[I]YOUR_HTML[/I]")
For Each m in mc
    MsgBox m.Value
Next
 
Last edited:
Upvote 0
awesome idea, thanks Sektor!
I think I got some problems with the "YOUR HTML" part.
Not sure if I get the html txt in the string here:

I also get a undefined type error in the "Dim RE as RegExp" line.
As well es in the version of Sektor as well in this pre-written version from
http://www.tmehta.com/regexp/add_code.htm

How do I declare and initialize regexp object correctly?

I find only these formulations use, but they dont seem to be recognized.

Code:
Sub fileloop()
    Dim MyDir As String
    Dim strPath As String
    Dim vaFileName As Variant
    Dim i As Integer
    Dim searchstring As String
        
    MyDir = ActiveWorkbook.Path ' current path
    strPath = MyDir & "\files" ' files subdir

    With Application.FileSearch
        .NewSearch
        .LookIn = strPath
        .SearchSubFolders = False
        .Filename = ".htm"

        If .Execute > 0 Then
     
               For i = 1 To .FoundFiles.Count
               searchstring = OpenTextFileToString2(.FoundFiles(i))
                                            
              'Company Name
              Cells(i + 1, 1) = RegExpFind(searchstring, [HTML]"<tr><th>Company Name</th><td colspan=""2"">(\w*)"[/HTML], True)
              
               Next i
 
            End If
    End With
End Sub

Code:
Function OpenTextFileToString2(ByVal strFile As String) As String
Dim hFile As Long
hFile = FreeFile
Open strFile For Input As #hFile
OpenTextFileToString2 = Input$(LOF(hFile), hFile)
Close #hFile
End Function
Code:
Function RegExpFind(FindIn, FindWhat As String, _
        Optional IgnoreCase As Boolean = False)
    Dim i As Long
    #If Not LateBind Then
    Dim RE As RegExp, allMatches As MatchCollection, aMatch As Match
    Set RE = New RegExp
    #Else
    Dim RE As Object, allMatches As Object, aMatch As Object
    Set RE = CreateObject("vbscript.regexp")
        #End If
    RE.Pattern = FindWhat
    RE.IgnoreCase = IgnoreCase
    RE.Global = True
    Set allMatches = RE.Execute(FindIn)
    ReDim rslt(0 To allMatches.Count - 1)
    For i = 0 To allMatches.Count - 1
        rslt(i) = allMatches(i).Value
        Next i
    RegExpFind = rslt
    End Function


Thanks for any advice!

cheers, mos
 
Upvote 0
crap. I forgot the Tools - Reference to RegExp.

Sorry.

Works now.


But result is:


HTML:
<tr><th>Company Name</th><td colspan="2">ABB

why is it giving back the html code, not only the \w* part?


Basically in the html code it is:


HTML:
<tr><th>Company Name</th><td colspan="2">ABB Bla2 Ltd. LLC-XYZ </tr> <whatever html code..>

I need ABB Bla2 Ltd. LLC-XYZ

So
HTML:
<tr><th>Company Name</th><td colspan="2">(\w|\s|\d)*</tr> <whatever html code..>

?

thanks a lot
 
Upvote 0
What happens if you open these files in Excel?
 
Upvote 0
Please, write HTML code of your company. From screen shot it can't be recognized.
 
Upvote 0
What happens if you open these files in Excel?

20110610140323.png


which is actually quite good yes. As I wrote in the first post I was considering importing them. Each file as one sheet and then reorganize to one sheet.
But this with 3-4k files, dont know...

As I look back now, maybe would have been easier.


Sektor: thanks for you pm, works almost perfect now!

html line is:
HTML:
<tr><th>Company Name</th><td colspan="2">ABB 123 Ltd. XYZ"<br />"

Code:
RegExpFind(searchstring, [HTML]"<tr><th>Company Name</th><td colspan=""2"">(.*)<br />"[/HTML], True)

gives me

HTML:
<tr><th>Company Name</th><td colspan="2">ABB 123 Ltd. XYZ<br />

How can I strip context away and just get
ABB 123 Ltd. XYZ


Code:
Function RegExpFind(FindIn, FindWhat As String, _
        Optional IgnoreCase As Boolean = False)
    Dim i As Long
    #If Not LateBind Then
    Dim RE As RegExp, allMatches As MatchCollection, aMatch As Match
    Set RE = New RegExp
    #Else
    Dim RE As Object, allMatches As Object, aMatch As Object
    Set RE = CreateObject("vbscript.regexp")
        #End If
    RE.Pattern = FindWhat
    RE.IgnoreCase = IgnoreCase
    RE.Global = True
    Set allMatches = RE.Execute(FindIn)
    ReDim rslt(0 To allMatches.Count - 1)
    For i = 0 To allMatches.Count - 1
        rslt(i) = allMatches(i).Value
        Next i
    RegExpFind = rslt
    End Function


thanks
 
Upvote 0
In
HTML:
<tr><th>Company Name</th><td colspan="2">ABB 123 Ltd. XYZ"
what is going after "XYZ"?
 
Upvote 0
HTML:
Sub FindCompanies()

    Dim re As New RegExp, mc As MatchCollection, m As Match
    Dim str As String
    
    str = "<tr><th>Company Name</th><td colspan=""2"">ABB 123 Ltd. XYZ</td>" & vbNewLine & _
          "<tr><th>Company Name</th><td colspan=""2"">CCC 456 Ltd. FFF</td>"
    
    With re
        .Global = True
        .Pattern = "<tr><th>Company Name</th><td colspan=""2"">(.*)<.*"
    End With
    
    For Each m In re.Execute(str)
        MsgBox m.SubMatches(0)
    Next
    
End Sub

BTW, I don't see rslt variable declaration.
 
Last edited:
Upvote 0
Use Replace rather than Execute and then just return the first submatch, ie (*.?) - which I made non greedy

Cheers

Dave


Code:
Function RegExpFind(FindIn, FindWhat As String)
    Dim RegExp
    Set RegExp = CreateObject("vbscript.regexp")
    With RegExp
        .Pattern = FindWhat
        .Global = True
    End With
    RegExpFind = RegExp.Replace(FindIn, "$1")
End Function
 
Last edited:
Upvote 0

Forum statistics

Threads
1,224,609
Messages
6,179,882
Members
452,948
Latest member
Dupuhini

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top