Read unicode from html file by VBA

morningstar1001

New Member
Joined
Jul 23, 2018
Messages
10
Hi everyone,

I need to extract the description text from a lot of html files. So I use this code to open and read the text from the html files.

Code:
myFile = myPath & Filename
    
    Open myFile For Input As [URL=https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=1]#1[/URL] 
    Do Until EOF(1)
        Line Input [URL=https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=1]#1[/URL] , textline
        posContent = InStr(1, textline, "title-detail")
        If posContent > 0 Then
            Sheets("data").Cells(j, 3) = textline
        End If
        
    Loop
    Close [URL=https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=1]#1[/URL]

It worked, but the description text which I need to collect is not UTF-8 and become "Nội thất văn phòng"

So is there any other way to extract correct data?

Thanks.
 

Excel Facts

Why are there 1,048,576 rows in Excel?
The Excel team increased the size of the grid in 2007. There are 2^20 rows and 2^14 columns for a total of 17 billion cells.
Code:
Function ExtractTitleDetail(strHtmlFilePath As String) As String
  Const TristateTrue = -1 ' Opens the file as Unicode
  Dim strLineOfText As String
  Dim objFileSys As Object
  Dim objStream As Object
  
  On Error GoTo ErrHandler
  Set objFileSys = CreateObject("Scripting.FileSystemObject")
  Set objStream = objFileSys.OpenTextFile(strHtmlFilePath, Format:=TristateTrue)
  
  Do Until objStream.AtEndOfStream
    strLineOfText = objStream.ReadLine()
    If InStr(1, strLineOfText, "title-detail", vbTextCompare) > 0 Then
      ExtractTitleDetail = strLineOfText
      GoTo ExitProc
    End If
  Loop
  
ExitProc:
  On Error Resume Next
  objStream.Close
  Set objStream = Nothing
  Set objFileSys = Nothing
  Exit Function
  
ErrHandler:
  ExtractTitleDetail = vbNullString
  Resume ExitProc
End Function
 
Upvote 0
Thanks for your help.
I tried with your code but the strLineOfText = "?????????".

However, I already found another solution using ADODB.Stream which work perfectly. Here the code just in case someone may need it.

Code:
 Dim adoStream As ADODB.Stream
    Dim var_String As Variant

    Set adoStream = New ADODB.Stream
    adoStream.Charset = "UTF-8"
    adoStream.Open
    adoStream.LoadFromFile myFile 'change this to point to your text file
  
    var_String = Split(adoStream.ReadText, vbCrLf) 'split entire file into array - lines delimited by CRLF
    
    For k = LBound(var_String) To UBound(var_String)
        textline = var_String(k)
        posContent = InStr(1, textline, "title-detail")
        If posContent > 0 Then
            Sheets("data2").Cells(j, 3) = textline
        End If
    Next k
 
Upvote 0
Thanks for your help.
I tried with your code but the strLineOfText = "?????????".

However, I already found another solution using ADODB.Stream which work perfectly. Here the code just in case someone may need it.

Code:
 Dim adoStream As ADODB.Stream
    Dim var_String As Variant

    Set adoStream = New ADODB.Stream
    adoStream.Charset = "UTF-8"
    adoStream.Open
    adoStream.LoadFromFile myFile 'change this to point to your text file
  
    var_String = Split(adoStream.ReadText, vbCrLf) 'split entire file into array - lines delimited by CRLF
    
    For k = LBound(var_String) To UBound(var_String)
        textline = var_String(k)
        posContent = InStr(1, textline, "title-detail")
        If posContent > 0 Then
            Sheets("data2").Cells(j, 3) = textline
        End If
    Next k

You said mine didn't work, but you stated in the title that the file is encoded as unicode rather than utf-8. If that was indeed the case, mine works fine.
 
Upvote 0

Forum statistics

Threads
1,214,996
Messages
6,122,636
Members
449,092
Latest member
bsb1122

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top