Convert text file to true utf-8 (not utf-16 BE BOM)

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows
Hi, I have created an xml file that is based on data from Excel. However, the result is that it is in an ansi format. The program that uses this file requires the utf-8 format and I am currently able to emulate this by throwing it into Notepad++ to format it correctly. I would like to accomplish this automatically.

I best code I found in several places on the web was this:
VBA Code:
        Set fso = CreateObject("Scripting.FileSystemObject")
        Set stream = CreateObject("ADODB.Stream")
        stream.Open
        stream.Type = 2
        stream.Charset = "utf-8"
        stream.LoadFromFile strPath
        fso.OpenTextFile(strPath, 2, True, True).Write stream.ReadText
        stream.Close
Unfortunately, this results in the "utf-16 BE BOM" instead of the desired "utf-8".
My question is "How can I modify the above code to get the right output or is there a better method?"

Thank you for the use of your brains.
 

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows
Ok so I had not thought of using Word for this. Did have a problem with the Portuguese unicode characters. I had this.
1637272631279.png

Portions like "Esta é a " or "de Abraão." bring up this error. Is there a way to have Word ignore these characters while loading the Text?
For reference, this is the line that it hit the error on.
VBA Code:
    Set WdDoc = WdApp.Documents.Open(FileName:=FileName, ReadOnly:=True, Encoding:=msoEncodingUnicodeLittleEndian) 'Open the Unicode text file (UTF-16 LE)

Sorry for the extra innings.
 

Excel Facts

Why are there 1,048,576 rows in Excel?
The Excel team increased the size of the grid in 2007. There are 2^20 rows and 2^14 columns for a total of 17 billion cells.

rlv01

Well-known Member
Joined
May 16, 2017
Messages
1,160
Office Version
  1. 365
  2. 2010
Platform
  1. Windows
VBA Code:
    FileName = "D:\MyPath\MyFile.txt"
    UTF8_FileName = "D:\MyPath\MyFile_UTF8.xml"

Make sure that the file extension for FileName is .txt and NOT .xml. That apparently triggers some ill-advised effort by word to try to understand the xml structure. UTF8_FileName can be xml.

Also I ran across another potential method that does not involve word

VBA Code:
  ActiveWorkbook.SaveAs FileName:=UTF8_FileName, FileFormat:=xlCSVUTF8

It worked pretty well on my test text:

Quando Abrão tinha noventa e nove anos, o Senhor apareceu a ele e disse: Eu sou o Deus Todo-Poderoso; anda diante de mim e sê irrepreensível. ... Você não será mais chamado de Abrão; seu nome será Abraão, porque eu tenho fiz de você um pai de muitas nações.
 

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows
Well you were right that I did not have an .txt extension. However, even after I changed it, the routine still failed in the same manner.

As to your alternate suggestion, This would only work for a workbook, not a text file, correct?
 

rlv01

Well-known Member
Joined
May 16, 2017
Messages
1,160
Office Version
  1. 365
  2. 2010
Platform
  1. Windows
I don't know why you are getting that error. I've tested the method on a couple of different test cases, including files using Chinese character sets. It works for me with no errors and good conversion results. The output UTF-8 correctly encodes the special characters. You might want to consider shifting over to a programming language that has better support for re-encoding to create a custom conversion routine.
 

Domenic

MrExcel MVP
Joined
Mar 10, 2004
Messages
19,829
Office Version
  1. 365
Platform
  1. Windows

ADVERTISEMENT

Here's a solution that should convert your file into UTF-8 without the BOM. It too uses the ADODB Stream, but without the use of the FileSystemObject. It simply loads the file onto a stream as UTF-8 with the BOM, moves the current position within the stream to the point after the BOM, copies this stream (starting from the new position after the BOM) to another stream, and then saves it. Note, as in your original code, it overwrites the original file. If this is not what you want, simply change the path and filename assigned to .SaveToFile accordingly.

VBA Code:
Option Explicit

Sub SaveFileAsUTF8NoBOM()

    Dim oStreamUTF8         As Object
    Dim oStreamUTF8NoBOM    As Object
    Dim strPath             As String
   
    strPath = "c:\users\domenic\desktop\book1.txt" 'change the path and filename accordigly
   
    Set oStreamUTF8 = CreateObject("ADODB.Stream")
    With oStreamUTF8
        .Charset = "UTF-8"
        .Type = 2 'adTypeText
        .Open
        .LoadFromFile strPath
        .Position = 3 'skip BOM (EF BB BF)
    End With
   
    Set oStreamUTF8NoBOM = CreateObject("ADODB.Stream")
    With oStreamUTF8NoBOM
        .Type = 1 'adTypeBinary
        .Open
        oStreamUTF8.CopyTo oStreamUTF8NoBOM
        .SaveToFile strPath, 2 'adSaveCreateOverWrite
    End With
   
    oStreamUTF8.Close
    oStreamUTF8NoBOM.Close
   
End Sub

Hope this helps!
 
Last edited:

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows
Thanks rlv01, I will keep hacking at that method

Domenic, I did try your method. It had the runtime error:
1637357246816.png

It happens on this line:
VBA Code:
.SaveToFile strPath, adSaveCreateOverWrite

I am sure it is something simple, Just having Friday brain right now.
 

Domenic

MrExcel MVP
Joined
Mar 10, 2004
Messages
19,829
Office Version
  1. 365
Platform
  1. Windows

ADVERTISEMENT

Sorry, I forgot to change that particular constant to it's value, since we're using late binding. So it should be . . .

VBA Code:
.SaveToFile strPath, 2 'adSaveCreateOverWrite

I'll amend my post accordingly.

Does this help?
 

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows
Thanks, that fixed the debug.
However, the file is still showing up in npp as ansi.
1637358105669.png


I almost wish I could dig into the npp to see how it handles this. It is instant and seamless to convert the file. I am sure once I find the answer it will be something really simple. Thanks for the input, no pun intended.
 

Domenic

MrExcel MVP
Joined
Mar 10, 2004
Messages
19,829
Office Version
  1. 365
Platform
  1. Windows
If you open it up in Notepad or Notepad++, does it show up as UTF-8 ?
 

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows
When I open it in notepad++ it still come up as ansi, not utf-8.
 

Forum statistics

Threads
1,148,422
Messages
5,746,587
Members
424,032
Latest member
pochie2741

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Top