Convert text file to true utf-8 (not utf-16 BE BOM)

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows
Hi, I have created an xml file that is based on data from Excel. However, the result is that it is in an ansi format. The program that uses this file requires the utf-8 format and I am currently able to emulate this by throwing it into Notepad++ to format it correctly. I would like to accomplish this automatically.

I best code I found in several places on the web was this:
VBA Code:
        Set fso = CreateObject("Scripting.FileSystemObject")
        Set stream = CreateObject("ADODB.Stream")
        stream.Open
        stream.Type = 2
        stream.Charset = "utf-8"
        stream.LoadFromFile strPath
        fso.OpenTextFile(strPath, 2, True, True).Write stream.ReadText
        stream.Close
Unfortunately, this results in the "utf-16 BE BOM" instead of the desired "utf-8".
My question is "How can I modify the above code to get the right output or is there a better method?"

Thank you for the use of your brains.
 

Excel Facts

Format cells as time
Select range and press Ctrl+Shift+2 to format cells as time. (Shift 2 is the @ sign).

rlv01

Well-known Member
Joined
May 16, 2017
Messages
1,160
Office Version
  1. 365
  2. 2010
Platform
  1. Windows
If you are sure that the input file will always be ANSI.

VBA Code:
    Dim strpathUTF8 As String, TextLine As String, BOM As String
    Dim LineNo As Long
    
    'strpathUTF8 is up to you to define
    strpathUTF8 = Left(strpath, InStrRev(strpath, ".") - 1) & "_UTF8" & Mid(strpath, InStrRev(strpath, "."), 10)
    
    BOM = Chr(239) & Chr(187) & Chr(191) 'UTF-8 byte order marker
    Open strpath For Input Access Read As #1
    Open strpathUTF8 For Output Access Write As #2
    
    Do While Not EOF(1)
        Line Input #1, TextLine
        LineNo = LineNo + 1
        
        If LineNo = 1 And Not (Left(TextLine, 3) = BOM) Then
            Print #2, BOM & TextLine
        Else
            Print #2, TextLine
        End If
    Loop
    Close #1
    Close #2
 

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows
Thanks rlv01,
This is one step closer.
This is producing an "utf-8 BOM" file instead of straight "utf-8".

I have never worked with any xml files until this so I don't even know what the byte order marker represents. I do know that the file still does not display correctly with it.

I tried to rem this, but it resulted in an ansi version.
VBA Code:
'        If LineNo = 1 And Not (Left(TextLine, 3) = BOM) Then
'            Print #2, BOM & TextLine
'        Else
            Print #2, TextLine
'        End If

Any thoughts?
 

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows
Here is an example of how the utf-8 BOM displays:
1637191251213.png


Here is an example of the straight utf-8.
1637191380708.png


I thought I would add this to show you how it actually looks.
 

rlv01

Well-known Member
Joined
May 16, 2017
Messages
1,160
Office Version
  1. 365
  2. 2010
Platform
  1. Windows

ADVERTISEMENT

Thanks rlv01,
This is one step closer.
This is producing an "utf-8 BOM" file instead of straight "utf-8".

I have never worked with any xml files until this so I don't even know what the byte order marker represents. I do know that the file still does not display correctly with it.

I tried to rem this, but it resulted in an ansi version.
VBA Code:
'        If LineNo = 1 And Not (Left(TextLine, 3) = BOM) Then
'            Print #2, BOM & TextLine
'        Else
            Print #2, TextLine
'        End If

Any thoughts?
The strict definition of UTF-8 includes the BOM (Byte Order Marker) at the beginning of the file, which is what my posted code adds. It is standard marker so that a program reading it can determine what flavor of Unicode it is. So you should be good to go with whatever application requires UTF-8, since if they are making a big deal out of it then they MUST support the BOM. A less strict definition has it that a file can be UTF-8 without a BOM, but then it could also just be an ANSI file. I'm actually surprised your program won't simply read the excel export file directly.
 

rlv01

Well-known Member
Joined
May 16, 2017
Messages
1,160
Office Version
  1. 365
  2. 2010
Platform
  1. Windows
Here is an example of how the utf-8 BOM displays:
View attachment 51526

Here is an example of the straight utf-8.
View attachment 51527

I thought I would add this to show you how it actually looks.
Yes and no. That those two files display differently may have more to do with which ANSI Code Page (ACP) your windows PC has loaded as part of the regional setup you are configured for. A non UTF-8 ANSI file on my Windows 1252 encoded PC would produce the same before and after results because the extended ascii character set encoding is different.
 

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows

ADVERTISEMENT

In principle I agree with what you said, and really do not know why they created it the way they did, but the fact remains that the file will display correctly with the basic utf-8 and will not display correctly with utf-8 BOM. In my original post I mentioned that I have been using npp to convert the file. In the menu for npp it has these options.
1637246811377.png

Only this option will successfully format my output to display correctly in the other program. As I said earlier, I do not understand the mechanics of utf, I only know what works.

I do appreciate the help, and you have already taught me more about it than I knew before.
 

rlv01

Well-known Member
Joined
May 16, 2017
Messages
1,160
Office Version
  1. 365
  2. 2010
Platform
  1. Windows
So the issue is not that your other program won't execute using an ANSI file, but that it won't display the data correctly? If you open your raw file (before NPP conversion) does NPP tell you what the encoding is? This should be on the right side of the status bar in npp. When you say that created an xml file that is 'based on' data from Excel, does that mean you are using VBA to create the file automatically? And if so, can you post the lines of code that are doing that save? Lastly, what is the installed ACP for your PC? Run this code to see:

VBA Code:
Sub GetACP()
    MsgBox "ACP (ANSI Code Page) setting on this PC is " & CreateObject("WScript.Shell").RegRead("HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP")
End Sub
 

MPW

Well-known Member
Joined
Oct 7, 2009
Messages
550
Office Version
  1. 365
Platform
  1. Windows
The encoding in npp is ansi before I convert it to utf-8.
The ACP is 1252.

To answer how my file is created is another matter.

The dev team created program(a) to record languages. The script for program(a) is generated from program(b). Program(b) will generate an xml or an xlsx. The dev team designed program(a) to work with the xml, however, it does not allow the team to manipulate the text so we use the xlsx for this purpose. I have used the xml as a template to emulate the needed lines from xlsx to xml. The only thing left is to encode it correctly.

For example this:
1637259628192.png


Becomes this: (The left side is the original xml. The right is created from excel.)
1637259675496.png


Here is how it looks in program(a):
1637259900466.png


In other words, I am not saving an Excel sheet as an xml, I am writing a txt doc based on information in an Excel sheet and then converting the txt doc into an xml.

I hope this helps. If this doesn't answer all of your questions, let me know.
 

Attachments

  • 1637258830505.png
    1637258830505.png
    24.3 KB · Views: 3

rlv01

Well-known Member
Joined
May 16, 2017
Messages
1,160
Office Version
  1. 365
  2. 2010
Platform
  1. Windows
Word has a much better set of tools for re-encoding files, so an expedient method might be to have Excel save the data file in Unicode format then use a macro like this to launch word to do the conversion:
VBA Code:
'Excel VBA, late binding for Word App.
Sub UnicodeToUTF8()
    Const wdFormatEncodedText = 7
    Const msoEncodingUnicodeLittleEndian = 1200
    Const msoEncodingUTF8 = 65001
    
    Dim WdApp As Object, WdDoc As Object
    Dim FileName As String, UTF8_FileName As String
    
    Set WdApp = CreateObject("word.Application")
    WdApp.Visible = True
    
    FileName = "D:\MyPath\MyFile.txt"
    UTF8_FileName = "D:\MyPath\MyFile_UTF8.txt"
    
    Set WdDoc = WdApp.Documents.Open(FileName:=FileName, ReadOnly:=True, Encoding:=msoEncodingUnicodeLittleEndian) 'Open the Unicode text file (UTF-16 LE)
    WdDoc.SaveAs FileName:=UTF8_FileName, FileFormat:=wdFormatEncodedText, Encoding:=msoEncodingUTF8 'Re-encode and save to new UTF-8 file
    WdDoc.Close False
    WdApp.Quit False
End Sub

Whether the UTF-8 format is acceptable is for you to judge.
 

Forum statistics

Threads
1,148,390
Messages
5,746,431
Members
424,015
Latest member
kyites

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Top