Managing Data From Text File

Clanrat

New Member
Joined
Jun 20, 2011
Messages
2
Hello everyone!


I'm currently trying to make an excel spreadsheet of around 19 000 chemical compounds with their name, formula, mass and database links (CAS number etc.). All the info I need is taken from a text file, which was generated by a script. The text file in question looks like this:

Code:
    ENTRY:    C00001    Compound    NAME:    H2O;    Water    FORMULA:    H2O    EXACT_MASS:    18.0106    DBLINKS:    CAS: 7732-18-5    PubChem: 3303    ChEBI: 15377    PDB-CCD: HOH    3DMET: B01124    NIKKAJI: J43.587B
    ENTRY:    C00002    Compound    NAME:    ATP;    Adenosine 5'-triphosphate    FORMULA:    C10H16N5O13P3    EXACT_MASS:    506.9957    DBLINKS:    CAS: 56-65-5    PubChem: 3304    ChEBI: 15422    KNApSAcK: C00001491    PDB-CCD: ATP    3DMET: B01125    NIKKAJI: J10.680A
    ENTRY:    C00003    Compound    NAME:    NAD+;    NAD;    Nicotinamide adenine dinucleotide;    DPN;    Diphosphopyridine nucleotide;    Nadide    FORMULA:    C21H28N7O14P2    EXACT_MASS:    664.1169    DBLINKS:    CAS: 53-84-9    PubChem: 3305    ChEBI: 15846    KNApSAcK: C00007256    PDB-CCD: NAD NAJ    3DMET: B01126    NIKKAJI: J136.554A
    ENTRY:    C00004    Compound    NAME:    NADH;    DPNH;    Reduced nicotinamide adenine dinucleotide    FORMULA:    C21H29N7O14P2    EXACT_MASS:    665.1248    DBLINKS:    CAS: 58-68-4    PubChem: 3306    ChEBI: 16908    KNApSAcK: C00019343    PDB-CCD: NAI    3DMET: B01127    NIKKAJI: J213.546I
    ENTRY:    C00005    Compound    NAME:    NADPH;    TPNH;    Reduced nicotinamide adenine dinucleotide phosphate    FORMULA:    C21H30N7O17P3    EXACT_MASS:    745.0911    DBLINKS:    CAS: 2646-71-1    PubChem: 3307    ChEBI: 16474    KNApSAcK: C00019545    PDB-CCD: NDP    3DMET: B01128    NIKKAJI: J208.978E
    ENTRY:    C00006    Compound    NAME:    NADP+;    NADP;    Nicotinamide adenine dinucleotide phosphate;    beta-Nicotinamide adenine dinucleotide phosphate;    TPN;    Triphosphopyridine nucleotide    FORMULA:    C21H29N7O17P3    EXACT_MASS:    744.0833    DBLINKS:    CAS: 53-59-8    PubChem: 3308    ChEBI: 18009    PDB-CCD: NAP    3DMET: B01129    NIKKAJI: J247.824B
    ENTRY:    C00007    Compound    NAME:    Oxygen;    O2    FORMULA:    O2    EXACT_MASS:    31.9898    DBLINKS:    CAS: 7782-44-7    PubChem: 3309    ChEBI: 15379    PDB-CCD: OXY    3DMET: B00001    NIKKAJI: J44.420K
    ENTRY:    C00008    Compound    NAME:    ADP;    Adenosine 5'-diphosphate    FORMULA:    C10H15N5O10P2    EXACT_MASS:    427.0294    DBLINKS:    CAS: 58-64-0    PubChem: 3310    ChEBI: 16761    KNApSAcK: C00019353    PDB-CCD: ADP    3DMET: B01130    NIKKAJI: J10.683F
    ENTRY:    C00009    Compound    NAME:    Orthophosphate;    Phosphate;    Phosphoric acid;    Orthophosphoric acid    FORMULA:    H3PO4    EXACT_MASS:    97.9769    DBLINKS:    CAS: 7664-38-2    PubChem: 3311    ChEBI: 18367    KNApSAcK: C00007408    PDB-CCD: 2HP PI PO4    3DMET: B00002    NIKKAJI: J3.746J
    ENTRY:    C00010    Compound    NAME:    CoA;    Coenzyme A;    CoA-SH    FORMULA:    C21H36N7O16P3S    EXACT_MASS:    767.1152    DBLINKS:    CAS: 85-61-0    PubChem: 3312    ChEBI: 15346    KNApSAcK: C00007258    PDB-CCD: COA COZ    3DMET: B04618    NIKKAJI: J192.630F
The problem I am facing is that, when it gets imported into excel, the columns get mis-aligned as not all compounds alternative names. The text file also contains entries which lack mass and chemical formula.
This wouldn't normally be a problem if I only had to deal with 10 or so compounds (then I would just fix it manually) but when I have to manage about 19 000 of them...

The end-result I am after looks pretty much like this:

Code:
ENTRY    NAME    FORMULA    MASS    DBLINKS
1     H20; Water     H2O    18.0106    CAS: 7732-18-5    PubChem: 3303    ChEBI: 15377    PDB-CCD: HOH    3DMET: B01124    NIKKAJI: J43.587B
I am guessing I need a macro of some sort to get the result I want. However, I am not very experienced in working with excel or programming in general. I will appreciate any help I can get.

Thanks!

Clanrat
 

Excel Facts

How to create a cell-sized chart?
Tiny charts, called Sparklines, were added to Excel 2010. Look for Sparklines on the Insert tab.
I shall hesitatingly say that there are two broadly similar possible approaches: import the text into Excel as it stands and try to parse each line by using Text-to-Columns, or write a routine in VBA to read the file, parse it, and load the values into Excel.

I shall have a go at making some progress with the latter method as I've done similar things before. Maybe someone else will come up with another approach.

First of all, I see the file has spaces separating the various fields and values. Are those single spaces, multiple spaces, TAB characters maybe?

Next, are you able to let me have a definitive list of 'field names' - EXACT_MASS, DBLINKS, CAS, etc?
 
Upvote 0
I've had a first attempt. Create a new workbook, then right-click the Sheet1 worksheet tab and select View Code. Press Ctrl-R to make sure the Project Explorer is visible then go Insert > Module and make sure Module1 has been created.

Double click the module name to open the code window and delete anything that's already in there, then paste this code in:-
Code:
[FONT=Courier New]Option Explicit[/FONT]
[FONT=Courier New][/FONT] 
[FONT=Courier New]Sub ImportData()[/FONT]
[FONT=Courier New][/FONT] 
[FONT=Courier New]  Dim ws As Worksheet
  
  Dim intFH As Integer
  Dim sRec As String
  Dim sTemp As String
  
  Dim aPtr As Integer
  Dim iRow As Long
  Dim iPtr As Integer
  Dim iPtr2 As Integer
  Dim iColumn As Long
  
  Dim sFileName As Variant
  Dim vFieldnames As Variant
  Dim sDelimiter As String
  
  sFileName = Application.GetOpenFilename(FileFilter:="All file types (*.*), *.*")
  If sFileName = "False" Then Exit Sub[/FONT]
[FONT=Courier New]  vFieldnames = Array("ENTRY", "NAME", "FORMULA", "EXACT_MASS", "DBLINKS", _
                      "PubChem", "ChEBI", "KNApSAcK", "PDB-CCD", "3DMET", "NIKKAJI")
  
  Set ws = ThisWorkbook.Sheets("Sheet1")
  ws.Cells.ClearContents
  ws.Range("A1:Z1") = vFieldnames
  ws.Range("A1:Z1").Font.Bold = True
  
  sDelimiter = Chr(255)
  
  Close
  intFH = FreeFile()
  Open sFileName For Input As #intFH
  iRow = 1
  
  Do Until EOF(intFH)
    Line Input #intFH, sRec
    For aPtr = LBound(vFieldnames) To UBound(vFieldnames)
      sRec = Replace(sRec, vFieldnames(aPtr) & ":", sDelimiter & vFieldnames(aPtr) & ":")
    Next aPtr
    sRec = sRec & sDelimiter
    iPtr = InStr(sRec, "  ")
    Do Until iPtr = 0
      sRec = Replace(sRec, "  ", " ")
      iPtr = InStr(sRec, "  ")
    Loop
    sRec = Replace(sRec, " " & sDelimiter, sDelimiter)
    sRec = Replace(sRec, ": ", ":")
    iRow = iRow + 1
    For aPtr = LBound(vFieldnames) To UBound(vFieldnames)
      iPtr = InStr(sRec, sDelimiter & vFieldnames(aPtr) & ":")
      If iPtr > 0 Then
        sTemp = Mid(sRec, iPtr + Len(vFieldnames(aPtr)) + 2)
        iPtr2 = InStr(sTemp, sDelimiter)
        sTemp = Left(sTemp, iPtr2)
        sTemp = Replace(sTemp, sDelimiter, "")
        iColumn = 1
        Do Until ws.Cells(1, iColumn) = vFieldnames(aPtr) Or iColumn > UBound(vFieldnames)
          iColumn = iColumn + 1
        Loop
        ws.Cells(iRow, iColumn) = sTemp
      End If
    Next aPtr
  Loop
    
  Close
  
  ws.Columns("A:Z").EntireColumn.AutoFit
  ws.Columns("B").ColumnWidth = 60
  ws.Range("A1").Activate
  With ActiveWindow
    .SplitColumn = 0
    .SplitRow = 1
    .FreezePanes = True
  End With
  
  MsgBox vbCrLf _
       & "Import file: " & sFileName & Space(10) & vbCrLf & vbCrLf _
       & CStr(iRow - 1) & " entries imported" & Space(10) & vbCrLf & vbCrLf _
       & CStr(UBound(vFieldnames) + 1) & " data fields identified", _
       vbOKOnly + vbInformation, "Clanrat's Data Import Routine"
  
End Sub[/FONT]

Now return to your worksheet and save it as a macro-enabled workbook (.xlsm). Finally go Developer > Macros > ImportData > Run to execute the code. When the dialog box opens, navigate to your text file and select it, then click Open.

Has that done the trick?

I've assumed that there are no funny characters in the text file like TABs: if there are, the code will need to be tweaked to allow for them.

Let me know how it goes?
 
Upvote 0
Oh yes, forgot to say: the code seems to work here with the small sample of data you've provided. I would have to have a much larger sample to test all possible variations, since you say some of the eNtries have data missing from them. Perhaps later, if you spot any inconsistencies in the results.
 
Last edited:
Upvote 0
The text file does include tabs, and it look like it replaces them with boxes containing question marks.

Other then that it worked, perfectly! Thanks a lot! :biggrin:
 
Upvote 0
I think you could solve that issue with a minor change in the code.

After:-
Code:
    Line Input #intFH, sRec
insert this:-
Code:
    sRec = Replace(sRec, vbTab, " ")

If that doesn't do the trick and you'd rather get it sorted out, as I said, I might need a sample of the actual data at some point.

Same goes if you spot any fields which aren't being extracted, or which aren't being extracted correctly.
 
Upvote 0

Forum statistics

Threads
1,224,521
Messages
6,179,285
Members
452,902
Latest member
Knuddeluff

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top