Convert Excel Multiline data to a Usable Data Set
Alan Z sent in this week's Excel problem. His MIS department gives him a file generated from an old COBOL report. After opening the file in Excel, he has 2500 lines of this:
CustLastName | CustFirstName | CustMiddle | CustAddress | CustCity |CustState | CustZip Useless LINE1 Useless LINE2 CustLastName | CustFirstName | CustMiddle | CustAddress | CustCity |CustState | CustZip Useless LINE1 Useless LINE2
Alan wants to get this ASCII report into a useful format: one line per customer, with State and Zip appended to the rest of the information. He also wants to zap the 2 useless lines. Obviously, we don't want to manually do all of this. Here is one way to quickly deal with the mess.
With the formulas
- Insert two blank columns to the left of the data.
- Add a header row above the data.
- Column A is called "Sequence"
- Column B is called "RowType"
- Column C is called "Data"
- Make all of the headings bold
- Column A is going to be used to assign a number to each logical record in the report. Because this report has 4 physical lines for each logical record, we need each set of 4 lines to have the same logical record number. I usually enter values for the first record and then design formulas for the 2nd record that can be copied down the entire report.
- In Cells A2:A5, enter a 1. In Cell A6, enter =A5+1. In cell A7, enter =A6. Copy A7 to A8 & A9. You now have a copyable set of formulas for the 2nd logical record of the report.
- Select A6:A9 and hit Ctrl C to copy. Select A10:A2501 and hit Ctrl V to paste.
- Column B is going to be used to identify if the particular row is the 1st, 2nd, 3rd, or 4th segment of the logical record.
- In cells B2:B5, enter 1, 2, 3, and 4. In cell B6, enter =B2. Copy cell B6 from B7:B2501.
After changing to values
- Now that you have sequence numbers and row types for all of your data, you need to change the formulas to values. Select A2:B2501. Edit > Copy, Edit > Paste Special > Values > OK.
Now that sequence numbers and row types have been assigned for all rows, we are almost done. Sort the data by Row Type as the primary key and Sequence as the secondary key. This will cause the 625 top lines of each record to float up to cells C2:C626. The 2nd line of each record will be in C626:C1251. The "useless" lines will start in C1252 and can be deleted. Move cells C626:C1251 to cell D2. In Cell E2, enter the formula =C2&D2. You can copy this formula from E2 to E626. Use the same Paste Special Value trick to change from formulas to values, delete columns A-D and you have your result.
From here, you can use the Text to Columns wizard to process this data further
You can easily customize this procedure to deal with any variety of ASCII reports. You need to figure out how many physical printed lines make up a single logical record on the report.