Excel tips for data normalisation needed

excelsishya

Board Regular
Joined
Jul 25, 2010
Messages
107
Hi all,

I have data list whose names are not in consistent format.

I am looking for excel tips which helps in name normalization.

i know it is not cent 100% accurate but any excel tips which reduces manual normalisation process is very helpful.

i am using excel 2007.

Thanks in advance.

Here is small sample set

Sheet2

<table style="padding-right: 2pt; padding-left: 2pt; font-size: 11pt; font-family: Calibri,Arial; background-color: rgb(255, 255, 255);" border="1" cellpadding="0" cellspacing="0"> <colgroup> <col style="font-weight: bold; width: 30px;"> <col style="width: 267px;"> <col style="width: 114px;"></colgroup> <tbody> <tr style="font-weight: bold; font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;"> <td>
</td> <td>B</td> <td>C</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">1</td> <td style="font-weight: bold;">Raw data</td> <td style="font-weight: bold;">Normalised</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">2</td> <td>zata ltd</td> <td>zata</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">3</td> <td>zATA LIMITED</td> <td>zata</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">4</td> <td>zata L T D</td> <td>zata</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">5</td> <td>VR GROUP</td> <td>VR GROUP</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">6</td> <td>V R GROUP</td> <td>VR GROUP</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">7</td> <td>ENFOY TECHNOLOGIES</td> <td>ENFOY</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">8</td> <td>ENFOY TECH</td> <td>ENFOY</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">9</td> <td>FIRST FORCE SA</td> <td>FIRST FORCE</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">10</td> <td>FIRST FORCE SRL</td> <td>FIRST FORCE</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">11</td> <td>FIRST FORCE OYA</td> <td>FIRST FORCE</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">12</td> <td>FIRST FORCE INTERNATION</td> <td>FIRST FORCE</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">13</td> <td>FIRST FORCE INTL</td> <td>FIRST FORCE</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">14</td> <td>PHK ENGINERING</td> <td>PHK ENGINERING</td></tr> <tr style="height: 18px;"> <td style="font-size: 8pt; background-color: rgb(202, 202, 202); text-align: center;">15</td> <td>P H K ENGINERING</td> <td>PHK ENGINERING</td></tr></tbody></table>
 

Excel Facts

Will the fill handle fill 1, 2, 3?
Yes! Type 1 in a cell. Hold down Ctrl while you drag the fill handle.
A Data Validation List is a good way to enforce consistent data.

It could be created with a Stop Error Alert to force a choice from the list, or it could use a Warning to encourage use of standardized choices but allow users to add items not on the list.
 
Upvote 0
I have data list whose names are not in consistent format. I am looking for excel tips which helps in name normalization.
Two problems:-
  • Enforce correct name selection going forwards
  • Correct existing list
How big is the list?
 
Upvote 0
foverman and Ruddles thanks for showing interest in thread.

Problem i am facing is not related to data validation.

here are more details on it.

data list is 20 thousand+.

these are companies name tracked over a period.

eg in 2008 name was entered as zata ltd with say 1000 Rs amount.

same company in 2009 has been entered as zATA LIMITED with 500RS.

my aim is to analyse growth rate after normalising.

if i find growth rate as it is without normalisation i would end up with incorrect analysis.

one idea i had was to match letters in strings in same order get some kind of % match, then sort by % giving highest % match.(Ocurrence of company names in same column makes more difficult though).


zata l t d

zata ltm

these should be match as letters are same & occurence also order.

zata limited would be say 90or 80 % match.

These kind of thoughts. but unfortunately my excel skills are not sufficient to pull this kind of stuff.

Any ideas or thoughts on how to normalise this kind of data is very helpful.

Thanks
 
Upvote 0
In the past when I've had to deal with dirty data, I've used another column and used the =lower function to ensure consistency in case type and then removed all spaces.

That will get you a single string that is at least consistent in type and with no extraneous spaces between words.

It's not 100%, obviously, but it's a good place to start.
 
Upvote 0
thanks VOG . these link has so many pages. i will read it and try to apply to my problem. i will let you know the results.
 
Upvote 0
Thanks chuckles1066. its certainly good place to start. atleast it will bring names spread allover 20k rows into visual range by sorting helper column after removing space .
 
Upvote 0
One approach I've used to clean data is to use progressive or multi-pass filtering.

Filter on the Normalised column to show only blanks, then filter Raw data for Zata. Enter the normalised name then move on to enfoy, first force, etc.
 
Upvote 0

Forum statistics

Threads
1,224,550
Messages
6,179,463
Members
452,915
Latest member
hannnahheileen

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top