Finding Duplicates and Storing Results

amerifax-heather · Jun 8, 2011

I am going to try my best to explain my situation. I am using Excel 2007. I have an extremely large data file (1,010,165 records to be exact). It is a combination of 2 different data files (2010 and 2005) with the same exact structure. Between the 2 files I need to know if I have any duplicates in 2 different scenarios, Columns A-F and Columns B-E. I want to keep the record from 2010 and mark it as MST and delete the record from 2005 (eventually), before it is deleted I need to have a record of that duplicate in the MST record as either 2005(A-F) or 2005(B-E) in a field called "ORG".

Columns A-F contain - Name, Addr, City, St, Zip, Phn
Columns B-E contain - Addr, City, St, Zip, Phn

What I need to determine is this...Each of the records have an identifier in a Field called "YR-Q" which contains either 2010 or 2005. I need to search the records and find out where there is a duplicate using columns A-F. When a duplicate is found, I need to mark the 1st rec (if YR-Q=2010 as MST or if YR-Q=2005 as DUP). I also need to record the result in the MST record as "2005(A-F)" in a field called "ORG". And then run the same scenario using columns B-E.

The other thing I am going to need to know is if the Address information (Columns B-E) is unique (not duplicated) where YR-Q=2005. If this is the case I would like to put the value "2005(ADDR)" in the "ORG" field.

I should also tell you that before the files are joined together each individual file is checked for duplicates within the file and all duplicates removed. So the only occurrence of a duplicate will occur between the 2010 file and the 2005 file.

I hope I've explained myself well enough. Thank you in advance for any help or guidance you can give.

Heather

Rijnsent · Jun 15, 2011

Hi Heather,
I'm going to assume here that you want to solve this question with formula only in Excel? (As in: with macros other solutions would be possible, Access would be an other option.)
What I would do:
-Create for each line one cell with a unique identifier, which in your case would be Column B-E, so the formula for that identifier column would be "=B1&C1&D1&E1" (for the first row). This is also assuming that all your addresses are formatted the same (capitals, spaces, etc. If e.g. one address is "Main street" and another "Mainstr." and another "Mainstreet", those are 3 different names and really hard to match...)
-Okay, now each address has a unique identifier (let's call it ID). With your amount of addresses I would copy the formula down to your last datarow, let Excel calculate and then copy-paste that new column as values.
-Next, in an extra column you can count the number of times an ID is in the list: =COUNTIF(F$2:F$6;F2) (change the ranges for your sheet). Copy paste that formula down your list, let excel calculate and copy-paste as values.
-Then, you know for each line whether it is from 2005 or 2010 and how many times the address is in the list. So 2005 and 1 time in the list means it's only in 2005, not in 2010. A 2 and 2010 means that it's both in 2005 and 2010... So one extra formula for each line should get you your result: =IF(F1=2005;IF(G1=1;"ONLY 2005";"2005 and 2010");IF(G1=1;"ONLY 2010";"2010 and 2005")) (change ranges and names in the formula).

Would that do the trick?

Cheers,

Koen

Finding Duplicates and Storing Results

amerifax-heather

New Member

Excel Facts

Rijnsent

Well-known Member

Similar threads

Forum statistics

Share this page

Finding Duplicates and Storing Results

amerifax-heather

New Member

Excel Facts

Rijnsent

Well-known Member

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock