Fast way to Identify Duplicates (Huge data set)

MarkCBB

Active Member
Joined
Apr 12, 2010
Messages
497
Hi there,

I have 100,000,000 (One hundred Million) Alphanumeric Codes, (100 columns with 1 Million in each column).

I have tried to use worksheetfunction.coutif(range,cell.value) > 1 but this take about 3-4 seconds per cell, thus over 5-6 years to calculate.

Does anyone have ideas on how to do this efficiently? like under 24 hours.
 

Excel Facts

What does custom number format of ;;; mean?
Three semi-colons will hide the value in the cell. Although most people use white font instead.
What do you want to do with the duplicates?
If you want to remove them, you could try Excel's built-in Remove Duplicates tool.
 
Upvote 0
I just need to remove them, the built in option is a row by row function which won't work in this case and my data is not really a row my row database but a cell by cell.
So If I remove the duplicates from Column A there could still be a duplicate in column B,c,,d,f etc...
I need to be able to scan all 100 million cells for duplicates.
 
Upvote 0
Why not try first column by column.
Then cut and paste to reduce the columns.
Repeat until no space left to cut and paste, and Remove Duplicates no longer working.
Assuming that there will be more than one column left, maybe by then the size will be more manageable to use other methods.

I wonder how long Remove Duplicates takes on 1 million rows.
 
Upvote 0
A possibility is to write a macro to split the columns into 200 columns and for each of all the combinations of 2 two columns (about 20,000), do a loop to Remove duplicates.
If each step in the loop takes less than about 4.3 seconds, the total time should be less than 24 hours.
I think the only way is to test it on some of the data.
 
Upvote 0
I just need to remove them, the built in option is a row by row function which won't work in this case and my data is not really a row my row database but a cell by cell.
So If I remove the duplicates from Column A there could still be a duplicate in column B,c,,d,f etc...
I need to be able to scan all 100 million cells for duplicates.

Hi, if get this right: you want to remove all the duplicated cells in your sheets and not the duplicated rows, or columns.
This macro will remove all duplicated cells in each column, move the column data below each other then remove the duplicated cells again.
Please try this on a COPY of the original sheet, and make sure there is no merged cells.

Code:
Sub mrexcel_markcbb_0327()

Application.screenupdateing = false

''' get rid of duplicates on each column
lc = ActiveSheet.Cells(1, Columns.Count).End(xlToLeft).column
Do
i = i + 1
lr = ActiveSheet.Cells(Rows.Count, i).End(xlUp).Row
ActiveSheet.Range(Cells(1, i), Cells(lr, i)).RemoveDuplicates Columns:=1, Header:=xlNo
Loop Until i = lc




'''' move columns below each other
i = 0
Do
lr = ActiveSheet.Cells(Rows.Count, 1).End(xlUp).Row
i = i + 1
k = ActiveSheet.Cells(Rows.Count, i + 1).End(xlUp).Row
Range(Cells(1, i + 1), Cells(k, i + 1)).Cut
Cells(lr + 1, 1).Select
ActiveSheet.Paste
Loop Until i = lc


'''redo the "removeduplicates"
lr = ActiveSheet.Range("A" & Rows.Count).End(xlUp).Row
ActiveSheet.Range("A1:A" & lr).RemoveDuplicates Columns:=1, Header:=xlNo

Msgbox ("done")
End Sub

I hope it helps, good luck
 
Last edited:
Upvote 0
CsJHUN
The problem is that each column has about 1 million rows of data, so column 1 will probably soon run out of space to receive more data.
 
Upvote 0
I understand and aware of this problem, but we don't know how many duplicated data is there.
The macro will run on error if it cannot cut and paste the data in "A".

The other way should be read all the data into array... currently working on it but don't know too much about them.
 
Upvote 0
@MarkCBB
Could we have small set of dummy data (say 10-15 rows and 4 columns) to show/explain the issue and also show the corresponding required result for that small sample?
 
Upvote 0
I understand and aware of this problem, but we don't know how many duplicated data is there.
The macro will run on error if it cannot cut and paste the data in "A".
After the first removal of duplicates, if the combined total rows of cols A & B have not reduced from 2M to 1M, there will not be enough room in col A - and there are 98 other columns starting with 1M rows each to be accounted for.
 
Upvote 0

Forum statistics

Threads
1,214,641
Messages
6,120,692
Members
448,979
Latest member
DET4492

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top