Fast way to Identify Duplicates (Huge data set)

MarkCBB

Active Member
Joined
Apr 12, 2010
Messages
497
Hi there,

I have 100,000,000 (One hundred Million) Alphanumeric Codes, (100 columns with 1 Million in each column).

I have tried to use worksheetfunction.coutif(range,cell.value) > 1 but this take about 3-4 seconds per cell, thus over 5-6 years to calculate.

Does anyone have ideas on how to do this efficiently? like under 24 hours.
 
MarkCCB
What is the reduction in cells after running this ? :
Code:
Sub First_Run()
Dim t As Date
Dim lc%, I%, lr&
Application.ScreenUpdating = False
lc = ActiveSheet.Cells(1, Columns.Count).End(xlToLeft).Column
On Error Resume Next
t = Now()
Do
    I = I + 1
    lr = ActiveSheet.Cells(Rows.Count, I).End(xlUp).Row
    ActiveSheet.Range(Cells(1, I), Cells(lr, I)).RemoveDuplicates Columns:=1, Header:=xlNo
Loop Until I = lc
MsgBox Format(Now() - t, "hh:mm:ss")
On Error GoTo 0
End Sub

And how long was the run time ?
 
Upvote 0

Excel Facts

How to create a cell-sized chart?
Tiny charts, called Sparklines, were added to Excel 2010. Look for Sparklines on the Insert tab.
Excel's built-in Remove Duplicates is unreliable. Test with this simple example (post #8).
With the huge data sample being spoken of here, how would you know if something like that example had happened within your data?

(I'd still like to see a small set of dummy data and expected results)
 
Last edited:
Upvote 0
The other way should be read all the data into array... currently working on it but don't know too much about them.

Here's one way, but have no idea how long it would take (or whether 100M items can be stored).
It should write the non-duplicates to a blank sheet ("Sheet2") :
Code:
Sub GetUnique()
Dim d As Object, cel As Range, k, tmp$, c%, r&
Set d = CreateObject("scripting.dictionary")
For Each cel In ActiveSheet.UsedRange
    tmp = Trim(cel.Value)
    If Len(tmp) > 0 Then d(tmp) = d(tmp) + 1
Next
c = 1
r = 1
For Each k In d.keys
    Sheets("Sheet2").Cells(r, c) = k
    r = r + 1
    If r = 1000000 Then
        r = 1
        c = c + 1
    End If
Next
End Sub
 
Upvote 0
Excel's built-in Remove Duplicates is unreliable. Test with this simple example (post #8).
With the huge data sample being spoken of here, how would you know if something like that example had happened within your data?

In the link you posted the data supplied by kalak was properly processed for me.
But the widely reported unreliability of Remove Duplicates probably means it should not be a part of MarkCCB's solution.
 
Upvote 0
I tested the macro in post #13 on 2,500,000 cells spread over 3 columns and containing 2,000,000 unique items.
It took about 4 minutes 18 seconds.
So with 100,000,000 cells it would probably take about 3½ hours (less if there are fewer than 80,000,000 unique items).
 
Upvote 0
Hi Everyone, sorry I am only replying now. I have tried all the options, but the codes take forever, I think the best way to tackle this is to create the unique code in a way that I can ensure there will not be any duplicates, creating them should be faster than removing duplicates.
 
Upvote 0

Forum statistics

Threads
1,215,372
Messages
6,124,532
Members
449,169
Latest member
mm424

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top