Fast Duplicate Check For Large Data Sets Using VBA?

CaliKidd

Board Regular
Joined
Feb 16, 2011
Messages
173
Hey Experts,

I am using the following logic to check for and highlight duplicate entries. The way the logic works, it starts off slow and speeds up as it goes because the number of cells it has to check for duplicates becomes less and less. It works OK as long as the data set is relatively small (hundreds to low thousands), but I decided to load-test it on 1,000,000 cells and it crawled! It took about 30 minutes to just get through the first 15 duplicate checks (and my PC has a quad-core processor overclocked to 4Ghz and 4Gb RAM). At this rate, it would probably take months to finish... :eek:

Here's the code I borrowed from another online site:
Code:
Sub DupsGreen()
[INDENT]Application.ScreenUpdating = False
Rng = Selection.Rows.Count
For i = Rng To 1 Step -1
[INDENT]myCheck = ActiveCell
ActiveCell.Offset(1, 0).Select
For j = 1 To i
[INDENT]If ActiveCell = myCheck Then
[INDENT]Selection.Font.Bold = True
Selection.Font.ColorIndex = 4
[/INDENT]End If
ActiveCell.Offset(1, 0).Select
[/INDENT]Next j
ActiveCell.Offset(-i, 0).Select
[/INDENT]Next i
Application.ScreenUpdating = True
[/INDENT]End Sub
Does anyone know of a slicker, faster way?
 
@CaliKidd,

I'm guessing the code I posted was not as quick as I imagined it would be. I'm wondering if that might be because I didn't turn off enough of the automatic events Excel performs. Since you have an existing quite large database, I was hoping you would perform an experiment for me. Could you run the following revision of my posted macro and tell me how long it takes to run (I'll use the comparison to mirabeau's code to judge the effectiveness of the technique)? Thanks!

Code:
Sub HighlightDuplicates()
  Dim DataColumn As Long, LastRow As Long, UnusedColumn As Long, Diff As Long
  DataColumn = ActiveCell.Column
  LastRow = Cells(Rows.Count, ActiveCell.Column).End(xlUp).Row
  UnusedColumn = Cells.Find(What:="*", SearchOrder:=xlByColumns, _
                 SearchDirection:=xlPrevious, LookIn:=xlFormulas).Column + 1
  Diff = UnusedColumn - DataColumn
  With Application
    .ScreenUpdating = False
    .Calculation = xlCalculationManual
    .EnableEvents = False
  End With
  With Columns(UnusedColumn)
    .Resize(LastRow).FormulaR1C1 = "=IF(COUNTIF(R1C[-" & Diff & "]:RC[-" & Diff & "],RC[-" & Diff & "])-1,""X"","""")"
    .Value = .Value
    Intersect(.Offset(, -Diff).EntireColumn, .SpecialCells(xlCellTypeConstants).EntireRow).Interior.ColorIndex = 6
    .Clear
  End With
  With Application
    .ScreenUpdating = True
    .Calculation = xlCalculationAutomatic
    .EnableEvents = True
  End With
End Sub
 
Upvote 0

Excel Facts

Did you know Excel offers Filter by Selection?
Add the AutoFilter icon to the Quick Access Toolbar. Select a cell containing Apple, click AutoFilter, and you will get all rows with Apple
Rick, I'd be glad to run that benchmark test for you. I suspect it will be the fastest.

The reason why I didn't comment on your code was that I posted my last reply at 3am last night and at that time I had only had a chance to modify and successfully test Mike's and Mirabeau's code. I did work with your code, too, but I ran into a few issues and, frankly, because the technique you employed is beyond my current skill level, I wasn't able to get it to work.

So, I would need your assistance in making some small changes in order to benchmark your code:

1. My data starts on row 4 (row 3 is a data header and rows 1 & 2 have some other stuff). When your code sets up the unused column, installs the formulas, and calculates the results, it does so on the entire column.

2. I believe when your intersect code finds a duplicate, it immediately changes the background to yellow. Would it be possible to break this down into an decision point event? In other words, when a duplicate is found, I would like to have the choice to highlight the cell, add a comment, or both. I already have those subs defined, so the logic would need to be something like:

Code:
If Duplicate_Found then
[INDENT]If Highlight_Option = True Then Call Highlight_Cell
If Comment_Option = True Then Call Comment_Cell
[/INDENT]End If
 
Upvote 0
Rick, just a quick test update. Your code is fast. Very fast. But I entered a duplicate and your existing code did not highlight it. Can you double-check it?
 
Last edited:
Upvote 0
Rick, just a quick test update. Your code is fast. Very fast. But I entered a duplicate and your existing code did not highlight it.
The only way I can think that would happen is if the "duplicates" are not really duplicates. That could happen if there is a trailing space or other "invisible" character attached to them. If you check the missed duplicates and find that is not the case, would you mind sending the file to me or, alternately, making it available to me online?
 
Upvote 0
I suspect it will be the fastest.
I'm not completely sure if that will be the case or not; hence the request for you to test it out on real data.

The reason why I didn't comment on your code was that I posted my last reply at 3am last night and at that time I had only had a chance to modify and successfully test Mike's and Mirabeau's code.
No problem... I just assumed since you didn't comment, that it must have been much slower than I thought it would be.

I did work with your code, too, but I ran into a few issues and, frankly, because the technique you employed is beyond my current skill level, I wasn't able to get it to work.
Is that still the case? Given your follow up message, it would appear you overcame the first of your numbered problems

So, I would need your assistance in making some small changes in order to benchmark your code:

1. My data starts on row 4 (row 3 is a data header and rows 1 & 2 have some other stuff). When your code sets up the unused column, installs the formulas, and calculates the results, it does so on the entire column.
As I said, your follow up message seems to indicate you fixed this problem. Did you?

So, I would need your assistance in making some small changes in order to benchmark your code:

2. I believe when your intersect code finds a duplicate, it immediately changes the background to yellow. Would it be possible to break this down into an decision point event? In other words, when a duplicate is found, I would like to have the choice to highlight the cell, add a comment, or both. I already have those subs defined, so the logic would need to be something like:

That is a little bit problematic with the code as I wrote it. My code does not iterate the duplicates changing them one at a time; rather, it identifies and highlights all of them at the same time. That seemed to be the functionality you were after based on your original message (or at least how I read it). Is there some way you can identify the "rule" that you will use to decide if the duplicate should be marked or not (perhaps I can incorportate it into my code)?
 
Upvote 0
Rick,

Those two issues I listed have not been resolved. I thought the fact that my data starts on row 4 may be part of the problem, so I copied the data to different column and started it at row 1.

But I ran into another problem...

I am now getting the message "Excel cannot complete this task with available resources. Choose less data or close other applications."

I rebooted my PC, opened Task Manager and killed off all non-essential tasks, but apparently 4GB of RAM is not enough. I cut the data set down to 500,000 and then to 250,000, but I continued to get this error.

Apparently your solution is done in memory and is therefore dependent on how much memory is available. I am running Win-7 64-bit and even though I have 6 GB, I believe only 4 GB max is recognized by the OS. Hence, I'm not sure your solution is viable until Microsoft can overcome this limitation in future OS's.
 
Upvote 0
If I might re-enter the thread ...

1 million entries to find duplicates is quite a large problem. In my experience many of the built-in Excel functions (although not all) just don't handle this sized problem very well.

But 6 minutes seems to me a very long time to produce the results for this problem even with a million entries.

If you want do do it faster you might consider trying the following code. This supposes you are only looking at duplicates in ColumnA, with relevant data starting from A4.

It's faster if there's lots of duplicates (took me 13 seconds with one example) and slower with fewer duplicates.

I'd also note that if your data are all integers then the problem can be solved quite a bit faster with a different code.
Code:
Sub colorduplicates()
Dim t As Single
t = Timer
Dim q&, x&, i&, a
Dim ash As Worksheet
Set ash = ActiveSheet
q = Range("A" & Rows.Count).End(3).Row - 3
a = Range("A4").Resize(q)
Application.ScreenUpdating = False
With Sheets.Add
    .Cells(1).Resize(q) = a
    .Cells(2) = 1: .Cells(2).Resize(q).DataSeries
    .Cells(1).Resize(q, 2).Sort .Cells(1), 1, Header:=xlNo
    a = .Cells(1).Resize(q + 1)
    For i = 1 To q
        If a(i, 1) <> a(i + 1, 1) Then
            If i > x + 1 Then _
                .Cells(x + 2, 1).Resize(i - x - 1).Font.ColorIndex = 4
            x = i
        End If
    Next i
    .Cells(1).Resize(q, 2).Sort .Cells(2), 1, Header:=xlNo
    .Cells(1).Resize(q).Copy ash.Range("A4")
Application.DisplayAlerts = False
    .Delete
Application.DisplayAlerts = True
End With
Application.ScreenUpdating = True
MsgBox "Code took " & Format(Timer - t, "0.00 secs")
End Sub
 
Upvote 0
I'd expect that the quickest results are when you can apply Ricks Conditional Formatting, standard methods in VBA such as Conditional Formatting or AutoFilter data (with or without a working formula column) offer the quickest solutions

If you do need to run significant data manipulation on the records then mirabeau's variant array approach is the way to go, my Duplicate Master addin uses this exact approach

Cheers

Dave
 
Upvote 0
Mirabeau,

Let me qualify the "6 minutes" figure by saying that it is my benchmark time using code I modified. Part of this modification is running a progress bar form as the code is executing so I have a visual of how the operation is progressing. I would estimate this progress bar creates a performance hit that probably amounts to 3 of those 6 minutes. There are some other subs I threw in also, which could slow it down even more. So, just the core code run by itself would probably execute in, say, 2 minutes (or less).

But, sometimes it helps to step back and put things into a new perspective to really understand the enormity of the problem. One million rows is a LOT of data. It's hard to visualize large numbers like this, but let me give it a whirl.

My fully populated worksheet I am load testing has 1,045,873 rows of data. I have a 30 inch LCD monitor of which I can comfortably display about 60 rows of data on screen at one time. So, doing the math, I would need 17,431 of these monitors to display all 1,045,873 rows. Now, the vertical height of my monitor is 18 inches, so if I placed all 17,431 monitors on the ground, one on top of the other, and formed a planked road, that road would be 4.95 miles (almost 8 km)!

So, even if the routine took a whopping 5 minutes to finish, if my computer was moving down this planked road of monitors, it would be going 60 miles per hour (100 kph). And if it can finish in 2 minutes, then it would be the equivalent of going 180 mph (300 kph), which is close to Indy 500 speed, lol. :eek:

So, in the final analysis, I'll gladly take the 6 minutes, especially given my original starting point was weeks/months! And, of course, I'll keep looking for faster methods in the meantime. :biggrin:

It's almost 1:30am, so I won't be able to try your modified code until tomorrow. I do appreciate your help (as well as the help of all the other experts, too).

Dave, thanks for letting me know of your cool add-in. I didn't realize there was one, so I'll give that link a look tomorrow.
 
Upvote 0

Forum statistics

Threads
1,217,380
Messages
6,136,227
Members
450,000
Latest member
jgp19

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top