Fast Duplicate Check For Large Data Sets Using VBA?

CaliKidd

Board Regular
Joined
Feb 16, 2011
Messages
173
Hey Experts,

I am using the following logic to check for and highlight duplicate entries. The way the logic works, it starts off slow and speeds up as it goes because the number of cells it has to check for duplicates becomes less and less. It works OK as long as the data set is relatively small (hundreds to low thousands), but I decided to load-test it on 1,000,000 cells and it crawled! It took about 30 minutes to just get through the first 15 duplicate checks (and my PC has a quad-core processor overclocked to 4Ghz and 4Gb RAM). At this rate, it would probably take months to finish... :eek:

Here's the code I borrowed from another online site:
Code:
Sub DupsGreen()
[INDENT]Application.ScreenUpdating = False
Rng = Selection.Rows.Count
For i = Rng To 1 Step -1
[INDENT]myCheck = ActiveCell
ActiveCell.Offset(1, 0).Select
For j = 1 To i
[INDENT]If ActiveCell = myCheck Then
[INDENT]Selection.Font.Bold = True
Selection.Font.ColorIndex = 4
[/INDENT]End If
ActiveCell.Offset(1, 0).Select
[/INDENT]Next j
ActiveCell.Offset(-i, 0).Select
[/INDENT]Next i
Application.ScreenUpdating = True
[/INDENT]End Sub
Does anyone know of a slicker, faster way?
 
hi,

if you are using Excel 2007 or further version, you can use the conditional formatting to highlight duplicate changes. Select your column which you want to highlight the duplicates and go to "Home" Menu --> Conditional Formatting --> Higlight Cell Rules --> Duplicate values.
 
Upvote 0

Excel Facts

Excel motto
Not everything I do at work revolves around Excel. Only the fun parts.
Maybe I don't understand the question: what about using conditional formatting? Untested on large dataset, as I use Excel 2003.

Code:
With Selection
  .FormatConditions.Delete
  .FormatConditions.Add Type:=xlExpression, Formula1:="=countif(" & .Address(ReferenceStyle:=xlR1C1) & ",RC)>1"
  .FormatConditions(1).Interior.ColorIndex = 4
End With
 
Upvote 0
Mirabeau,

I executed your last solution and it took 10.48 seconds, which is commendable on such a large record set. As I indicated in my last post, my approach thus far has been to identify particular cells that have duplicate values in them and then to highlight the cell, add a cell comment, display a msgbox, or any combination thereof, based on user selections.

Having said that, your last solution and other conditional formatting solutions (mentioned by Fazza and ganeshpoojary05), would require me to modify my approach; that is, I believe I would have to "mark" all of the duplicates first (in your case with a green font), then swing back through in a second pass to find all of those marked cells and then take action (e.g., highlight, add cell comment, display msgbox, etc).

I may invest the time to experiment with this because the execution speed in finding the duplicates is so fast I wouldn't need a progress bar. The speed is intriguing and somewhat addicting, lol.

As for your other comment about if the data were integers, there would be another approach -- well, unfortunately, in my case, the duplicates would be alphanumeric/string values.

P.S. For kicks and grins, I clocked my PC's run of 10.48 seconds at about 1700 mph (~ 2720 kph), or Mach 2.2. ;)
 
Upvote 0
FYI - Fazza's conditional formatting solution clocked in at 3.08 seconds (5,786 mph / 9311 kph / Mach 7.5). (y)
 
Upvote 0
Having said that, your last solution and other conditional formatting solutions (mentioned by Fazza and ganeshpoojary05), would require me to modify my approach; that is, I believe I would have to "mark" all of the duplicates first (in your case with a green font), then swing back through in a second pass to find all of those marked cells and then take action (e.g., highlight, add cell comment, display msgbox, etc).
Do I understand by this that you'd like to have some input box pop up whenever a duplicate is found, asking what you want done with the duplicate(s)?

That's easy enough to do a (fast) code for, but if you've got a large dataset with lots of duplicates then you might be doing your own inputs to it for quite some time.

It could be done either where you want to action each individual duplicate separately, or for a whole slab of similar duplicates, i.e. whenever the string CaliK75432 (or whatever) appears, all in one fell swoop.
 
Upvote 0
I take the large datasets here to be for the purpose of speed tests. Any actual dataset with millions of items would otherwise be a very poor candidate for highlighting duplicates as it would terribly impractical to actually intend for someone to visually scan millions of items...
 
Upvote 0
Xenou, yes, this is just a load-testing exercise that I was performing and asking others for faster solutions.

Mirabeau,
Do I understand by this that you'd like to have some input box pop up whenever a duplicate is found, asking what you want done with the duplicate(s)?

That's easy enough to do a (fast) code for, but if you've got a large dataset with lots of duplicates then you might be doing your own inputs to it for quite some time.

It could be done either where you want to action each individual duplicate separately, or for a whole slab of similar duplicates, i.e. whenever the string CaliK75432 (or whatever) appears, all in one fell swoop.
Let me clarify. I have a setup form for finding duplicates in my worksheet that allows me to click separate checkboxes for highlighting the cell, adding a cell comment, and/or displaying a message box. The options selected, execute. Those that are not selected, do not execute.

My initial approach to finding duplicates was to use a VBA loop. It went cell by cell. If a duplicate was found, I called the necessary subs to highlight, add a cell comment, display a msgbox, or any combination thereof. The logic was simple: If Highlight = True then Call Highlight; If Comment = True then Call Comment; etc.

I was also able to keep a running tally because everytime I found a duplicate, I incremented a counter variable (e.g., Duplicates = Duplicates +1).

The problem was the VBA loop method was SLOW, so I created a progress bar form so I could monitor the macro's progress and could be sure the system hadn't locked up. This gave me the visual feedback I wanted, but it slowed the operation down even more. :rolleyes: As I mentioned earlier in this thread, using your initial solution, it took about 6 minutes.

Your last solution, as well as Fazza's, is so fast, a progress bar is no longer necessary. However, since it is not using a cell-by-cell VBA loop approach, I have to rethink a few things. Let's use Fazza's code, for example.

After this conditional formatting macro runs, the duplicates are highlighted (cell background green). Now, I need to a fast way to find those highlighted cells, so I can:
1. Tally them (to get a total number of duplicates found)
2. Remove the highlighting (if the highlighting option is turned off)
3. Add comments to those duplicate cells (if this option is turned on)
4. Display a msgbox indicating which cells have duplicates (if this option is turned on). BTW, if there are a lot of duplicates found, the way I handle the neverending stream of msgboxes is simple to say "A duplicate has been found in cell ___. Do you want to continue?" , with a "Yes" response proceeding, and a "No" response exiting the sub.

In summary, what I now need is to come up with a fast way of finding and tallying those cells that Fazza's code highlighted. Any ideas of how to do this?

Once I know which cells they are and can select them, I can do steps 2-4 by simply call those subs I have in place to handle these options.
 
Last edited:
Upvote 0
In summary, what I now need is to come up with a fast way of finding and tallying those cells that Fazza's code highlighted. Any ideas of how to do this?

This code uses variant arrays to find the dupes, then a range object to hold the duplicate range. The code will cater for multiple range areas, empty cells are ignored

Reporting the range object will slow the code down if there are many dupes
Early binding the Dictionary Object will probably reduce code time by 20-30%

Lastly I have stored the address of the initial value in the dictionary, so the code can easily be tweaked to report all duplicate values (not just the 2nd, 3rd, x occurrences)

Code:
Sub GetDupes()
    Dim rng1 As Range
    Dim rngArea As Range
    Dim rng2 As Range
    Dim lngRow As Long
    Dim lngCol As Long
    Dim lngCalc As Long
    Dim objDic
    Dim X()


    On Error Resume Next
    Set rng1 = Application.InputBox("Select range", "User select", Selection.Address, , , , , 8)
    If rng1 Is Nothing Then Exit Sub
    On Error GoTo 0


    Set objDic = CreateObject("scripting.dictionary")


    'Speed up the code by turning off screenupdating and setting calculation to manual
    'Disable any code events that may occur when writing to cells
    With Application
        lngCalc = .Calculation
        .ScreenUpdating = False
        .Calculation = xlCalculationManual
        .EnableEvents = False
    End With

    'Test each area in the user selected range
    'Non contiguous range areas are common when using SpecialCells to define specific cell types to work on
    For Each rngArea In rng1.Areas
        'The most common outcome is used for the True outcome to optimise code speed
        If rngArea.Cells.Count > 1 Then
            'If there is more than once cell then set the variant array to the dimensions of the range area
            'Using Value2 provides a useful speed improvement over Value. On my testing it was 2% on blank cells, up to 10% on non-blanks
            X = rngArea.Value2
            For lngRow = 1 To rngArea.Rows.Count
                For lngCol = 1 To rngArea.Columns.Count
                    If Len(X(lngRow, lngCol)) > 0 Then
                        If Not objDic.exists(X(lngRow, lngCol)) Then
                            objDic.Add X(lngRow, lngCol), rngArea.Cells(1).Offset(lngRow - 1, lngCol - 1).Address
                        Else
                            If Not rng2 Is Nothing Then
                                Set rng2 = Union(rng2, rngArea.Cells(1).Offset(lngRow - 1, lngCol - 1))
                            Else
                                Set rng2 = rngArea.Cells(1).Offset(lngRow - 1, lngCol - 1)
                            End If
                        End If
                    End If
                Next lngCol
            Next lngRow
        Else
            If Len(rngArea.Value) > 0 Then
                If Not objDic.exists(rngArea.Value) Then
                    objDic.Add rngArea.Value, rngArea.Address
                Else
                    If Not rng2 Is Nothing Then
                        Set rng2 = Union(rng2, rngArea)
                    Else
                        Set rng2 = rngArea
                    End If
                End If
            End If
        End If
    Next rngArea

    'cleanup the Application settings
    With Application
        .ScreenUpdating = True
        .Calculation = lngCalc
        .EnableEvents = True
    End With

    If Not rng2 Is Nothing Then
        MsgBox "Duplicate range is " & rng2.Address
    Else
        MsgBox "No dupes"
    End If
End Sub
 
Upvote 0
Why don't you just request a code to make a list of the duplicates and/or their addresses, and then operate directly from that list?

Highlighted cells may be OK visually, but for further work they would very rarely be an efficient way to proceed.

Perhaps I should add that the codes I've provided and the conditional formatting approach don't seem to give quite the same results.

Say you have the number 37 repeated 3 times on your worksheet, my code as it stands highlights the 2nd and 3rd occurrences but not the first (although it can be trivially modified to highlight all 3) while the conditional formatting highlights all 3.

This is I guess is because of some ambiguity about just what is meant (on this Forum and elsewhere) by "duplicate". Apposite, there was a case that went to the Irish Supreme Court where 3 cyclists were riding 3 abreast when only 2 were allowed. The argument was about which one was the third.

Again quoting Kafka (getting to be a habit) "die gemeinsame Unsicherheit in dieser Hinsicht ist so groß, dass selbst die besten Schülerinnen und Schüler darin gezeichnet werden"
 
Upvote 0
Why don't you just request a code to make a list of the duplicates and/or their addresses, and then operate directly from that list?

Highlighted cells may be OK visually, but for further work they would very rarely be an efficient way to proceed.
How exactly would this be done? One caveat - I cannot alter/remove the data in any way. I can only analyze it and then annotate it in some fashion (hence, the highlight, cell comments, and/or msgbox) so the owner of the data can then fix it.
Perhaps I should add that the codes I've provided and the conditional formatting approach don't seem to give quite the same results.

Say you have the number 37 repeated 3 times on your worksheet, my code as it stands highlights the 2nd and 3rd occurrences but not the first (although it can be trivially modified to highlight all 3) while the conditional formatting highlights all 3.
Good catch! I completely misssed that one. I actually need the solution to work the way your code works, not the way Fazza's conditional formatting solution does.
This is I guess is because of some ambiguity about just what is meant (on this Forum and elsewhere) by "duplicate". Apposite, there was a case that went to the Irish Supreme Court where 3 cyclists were riding 3 abreast when only 2 were allowed. The argument was about which one was the third.
That's hilarious! :LOL:

@Brettdj:
Thanks again! I'm going to give your code a try, too. It's getting late here in the US, so it may be tomorrow.
 
Upvote 0

Forum statistics

Threads
1,217,380
Messages
6,136,227
Members
450,000
Latest member
jgp19

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top