Thanks Thanks:  0
Likes Likes:  0
Page 1 of 2 12 LastLast
Results 1 to 10 of 16

Thread: Fast way to Identify Duplicates (Huge data set)

  1. #1
    Board Regular MarkCBB's Avatar
    Join Date
    Apr 2010
    Location
    Johannesburg, South Africa
    Posts
    497
    Post Thanks / Like
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Default Fast way to Identify Duplicates (Huge data set)

    Hi there,

    I have 100,000,000 (One hundred Million) Alphanumeric Codes, (100 columns with 1 Million in each column).

    I have tried to use worksheetfunction.coutif(range,cell.value) > 1 but this take about 3-4 seconds per cell, thus over 5-6 years to calculate.

    Does anyone have ideas on how to do this efficiently? like under 24 hours.

  2. #2
    Board Regular
    Join Date
    Sep 2016
    Posts
    2,206
    Post Thanks / Like
    Mentioned
    26 Post(s)
    Tagged
    1 Thread(s)

    Default Re: Fast way to Identify Duplicates (Huge data set)

    What do you want to do with the duplicates?
    If you want to remove them, you could try Excel's built-in Remove Duplicates tool.

  3. #3
    Board Regular MarkCBB's Avatar
    Join Date
    Apr 2010
    Location
    Johannesburg, South Africa
    Posts
    497
    Post Thanks / Like
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Default Re: Fast way to Identify Duplicates (Huge data set)

    I just need to remove them, the built in option is a row by row function which won't work in this case and my data is not really a row my row database but a cell by cell.
    So If I remove the duplicates from Column A there could still be a duplicate in column B,c,,d,f etc...
    I need to be able to scan all 100 million cells for duplicates.

  4. #4
    Board Regular
    Join Date
    Sep 2016
    Posts
    2,206
    Post Thanks / Like
    Mentioned
    26 Post(s)
    Tagged
    1 Thread(s)

    Default Re: Fast way to Identify Duplicates (Huge data set)

    Why not try first column by column.
    Then cut and paste to reduce the columns.
    Repeat until no space left to cut and paste, and Remove Duplicates no longer working.
    Assuming that there will be more than one column left, maybe by then the size will be more manageable to use other methods.

    I wonder how long Remove Duplicates takes on 1 million rows.

  5. #5
    Board Regular
    Join Date
    Sep 2016
    Posts
    2,206
    Post Thanks / Like
    Mentioned
    26 Post(s)
    Tagged
    1 Thread(s)

    Default Re: Fast way to Identify Duplicates (Huge data set)

    A possibility is to write a macro to split the columns into 200 columns and for each of all the combinations of 2 two columns (about 20,000), do a loop to Remove duplicates.
    If each step in the loop takes less than about 4.3 seconds, the total time should be less than 24 hours.
    I think the only way is to test it on some of the data.

  6. #6
    Board Regular
    Join Date
    Jan 2015
    Location
    Hungary, Budapest
    Posts
    245
    Post Thanks / Like
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Default Re: Fast way to Identify Duplicates (Huge data set)

    Quote Originally Posted by MarkCBB View Post
    I just need to remove them, the built in option is a row by row function which won't work in this case and my data is not really a row my row database but a cell by cell.
    So If I remove the duplicates from Column A there could still be a duplicate in column B,c,,d,f etc...
    I need to be able to scan all 100 million cells for duplicates.
    Hi, if get this right: you want to remove all the duplicated cells in your sheets and not the duplicated rows, or columns.
    This macro will remove all duplicated cells in each column, move the column data below each other then remove the duplicated cells again.
    Please try this on a COPY of the original sheet, and make sure there is no merged cells.

    Code:
    Sub mrexcel_markcbb_0327()
    
    Application.screenupdateing = false
    
    ''' get rid of duplicates on each column
    lc = ActiveSheet.Cells(1, Columns.Count).End(xlToLeft).column
    Do
    i = i + 1
    lr = ActiveSheet.Cells(Rows.Count, i).End(xlUp).Row
    ActiveSheet.Range(Cells(1, i), Cells(lr, i)).RemoveDuplicates Columns:=1, Header:=xlNo
    Loop Until i = lc
    
    
    
    
    '''' move columns below each other
    i = 0
    Do
    lr = ActiveSheet.Cells(Rows.Count, 1).End(xlUp).Row
    i = i + 1
    k = ActiveSheet.Cells(Rows.Count, i + 1).End(xlUp).Row
    Range(Cells(1, i + 1), Cells(k, i + 1)).Cut
    Cells(lr + 1, 1).Select
    ActiveSheet.Paste
    Loop Until i = lc
    
    
    '''redo the "removeduplicates"
    lr = ActiveSheet.Range("A" & Rows.Count).End(xlUp).Row
    ActiveSheet.Range("A1:A" & lr).RemoveDuplicates Columns:=1, Header:=xlNo
    
    Msgbox ("done")
    End Sub
    I hope it helps, good luck
    Last edited by CsJHUN; Mar 27th, 2017 at 06:24 AM. Reason: screenupdate = off
    Answer:= 42, but please be aware of the possible different local settings(in control panel).
    Make a backup of your original data before testing any VBA, it can't be undone.
    I'm still learning with you, don't be hard on me Playing on Excel'10 and Win7x64.

  7. #7
    Board Regular
    Join Date
    Sep 2016
    Posts
    2,206
    Post Thanks / Like
    Mentioned
    26 Post(s)
    Tagged
    1 Thread(s)

    Default Re: Fast way to Identify Duplicates (Huge data set)

    CsJHUN
    The problem is that each column has about 1 million rows of data, so column 1 will probably soon run out of space to receive more data.

  8. #8
    Board Regular
    Join Date
    Jan 2015
    Location
    Hungary, Budapest
    Posts
    245
    Post Thanks / Like
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Default Re: Fast way to Identify Duplicates (Huge data set)

    I understand and aware of this problem, but we don't know how many duplicated data is there.
    The macro will run on error if it cannot cut and paste the data in "A".

    The other way should be read all the data into array... currently working on it but don't know too much about them.
    Answer:= 42, but please be aware of the possible different local settings(in control panel).
    Make a backup of your original data before testing any VBA, it can't be undone.
    I'm still learning with you, don't be hard on me Playing on Excel'10 and Win7x64.

  9. #9
    MrExcel MVP
    Moderator
    Peter_SSs's Avatar
    Join Date
    May 2005
    Location
    Macksville, Australia
    Posts
    39,410
    Post Thanks / Like
    Mentioned
    78 Post(s)
    Tagged
    16 Thread(s)

    Default Re: Fast way to Identify Duplicates (Huge data set)

    @MarkCBB
    Could we have small set of dummy data (say 10-15 rows and 4 columns) to show/explain the issue and also show the corresponding required result for that small sample?
    Hope this helps, good luck.
    Peter
    Excel 365 - Windows 10
    - Want to help your helpers by posting a small, copyable, screen shot directly in your post? Look here
    - If posting vba code, please use Code Tags - like this [code]Paste code here[/code] - or use the VBHTML Maker
    - Read: Forum Rules, Forum Use Guidelines, & FAQ

  10. #10
    Board Regular
    Join Date
    Sep 2016
    Posts
    2,206
    Post Thanks / Like
    Mentioned
    26 Post(s)
    Tagged
    1 Thread(s)

    Default Re: Fast way to Identify Duplicates (Huge data set)

    Quote Originally Posted by CsJHUN View Post
    I understand and aware of this problem, but we don't know how many duplicated data is there.
    The macro will run on error if it cannot cut and paste the data in "A".
    After the first removal of duplicates, if the combined total rows of cols A & B have not reduced from 2M to 1M, there will not be enough room in col A - and there are 98 other columns starting with 1M rows each to be accounted for.

Some videos you may like

User Tag List

Like this thread? Share it with others

Like this thread? Share it with others

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •