Compare Data

Huey72

New Member
Joined
Nov 6, 2019
Messages
32
Hi everyone, I am migrating data between information systems, and want to compare the source dataset from the legacy system to the migrated dataset in the new system to ensure they match.

To do this, I've imported both datasets into the same workbook, into their respective worksheets, named Source and Migrated. The dataset in both sheets is large: 300 columns x 600,000 rows.

For smaller datasets, I have a macros which will loop through each cell in the data and return a list of mismatch results e.g., where values don't match, the result list would return a list containing values like: Cell: D75 Source value: 1 <> Migrated value: 2, etc.

The problem is to loop through each cell on a dataset this size takes a really long time. I looked at using Access, and it has a 255 column limit that I prefer not to work around. I'm not familiar with PowerQuery enough to do this in a way that will return the clear result. I think this can be done using the scripting dictionary in a much faster way than looping through cells, but I can't get the syntax right.

The following code is sort of doing what I want it to do, but it's currently returning the complete record from each dataset that doesn't match, rather than only the specific value. Rather than returning two records for the mismatch, I'd prefer to have one record for each specific difference e.g., Cell: D75 Source value: 1 <> Migrated value: 2, Cell: JK 525,431 Source value: A <> Migrated value: B, etc.. To do this, I think the code may need to be changed to nest the second dictionary within the first, to then iterate through the items of each dictionary, compare and return the mismatch.

I'm new to using the scripting dictionary and would like to learn this - any guidance on this would be greatly appreciated. Thanks!


VBA Code:
Sub CompareData()

Dim vstr As String
Dim vData As Variant
Dim vitm As Variant
Dim vArr As Variant
Dim v()
Dim currentSht As String
Dim shtResults As String

Dim a As Long
Dim b As Long
Dim c As Long

Application.ScreenUpdating = False

shtResults = "Results"

'If Sheet doesn't exist, create it
If SheetExists(shtResults) = True Then
    Application.DisplayAlerts = False
    Sheets(shtResults).Delete
    Application.DisplayAlerts = True
    Sheets.Add
    ActiveSheet.Name = shtResults
Else
    Sheets.Add
    ActiveSheet.Name = shtResults
End If

vData = Sheets("Source").Range("A1:M11").Value
currentSht = "Source"

With CreateObject("Scripting.Dictionary")
        .CompareMode = 1
        
        ReDim v(1 To UBound(vData, 2))
        
        For a = 2 To UBound(vData, 1)
            
            For b = 1 To UBound(vData, 2)
                vstr = vstr & Chr(2) & vData(a, b)
                'Debug.Print vstr
                v(b) = "[Cell: " & a & ", " & b & " -- " & currentSht & "] " & vData(a, b)
            Next
            
            .Item(vstr) = v
            vstr = ""
            
        Next
        
        vData = Sheets("Migrated").Range("A1:M11").Value
        currentSht = "Migrated"

        For a = 2 To UBound(vData, 1)
            
            For b = 1 To UBound(vData, 2)
                vstr = vstr & Chr(2) & vData(a, b)
                v(b) = "[Cell: " & a & ", " & b & " -- " & currentSht & "] " & vData(a, b)
            Next
        
            If .Exists(vstr) Then
                .Item(vstr) = Empty
                Else
                .Item(vstr) = v
            End If
            
            vstr = ""
        Next
        
        For Each vitm In .Keys
            If IsEmpty(.Item(vitm)) Then
            .Remove vitm
            End If
        Next
            
        vArr = .Items
        c = .Count

End With

With Sheets(shtResults).Range("A1").Resize(, UBound(vData, 2))

    .Cells.Clear
    .Value = vData
    
    If c > 0 Then
        .Offset(1).Resize(c).Value = Application.Transpose(Application.Transpose(vArr))
        
    End If
    
End With

Application.ScreenUpdating = True

End Sub
 

Excel Facts

Highlight Duplicates
Home, Conditional Formatting, Highlight Cells, Duplicate records, OK to add pink formatting to any duplicates in selected range.
I suspect that the dictionary is the least of your problems. Depending on your data you will probably not be able to pull 600,000 rows by 300 columns into an array.
 
Upvote 0
Thanks Fluff, that's something I hadn't considered. Based on tests from today, I was able to run the code I listed above on a dataset of 400,000 rows by 387 columns successfully, it took 34 min to complete...to your point, it did fail when I expanded to 500,000 rows.

Based on this, I think we could split the range in half and do a loop i.e., set the first range to be run on rows 1-300,000, and loop again for rows 300,001 to 600,00, or something of the sort. Might that work around the limitation or array size and still provide better performance than looping through each cell?

I'm interested in trying it, but could use advice on how to nest the dictionaries (if that's the right approach), since I can see uses for learning how to do this for other applications, and even better if can apply the loop and also use it for this case. What do you think?
 
Upvote 0
Are you trying to do a direct cell to cell comparison, ie Z10 to Z10 or is it something else?
 
Upvote 0
Yes, it's exactly that. I'm expecting the data cell by cell to match exactly. I typically do it by looping through each cell in the dataset, but because the dataset is so large, I'm expecting the run time to be unreasonable, so looking for a faster method. For every mismatch found, I'm hoping to return a list of discrepancies, with each discrepancy in the list to have a description like this: Cell: Z10 Source value: 1 <> Migrated value: 2. Thanks again!
 
Upvote 0
In that case there is no real need for a dictionary, I'd do it like
VBA Code:
Sub huey()
   Dim Mary As Variant, Sary As Variant, Rary(1 To 1000000, 1 To 1) As Variant
   Dim r As Long, c As Long, nr As Long
   
   Rary = Sheets("Migrated").Range("A1:M11").Value2
   Sary = Sheets("Source").Range("A1:M11").Value2
   
   For r = 1 To UBound(Mary)
      For c = 1 To UBound(Mary, 2)
         If Mary(r, c) <> Sary(r, c) Then
            nr = nr + 1
             Rary(nr, 1) = "Cell: " & Cells(r, c).Address & " Migrated = " & Mary(r, c) & " Source = " & Sary(r, b)
         End If
      Next c
   Next r
   Sheets("Results").Resize(nr).Value = Rary
End Sub
 
Upvote 0
Solution
Amazing, thank you! This is a big help, I was spinning on how to approach it - thanks so much for clarifying. I'll test today and let you know.
 
Upvote 0
You're welcome & thanks for the feedback.
 
Upvote 0
Hi Fluff, wanted to confirm with you your code works like a charm, thanks again! I've added two pieces of code, and wouldn't mind your thoughts to see if I approached properly, here's what I've done:
  1. I added an array to capture the header in the sheet, and include it in the output - it's useful information, in addition to the cell address and the mismatch values to help troubleshoot any mismatch results. I also added it in its own column in the output to enable sort/group/pivot to see if multiple errors on particular fields. I think I did that properly?
  2. To work around the array size limit you mentioned ( I get an out of memory error somewhere between 400,000 and 500,000 rows), I added a loop to step through groups of 400,000 rows, and then release the memory after each group. It runs on 531,000 rows x 387 columns in under 10 min.
On first glance at results, everything looks good with what I've added, but being new to arrays, would appreciate any feedback for what I should have done differently or if ok? Thanks for you help with this!

VBA Code:
Sub huey() 
Dim Mary As Variant, Sary As Variant, Hary As Variant, Rary(1 To 1000000, 1 To 2) As Variant
Dim r As Long, c As Long, nr As Long, myRow As Long, lastRow  As Long

lastRow = Sheets("Source").Cells.Find("*", searchorder:=xlByRows, searchdirection:=xlPrevious).Row

With Sheets("Results")
    .Range("A1").Value = "Field"
    .Range("B1").Value = "Mismatch"
End With
    
For myRow = 1 To lastRow Step 400000

    Sary = Sheets("Source").Range("A" & myRow & ":NW" & myRow + 399999 & "").Value2
    Mary = Sheets("Migrated").Range("A" & myRow & ":NW" & myRow + 399999 & "").Value2
    Hary = Sheets("Source").Range("A1:NW1").Value2
    
     For r = 1 To UBound(Mary)
        For c = 1 To UBound(Mary, 2)
           If Mary(r, c) <> Sary(r, c) Then
              nr = nr + 1
              Rary(nr, 1) = Hary(1, c)
              Rary(nr, 2) = "Cell: " & Cells(r, c).Address & "  Migrated value: " & Mary(r, c) & "  <>  " & "Source value: " & Sary(r, c)
           End If
        Next c
     Next r

    ThisWorkbook.Sheets("Results").Cells(Rows.Count, 1).End(xlUp).Offset(1, 0).Resize(nr, 2).Value = Rary

    '//Release memory from array before it is reset, otherwise will reach memory limits and fail on large datasets
    Erase Sary
    Erase Mary
    Erase Rary
    nr = 0

Next myRow

End Sub
 
Upvote 0

Forum statistics

Threads
1,214,583
Messages
6,120,380
Members
448,955
Latest member
BatCoder

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top