Is there a way to VBA coding to detect and replace "Fuzzy" duplicates and typos with the text of the 1st match occurrence?

LydiaA

New Member
Joined
Nov 13, 2020
Messages
3
Office Version
  1. 2010
Platform
  1. Windows
Essentially I have a list of customer names and products sold, the issue is that the data source uses manually input Customer Names which often results in slight variances (typos, middle initials excluded/included, etc) This makes finding a true count of total products sold to each customer impossible. I am looking for a way to find and replace the near duplicates with their 1st near duplicate occurrence in an adjacent column so that I can use that new name column to determine the count of total customers. Please see the example below...

A​
B​
C​
D​
E​
DateMonthNameNew NameProduct
9/18/2020September360 THRIFT 3 LLC361 THRIFT 3 LLCReferred Merchant Services
10/7/2020October360 THRIFT LLC - missing a "3"362 THRIFT 3 LLCOther Products
9/17/2020SeptemberA & R NATURELLES, INC.A & R NATURELLES, INC.Business DDA
9/17/2020SeptemberA & R NATURELLES, INC - missing the "." after INCA & R NATURELLES, INC.Business DDA
10/27/2020OctoberA NATIONAL DELIVERER OF EVERYTHING LLC A NATIONAL DELIVERER OF EVERYTHING LLCOther Products
10/1/2020OctoberA NATIONAL DELIVERER OF EVERYTHING LLC D - added an extra "D"A NATIONAL DELIVERER OF EVERYTHING LLCBusiness Money Market
9/22/2020SeptemberA TO Z THERAPIES LLCA TO Z THERAPIES LLCOther Products
9/1/2020SeptemberA+ CREDIT AND DEBT HELP ASSOCIATIONA+ CREDIT AND DEBT HELP ASSOCIATIONOther Products
8/25/2020AugustA+ CREDIT AND DEBT HELP ASSOCIATIONA+ CREDIT AND DEBT HELP ASSOCIATIONBusiness DDA
8/25/2020AugustA& CREDIT AND DEBT HELP ASSOCIATION - typo, entered A"&" instead of A+A+ CREDIT AND DEBT HELP ASSOCIATIONBusiness Money Market
8/24/2020AugustAARON NICHOLAS AMARALAARON NICHOLAS AMARALRetail DDA
10/29/2020OctoberAARON R VEGAAARON R VEGARetail DDA
8/7/2020AugustAARON VEGA - missing "R"AARON R VEGARetail DDA

Using column C in the above data set, my Customer count totals: 12 unique customers
Whereas using the column D it totals: 8 unique customers (which is the correct total)

Unfortunately I do not have the Ablebits add-in to detect fuzzy duplicates etc
Is there a VBA code that can populate column D with the 1st occurrence of a near duplicate from column C?

Thank you kindly! :)
 

Excel Facts

Test for Multiple Conditions in IF?
Use AND(test, test, test, test) or OR(test, test, test, ...) as the logical_test argument of IF.
If you enter this search term "excel 365 vba fuzzy match" into your favourite search engine you should find something that suits your needs.
 
Upvote 0
Essentially I have a list of customer names and products sold, the issue is that the data source uses manually input Customer Names which often results in slight variances (typos, middle initials excluded/included, etc) This makes finding a true count of total products sold to each customer impossible. I am looking for a way to find and replace the near duplicates with their 1st near duplicate occurrence in an adjacent column so that I can use that new name column to determine the count of total customers. Please see the example below...

A​
B​
C​
D​
E​
DateMonthNameNew NameProduct
9/18/2020September360 THRIFT 3 LLC361 THRIFT 3 LLCReferred Merchant Services
10/7/2020October360 THRIFT LLC - missing a "3"362 THRIFT 3 LLCOther Products
9/17/2020SeptemberA & R NATURELLES, INC.A & R NATURELLES, INC.Business DDA
9/17/2020SeptemberA & R NATURELLES, INC - missing the "." after INCA & R NATURELLES, INC.Business DDA
10/27/2020OctoberA NATIONAL DELIVERER OF EVERYTHING LLCA NATIONAL DELIVERER OF EVERYTHING LLCOther Products
10/1/2020OctoberA NATIONAL DELIVERER OF EVERYTHING LLC D - added an extra "D"A NATIONAL DELIVERER OF EVERYTHING LLCBusiness Money Market
9/22/2020SeptemberA TO Z THERAPIES LLCA TO Z THERAPIES LLCOther Products
9/1/2020SeptemberA+ CREDIT AND DEBT HELP ASSOCIATIONA+ CREDIT AND DEBT HELP ASSOCIATIONOther Products
8/25/2020AugustA+ CREDIT AND DEBT HELP ASSOCIATIONA+ CREDIT AND DEBT HELP ASSOCIATIONBusiness DDA
8/25/2020AugustA& CREDIT AND DEBT HELP ASSOCIATION - typo, entered A"&" instead of A+A+ CREDIT AND DEBT HELP ASSOCIATIONBusiness Money Market
8/24/2020AugustAARON NICHOLAS AMARALAARON NICHOLAS AMARALRetail DDA
10/29/2020OctoberAARON R VEGAAARON R VEGARetail DDA
8/7/2020AugustAARON VEGA - missing "R"AARON R VEGARetail DDA

Using column C in the above data set, my Customer count totals: 12 unique customers
Whereas using the column D it totals: 8 unique customers (which is the correct total)

Unfortunately I do not have the Ablebits add-in to detect fuzzy duplicates etc
Is there a VBA code that can populate column D with the 1st occurrence of a near duplicate from column C?

Thank you kindly! :)
Fuzzy Function.xlsm
ABCDE
1DateMonthNameNew NameProduct
29/18/2020September360 THRIFT 3 LLC360 THRIFT 3 LLCReferred Merchant Services
310/07/2020October360 THRIFT LLC360 THRIFT 3 LLCOther Products
49/17/2020SeptemberA & R NATURELLES, INC.A & R NATURELLES, INC.Business DDA
59/17/2020SeptemberA & R NATURELLES, INCA & R NATURELLES, INC.Business DDA
610/27/2020OctoberA NATIONAL DELIVERER OF EVERYTHING LLCA NATIONAL DELIVERER OF EVERYTHING LLCOther Products
710/01/2020OctoberA NATIONAL DELIVERER OF EVERYTHING LLC DA NATIONAL DELIVERER OF EVERYTHING LLCBusiness Money Market
89/22/2020SeptemberA TO Z THERAPIES LLCA TO Z THERAPIES LLCOther Products
909/01/2020SeptemberA+ CREDIT AND DEBT HELP ASSOCIATIONA+ CREDIT AND DEBT HELP ASSOCIATIONOther Products
108/25/2020AugustA+ CREDIT AND DEBT HELP ASSOCIATIONA+ CREDIT AND DEBT HELP ASSOCIATIONBusiness DDA
118/25/2020AugustA& CREDIT AND DEBT HELP ASSOCIATIONA+ CREDIT AND DEBT HELP ASSOCIATIONBusiness Money Market
128/24/2020AugustAARON NICHOLAS AMARALAARON NICHOLAS AMARALRetail DDA
1310/29/2020OctoberAARON R VEGAAARON R VEGARetail DDA
1408/07/2020AugustAARON VEGAAARON R VEGARetail DDA
Sheet2
Cell Formulas
RangeFormula
D2:D14D2=GetFirstMatch(C2,C$2:C2,0.8)

VBA Code:
Option Explicit

Function GetFirstMatch(ByVal String1 As String, ByRef Rangex As Range, Optional MinPercent As Single = 0.8) As String

Dim sResult As String
Dim sngMatchPercent As Single

Dim rCur As Range

For Each rCur In Rangex
    sResult = CStr(rCur.Value)
    sngMatchPercent = GetLevenshteinPercentMatch(String1:=String1, _
                                                String2:=sResult, _
                                                Normalised:=True)
    If sngMatchPercent >= MinPercent Then
        GetFirstMatch = sResult
        Exit Function
    End If
Next rCur
GetFirstMatch = String1
End Function

Public Function GetLevenshteinPercentMatch(ByVal String1 As String, _
                                            ByVal String2 As String, _
                                            Optional Normalised As Boolean = True) As Single
Dim iLen As Integer
If Normalised = False Then
    String1 = UCase$(WorksheetFunction.Trim(String1))
    String2 = UCase$(WorksheetFunction.Trim(String2))
End If
iLen = WorksheetFunction.Max(Len(String1), Len(String2))
GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(String1, String2)) / iLen
End Function

Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
'********************************
'*** Compute Levenshtein Distance
'********************************

Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim I As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

  ' Step 1

  N = Len(s)
  m = Len(t)
  If N = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = N
    Exit Function
  End If
  ReDim d(0 To N, 0 To m) As Integer

  ' Step 2

  For I = 0 To N
    d(I, 0) = I
  Next I

  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For I = 1 To N
    s_i = Mid$(s, I, 1)
    ' Step 4
    For j = 1 To m
      t_j = Mid$(t, j, 1)
      
      ' Step 5
      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If
      ' Step 6

      d(I, j) = WorksheetFunction.Min(d(I - 1, j) + 1, d(I, j - 1) + 1, d(I - 1, j - 1) + cost)
    Next j
  Next I

  ' Step 7
  LevenshteinDistance = d(N, m)
End Function
 
Upvote 0
On reviewing, this code is more robust:
VBA Code:
Option Explicit

Function GetFirstMatch(ByVal String1 As String, ByRef Rangex As Range, Optional MinPercent As Single = 0.8) As String

Dim sResult As String
Dim sCompare As String
Dim sCurCompare As String

Dim sngMatchPercent As Single

Dim rCur As Range

sCompare = LCase$(WorksheetFunction.Trim(String1))      '** Remove leading, trailing and intermediate multiple spaces
For Each rCur In Rangex
    sResult = Trim$(CStr(rCur.Value))
    sCurCompare = LCase$(WorksheetFunction.Trim(sResult))
    sngMatchPercent = GetLevenshteinPercentMatch(String1:=sCompare, _
                                                String2:=sCurCompare)
    If sngMatchPercent >= MinPercent Then
        GetFirstMatch = sResult
        Exit Function
    End If
Next rCur
GetFirstMatch = String1
End Function

Public Function GetLevenshteinPercentMatch(ByVal String1 As String, _
                                            ByVal String2 As String) As Single
Dim iLen As Integer

iLen = WorksheetFunction.Max(Len(String1), Len(String2))
GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(String1, String2)) / iLen
End Function

Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
'********************************
'*** Compute Levenshtein Distance
'********************************

Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim I As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

  ' Step 1

  N = Len(s)
  m = Len(t)
  If N = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = N
    Exit Function
  End If
  ReDim d(0 To N, 0 To m) As Integer

  ' Step 2

  For I = 0 To N
    d(I, 0) = I
  Next I

  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For I = 1 To N
    s_i = Mid$(s, I, 1)
    ' Step 4
    For j = 1 To m
      t_j = Mid$(t, j, 1)
      
      ' Step 5
      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If
      ' Step 6

      d(I, j) = WorksheetFunction.Min(d(I - 1, j) + 1, d(I, j - 1) + 1, d(I - 1, j - 1) + cost)
    Next j
  Next I

  ' Step 7
  LevenshteinDistance = d(N, m)
End Function
 
Upvote 0

Forum statistics

Threads
1,214,896
Messages
6,122,132
Members
449,066
Latest member
Andyg666

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top