How to find the most similar Word to a word? (Using Excel or Vba)

makiwara

Board Regular
Joined
Mar 8, 2018
Messages
171
INPUT:

List1 BASIS (FIX)

all
allow
dog
dumm
elephant
element


List2

alarm
allowance
dogtag
donut
duck
elemental


OUTPUT

List1 BASIS (FIX) List2

all allowance
allow allowance
dog dogtag
dumm duck
elephant elemental
element elemental

I tried to use the FUZZY LOOKUP, but it doesn't return the right values I expect.

Thank you for your help, have a very nice day! :)
 
Dear Eric! I tested today the code. And everything works fine but I discovered a little problem, maybe I couldn't explain everything correctly. I don't know whether you have the time or not, but I really appreciate your help. And I don't know how could I thank this.

List1 Accept matches List2 access, instead of acceptance

The code should choose a word which has the most matching caracter from the start of the word.

For example:
accept --> acceptance and not access
able --> ably and not ablaze (in this case the code works fine) (in both words there is a string "abl", but as you have mentioned, ably can be easier created from able, so this is the right choice.)

So your code is great, which make decisions based on the characters need to be changed to get the word. Could you combine this with a code, which first examines the words from the left, and if there are more words in list2 which contains the word from list1 (or a part of that, for example able --> ably and not ablaze (here works code fine), adapt-->adaptableand not adept (the program shows adept here) then continues your code that you have written? (based on Levenhstein distance, everything is perfect you have written in the code)

I really hope that you don't get angry because I couldn't describe exactly what i am looking for. But your code is great and I will use that, but it will be great, if you could add a little part to it. Have a very nice day




You're welcome! :cool:
 
Upvote 0

Excel Facts

What is the shortcut key for Format Selection?
Ctrl+1 (the number one) will open the Format dialog for whatever is selected.
I added a check so that if we compare a word, and it finds a word on the list that's exactly the same, but with additional letters, we'll take that.

Rich (BB code):
Public Function LevDist(str1 As String, str2 As String)
Dim d() As Long, m As Long, n As Long, i As Long, j As Long, Cost As Long

    m = Len(str1)
    n = Len(str2)
    ReDim d(0 To m, 0 To n)
    
    For i = 1 To m
        d(i, 0) = i
    Next i
    
    For j = 1 To n
        d(0, j) = j
    Next j
    
    For j = 1 To n
        For i = 1 To m
            Cost = IIf(Mid(str1, i, 1) = Mid(str2, j, 1), 0, 1)
            d(i, j) = WorksheetFunction.Min(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + Cost)
        Next i
    Next j
    
    LevDist = d(m, n)
    
End Function


Public Function GetClosest(str1 As String, r1 As Range)
Dim d1 As Variant, MinDist As Long, i As Long, j As Long, LD As Long

    d1 = r1.Value
    MinDist = 99999
    GetClosest = ""
    
    For i = 1 To UBound(d1)
        For j = 1 To UBound(d1, 2)
            LD = LevDist(str1, CStr(d1(i, j)))
            If LD < MinDist Then
                MinDist = LD
                GetClosest = d1(i, j)
            End If
            If Left(CStr(d1(i, j)), Len(str1)) = str1 Then
                GetClosest = CStr(d1(i, j))
                Exit Function
            End If
        Next j
    Next i
        
End Function
I hope this works for you, since I don't have time to do much more. Good luck! :)
 
Upvote 0
makiwara, I don't see where you ever explicitly described "most similar" as Fazza requested. One way of describing similarity between strings is the Levenshtein distance. See:

Levenshtein distance - Wikipedia

The distance is roughly defined as how many changes are needed to change one string into the other, adds, deletes, changes. I converted the algorithm on that web page to VBA. Then I wrote another function that compares one word to every word on another list, and returns the word with the smallest Levenshtein distance.

To try it, open a copy of your workbook. Press Alt-F11 to open the VBA editor. From the menu, select Insert > Module. Paste the following code into the window that opens:

Code:
Public Function LevDist(str1 As String, str2 As String)
Dim d() As Long, m As Long, n As Long, i As Long, j As Long, Cost As Long

    m = Len(str1)
    n = Len(str2)
    ReDim d(0 To m, 0 To n)
   
    For i = 1 To m
        d(i, 0) = i
    Next i
   
    For j = 1 To n
        d(0, j) = j
    Next j
   
    For j = 1 To n
        For i = 1 To m
            Cost = IIf(Mid(str1, i, 1) = Mid(str2, j, 1), 0, 1)
            d(i, j) = WorksheetFunction.Min(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + Cost)
        Next i
    Next j
   
    LevDist = d(m, n)
   
End Function


Public Function GetClosest(str1 As String, r1 As Range)
Dim d1 As Variant, MinDist As Long, i As Long, j As Long, LD As Long

    d1 = r1.Value
    MinDist = 99999
    GetClosest = ""
   
    For i = 1 To UBound(d1)
        For j = 1 To UBound(d1, 2)
            LD = LevDist(str1, CStr(d1(i, j)))
            If LD < MinDist Then
                MinDist = LD
                GetClosest = d1(i, j)
            End If
        Next j
    Next i
       
End Function
Press Alt-Q to close the editor. Then enter the function like this on your worksheet:


ABCDEFGHI
1List1List2List1List2
2a
bandon
abandonedlyabandonedlyabandonallalarmalarmall
3abilityablyabasedlyabandonallowalarmallowanceallow
4ableablyabashedlyabandondogdogtagdogtagdog
5aboutaboutabatedlyabandondummduckdonutdog
6aboutaboutabatinglyabandonelephantelementalduckdumm
7aboveaboutabbreviatedlyabsolutelyelementelementalelementalelement
8aboveaboutaberrantlyabandon
9ab
road
aboutabeyantlyabandon
10absenceabsentlyabhorrentlyabsolutely
11absoluteabsolutelyabidinglyability>
12absolutelyabsolutelyabjectlyability
13absorbaboutablyable
14abuseablyabnormallyabsolutely
15academicacademicallyabominablyability
16acceptaboutabortedlyabsolutely
17acceptableacceptinglyaboutabout
18accessaccessiblyabra
sively
absolutely
19accidentaccidentallyabruptlyability
20accommodationaccommodatinglyabsentlyabsence
21accompanyabnormallyabsentmindedlyabsence
22according toabsorbinglyabsent-mindedlyabsence
23accountaboutabsolutelyabsolutely
24accountaboutabsorbantlyabsolutely
25accurateaboutabsorbedlyabsolutely
26accuseaboutabsorbinglyabsolutely
27achieveacapellaabstainedlyabsolutely
28achievementabhorrentlyabstentiouslyabsolutely
29acidablyabstinentlyability
30acknowledgeabandonedlyabstractedlyabsolutely
31acquireaboutabstractlyabsolutely
32d>acrossaboutabstruselyabsolutely
33actablyabsurdlyabsolutely
34actablyabundantlyabandon
35actionablyabusedlyabuse
36activeablyabusivelyabuse<
/td>
37activistabatinglyabysmallyactually
38activityabatinglyacademicallyacademic
39actorablyacapellaacademic
40actualaboutacceptinglyacceptable
41actuallyabruptlyaccessiblyacceptable
42adablyaccidentallyaccident
43adaptaboutacclamatorilyacceptable
44addablyaccommodatinglyaccommodation

<tbody>
</tbody>
Sheet1

Worksheet
Formulas

CellFormula
B2=getclosest(A2,$C$2:$C$44)

<tbody>
</tbody>

<tbody>
</tbody>



Note that this does NOT return all the results you expect. For example, on your sheet you show "ability" returning "abasedly", while on my sheet I get "ably". The Levenshtein distance between "ability" and "abasedly" is 5 (change "ilit" to "ased", 4 changes, and insert "l"). The Levenshtein distance between "ability" and "ably" is 3 (delete "i" and "it").

So there you go. If this does not return the results you want, you'll need to describe how to get those results.
This is terrific! Exactly what I was looking for. Is there a way to edit the "getclosest" function to show the closest with a Lev. Dist. of greater than 0? In other words, if there is an exact match in the list, exclude it and go to the next closest?
 
Upvote 0
Welcome to the forum!

Sure, change this line:

VBA Code:
            If LD < MinDist Then

to

VBA Code:
            If LD < MinDist And LD > 0 Then

It's up to you whether you want to keep the code in red from post 12, or if you want to go back to the earlier version.
 
Upvote 0
Welcome to the forum!

Sure, change this line:

VBA Code:
            If LD < MinDist Then

to

VBA Code:
            If LD < MinDist And LD > 0 Then

It's up to you whether you want to keep the code in red from post 12, or if you want to go back to the earlier version.
You're awesome! Thank you so much!
 
Upvote 0
Dear Eric - Great code! This works great, Thanks!!

Any way to modify the LevDist function to compute a similarity score instead of the actual Levenshtein distance.

So:

1 - LevDist(m, n) / max character length between A and B
 
Upvote 0
Hi banamwana, Welcome to the forum!

You could easily do that as part of the formula:

Excel Formula:
=LevDist(A1,B1)/MAX(LEN(A1),LEN(B1))
 
Upvote 0
Thanks, that works, but given some of the other xls formulas already wrapped around, that's making things a little unwieldy. I had been hoping to implement the similarity score in the VBA function.

I'm really unfamiliar with VBA code, but hoping to use this function to help a global Public Health agency strengthen its disease surveillance by matching up suspect cases with lab reports.
 
Upvote 0
OK, here's a minor tweak to the code:

VBA Code:
Public Function LevDist(str1 As String, str2 As String, Optional type1 = 0)
Dim d() As Long, m As Long, n As Long, i As Long, j As Long, Cost As Long

    m = Len(str1)
    n = Len(str2)
    ReDim d(0 To m, 0 To n)
    
    For i = 1 To m
        d(i, 0) = i
    Next i
    
    For j = 1 To n
        d(0, j) = j
    Next j
    
    For j = 1 To n
        For i = 1 To m
            Cost = IIf(Mid(str1, i, 1) = Mid(str2, j, 1), 0, 1)
            d(i, j) = WorksheetFunction.Min(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + Cost)
        Next i
    Next j
    
    LevDist = d(m, n)
    If type1 > 0 Then LevDist = 1 - (LevDist / WorksheetFunction.Max(m, n))
    
End Function

I added an optional code to the parameters to return the percent match instead of the distance. Add a 1 as a third parameter like so:

Book1
IJ
1
2bananas86%
3banana
4
5John Doe63%
6Jane Doe
7
8Jim Smith57%
9James A. Smith
Sheet15
Cell Formulas
RangeFormula
J2,J8,J5J2=levdist(I2,I3,1)




However, here are a few thoughts. I don't answer a lot of fuzzy matching questions anymore, it's just too hard to do well. Too many false positives or missed positives. I actually do some of this in my fulltime job, and the users are constantly asking me to make it better, but there's very little room for improvement. The Jim Smith example is especially telling. A human would instantly recognize that as a 99% likely match, but this algorithm only 57%. There are much better algorithms. Google "java fuzzy string matching" and you can find a java (or Python) tool that figures out a percentage match based on similarities other than just the Levenshtein distance. Sadly, I can't help much with that though.

But maybe this might work for your situation. I hope so! Good luck!
 
Upvote 0

Forum statistics

Threads
1,215,172
Messages
6,123,447
Members
449,100
Latest member
sktz

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top