Fuzzy Match Challenge
Back in 1999, I used to publish a monthly challenge. It would be an insane problem and I would offer a MrExcel.com Coffee Mug to whoever came up with the best answer. We brought the challenge back in 2001 to solve this fuzzy match problem. Read about it below.
So - as a great big experiment, I am proud to announce the return of the Challenge. I'll explain the problem below. This is a tough challenge, so I will upgrade the prize offerings. Everyone who contributes and makes a significant contribution towards moving this challenge along to a solution will win an official MrExcel.com Olympic Winter Games Salt Lake 2002 Calendar. The calendar is packed with Winter Olympic history, schedules, and cool stickers for the kids. Whoever makes the most significant contribution to the answer with receive one of the super-deluxe embroidered MrExcel.com long sleeve denim shirts.
Here is the problem: Build in VBA a routine that will calculate a "fuzzy match" between two text strings. This routine will allow us to say that one string is a 75% match to the other string. Here are some examples:
"Mr. Excel.com Consulting"
There are 11 characters which match and are in order between these two strings. We'll divide the 11 by the length of string1, 11/15 = 73% match.
"I B M"
This has 3 characters that match, divided by 5 in the top string, for a 60% match.
The characters that match are A-space-S-h-u-l-m-a-n. That is 9 characters out of 11, for a 82% match. Note that the characters have to be in order.
These two words have all the same characters, but the longest section in order is l-v-s for a 60% match.
I'll post any significant advances here in the Web log. Remember it shows posts in reverse chronological order, so the original post is at the bottom. The Blogger Comment feature is just so-so, so feel free to send advances to Chal @ MrExcel . com and I will periodically update the progress.
Sunday, February 03, 2002
Ed Acosta sent this note and a nice little bit of code. It is not a UDF, but if you need to perform a fuzzy match on two columns of numbers it will work well.
I guess this is kind of late for the contest, but my code lines are less then the ones you posted and it works from what I tested. I basically determined that once a match is made on the first string from the second string the next comparison starts from that point of the first string.
Sub FuzzyMatch() Dim L, L1, L2, M, SC, T, R As Integer Dim Fstr, Sstr As String For R = 1 To Range("A65536").End(xlUp).Row L = 0: M = 0: SC = 1 Fstr = UCase(Cells(R, 1).Value) Sstr = UCase(Cells(R, 2).Value) L1 = Len(Fstr) L2 = Len(Sstr) Do While L < L1 L = L + 1 For T = SC To L1 If Mid$(Sstr, L, 1) <> Mid$(Fstr, T, 1) Then GoTo RS M = M + 1 SC = T T = L1 + 1 RS: Next T Loop Cells(R, 3).Value = M / L1 Next R End Sub
Wednesday, November 07, 2001
Damon Ostrander followed up with this code. It is nice and compact. Excellent entry from Damon. I'll queue him up for one of the MrExcel.com T-Shirts. First his comments, then the code:
I took a different approach to the problem, writing a function using only built-in VBA resources. The function can be called either from VBA or directly from a worksheet using for example the syntax:
=Fuzzy("I B M","The IBM Corporation")
to return the percentage match.
I limited the first string to 24 characters max since the algorithm used is not the fastest--it does an exhaustive compare against every ordered combination of characters in the first string. I believe it could be made much more efficient via the use of Grey Code rather than straight binary so that each time a match is found all lesser matches would automatically be eliminated.
This method yields the correct percentages for each of the examples given in the challenge including the IBM case above, which yields 60%.
Dim TopMatch As Integer Dim strCompare As String Function Fuzzy(strIn1 As String, strIn2 As String) As Single Dim L1 As Integer Dim In1Mask(1 To 24) As Long 'strIn1 is 24 characters max Dim iCh As Integer Dim N As Long Dim strTry As String Dim strTest As String TopMatch = 0 L1 = Len(strIn1) strTest = UCase(strIn1) strCompare = UCase(strIn2) For iCh = 1 To L1 In1Mask(iCh) = 2 ^ iCh Next iCh 'Loop thru all ordered combinations of characters in strIn1 For N = 2 ^ (L1 + 1) - 1 To 1 Step -1 strTry = "" For iCh = 1 To L1 If In1Mask(iCh) And N Then strTry = strTry & Mid(strTest, iCh, 1) End If Next iCh If Len(strTry) > TopMatch Then TestString strTry Next N Fuzzy = TopMatch / CSng(L1) End Function Sub TestString(strIn As String) Dim L As Integer Dim strTry As String Dim iCh As Integer L = Len(strIn) If L <= TopMatch Then Exit Sub strTry = "*" For iCh = 1 To L strTry = strTry & Mid(strIn, iCh, 1) & "*" Next iCh If strCompare Like strTry Then If L > TopMatch Then TopMatch = L End If End Sub
Wednesday, October 31, 2001
Damon Ostrander checked in with this note:
The problem definition is not entirely clear. For example, in the comparison of the two strings:
"Mr. Excel.com Consulting"
There are supposed to be 11 characters that match. But what about the s in Ask matching with the s in Consulting? And if this is not a legitimate match, then what about matching the two strings:
"Mr. Excel.com Consulting, Ask"
Would the two "Ask" strings not count as matching?
Great questions. As I posted on October 17th, the characters must be in order to match. This would preclude the "s" in Ask to match the "s" in Consulting. If the letters could match in any order, then these two strings would be a 100% match:
I like Damon's second question, which makes me want to change the original question. It would be tempting to include a "significant" number of sequential letters matching as in:
Sunday, October 28, 2001
Andrew from Australia wrote with comments. Juan Pablo's code looks OK. The solution to handle the IBM pair is to swap str1 & str2 and re-run the i1 loop. Thanks to Andrew for his contribution. A MrExcel.com 2002 Olympic Calendar is on its way to Australia for his suggestion shown below.
Sub FuzzyMatch() Dim i1 As Integer Dim i As Integer Dim TopString As String Dim TopMatch As String Dim Str1 As String Dim Str2 As String Dim Ar1() As Integer Dim Fn As WorksheetFunction Set Fn = WorksheetFunction Str1 = Application.InputBox("Cell Address First String", , , , , , , 8) Str2 = Application.InputBox("Cell Address Second String", , , , , , , 8) ReDim Ar1(Len(Str1)) As Integer On Error Resume Next FirstRun = True ReRun: For i1 = 1 To Len(Str1) For i = 1 To Len(Str1) Ar1(i) = 0 Next i TopString = "" i = 1 F = Fn.Search(Mid(Str1, i1, 1), Str2) If Not IsEmpty(F) Then TopString = IIf(1 >= Len(TopString), Mid(Str1, i1, 1), TopString) Ar1(i) = F i = i + 1 F = Empty For j1 = i1 + 1 To Len(Str1) F = Fn.Search(Mid(Str1, j1, 1), Str2, Ar1(i - 1)) If Not IsEmpty(F) Then If F > Ar1(i - 1) Then TopString = IIf(Len(TopString & Mid(Str1, j1, 1)) >= Len(TopString),_ TopString & Mid(Str1, j1, 1), TopString) Ar1(i) = F i = i + 1 End If F = Empty End If Next j1 End If TopMatch = IIf(Len(TopMatch) < Len(TopString), TopString, TopMatch) Next i1 If FirstRun then FirstRun = False StrTemp = Str1 Str1 = Str2 Str2 = StrTemp GoTo ReRun End if MsgBox "Top Match is: '" & TopMatch & "', with a fuzzy match of " & _ Fn.Text(Len(TopMatch) / Len(Str1), "0.00%") & "." End Sub
Thursday, October 18, 2001
After Qroozn got us off to a stellar start ;-), Juan Pablo sent in the first serious stab at a solution. The thing works for every case except "I B M" and "IBM Corporation". Take a look at it, build on it.
Sub FuzzyMatch() Dim i1 As Integer Dim i As Integer Dim TopString As String Dim TopMatch As String Dim Str1 As String Dim Str2 As String Dim Ar1() As Integer Dim Fn As WorksheetFunction Set Fn = WorksheetFunction Str1 = Application.InputBox("Cell Address First String", , , , , , , 8) Str2 = Application.InputBox("Cell Address Second String", , , , , , , 8) ReDim Ar1(Len(Str1)) As Integer On Error Resume Next For i1 = 1 To Len(Str1) For i = 1 To Len(Str1) Ar1(i) = 0 Next i TopString = "" i = 1 F = Fn.Search(Mid(Str1, i1, 1), Str2) If Not IsEmpty(F) Then TopString = IIf(1 >= Len(TopString), Mid(Str1, i1, 1), TopString) Ar1(i) = F i = i + 1 F = Empty For j1 = i1 + 1 To Len(Str1) F = Fn.Search(Mid(Str1, j1, 1), Str2, Ar1(i - 1)) If Not IsEmpty(F) Then If F > Ar1(i - 1) Then TopString = IIf(Len(TopString & Mid(Str1, j1, 1)) >= Len(TopString), _ TopString & Mid(Str1, j1, 1), TopString) Ar1(i) = F i = i + 1 End If F = Empty End If Next j1 End If TopMatch = IIf(Len(TopMatch) < Len(TopString), TopString, TopMatch) Next i1 MsgBox "Top Match is: '" & TopMatch & "', with a fuzzy match of " & _ Fn.Text(Len(TopMatch) / Len(Str1), "0.00%") & "." End Sub
Juan Pablo definitely has an official MrExcel.com 2002 Salt Lake City Official Olympic Calendar coming his way. Thanks for the entry.
November 19, 2003
Joe Stanton checked in to thank us for the routine posted by Ed Acosta. Joe offered his wrapper application that allows the search to operate on a word by word basis with garbage-characters removed and with extra words discarded (optional).
Function FuzzyMatchByWord(ByVal lsPhrase1 As String, ByVal lsPhrase2 As String, Optional lbStripVowels As Boolean = False, Optional lbDiscardExtra As Boolean = False) As Double ' ' Compare two phrases and return a similarity value (between 0 and 100). ' ' Arguments: ' ' 1. Phrase1 String; any text string ' 2. Phrase2 String; any text string ' 3. StripVowels Optional to strip all vowels from the phrases ' 4. DiscardExtra Optional to discard any unmatched words ' 'local variables Dim lsWord1() As String Dim lsWord2() As String Dim ldMatch() As Double Dim ldCur As Double Dim ldMax As Double Dim liCnt1 As Integer Dim liCnt2 As Integer Dim liCnt3 As Integer Dim lbMatched() As Boolean Dim lsNew As String Dim lsChr As String Dim lsKeep As String 'set default value as failure FuzzyMatchByWord = 0 'create list of characters to keep lsKeep = "BCDFGHJKLMNPQRSTVWXYZ0123456789 " If Not lbStripVowels Then lsKeep = lsKeep & "AEIOU" End If 'clean up phrases by stripping undesired characters 'phrase1 lsPhrase1 = Trim$(UCase$(lsPhrase1)) lsNew = "" For liCnt1 = 1 To Len(lsPhrase1) lsChr = Mid$(lsPhrase1, liCnt1, 1) If InStr(lsKeep, lsChr) <> 0 Then lsNew = lsNew & lsChr End If Next lsPhrase1 = lsNew lsPhrase1 = Replace(lsPhrase1, " ", " ") lsWord1 = Split(lsPhrase1, " ") If UBound(lsWord1) = -1 Then Exit Function End If ReDim ldMatch(UBound(lsWord1)) 'phrase2 lsPhrase2 = Trim$(UCase$(lsPhrase2)) lsNew = "" For liCnt1 = 1 To Len(lsPhrase2) lsChr = Mid$(lsPhrase2, liCnt1, 1) If InStr(lsKeep, lsChr) <> 0 Then lsNew = lsNew & lsChr End If Next lsPhrase2 = lsNew lsPhrase2 = Replace(lsPhrase2, " ", " ") lsWord2 = Split(lsPhrase2, " ") If UBound(lsWord2) = -1 Then Exit Function End If ReDim lbMatched(UBound(lsWord2)) 'exit if empty If Trim$(lsPhrase1) = "" Or Trim$(lsPhrase2) = "" Then Exit Function End If 'compare words in each phrase For liCnt1 = 0 To UBound(lsWord1) ldMax = 0 For liCnt2 = 0 To UBound(lsWord2) If Not lbMatched(liCnt2) Then ldCur = FuzzyMatch(lsWord1(liCnt1), lsWord2(liCnt2)) If ldCur > ldMax Then liCnt3 = liCnt2 ldMax = ldCur End If End If Next lbMatched(liCnt3) = True ldMatch(liCnt1) = ldMax Next 'discard extra words ldMax = 0 For liCnt1 = 0 To UBound(ldMatch) ldMax = ldMax + ldMatch(liCnt1) Next If lbDiscardExtra Then liCnt2 = 0 For liCnt1 = 0 To UBound(lbMatched) If lbMatched(liCnt1) Then liCnt2 = liCnt2 + 1 End If Next Else liCnt2 = UBound(lsWord2) + 1 End If 'return overall similarity FuzzyMatchByWord = 100 * (ldMax / liCnt2) End Function Function FuzzyMatch(Fstr As String, Sstr As String) As Double ' ' Code sourced from: http://www.mrexcel.com/pc07.shtml ' Credited to: Ed Acosta ' Modified: Joe Stanton ' Dim L, L1, L2, M, SC, T, R As Integer L = 0 M = 0 SC = 1 L1 = Len(Fstr) L2 = Len(Sstr) Do While L < L1 L = L + 1 For T = SC To L1 If Mid$(Sstr, L, 1) = Mid$(Fstr, T, 1) Then M = M + 1 SC = T T = L1 + 1 End If Next T Loop If L1 = 0 Then FuzzyMatch = 0 Else FuzzyMatch = M / L1 End If End Function