[an error occurred while processing this directive]

Challenge of the Month


Here is the problem: Build in VBA a routine that will calculate a "fuzzy match" between two text strings. This routine will allow us to say that one string is a 75% match to the other string. Here are some examples:

"Ask MrExcel.com"
"Mr Excel.com Consulting"
There are 11 characters which match and are in order between these two strings. We'll divide the 11 by the length of string1, 11/15 = 73% match.

"I B M"
"IBM Corporation"
This has 3 characters that match, divided by 5 in the top string, for a 60% match.

"A. Schulman"
"A Shulman"
The characters that match are A-space-S-h-u-l-m-a-n. That is 9 characters out of 11, for a 82% match.

Note that the characters have to be in order.

"Elvis"
"Lives"
have all the same characters, but the longest section in order is l-v-s for a 60% match.

I'll post any significant advances here in the Weblog. Remember it shows posts in reverse chronological order, so the original post is at the bottom. The Blogger Comment feature is just so-so, so feel free to send advances to Challenge@MrExcel.com and I will periodically update the progress.




Monthly Challenge Problem






This page is powered by Blogger. Isn't yours?
Sunday, February 03, 2002
 
Ed Acosta sent this note and a nice little bit of code. It is not a UDF, but if you need to perform a fuzzy match on two columns of numbers it will work well.

I guess this is kind of late for the contest, but my code lines are less then the ones you posted and it works from what I tested. I basically determined that onces a match is made on the first string from the second string the next comparison starts from that point of the first string.


Sub FuzzyMatch()

Dim L, L1, L2, M, SC, T, R As Integer
Dim Fstr, Sstr As String

For R = 1 To Range("A65536").End(xlUp).Row
L = 0: M = 0: SC = 1
Fstr = UCase(Cells(R, 1).Value)
Sstr = UCase(Cells(R, 2).Value)
L1 = Len(Fstr)
L2 = Len(Sstr)

Do While L < L1
L = L + 1
For T = SC To L1
If Mid$(Sstr, L, 1) <> Mid$(Fstr, T, 1) Then GoTo RS
M = M + 1
SC = T
T = L1 + 1
RS:

Next T
Loop
Cells(R, 3).Value = M / L1

Next R

End Sub



Wednesday, November 07, 2001
 
Damon Ostrander followed up with this code. It is nice and compact. Excellent entry from Damon. I'll queue him up for one of the MrExcel T-Shirts. First his comments, then the code:

I took a different approach to the problem, writing a function using only built-in VBA resources. The function
can be called either from VBA or directly from a worksheet using for example the syntax:

=Fuzzy("I B M","The IBM Corporation")

to return the percentage match.

I limited the first string to 24 characters max since the algorithm used is not the fastest--it does an exhaustive compare against every ordered combination of characters in the first string. I believe it could be made much more efficient via the use of Grey Code rather than straight binary so that each time a match is found all lesser matches would automatically be eliminated.

This method yields the correct percentages for each of the examples given in the challenge including the IBM case above, which yields 60%.


Dim TopMatch As Integer
Dim strCompare As String

Function Fuzzy(strIn1 As String, strIn2 As String) As Single

Dim L1 As Integer
Dim In1Mask(1 To 24) As Long 'strIn1 is 24 characters max
Dim iCh As Integer
Dim N As Long
Dim strTry As String
Dim strTest As String

TopMatch = 0
L1 = Len(strIn1)
strTest = UCase(strIn1)
strCompare = UCase(strIn2)

For iCh = 1 To L1
In1Mask(iCh) = 2 ^ iCh
Next iCh

'Loop thru all ordered combinations of characters in strIn1
For N = 2 ^ (L1 + 1) - 1 To 1 Step -1
strTry = ""
For iCh = 1 To L1
If In1Mask(iCh) And N Then
strTry = strTry & Mid(strTest, iCh, 1)
End If
Next iCh
If Len(strTry) > TopMatch Then TestString strTry
Next N

Fuzzy = TopMatch / CSng(L1)

End Function

Sub TestString(strIn As String)

Dim L As Integer
Dim strTry As String
Dim iCh As Integer

L = Len(strIn)
If L <= TopMatch Then Exit Sub

strTry = "*"

For iCh = 1 To L
strTry = strTry & Mid(strIn, iCh, 1) & "*"
Next iCh

If strCompare Like strTry Then
If L > TopMatch Then TopMatch = L
End If

End Sub



Wednesday, October 31, 2001
 
Damon Ostrander checked in with this note:

The problem definition is not entirely clear. For example, in the comparison of the two strings:

"Ask MrExcel.com"
"Mr Excel.com Consulting"

there are supposed to be 11 characters that match. But what about the s in Ask matching with the s in Consulting? And if this is not a legitimate match, then what about matching the two strings:

"Ask MrExcel.com"
"Mr Excel.com Consulting, Ask"

Would the two "Ask" strings not count as matching?

Great questions. As I posted on October 17th, the characters must be in order to match. This would preclude the "s" in Ask to match the "s" in Consulting. If the letters could match in any order, then these two strings would be a 100% match:
"Damon Ostrander"
"Mastadon Errand"

I like Damon's second question, which makes me want to change the original question. It would be tempting to include a "significant" number of sequential letters matching as in:
"LastName, FirstName"
"FirstName, LastName"


Sunday, October 28, 2001
 
Andrew from Australia wrote with comments. Juan Pablo's code looks OK. The solution to handle the IBM pair is to swap str1 & str2 and re-run the i1 loop. Thanks to Andrew for his contribution. A MrExcel 2002 Olympic Calendar is on its way to Australia for his suggestion shown below.


Sub FuzzyMatch()
Dim i1 As Integer
Dim i As Integer
Dim TopString As String
Dim TopMatch As String
Dim Str1 As String
Dim Str2 As String
Dim Ar1() As Integer
Dim Fn As WorksheetFunction

Set Fn = WorksheetFunction

Str1 = Application.InputBox("Cell Address First String", , , , , , , 8)
Str2 = Application.InputBox("Cell Address Second String", , , , , , , 8)

ReDim Ar1(Len(Str1)) As Integer
On Error Resume Next
FirstRun = True
ReRun:
For i1 = 1 To Len(Str1)
For i = 1 To Len(Str1)
Ar1(i) = 0
Next i
TopString = ""
i = 1
F = Fn.Search(Mid(Str1, i1, 1), Str2)
If Not IsEmpty(F) Then
TopString = IIf(1 >= Len(TopString), Mid(Str1, i1, 1), TopString)
Ar1(i) = F
i = i + 1
F = Empty
For j1 = i1 + 1 To Len(Str1)

F = Fn.Search(Mid(Str1, j1, 1), Str2, Ar1(i - 1))
If Not IsEmpty(F) Then
If F > Ar1(i - 1) Then
TopString = IIf(Len(TopString & Mid(Str1, j1, 1)) >= Len(TopString), _
TopString & Mid(Str1, j1, 1), TopString)
Ar1(i) = F
i = i + 1
End If
F = Empty
End If
Next j1
End If
TopMatch = IIf(Len(TopMatch) < Len(TopString), TopString, TopMatch)
Next i1
If FirstRun then
FirstRun = False
StrTemp = Str1
Str1 = Str2
Str2 = StrTemp
GoTo ReRun
End if
MsgBox "Top Match is: '" & TopMatch & "', with a fuzzy match of " & _
Fn.Text(Len(TopMatch) / Len(Str1), "0.00%") & "."
End Sub



Thursday, October 18, 2001
 
After Qroozn got us off to a stellar start ;-), Juan Pablo sent in the first serious stab at a solution. The thing works for every case except "I B M" and "IBM Corporation". Take a look at it, build on it.


Sub FuzzyMatch()
Dim i1 As Integer
Dim i As Integer
Dim TopString As String
Dim TopMatch As String
Dim Str1 As String
Dim Str2 As String
Dim Ar1() As Integer
Dim Fn As WorksheetFunction

Set Fn = WorksheetFunction

Str1 = Application.InputBox("Cell Address First String", , , , , , , 8)
Str2 = Application.InputBox("Cell Address Second String", , , , , , , 8)

ReDim Ar1(Len(Str1)) As Integer
On Error Resume Next
For i1 = 1 To Len(Str1)
For i = 1 To Len(Str1)
Ar1(i) = 0
Next i
TopString = ""
i = 1
F = Fn.Search(Mid(Str1, i1, 1), Str2)
If Not IsEmpty(F) Then
TopString = IIf(1 >= Len(TopString), Mid(Str1, i1, 1), TopString)
Ar1(i) = F
i = i + 1
F = Empty
For j1 = i1 + 1 To Len(Str1)

F = Fn.Search(Mid(Str1, j1, 1), Str2, Ar1(i - 1))
If Not IsEmpty(F) Then
If F > Ar1(i - 1) Then
TopString = IIf(Len(TopString & Mid(Str1, j1, 1)) >= Len(TopString), _
TopString & Mid(Str1, j1, 1), TopString)
Ar1(i) = F
i = i + 1
End If
F = Empty
End If
Next j1
End If
TopMatch = IIf(Len(TopMatch) < Len(TopString), TopString, TopMatch)
Next i1
MsgBox "Top Match is: '" & TopMatch & "', with a fuzzy match of " & _
Fn.Text(Len(TopMatch) / Len(Str1), "0.00%") & "."
End Sub


Juan Pablo definitely has an official MrExcel.com 2002 Salt Lake City Official Olympic Calendar coming his way. Thanks for the entry.


Wednesday, October 17, 2001
 
Back in 1999, I used to publish a monthly challenge. It would be an insane problem and I would offer a MrExcel.com Coffee Mug to whoever came up with the best answer. Well, I recently came across an insane problem.

Web technology has come along. If you are a frequent reader of the message board, you can see how collaboration helps come to the optimal solution for a problem. Someone will ask a question, someone else will comment, someone else will build on that comment, etc. Now, with the Weblog technology, we can, in theory all collaborate right here on the Weblog.

So - as a great big experiment, I am proud to annouce the return of the Challenge. I'll explain the problem below. Post your replies, thoughts, etc. below that. This is a tough challenge, so I will upgrade the prize offerings. Everyone who contributes and makes a significant contribution towards moving this challenge along to a solution will win an official MrExcel.com Olympic Winter Games Salt Lake 2002 Calendar. The calendar is packed with Winter Olympic history, schedules, and cool stickers for the kids. Whoever makes the most significant contribution to the answer with receive one of the super-deluxe embroidered MrExcel.com long sleeve denim shirts.

Here is the problem: I would like to build in VBA a routine that will calculate a "fuzzy match" between two text strings. This routine will allow us to say that one string is a 75% match to the other string. Here are some examples:

"Ask MrExcel.com"
"Mr Excel.com Consulting"
There are 11 characters which match and are in order between these two strings. We'll divide the 11 by the length of string1, 11/15 = 73% match.

"I B M"
"IBM Corporation"
This has 3 characters that match, divided by 5 in the top string, for a 60% match.

"A. Schulman"
"A Shulman"
The characters that match are A-space-S-h-u-l-m-a-n. That is 9 characters out of 11, for a 82% match.

Note that the characters have to be in order.

"Elvis"
"Lives"
have all the same characters, but the longest section in order is l-v-s for a 60% match.

It is really easy for our brains to look at those two strings and to pick out the patterns. I've challenged a few friends to try and do this in VBA, and we have concluded: "its hard!".

There you have it. We'll let the challenge run until we all come up with some sort of a solution or mutually agree that it can not be done. If you have comments, ideas, etc. you can either e-mail the to Challenge@MrExcel.com or click the little comment hyperlink below and post it interactively. If you send e-mail, I will periodically summarize and post here.

Best of luck.





MrExcel.com provides examples of Visual Basic procedures for illustration only, without warranty either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose. The Visual Basic procedures on this web site are provided "as is" and we do not guarantee that they can be used in all situations.

Return to the Mr Excel Home Page.

Thanks for visiting Mr Excel.com! Proudly serving the web since November 21, 1998.

All contents copyright 2001 by MrExcel.com.

Send questions to ask@MrExcel.com

Excel is a registered trademark of the Microsoft Corporation.