Fuzzy Matching - new version plus explanation

al_b_cnu · Mar 27, 2006

It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:

Fuzzy Examples.xls

A

B

C

D

E

1

Starts With

Ends

Contains

2

Bill

jelen

Bill

3

Mr Bill Jelen

4

3

4

Bill Jelen

5

Joe Bloggs

6

Fred Smith

MATCH Example

... Continued ...

al_b_cnu · Mar 2, 2013

ipswichblues said:
Thanks so much, I am merging 2 databases for work and this is just what I needed.
Just making sure though. Is the most up to date code on the first page?

Hi ipswichblues,
Follow the link in my signature for latest code.
Enjoy

ipswichblues · Mar 2, 2013

Superb!!
quick question. What would u recommend the best settings are for the following.

I am merging 2 databases and will be checking for duplicates in names, emails and addresses (street name and zip code)

one database has 11,000 records and the other 6,000, so don't want to spend all day figuring out what's best for each lookup. Have a quad core processor but even that takes a while to get results.

thanks in advance!!

al_b_cnu · Mar 4, 2013

ipswichblues said:
Superb!!
quick question. What would u recommend the best settings are for the following.

I am merging 2 databases and will be checking for duplicates in names, emails and addresses (street name and zip code)

one database has 11,000 records and the other 6,000, so don't want to spend all day figuring out what's best for each lookup. Have a quad core processor but even that takes a while to get results.

thanks in advance!!

Hi,

Algorithm 4 (Levenstein distance) is the most accurate, but slower than algorithm 2 which matches pairs, triplets, quadfs etc..

Given the number of entries, I'd suggest algorithm 2.

nicklear · Mar 11, 2013

Unusual reason for non-completion of fuzzy vlookup

Just thought I'd post this because it is quite funny.

Was running a fuzzy vlookup for a column of 2000 odd names and while in the midst of the calculation, my laptop was stolen! From the kitchen while we were in the room next door!

(I am in Africa doing a child sponsorship program - was comparing names to last years list with plenty of spelling errors).

I wonder if the bandito (as they call them here in Mozambique) allowed the calculation to complete...

al_b_cnu · Mar 11, 2013

Re: Unusual reason for non-completion of fuzzy vlookup

Don't they know that they can get the Fuzzy Matching code for free on this site?

irhumbled · Apr 3, 2013

I've been using FuzzyVlookup to match account information between two databases. I'm matching based on the street address and then removing duplicates. This has saved me considerable time where Vlookup would have caused all sorts of issues in regards to slight deviations in spelling street names or in how certain streets were abbreviated when entered in. Sometimes "6th st" was entered "sixth st" for example and this has saved me time in matching.

However, I feel that my procoss of merging databases, given the criteria of matching by street address, would be more accurate by matching first on street number and then referencing the street name in another column. I'm not quite sure how to inform the macro that's run though to first match to street number and then to compare that against the street name. Basically I want to have a higher threshold for the street number with a lower threshhold for the street name. I imagine this would be best served with one column having the respective street numbers and the column over having the street names.

This is probably already possible to do, but could someone provide an example of how to do this? Perhaps it involves the group or column offset?

Thanks for your help!

al_b_cnu · Apr 4, 2013

Hi irhumbled,

Can you provide a sample set of inputs?

you may want to start a new thread for this (although I'm sure at least one reply will say "try FuzzyVLookup") - by all means PM me with a link if you wish

irhumbled · Apr 4, 2013

al_b_cnu said:
Hi irhumbled,

Can you provide a sample set of inputs?

you may want to start a new thread for this (although I'm sure at least one reply will say "try FuzzyVLookup") - by all means PM me with a link if you wish

I'll try to explain what I'm trying to accomplish hopefully more thoroughly.

You might imagine in Column A I have street numbers going down.

For example in A1: 122, A2: 156, A3: 994

And in Column B I'll have the street name of the street address

For example in B1: Thoroughway St, B2: Douglas Ct, B3: Clearwater Dr

And in Column C and/or D I'd have the street addresses (split or concated together into one column) from another database.

I'd like to have the next column over attempt to match the information in column C (or C and D if the info is split) to columns A and B.

The reason I mention this is in addresses I fee it makes the most sense to first priorotize matching street numbers, and then to see if the street name is correct. I'm using the FuzzyVlookup currently but I feel it's giving the same value to a street name as it would to a street number. Basically when there's not a perfect or near perfect match, I'd rather have it have a high priority of matching street numbers even if the street name is moderatly off. This was, something like 1113 6th st matches more likely to 1113 sixth street because of the shared street number.

If I can clarify any further, please let me know and I'll be happy to. It may not be necessary to split the contents of a street address into two columns but I thought it might be helpful in attempting what I want to do.

Thanks!

al_b_cnu · Apr 5, 2013

ok, try this example:

Excel 2003

A

B

C

D

E

F

1

Num

Address

Lookup 1

Lookup 2

Fuzzy Match Row

2

122

Thoroughway St

123 thoroughway Street

5

3

156

Douglas Ct

157 Douglas Ct

#N/A

4

994

Clearwater Dr

156

duglas Ct

3

5

123

Thoroway St

156 duglas court

3

Sheet2

irhumbled · Apr 5, 2013

al_b_cnu said:
ok, try this example:
Excel 2003

A B C D E F
1 Num Address Lookup 1 Lookup 2 Fuzzy Match Row
2 122 Thoroughway St 123 thoroughway Street 5
3 156 Douglas Ct 157 Douglas Ct #N/A
4 994 Clearwater Dr 156 duglas Ct 3
5 123 Thoroway St 156 duglas court 3

<colgroup><col><col><col><col><col><col><col></colgroup><thead>
</thead><tbody>
</tbody>
Sheet2

So... I'm not quite sure what to make of it. Perhaps the formulas posted would help me understand how the matching is done so I can replicate it in my own workbook?

Thank so much for you help al_b_cnu, you're UDF has been a lifesaver!

Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member

al_b_cnu

Well-known Member

Excel Facts

ipswichblues

New Member

al_b_cnu

Well-known Member

nicklear

New Member

al_b_cnu

Well-known Member

irhumbled

New Member

al_b_cnu

Well-known Member

irhumbled

New Member

al_b_cnu

Well-known Member

irhumbled

New Member

Similar threads

Forum statistics

Share this page

Fuzzy Matching - new version plus explanation

Well-known Member

Well-known Member

Excel Facts

New Member

Well-known Member

New Member

Well-known Member

New Member

Well-known Member

New Member

Well-known Member

New Member

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock