trying to detect look-alike 8-digit numbers - Excel fuzzy lookup?

radia

New Member
Joined
May 20, 2010
Messages
5
Hi - I've a list of unique 8-digit numbers (identities). While I know that each of them is unique, some of them are very similar. For example only one digit might be different, or two digits might be in the opposite order, at any location within the 8-digit sequence.

People will be manually entering data against these 8-digit numbers. Some are not great at data entry, and I'm concerned that they may enter it against the wrong identity.
E.g. one identity is 12345678 and another is 12346578. The inattentive worker may put the data belonging to the former against the ID number for the latter.

Therefore I want to draw up a list of 'look-alikes' - pairs (or perhaps more - there could be clusters of three or four IDs similar to each other) of numbers within the database that I can highlight as needing special attention when entering data against them (and that can be double-checked afterwards).

I downloaded the Excel fuzzy lookup add-on from Microsoft, and set it up to compare the table of ID numbers with a copy of itself, but reporting a relatively low similarity threshold and a high number of matches. But even though I brought the similarity down to 0.25 and the number of matches up to 10 for experimental purposes, it still only ever reported one result for each number - the perfect similarity matches with the same ID numbers in the second copy of the table. Whereas what I'm really interested in are the ones with slightly lower similarity scores in order to pick out the non-identical but look-alike IDs.

I know there are definitely some look-alikes within the database because I've identified a few of them manually.

I'm probably doing something very basic wrong but I can't see what it is. Can anyone please help? Or alternatively suggest a different way of achieving what I want (flagging the look-alikes)?
(Note - the data entry has to be manual. It's first time entry of new data that doesn't yet exist in any database. Therefore it's not simply a question of using vlookup or similar to get it from one place to another.)

Many thanks for any help.
 

Excel Facts

Can Excel fill bagel flavors?
You can teach Excel a new custom list. Type the list in cells, File, Options, Advanced, Edit Custom Lists, Import, OK
One thing you could do is add up the sum of the individual numbers. For instance the sum of all the numbers in your example above is 36 for each of the entries, but the numbers are not equal to each other. Maybe if the sums vary by no more than 0, 1, or 2 then it could be counted as similar and potentially incorrect. I think if you talked more about the process and layout, because in this example the numbers do not equal each other, therefore one is wrong. I don't see how that is not caught with a simple equal sign, given what we know so far.
 
Upvote 0
Let me know if this works for you.

Book1
ABCD
1123456124365TRUE
2123456147546FALSE
3123456654321TRUE
4123456654322FALSE
Sheet4
Cell Formulas
RangeFormula
D1:D4D1=isSame(A1,B1)


VBA Code:
Function isSame(val1 As String, val2 As String) As Boolean
Dim res(0 To 1) As String

res(0) = loadArray(val1)
res(1) = loadArray(val2)
isSame = res(0) = res(1)
End Function

Function loadArray(val)
Dim tmp() As String
With CreateObject("System.Collections.ArrayList")
    tmp = Split(StrConv(val, vbUnicode), Chr(0))
    For i = LBound(tmp) To UBound(tmp) - 1
        .Add tmp(i)
    Next i
    .Sort
    loadArray = Join(.toArray, vbNullString)
End With
End Function
 
Upvote 0
It seems to me that even identifying a similar number and flagging it to be checked does not help a great deal.

Is it feasible to have the eight digit number entered as well so that the input operator gets a warning immediately that the numbers do not match?
 
Upvote 0
Many thanks for all the suggestions.
I think I probably didn't explain myself clearly enough! The data entry staff will not be entering the numbers. They will have a spreadsheet with these ID numbers, and will be entering other new data against them. I'm concerned that they will accidentally enter this new data against the wrong identity. (It's completely new data, so has to be typed in, rather than being automatically pulled by vlookup or similar from another database.)
So in the attached example, they would enter new data in column B. I'm worried that since the ID numbers in A6 and A7 are similar, they will accidentally type data belonging to the wrong person in column B - i.e. they'll type data in B6 that really should be in B7.

So before issuing the spreadsheet to them for data entry, what I want to do is identify pairs (or potentially clusters of 3-4) IDs that are potentially easily confused, so that I can highlight them to them as ones to be particularly careful with - in a similar way to how I have coloured the look-alike pairs in the attachment.
I don't need an Excel formula that's good enough to conditionally format the similar pairs - I can do that bit manually.
What I'm concerned about is actually identifying the look-alikes.

I appreciate your help with this!
lookalike_example.png
 
Upvote 0
So essentially what I'm trying to do is to compare each number with all the other numbers in the database to see if there are any that are similar, and if so, to flag which one(s) are look-alikes for it.
Many thanks for your help.
 
Upvote 0
So essentially what I'm trying to do is to compare each number with all the other numbers in the database to see if there are any that are similar, and if so, to flag which one(s) are look-alikes for it.
Many thanks for your help.
I think the difficulty you have set yourself is identifying that what constitutes a similar number may vary depending on the operator, as you have to identify both the common differences - substitutes 2 for 5 , 6 for 9, that sort of thing, and those that are mis-located i.e. 09 instead of 90. Any routine to identify and flag problematic numbers is still liable to failure for not searching broadly enough for the possible problem numbers, and if you make the search too broad, than every number gets flagged. That is why I suggested that the operator re-enter the identifying number as their first item, so that they can immediately know if the entry is for the correct item or if they have made a mistake, and to check to see what the cause of the mis-match is.

Unfortunately that is about as far as my Excel expertise goes, as I would think that to achieve what you want will require a VBA solution, which is definitely outside my limited expertise with Excel.

Good luck
 
Upvote 0
Thanks DSCfromCFA.
Yes, it's a tricky one and there's definitely an element of subjectivity possible.
I guess I'd be happy with identifying any that:
* differ by just one digit; or
* have two digits transposed (thus differing by those two digits.

Your idea to have them enter the number manually as an additional prompt is a good one.
 
Upvote 0

Forum statistics

Threads
1,213,543
Messages
6,114,237
Members
448,555
Latest member
RobertJones1986

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top