Fuzzy matching on textual data in Power Query

absherzad

New Member
Joined
Jun 19, 2017
Messages
42
Hello everyone,
Is there a way to incorporate Microsoft Fuzzy Lookup Add-In for Excel in Power Query to perform fuzzy matching on textual data in Power Query? Or is it possible to calculate string similarity in Power Query?
 

Excel Facts

Shade all formula cells
To shade all formula cells: Home, Find & Select, Formulas to select all formulas. Then apply a light fill color.
Hi.
The simplest vesion of Fuzzy
Code:
let
    words1 = "together",
    words2 = "to their",
    simpleLikeFuzzy = (words1 as text, words2 as text) as number =>
        2 * List.Count(List.Intersect({Text.ToList(words1), Text.ToList(words2)})) / (Text.Length(words1) + Text.Length(words2)),
    return = simpleLikeFuzzy(words1, words2)
in
    return
Regards,
 
Upvote 0
Thank you for your response, and it is appreciated :)
The algorithm sounds simple, does it have a name?

My observation:
In some cases this simple version of Fuzzy can be used, such as, minor typo in a column where there is a standard master list for it. However in the following cases it is not advised:

  • Columns that contains values with few characters. Even with one character difference the result may significantly be decreased.
  • Also, sometimes there are words completely different, they, however, share common characters.
  • Moreover, situations that require partial matching.

Fuzzy lookup and comparison is great and has numerous applications. I am looking forward and vote for adding this feature to the Power Query :)

Best Regards,
 
Upvote 0
Hi
Thank you for your observation.
Columns that contains values with few characters. Even with one character difference the result may significantly be decreased.
You can use another criteria for it like this
Code:
    simpleLikeFuzzy = (words1 as text, words2 as text) as number =>
        List.Count(List.Intersect({Text.ToList(words1), Text.ToList(words2)}))
        / (if Text.Length(words1) > Text.Length(words2) then Text.Length(words2) else Text.Length(words1)),
Because chars intersection cannot be more than the shortest char sequence.
Also, sometimes there are words completely different, they, however, share common characters.
You can use more a strong method. It is not optimized as is
Code:
let
    words1 = "A big father waks with his own children about a green garden.",
    words2 = "With nice children a big father walks about a greate garden.",
    toTable = (words as text, idName as text) as any =>
    let
    col1 = Table.RenameColumns(Table.FromList(Text.ToList(words)), {{"Column1", "char" & idName}}),
    idx = Table.AddIndexColumn(col1, idName)
    in
    idx,
    chars1 = toTable(words1, "row"),
    chars2 = toTable(words2, "col"),
    joined = Table.Join(chars1, "charrow", chars2, "charcol"),
    addDif = Table.AddColumn(joined, "dif", each [col] -[row]),
    sorted = Table.Sort(addDif,{{"dif", Order.Ascending}, {"row", Order.Ascending}}),
    addIdx = Table.AddIndexColumn(sorted, "idx"),
    addGlobalDif = Table.AddColumn(addIdx, "gdif", each [row] - [idx])[[charrow], [row], [dif], [gdif]],
    rankGroup = Table.Group(addGlobalDif, {"gdif", "dif"}, {{"forExpand", each [[charrow], [row]]}, {"rows", each [row]}, {"count", each List.Count([row])}}),
    rankSort = Table.Sort(rankGroup,{{"count", Order.Ascending}}),
    rankIndex = Table.Buffer(Table.AddIndexColumn(rankSort[[forExpand], [rows]], "idx")),
    step1 = List.Transform(rankIndex[idx], (cur) =>
    let
        rec = rankIndex{cur},
        sub = Table.SelectRows(rankIndex, each (_[idx] > cur) and ( List.Count(List.Intersect({_[rows], rec[rows]})) > 0 )),
        result = if Table.RowCount(sub) > 0 then null else rec
    in
        result
    ),
    step2 = Table.FromRecords(List.RemoveNulls(step1))[[forExpand]],
    uniqieRows = Table.ExpandTableColumn(step2, "forExpand", {"charrow", "row"}),
    sortedUniques = Table.Sort(uniqieRows,{{"row", Order.Ascending}}),
    return = Text.Combine(sortedUniques[charrow])
in
    return
Regards, Andrey
 
Upvote 0
Thank you very much for more examples.
I think I like the following solution in your first reply:
Code:
let
    words1 = "together",
    words2 = "to their",
    simpleLikeFuzzy = (words1 as text, words2 as text) as number =>
        2 * List.Count(List.Intersect({Text.ToList(words1), Text.ToList(words2)})) / (Text.Length(words1) + Text.Length(words2)),
    return = simpleLikeFuzzy(words1, words2)
in
    return

By the way, could you please write me which algorithm and approach this method uses? Is it known one?

Best Regards,
Sherzad
 
Upvote 0
Dear Andrey,

Is it possible to use the simpleLikeFuzzy function in the merge process while joining two tables?
Let me elaborate my question, I have two tables without a common key. I like to consider the name column and compare their similarity and if the result of the similarity is greater than the threshold .85 then it is considered a match, otherwise, not.


Regards,
 
Upvote 0
Hi
Is it possible to use the simpleLikeFuzzy function in the merge process while joining two tables?
Alright
Code:
let    // an etalon word table, maybe loaded from external source
    defEtalon = #table(type table [value = text], 
    { {"spruce"}, {"cedar"}, {"birch"}, {"pine"}, {"poplar"}, {"aspen"}, 
      {"baobab"}, {"larch"}, {"willow"} }),
    addChars = Table.AddColumn(defEtalon, "chars", each Text.ToList([value]), type list),
    addLength = Table.AddColumn(addChars, "length", each Text.Length([value]), Int64.Type),
    etalonTable = Table.Buffer(addLength),
    // define a search function in the etalon table by the simplest fuzzy
    lookup = (testText as text) as record =>
    let
        simpleFuzzy = (testChars as list) as record =>
        let
            coefficient = Table.AddColumn(etalonTable, "c", each 2 * List.Count(List.Intersect({testChars, [chars]})) / ([length] + List.Count(testChars)), type number)
        in
            Table.Max(coefficient, "c"),
        testChars = Text.ToList(testText)
    in
        // define record fields from the etalon table for returning
        simpleFuzzy(testChars)[[value], [c]],
    // a test table for join with etalon table
    testTable = #table(type table [word = text],
    { {"sprce"}, {"cidar"}, {"pane"}, {"aspern"}, {"babab"} }),
    // to join them
    result = Table.AddColumn(testTable, "rec", each lookup([word]), type record),
    // expand returned joined records
    return = Table.ExpandRecordColumn(result, "rec", {"value", "c"})


in
    return
Regards,
 
Upvote 0
Many thanks for the input. Frankly speaking I could not manage to customize it in my scenario, attached screenshot.
One idea crossed my mind is to use the combination of Name, Job and Department columns using the concatenate transformation function. Can you please have a look at the screenshot and kindly share your opinion with me how to customize the code.
 
Last edited:
Upvote 0
Hi Andrey,
Any idea how to use the simplest vesion of Fuzzy on two tables/datasets similar to what Microsoft Excel Fuzzy Lookup Add-In does?

Best,
 
Upvote 0
Anvg - Brilliant solution. Your solution delivers better results that the 'untweaked' Microsoft fuzzy add-in. With tweaks - the two solutions are comparable. DrewBbc
 
Upvote 0

Forum statistics

Threads
1,213,554
Messages
6,114,280
Members
448,562
Latest member
Flashbond

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top