Use regular expressions to look for unicode words

Jubinell · Jun 3, 2010

Hi,

I'm trying to look for some whole-word Unicode words in a range of cells.
My lookup words may look something like these:
"an" "tiếp" "lị" "nguyễn", etc. (they are made up of characters from the Vietnamese alphabet)
and my lookin cells may look something like these:
"an" "ban, tiếp; lịn, ngạn"

In this case, word # 1 is in cell 1 and not in cell 2, word # 2 is in cell 2 and not cell 1. Words 3 and 4 are in neither cells.

Unable to find a good solution with Excel's basic functions, I turned to regular expressions. I ended up using the RegExpFind() function given here: http://www.vbaexpress.com/kb/getarticle.php?kb_id=841
There are a bunch of them floating around but they are mostly the same. I'm not too worried about these functions themselves. Instead, I'm wondering if somebody can help me with the patterns.

For example, if I look up the word "an" and only want to match whole words, I simply use "\ban\b" and the double \b word boundaries take care of it for me. However, when I move onto non-English-character words, such as "tiếp", "\btiếp\b" would not work.

Am I missing something here? Does Excel's regular expression not support unicode characters?

Jubinell · Jun 7, 2010

Thank you pgc01, I've added my own values of Vietnamese characters into the include/exclude string variables, added a Match Case option, and done some other minor tweaks. Things work OK for now.

I'm still not sure I understand the .Pattern conditions. What do the texts in each pattern mean (in terms of how RegExp interprets and uses them) and how do they differ between the two conditions?

Forgive me I just started treading into regular expression's and the rules/patterns are still a mystery...no biggies if you have no time for this last question.

pgc01 · Jun 8, 2010

Hi Jubinell

I'm glad it helped.

What do the texts in each pattern mean (in terms of how RegExp interprets and uses them) and how do they differ between the two conditions?

The logic here is the following:

1 -

Code:

    sWP = Replace(sWordPattern, "\w", sLetterInc)

since the default "\w" (defined as [a-zA-Z0-9_]) is not valid in this case, we replace it with the character class we build ourselves including the characters that are valid for this language.

2 -

Code:

        .Pattern = "(?:^|" & sLetterExc & ")(" & sWP & ")(?=" & sLetterExc & "|$)"

This is the pattern to look for a whole word (bWhole = True). Ideally, the pattern should be:

<preceeded by non-word character> <pattern defined by the user> <followed by non-word character>

Unfortunately the "preceeded by" and "followed by", that in RegExp are called the lookaround assertions, are not (yet) both implemented in VBA, only the second one, the lookahead is. Since we don't have access to the lookbehind in vba we have to use a workaround, in this case we match a non-word character before the word and we capture the word in a sub-match (enclosing it in parentheses). We replace the "ideal" pattern with a possible one:

<NON-WORD character><non-word character> <pattern defined by the user> <followed by non-word character>

Now we match this pattern and then our word is the first submatch.

I defined the non-word character "(?:^|" & sLetterExc & ")" as

- either the beginning of the text
- or a character in the exclude character class

I defined the lookahead for the non-word character "(?=" & sLetterExc & "|$)" as looking for:

- either a character in the exclude character class
- or the end of the text

In the middle, between both, we have the word we are looking for "(" & sWP & ")", using the parentheses to capture it into a submatch. This part is also what we use in the case of bWhole = False, where we don't care what preceeds or follows the word.

So, in conclusion, in both patterns (for bWhole = True/False) the resulting word will be in the first submatch.

I've added my own values of Vietnamese characters into the include/exclude string variables

If you want it to be more efficient you can also delete the ones I included for the European languages that are not used in the Vietnamese language.

The smaller the number of characters in the include/exclude character classes, the faster the code will execute.

Hope this helps.

Use regular expressions to look for unicode words

Jubinell

Board Regular

Jubinell

Board Regular

Excel Facts

pgc01

MrExcel MVP

Similar threads

Forum statistics

Share this page

Use regular expressions to look for unicode words

Jubinell

Board Regular

Jubinell

Board Regular

Excel Facts

pgc01

MrExcel MVP

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock