Use regular expressions to look for unicode words

Jubinell

Board Regular
Joined
Jan 17, 2008
Messages
166
Hi,

I'm trying to look for some whole-word Unicode words in a range of cells.
My lookup words may look something like these:
"an" "tiếp" "lị" "nguyễn", etc. (they are made up of characters from the Vietnamese alphabet)
and my lookin cells may look something like these:
"an" "ban, tiếp; lịn, ngạn"

In this case, word # 1 is in cell 1 and not in cell 2, word # 2 is in cell 2 and not cell 1. Words 3 and 4 are in neither cells.

Unable to find a good solution with Excel's basic functions, I turned to regular expressions. I ended up using the RegExpFind() function given here: http://www.vbaexpress.com/kb/getarticle.php?kb_id=841
There are a bunch of them floating around but they are mostly the same. I'm not too worried about these functions themselves. Instead, I'm wondering if somebody can help me with the patterns.

For example, if I look up the word "an" and only want to match whole words, I simply use "\ban\b" and the double \b word boundaries take care of it for me. However, when I move onto non-English-character words, such as "tiếp", "\btiếp\b" would not work.

Am I missing something here? Does Excel's regular expression not support unicode characters?
 
Thank you pgc01, I've added my own values of Vietnamese characters into the include/exclude string variables, added a Match Case option, and done some other minor tweaks. Things work OK for now.

I'm still not sure I understand the .Pattern conditions. What do the texts in each pattern mean (in terms of how RegExp interprets and uses them) and how do they differ between the two conditions?

Forgive me I just started treading into regular expression's and the rules/patterns are still a mystery...no biggies if you have no time for this last question.
 
Upvote 0

Excel Facts

Can Excel fill bagel flavors?
You can teach Excel a new custom list. Type the list in cells, File, Options, Advanced, Edit Custom Lists, Import, OK
Hi Jubinell

I'm glad it helped.

What do the texts in each pattern mean (in terms of how RegExp interprets and uses them) and how do they differ between the two conditions?

The logic here is the following:

1 -

Code:
    sWP = Replace(sWordPattern, "\w", sLetterInc)

since the default "\w" (defined as [a-zA-Z0-9_]) is not valid in this case, we replace it with the character class we build ourselves including the characters that are valid for this language.

2 -

Code:
        .Pattern = "(?:^|" & sLetterExc & ")(" & sWP & ")(?=" & sLetterExc & "|$)"

This is the pattern to look for a whole word (bWhole = True). Ideally, the pattern should be:

<preceeded by non-word character> <pattern defined by the user> <followed by non-word character>

Unfortunately the "preceeded by" and "followed by", that in RegExp are called the lookaround assertions, are not (yet) both implemented in VBA, only the second one, the lookahead is. Since we don't have access to the lookbehind in vba we have to use a workaround, in this case we match a non-word character before the word and we capture the word in a sub-match (enclosing it in parentheses). We replace the "ideal" pattern with a possible one:

<NON-WORD character><non-word character> <pattern defined by the user> <followed by non-word character>

Now we match this pattern and then our word is the first submatch.

I defined the non-word character "(?:^|" & sLetterExc & ")" as

- either the beginning of the text
- or a character in the exclude character class

I defined the lookahead for the non-word character "(?=" & sLetterExc & "|$)" as looking for:

- either a character in the exclude character class
- or the end of the text

In the middle, between both, we have the word we are looking for "(" & sWP & ")", using the parentheses to capture it into a submatch. This part is also what we use in the case of bWhole = False, where we don't care what preceeds or follows the word.

So, in conclusion, in both patterns (for bWhole = True/False) the resulting word will be in the first submatch.

I've added my own values of Vietnamese characters into the include/exclude string variables

If you want it to be more efficient you can also delete the ones I included for the European languages that are not used in the Vietnamese language.

The smaller the number of characters in the include/exclude character classes, the faster the code will execute.


Hope this helps.
 
Last edited:
Upvote 0

Forum statistics

Threads
1,214,622
Messages
6,120,585
Members
448,972
Latest member
Shantanu2024

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top