Understanding/using RegExp for string processing in VBA

JackDanIce

Well-known Member
Joined
Feb 3, 2010
Messages
9,922
Office Version
  1. 365
Platform
  1. Windows
Hi,

Imported data (csv file) has a column of strings and I need to extract 11 characters after a particular value, as long as those 11 characters are alpha-numeric, i.e. A-Z, a-z and 0-9 (no special characters).

The string possibly contains an ISIN code value preceded by the term "ANY." (constant in every string).

Example with required extraction in blue:
Rich (BB code):
RMDS_IDN_EM::IDN_RDF.ANY.AT000034269=TE.NaE
(I already have a check function to verify it's an ISIN and generate a 12th digit check digit)

However, the column can contain false ISIN values (in red) such as:
Rich (BB code):
RMDS_IDN_EM::IDN_RDF.ANY.ITEEU3Y=TE.NaE

I can already extract the 11 characters using a combination of MID and FIND and currently my code eliminates any extraction that contains the character "=".


I'd like to learn how Reg Exp object can be used to extract this value, I suspect it will shorten the code as well for the matching part.

Can you suggest the required code please?

Thank you in advance,
Jack

Alternatively, every string contains one and only one instance of "=", so another test might be extract all the characters between "ANY." and "=" (discard row if length <> 11 or contains symbols not alpha-numeric)
 
Last edited:

Excel Facts

Add Bullets to Range
Select range. Press Ctrl+1. On Number tab, choose Custom. Type Alt+7 then space then @ sign (using 7 on numeric keypad)
See if this helps:

You could try this user-defined function. To implement ..
1. Right click the sheet name tab and choose "View Code".
2. In the Visual Basic window use the menu to Insert|Module
3. Copy and Paste the code below into the main right hand pane that opens at step 2.
4. Close the Visual Basic window.
5. Enter the formula as shown in the screen shot below and copy down.
6. If using Excel 2007 or later your workbook will need to be saved as a macro-enabled workbook (*.xlsm)

If it fails for any data, please provide samples of that data and the expected results.

Rich (BB code):
Function TruckID(s As String) As String
  With CreateObject("VBScript.RegExp")
    .Pattern = "^(D\*)?(\D+)(0*)(\d+)(\D)?$"
    TruckID = .Replace(Replace(s, " ", ""), "$2$4")
  End With
End Function


Excel Workbook
BC
1Truck IDsExtracted ID
2ACL123ACL123
3D*ACL123ACL123
4ABC 329ABC329
5D*ABC329ABC329
6ABC 35060BABC35060
7D*ABC35060ABC35060
8PVHC20BPVHC20
9D*PVHC20PVHC20
10CC 01116CC1116
11D*CC1116CC1116
12GOPV 036BGOPV36
13D*GOPV36BGOPV36
14PVBL0099PVBL99
15D*PVBL99PVBL99
Truck IDs

Regular expressions look for a 'pattern', defined by the code, in a string.
In a RegEx pattern () divide the pattern into sections and certain characters have special meaning as well as their own (literal) meaning. To identify whether a character represents itself or a special meaning, that character can be preceded by a "\" or not.

The pattern in this RailID function works like this, after any spaces are first removed from the string.

^(B\*)?(\D+)(0*)(\d+)(\D)?$

^ represents the beginning of the string
followed by
(B\*)? the actual letter B followed by an asterisk. The "?" means that the preceding section may or may not occur (once) in the string
followed by
(\D+) \D represents a non-digit, "+" means one or more times
followed by
(0*) the digit zero, "*" means zero or more times.
followed by
(\d+) \d represents a digit, "+" means one or more times
followed by
(\D)? \D represents a non-digit, "?" means zero or 1 times
followed by
$ the end of the string

Finally, the .Replace(...., "$2$4") says that if the pattern is found, then return the 2nd and 4th sections. They are the red sections, being a series of 1 or more non-digits followed by a series of 1 or more digits.

Hopefully from that you can see why not all the "D"s were changed to "B"s. Most of them were preceded by a "\" giving their special meaning as either a non-digit or a digit.
 
Upvote 0
Thank you sheetspread, it's useful and I'm starting to understand how it works, but would appreciate if someone can suggest the pattern I need; I find it easier to "back out" the understanding from the answer!
 
Upvote 0
Jack

Try this pattern for a valid ISIN :-
Code:
([A-Z]{2}[0-9]{9})

Two Alphabetic Characters followed by 9 digits.

Change the pattern to :-
Code:
([A-Z]{2}[0-9]{1,9})
and then if the length of the result is not 11 you have an invalid ISIN.

hth
 
Last edited:
Upvote 0
Hi Mike, thanks for replying.

I found that in the link sheetspread suggested but the difficulty I'm having is the ISIN is in the middle of a longer string and I can't work out how to extract the part between "ANY." and "="

Those two will always wrap around the part which I want to test is an ISIN or not.

Completely made up pysuedo reg ex but what I would like is:
Code:
[(^*ANY.)([A-Z]{2}[0-9]{1,9})(^=.$)]

By which I mean, "drop everything before and including "ANY." and drop everything after and including "="

Whatever's left, I can then test with your suggestion provided. It's the parts I want to drop I'm struggling with and only the terms "ANY." and "=" are consistent in all the strings.

If I wasn't using RegExp then this works:
Code:
Dim str As String: str = "xxxx_xxx_xx::xxx_xxx.ANY.AT000034269=xx.xx"

Debug.Print Mid(str, InStr(str, "ANY.") + Len("ANY."), InStr(str, "=") - InStr(str, "ANY.") - Len("ANY."))
However, I'd rather use RegExp to gain experience of it.
 
Last edited:
Upvote 0
I don't know what an ISIN is, but maybe using ukmikeb's pattern and a look behind is what you need.
Code:
(?<=ANY\.)([A-Z]{2}[0-9]{1,9})
 
Last edited:
Upvote 0
Hi Bruce,

ISIN is an acronym and non-important (more to add colour to the problem). It's the middle part of the string I'm after, that is a look behind "=" And a look after "ANY." i.e. part in blue of string below:

xxxxxxxANY.AB123456789=xxxxx

NB there is a '.' after the word ANY so think that means expression would contain "?>ANY.\." based on your suggestion? Testing on some code now...
 
Last edited:
Upvote 0
No you have to have \. in order to match the period or full stop, whatever you want to call it.
The backslash removes the normal meaning of the period or full stop, which is it matches any character except a new line.
Putting the backslash changes the meaning to mean, match this character literally.
 
Last edited:
Upvote 0
Sorry yes you're right, I got confused in my reply over excluding the \ before the full stop, forgetting that . represents any character so ANY. could mean ANYyyyyyyyyyyy instead of ANY.!

Any thoughts on an expression to extract part in blue though? Also, with the following, I'm getting a 5019 error (application or object defined error) so there's likely something not intialised correctly. Testing code is:
Rich (BB code):
Sub testme()

Dim str As String: str = "xxxx_xxx_xx::xxx_xxx.ANY.AT000034269=xx.xx"

testRegExp str
'Debug.Print Mid(str, InStr(str, "ANY.") + Len("ANY."), InStr(str, "=") - InStr(str, "ANY.") - Len("ANY."))

End Sub
Rich (BB code):
Function testRegExp(ByRef str As String) As String

    With CreateObject("VBScript.RegExp")
        .Pattern = "\[ANY\.](.*?)[=\]"
        testRegExp = .test(str)
        testRegExp = .Execute(str)(0)
    End With

End Function
Blue line is line causing error, though I get the same error with the line below it (i.e. if I comment out the blue line).

Later, I will replace (.*?) with ([A-Z]{2}[A-Z0-9]{1,9})

This link suggests first 2 chars are alpha characters and next 9 can be alpha numeric, rather than only numeric:
https://en.wikipedia.org/wiki/International_Securities_Identification_Number
 
Last edited:
Upvote 0

Forum statistics

Threads
1,214,606
Messages
6,120,483
Members
448,967
Latest member
visheshkotha

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top