Understanding/using RegExp for string processing in VBA

JackDanIce

Well-known Member
Joined
Feb 3, 2010
Messages
9,604
Office Version
365
Platform
Windows
Hi,

Imported data (csv file) has a column of strings and I need to extract 11 characters after a particular value, as long as those 11 characters are alpha-numeric, i.e. A-Z, a-z and 0-9 (no special characters).

The string possibly contains an ISIN code value preceded by the term "ANY." (constant in every string).

Example with required extraction in blue:
Rich (BB code):
RMDS_IDN_EM::IDN_RDF.ANY.AT000034269=TE.NaE
(I already have a check function to verify it's an ISIN and generate a 12th digit check digit)

However, the column can contain false ISIN values (in red) such as:
Rich (BB code):
RMDS_IDN_EM::IDN_RDF.ANY.ITEEU3Y=TE.NaE
I can already extract the 11 characters using a combination of MID and FIND and currently my code eliminates any extraction that contains the character "=".


I'd like to learn how Reg Exp object can be used to extract this value, I suspect it will shorten the code as well for the matching part.

Can you suggest the required code please?

Thank you in advance,
Jack

Alternatively, every string contains one and only one instance of "=", so another test might be extract all the characters between "ANY." and "=" (discard row if length <> 11 or contains symbols not alpha-numeric)
 
Last edited:

Some videos you may like

Excel Facts

Convert text numbers to real numbers
Select a column containing text numbers. Press Alt+D E F to quickly convert text to numbers. Faster than "Convert to Number"

sheetspread

Well-known Member
Joined
Sep 19, 2005
Messages
5,117
See if this helps:

You could try this user-defined function. To implement ..
1. Right click the sheet name tab and choose "View Code".
2. In the Visual Basic window use the menu to Insert|Module
3. Copy and Paste the code below into the main right hand pane that opens at step 2.
4. Close the Visual Basic window.
5. Enter the formula as shown in the screen shot below and copy down.
6. If using Excel 2007 or later your workbook will need to be saved as a macro-enabled workbook (*.xlsm)

If it fails for any data, please provide samples of that data and the expected results.

Rich (BB code):
Function TruckID(s As String) As String
  With CreateObject("VBScript.RegExp")
    .Pattern = "^(D\*)?(\D+)(0*)(\d+)(\D)?$"
    TruckID = .Replace(Replace(s, " ", ""), "$2$4")
  End With
End Function

Excel Workbook
BC
1Truck IDsExtracted ID
2ACL123
ACL123
3D*ACL123ACL123
4ABC 329ABC329
5D*ABC329ABC329
6ABC 35060BABC35060
7D*ABC35060ABC35060
8PVHC20BPVHC20
9D*PVHC20PVHC20
10CC 01116CC1116
11D*CC1116CC1116
12GOPV 036BGOPV36
13D*GOPV36BGOPV36
14PVBL0099PVBL99
15D*PVBL99PVBL99
Truck IDs
Regular expressions look for a 'pattern', defined by the code, in a string.
In a RegEx pattern () divide the pattern into sections and certain characters have special meaning as well as their own (literal) meaning. To identify whether a character represents itself or a special meaning, that character can be preceded by a "\" or not.

The pattern in this RailID function works like this, after any spaces are first removed from the string.

^(B\*)?(\D+)(0*)(\d+)(\D)?$

^ represents the beginning of the string
followed by
(B\*)? the actual letter B followed by an asterisk. The "?" means that the preceding section may or may not occur (once) in the string
followed by
(\D+) \D represents a non-digit, "+" means one or more times
followed by
(0*) the digit zero, "*" means zero or more times.
followed by
(\d+) \d represents a digit, "+" means one or more times
followed by
(\D)? \D represents a non-digit, "?" means zero or 1 times
followed by
$ the end of the string

Finally, the .Replace(...., "$2$4") says that if the pattern is found, then return the 2nd and 4th sections. They are the red sections, being a series of 1 or more non-digits followed by a series of 1 or more digits.

Hopefully from that you can see why not all the "D"s were changed to "B"s. Most of them were preceded by a "\" giving their special meaning as either a non-digit or a digit.
 

JackDanIce

Well-known Member
Joined
Feb 3, 2010
Messages
9,604
Office Version
365
Platform
Windows
Thank you sheetspread, it's useful and I'm starting to understand how it works, but would appreciate if someone can suggest the pattern I need; I find it easier to "back out" the understanding from the answer!
 

ukmikeb

Well-known Member
Joined
Jul 10, 2009
Messages
2,757
Jack

Try this pattern for a valid ISIN :-
Code:
([A-Z]{2}[0-9]{9})
Two Alphabetic Characters followed by 9 digits.

Change the pattern to :-
Code:
([A-Z]{2}[0-9]{1,9})
and then if the length of the result is not 11 you have an invalid ISIN.

hth
 
Last edited:

JackDanIce

Well-known Member
Joined
Feb 3, 2010
Messages
9,604
Office Version
365
Platform
Windows
Hi Mike, thanks for replying.

I found that in the link sheetspread suggested but the difficulty I'm having is the ISIN is in the middle of a longer string and I can't work out how to extract the part between "ANY." and "="

Those two will always wrap around the part which I want to test is an ISIN or not.

Completely made up pysuedo reg ex but what I would like is:
Code:
[(^*ANY.)([A-Z]{2}[0-9]{1,9})(^=.$)]
By which I mean, "drop everything before and including "ANY." and drop everything after and including "="

Whatever's left, I can then test with your suggestion provided. It's the parts I want to drop I'm struggling with and only the terms "ANY." and "=" are consistent in all the strings.

If I wasn't using RegExp then this works:
Code:
Dim str As String: str = "xxxx_xxx_xx::xxx_xxx.ANY.AT000034269=xx.xx"

Debug.Print Mid(str, InStr(str, "ANY.") + Len("ANY."), InStr(str, "=") - InStr(str, "ANY.") - Len("ANY."))
However, I'd rather use RegExp to gain experience of it.
 
Last edited:

skywriter

Well-known Member
Joined
Feb 15, 2014
Messages
1,642
I don't know what an ISIN is, but maybe using ukmikeb's pattern and a look behind is what you need.
Code:
(?<=ANY\.)([A-Z]{2}[0-9]{1,9})
 
Last edited:

JackDanIce

Well-known Member
Joined
Feb 3, 2010
Messages
9,604
Office Version
365
Platform
Windows
Hi Bruce,

ISIN is an acronym and non-important (more to add colour to the problem). It's the middle part of the string I'm after, that is a look behind "=" And a look after "ANY." i.e. part in blue of string below:

xxxxxxxANY.AB123456789=xxxxx

NB there is a '.' after the word ANY so think that means expression would contain "?>ANY.\." based on your suggestion? Testing on some code now...
 
Last edited:

skywriter

Well-known Member
Joined
Feb 15, 2014
Messages
1,642
No you have to have \. in order to match the period or full stop, whatever you want to call it.
The backslash removes the normal meaning of the period or full stop, which is it matches any character except a new line.
Putting the backslash changes the meaning to mean, match this character literally.
 
Last edited:

JackDanIce

Well-known Member
Joined
Feb 3, 2010
Messages
9,604
Office Version
365
Platform
Windows
Sorry yes you're right, I got confused in my reply over excluding the \ before the full stop, forgetting that . represents any character so ANY. could mean ANYyyyyyyyyyyy instead of ANY.!

Any thoughts on an expression to extract part in blue though? Also, with the following, I'm getting a 5019 error (application or object defined error) so there's likely something not intialised correctly. Testing code is:
Rich (BB code):
Sub testme()

Dim str As String: str = "xxxx_xxx_xx::xxx_xxx.ANY.AT000034269=xx.xx"

testRegExp str
'Debug.Print Mid(str, InStr(str, "ANY.") + Len("ANY."), InStr(str, "=") - InStr(str, "ANY.") - Len("ANY."))

End Sub
Rich (BB code):
Function testRegExp(ByRef str As String) As String

    With CreateObject("VBScript.RegExp")
        .Pattern = "\[ANY\.](.*?)[=\]"
        testRegExp = .test(str)
        testRegExp = .Execute(str)(0)
    End With

End Function
Blue line is line causing error, though I get the same error with the line below it (i.e. if I comment out the blue line).

Later, I will replace (.*?) with ([A-Z]{2}[A-Z0-9]{1,9})

This link suggests first 2 chars are alpha characters and next 9 can be alpha numeric, rather than only numeric:
https://en.wikipedia.org/wiki/International_Securities_Identification_Number
 
Last edited:

Watch MrExcel Video

Forum statistics

Threads
1,095,929
Messages
5,447,361
Members
405,447
Latest member
WPY

This Week's Hot Topics

Top