just extract the alphanumeric characters

vacation

Board Regular
Joined
Dec 6, 2003
Messages
56
Hi,

In column C, there are over 30,000 items of data that look like this:

...
R#CS2719==
R#CS2653=
R#CS2720=
R#CS2654=
R#CS2770===
BMHH235.5===#
BMFA63023.1=%
BMFA63023===)
BMHP286.1===$
BMHP286=====!
BMRS192A1===~
BMFD280.1===#
BMFD281.1===#
BMHD196======
BMRS283=====(
BMH34641====~
...

They contain all kinds of non-alphanumeric characters.
Note: All possible non-alphanumeric characters are not shown above.
The above is just a small sample.

I would like column E to contain data from the corresponding cells in
column C but exclude all the non-alphanumeric characters.
So the above column C data would appear in column E as:

...
RCS2719
RCS2653
RCS2720
RCS2654
RCS2770
BMHH2355
BMFA630231
BMFA63023
BMHP2861
BMHP286
BMRS192A1
BMFD2801
BMFD2811
BMHD196
BMRS283
BMH34641
...


What combination of nested functions should I assign to each cell in column E to achieve this?


Thanks.
 
Hi Dave,

No reason to dump all kinds of capability into one function.

To get a specific occurence, use INDEX as in:

=INDEX(regexpfind(F15,"(\d+)+"),2)

To combine all the elements into a single cell use MCONCAT (from Laurent Longre's add-in at http://longre.free.fr/english/ Alternatively, use the foll. function, which lacks even the most basic error trapping:
Code:
Function VBAJoin(anArr, Optional Delim As String = "")
    VBAJoin = Join(anArr, Delim)
    End Function
So, you can do something like =VBAJoin(RegExpFind(F15,"(\d+)+")) Note that the Join function is not available in older versions of VBA, though I don't remember whether it was introduced with 2000 or 2002.

Compile time directives are a rather rudimentary capability implemented in VBA some versions ago. You would use something like:

Code:
#Const LateBind = True

Function RegExpSubstitute(ReplaceIn, _
        ReplaceWhat As String, ReplaceWith As String)
    #If Not LateBind Then
    Dim RE As RegExp
    Set RE = New RegExp
    #Else
    Dim RE As Object
    Set RE = CreateObject("vbscript.regexp")
        #End If
    '...
    End Function
Check VBA help for more.
brettdj said:
RegExpFind does return all matches. Use it as an array formula. For example if F15 contains 9809abc123, then select G15:H15 and array-enter =RegExpFind(F15,"(\d+)+") to get both sets of numbers.

Thats a clever way of returning the matches discretely, I understand now why you created the array of matches

My comment was aimed more so at being able to return the entire collection of matches (ie 9809123) or any particular match (either 9809 or 123) in a single cell.

Have you tried writing a UDF to capture submatches as well?

I use a compile-time constant to switch to late-binding for released software

Can you please expand further on this?

Cheers

Dave
 
Upvote 0

Excel Facts

Get help while writing formula
Click the italics "fx" icon to the left of the formula bar to open the Functions Arguments dialog. Help is displayed for each argument.
tusharm said:
Hi Dave,

No reason to dump all kinds of capability into one function.

Fair enough, you have listed a couple of methods to achieve this with your existing function. I thought this step worthwhile as I've often seen people just wanting the total stripped down result in one cell rather than each portion.

Compile time directives are a rather rudimentary capability implemented in VBA some versions ago. You would use something like:

Code:
#Const LateBind = True

Function RegExpSubstitute(ReplaceIn, _
        ReplaceWhat As String, ReplaceWith As String)
    #If Not LateBind Then
    Dim RE As RegExp
    Set RE = New RegExp
    #Else
    Dim RE As Object
    Set RE = CreateObject("vbscript.regexp")
        #End If
    '...
    End Function
Check VBA help for more.

Thanks for the pointer on this

Cheers

Dave
 
Upvote 0
Here’s another alternative based on postings from Tom Oglivy and JE McGimpsey relating to the trapping of illegal characters in file names. When used to remove unwanted characters in a string, the UDF, as amended, has certain advantages, and one disadvantage, relative to previous solutions posted to this thread.
Code:
Public Function cleanString(fNameStr As String)
Dim i As Integer

Const BadChars = "=+/'<|>*?).,#%~!($[]÷™©"
   
   For i = 1 To Len(BadChars)
     fNameStr = Application.WorksheetFunction.Substitute(fNameStr, _
               Mid(BadChars, i, 1), "")
   Next i
   cleanString = fNameStr

End Function
Calling the UDF:

=cleanString(A1)

Retain certain non-alphanumeric characters

The constant “BadChars” includes all characters, including symbols, to be removed from a string. Conversely, a string will not be cleansed of a character(s) if that character is omitted from the constant e.g. if you want to retain the trade mark symbol (Alt 0153) or the copyright symbols (Alt 0169) then simply remove those character(s) from the constant. This feature is particularly useful if you want to retain say commas and periods in a string.

Amend the constant to exclude the comma and period, then try the UDF with:

##?!Happ??y Bir#=thday=% to( you==, John. Con==$gratuations=== on your++== 21st birthda==++?y===!.

(BTW, two of the previous solutions, as presented, do not remove the “+” from the above string, and one solution returns “HBJC21”).

Spaces

With this string:

Happ??y Bir#=thday=% to( you==

The above UDF will return “Happy Birthday to you”

Two previous solutions return HappyBirthdaytoyou (i.e. all spaces are eliminated), and one solution returns “HB” (no quotes).

To eliminate spaces with the UDF, simply use the spacebar to add a space to the constant.

Leading Characters

All solutions remove leading non-numeric characters.

Numbers

With numbers prefixed by an apostrophe, the solutions from brettdj (Dave) and Yogi will return a true number. However, while the UDFs remove the apostrophe, the “numbers” are still returned as text unless the double unary operator is used e.g.

=--cleanstring(A1)
=--RegExpSubstitute(A1,"[^\w+]","")


Regards,

Mike
 
Upvote 0
At the very least, I would pass the BadCharacters 'constant' as a 2nd parameter. And, take it a step further and pass the target string as a 3rd parameter. I'm surprised Tom and JE would hardcode such information -- or was that part of some modifications you made? That way the function is more generic in nature.

At a more fundamental level, there is clearly some confusion between the method illustrated in the regular expression functions I shared and their application to the specific problem.

The regexp function works as intended. To do something different, one needs to change the regular expression pattern. For example, ^\w+ specifically removes everything other than a-z, A-Z, 0-9, and underscore. So, the fact that it removes spaces is not a disadvantage or a problem, but an intended result. To change the specifications and preserve spaces, one needs to change the associated regexp pattern.

To get RegExpSubstitute to do the equivalent of the cleanString function use ReplaceWhat="=+/'<|>*?).,#%~!($[]÷™©" and ReplaceWith="" Or put those constants in worksheet cells and use cell references.

Cleaning a text string is only one of a myriad of functions that can be performed by just those two functions RegExpFind and RegExpSubstitute. For other capabilities search this forum or the google.com archives of the XL newsgroups for related posts. And, even then what you will find will be only a fraction of the possibilities -- limited both by my finding open questions to answer and my (relatively) basic knowledge about regexp patterns.

Ekim said:
Here’s another alternative based on postings from Tom Oglivy and JE McGimpsey relating to the trapping of illegal characters in file names. When used to remove unwanted characters in a string, the UDF, as amended, has certain advantages, and one disadvantage, relative to previous solutions posted to this thread.
Code:
Public Function cleanString(fNameStr As String)
Dim i As Integer

Const BadChars = "=+/'<|>*?).,#%~!($[]÷™©"At 
   
   For i = 1 To Len(BadChars)
     fNameStr = Application.WorksheetFunction.Substitute(fNameStr, _
               Mid(BadChars, i, 1), "")
   Next i
   cleanString = fNameStr

End Function
Calling the UDF:

=cleanString(A1)

Retain certain non-alphanumeric characters

The constant “BadChars” includes all characters, including symbols, to be removed from a string. Conversely, a string will not be cleansed of a character(s) if that character is omitted from the constant e.g. if you want to retain the trade mark symbol (Alt 0153) or the copyright symbols (Alt 0169) then simply remove those character(s) from the constant. This feature is particularly useful if you want to retain say commas and periods in a string.

Amend the constant to exclude the comma and period, then try the UDF with:

##?!Happ??y Bir#=thday=% to( you==, John. Con==$gratuations=== on your++== 21st birthda==++?y===!.

(BTW, two of the previous solutions, as presented, do not remove the “+” from the above string, and one solution returns “HBJC21”).

Spaces

With this string:

Happ??y Bir#=thday=% to( you==

The above UDF will return “Happy Birthday to you”

Two previous solutions return HappyBirthdaytoyou (i.e. all spaces are eliminated), and one solution returns “HB” (no quotes).

To eliminate spaces with the UDF, simply use the spacebar to add a space to the constant.

Leading Characters

All solutions remove leading non-numeric characters.

Numbers

With numbers prefixed by an apostrophe, the solutions from brettdj (Dave) and Yogi will return a true number. However, while the UDFs remove the apostrophe, the “numbers” are still returned as text unless the double unary operator is used e.g.

=--cleanstring(A1)
=--RegExpSubstitute(A1,"[^\w+]","")


Regards,

Mike
 
Upvote 0
I've put together a paper with an attached Excel workbook example on using Regular Expressions with Excel, see http://www.vbaexpress.com/forum/showthread.php?t=226

Code examples include

1. Extracting the numeric portion of a string.
2. Reversing the order of a persons name.
3. Parsing a single cell street address into street number, address, direction & street type.
4. Testing whether the first three characters at the start of a string are found at the end of a string in reverse order
5. Converting the nth cell reference in a range of Excel Formulae from relative to absolute reference.

Cheers

Dave
 
Upvote 0
Hello Brettdj,

I found you explanations and links very helpful. I am curious, why did you use this line instead of just checking column c?

Set Myrange = Intersect(ActiveSheet.UsedRange, ActiveSheet.Columns("C"))
 
Upvote 0
Hey Brett,

Thanks for posting the code example. I have just recently started exploring Regular Expressions because I have some complex formulas that I have to parse. The end result will be a replacable string I can copy down the range to react to row insertions and deletions. You saved me at least the first bit of work as you are pulling in addresses.
 
Upvote 0
Hello David,

I responded to your post earlier today using quick reply when it came up on a search. I didn't realize my response would be so far removed from your Dec 9, 2003 post. A couple questions for you:

1. Why do you use the intersection method? Won't special cells return all of the forumulas without needing the rest?

Set Myrange = Intersect(Selection,
Selection.Cells.SpecialCells(xlFormulas))

'Equivalent ???
Set Myrange = Selection.Cells.SpecialCells(xlFormulas)

2. How do I exclude a valid address embedded in a string?

=a2&"a3+a4"&a5 will return =a2&"$a$3+a4"&a5 using your example posted Dec 9, 2003. If preceeding double quote count is even then I am not currently in a string, or can you say something like [^"."]? Thanks for your patience, I am pretty new to this and learned a lot by watching your example work. Thanks again!
 
Upvote 0
Hi there

I don’t come by here very often so I missed your first two posts :)

1) Laziness on my behalf. I edited the intersect which I'd used elsewhere to pull the formulas out of a column without simplifying it as I should have

2) I think it’s easiest to use two RegExp's here. The first uses the Replace method to cull all the enclosed internal string text. The second uses Execute to pull out the valid cell references.

I made your example a little longer to ensure it worked on multiple internal


Code:
Sub RegExp_Late_Execute_1()
'Late binding
'Dimension the RegExp objects
Dim RegEx As Object, RegMatchCollection As Object, RegMatch As Object
Dim TestStr As String, OutPutStr As String, NewStr As String

    ' create the RegExp Object with late binding
    Set RegEx = CreateObject("vbscript.regexp")

    TestStr = "a2&""a3+a4""&a5 & a12&""a13+a14""&a15"

    ' set the RegExp parameters
    With RegEx
        'look for global matches
        .Global = True
        'ignore case as string has lower case cell addresses :)
        .ignorecase = True
        .Pattern = """.+?"""
        NewStr = .Replace(TestStr, "")
        .Pattern = "(\$?)([A-Z]{1,2})(\$?)(\d{1,5})"
        Set RegMatchCollection = .Execute(NewStr)
    End With

    'Loop through each match in the string and concatenate them
    For Each RegMatch In RegMatchCollection
        OutPutStr = OutPutStr & RegMatch & vbNewLine
    Next
    MsgBox OutPutStr

    Set RegMatchCollection = Nothing
    Set RegEx = Nothing

End Sub

I updated my RegExp example list a while back @ http://www.vbaexpress.com/forum/showthread.php?t=226

  • 1. Extracting the numeric portion of a string.
    2. Reversing the order of a persons name.
    3. Testing the validity of an email address (assuming simple contsraints) and then parsing it.
    4. Testing whether the first three characters at the start of a string are found at the end of a string in reverse order.
    5. Converting the nth cell reference in a range of Excel Formulae from relative to absolute reference.
    6. Testing validity of a numeric string input.
    7. Removing back to back alphanumeric strings.

Best wishes

Dave
 
Upvote 0

Forum statistics

Threads
1,216,099
Messages
6,128,818
Members
449,469
Latest member
Kingwi11y

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top