RegEx criteria - remove punctuation, name honorifics, suffixes

Dr. Demento

Well-known Member
Joined
Nov 2, 2010
Messages
618
Office Version
  1. 2019
  2. 2016
Platform
  1. Windows
I am comparing a large dataset by concatenating non-unique identifiers. However, when it comes to names, I'm not getting matches that I should. One of the issues is punctuation and name suffixes (LNAME, Jr., etc.). I thought a RegEx would do the trick, but I'm not that proficient yet.

My goal is to remove all punctuation from a string (easy; see below) but in also exclude certain name honorifics/suffixes (Mr, Mrs, Ms, Jr, Sr, III, etc) AND to keep my pipe-delimiter in the string (desired format: LASTNAME|FIRSTNAME). I know this is used as an OR in RegEx, but hopefully there's a workaround.

Thanks y'all.

Code:
Function RemovePunctuation(rts As String) As String
' ~~ Remove punctuation
' [URL]https://www.extendoffice.com/documents/excel/3296-excel-remove-all-punctuation.html[/URL]  
  With CreateObject("VBScript.RegExp")
    .Pattern = "[^a-zA-Z0-9 ]"
    .IgnoreCase = True
    .Global = True
    RemovePunctuation = .Replace(rts, vbNullString)
  End With
  
End Function
 

Excel Facts

Can a formula spear through sheets?
Use =SUM(January:December!E7) to sum E7 on all of the sheets from January through December
Can you post a sample of your data as it would appear with the honorific and one without and how you would like them to appear after the function is run?
 
Last edited:
Upvote 0
Here's a list of inputs/outputs. It contains not only suffixes, honorifics, but also punctuation. Please note, an apostrophe could be (' or `) or a dash could be en dash, em dash, or horizontal bar, so I'm hoping the function would strip any/all punctuation rather than identifying specifics.

O'Malley|Mike --> OMALLEY|MIKE
Smythe-Wilson|John --> SMYTHEWILSON|JOHN
West, Jr.|Phillip --> WEST|PHILLIP
Thomas III|Aaron --> THOMAS|AARON
Adams|Dr. Greg --> ADAMS|GREG

Does that help?
 
Upvote 0
Sorry this took so long. After a few attempts at this, this is what I came up with:

Code:
Function FirstLast(r As String) As String
Dim t, u, v As String
t = Split(r, "|")
u = Split(t(0))(0)
t(0) = UCase(u)
u = Split(t(1))
t(1) = UCase(u(UBound(u)))
FirstLast = RemovePunctuation(Join(t, "|"))
End Function

Private Function RemovePunctuation(s As String) As String
With CreateObject("vbscript.regexp")
    .Pattern = "[^A-Z\|]"
    .ignorecase = True
    .Global = True
    RemovePunctuation = .Replace(s, "")
End With
End Function


Excel 2010
AB
1O'Malley|MikeOMALLEY|MIKE
2Smythe-Wilson|JohnSMYTHEWILSON|JOHN
3West, Jr.|PhillipWEST|PHILLIP
4Thomas III|AaronTHOMAS|AARON
5Adams|Dr. GregADAMS|GREG
Sheet1
Cell Formulas
RangeFormula
B1=firstlast(A1)
B2=firstlast(A2)
B3=firstlast(A3)
B4=firstlast(A4)
B5=firstlast(A5)
 
Last edited:
Upvote 0
Much appreciated, Scott!! That worked great.

I did make one change; if A2 = Smythe Wilson|John (sans hyphen), the function returned SMYTHE|JOHN. My solution was to run it thru the original string (r) thru the RemovePunctuation function PRIOR to splitting it for the first time. It now looks like this:

Code:
Function FirstLast(r As String) As String
Dim t, u, v As String

[B][COLOR=#0000FF]r = RemovePunctuation(r)[/COLOR][/B]
t = Split(r, "|")
u = Split(t(0))(0)
t(0) = UCase(u)
u = Split(t(1))
t(1) = UCase(u(UBound(u)))
FirstLast = RemovePunctuation(Join(t, "|"))
End Function

Thanks again!
 
Upvote 0
Looks like I spoke too soon. I need to determine if there's a space before blindly removing punctuation from the original string (r).

New, new code:
Code:
Function FirstLast(r As String) As String
Dim t, u, v As String

[B][COLOR=#0000ff]if instr(r, " ") = 0 then r = RemovePunctuation(r)[/COLOR][/B]
t = Split(r, "|")
u = Split(t(0))(0)
t(0) = UCase(u)
u = Split(t(1))
t(1) = UCase(u(UBound(u)))
FirstLast = RemovePunctuation(Join(t, "|"))
End Function
 
Upvote 0
The problem with that is that it wouldn't be able to differentiate between that and say Thomas III. This is always the problem with trying to parse items that aren't delimited correctly between each piece of data.
 
Upvote 0
I see what you mean.

What do you mean I can't automate EVERYTHING??!!! I don't want to have to think!

*sigh*

Thanks again
 
Upvote 0
The problem with that is that it wouldn't be able to differentiate between that and say Thomas III. This is always the problem with trying to parse items that aren't delimited correctly between each piece of data.
Worse, it screws up names with natural spaces in them. For example, the first name of one woman I worked with before retirement was Mary Ann with the space... the code Dr. D posted strips out the "Ann" part of her name. Another person I worked with back then had this last name... Della Rosa, again, with the space in it... the code Dr. D posted converts that to Rosa. It would seem that without full and proper delimiters, there can be no splitting of names without some kind of dictionary of names (both first and last) to check against or, perhaps smaller, a dictionary of prefixes and suffixes to be removed)
 
Last edited:
Upvote 0
Thanks for the feedback, Rick.

My main intent for this function is a part of a sub that compares two lists by non-unique identifiers (concatenated names and demographics). In my work, I'll often get two individuals submitted to me with different unique identifiers (SSN) but when you compare name/demographics, they are obviously the same person.

Where it was getting jacked was with those individuals who had spaces, punctuation (O'Malley), suffixes (Brown III or Jones Jr), or honorifics (Dr., PhD - which shouldn't be there at all, but data entry QC is not the greatest). My intent was to delete any punctuation, suffixes, and honorifics (which worked fine), but then those folks with natural spaces in their names gacked it all up. My kludge is to do a manual search for spaces in the name and replace it with a '-' if they had two "real" parts to their name (I'm not working with middle names at all, only first/last). Then, using Scott's function, the full name (sans non-letter characters) provides a concatenated "full" name. The fact I had mashed together their name was only to test for duplicates vs uniques, so it wasn't important to anyone but me.
 
Upvote 0

Forum statistics

Threads
1,214,584
Messages
6,120,387
Members
448,957
Latest member
Hat4Life

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top