Regular Expression to find last names & firstname

Magic_Doctor

Board Regular
Joined
Mar 18, 2009
Messages
56
Hello,


I try, by means of regular expressions, to write a function allowing me to extract from a string of characters either last names (all capitalized) or first names (up to 3, as in France for example) all starting with a capital letter followed by lower case letters. Vocal accents will be taken into account (I am not Anglo-Saxon).
My function works almost well. Indeed, when there are 3 first names the third is added to the last name.
Code:
Option Explicit
Dim regEx As Object 'memorization to save time in creating the object ("VBScript.RegExp") 'job75


Function DissectNP(NP As String, x As Byte) As String
'- NP : the string containing the NAME(s) + first name(s)
'- x : if x = 1 --> NAME(s)
'      if x = 2 --> first name(s)


Dim strPattern As String, NBNomsMaj As Integer, NBNomsFirstMaj As Integer
Dim PatternNoms As String, PatternPreNoms As String
    
    Set regEx = CreateObject("VBScript.RegExp")


    NBNomsMaj = NB_MotsMmFirstM(NP, 1) 'number of names in CAPITALS
    NBNomsFirstMaj = NB_MotsMmFirstM(NP, 3) 'number of names beginning with a CAPITAL followed by lowercase
    
    PatternNoms = "([A-ZÈÉÊËÄÇÆŒÁÍÓÚÂÊÎÔÛÑÄËÏÜ\-]*\s)" 'pattern of NAMES
    PatternPreNoms = "([a-zçæéèíóúâêîôûäëïöüñA-ZÈÉÊËÄÇÆŒÁÍÓÚÂÊÎÔÛÄËÏÜÑ\-]*)" 'pattern of last names
    'PatternPreNoms = "([a-zçæéèíóúâêîôûäëïöüñA-ZÈÉÊËÄÇÆŒÁÍÓÚÂÊÎÔÛÄËÏÜÑ\-]*\s)" 'do not work!!
    
    'strPattern = Application.WorksheetFunction.Rept(PatternNoms, NBNomsMaj) & PatternPreNoms & "( )" & PatternPreNoms
    strPattern = Application.WorksheetFunction.Rept(PatternNoms, NBNomsMaj) & PatternPreNoms & IIf(NBNomsFirstMaj = 1, "", "( )" & PatternPreNoms)
    'strPattern = Application.WorksheetFunction.Rept(PatternNoms, NBNomsMaj) & Application.WorksheetFunction.Rept(PatternPreNoms, NBNomsFirstMaj) 'do not work!
    
    'If strPattern <> "" Then 'is it really useful?
        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = False
            .Pattern = strPattern
        End With


        If regEx.Test(NP) Then
            Select Case x
                Case 1 'NAMES
                    Select Case NBNomsMaj
                        Case 1
                            DissectNP = regEx.Replace(NP, "$1")
                        Case 2
                            DissectNP = regEx.Replace(NP, "$1" & "$2")
                        Case 3
                            DissectNP = regEx.Replace(NP, "$1" & "$2" & "$3")
                    End Select
                Case 2 'last names
                    DissectNP = regEx.Replace(NP, "$" & NBNomsMaj + 1 & IIf(NBNomsFirstMaj = 1, "", "$" & NBNomsMaj + 2 & "$" & NBNomsMaj + 3)) 'bricolage...
            End Select
        Else
            DissectNP = "Not matched"
        End If
    'End If
End Function


Function NB_MotsMmFirstM(phrase As String, x As Byte) As Integer
'Returns the number of words, in UPPERCASE or lowercase or starting with a UPPERCASE followed by lowercase, of a sentence
'Magic_Doctor
'- phrase : the string of characters studied
'- x : if x = 1 --> words in CAPITALS
'      if x = 2 --> words in lowercases
'      if x = 3 --> words starting with a CAPITAL followed by lowercases


Dim tmp, nbmots As Integer, mot() As String, i As Integer, lemot As String, n As Byte, nbmm As Integer


    tmp = Split(phrase, " ")
    nbmots = UBound(tmp) + 1 'total number of words (whatever they are) in the sentence
    mot = Split(phrase)
    
    For i = 1 To nbmots
        lemot = mot(i - 1) 'Option Base 0
        Select Case x
            Case 1 'words in CAPITALS
                n = IIf(UCase(lemot) = lemot, 1, 0)
            Case 2 'words in lowercases
                n = IIf(LCase(lemot) = lemot, 1, 0)
            Case 3 'words starting with a CAPITAL followed by lowercases
                n = IIf(Left(lemot, 1) = UCase(Left(lemot, 1)) And Mid(lemot, 2, 1) = LCase(Mid(lemot, 2, 1)), 1, 0)
        End Select
        nbmm = nbmm + n
    Next
    
    NB_MotsMmFirstM = nbmm
End Function


Thank you for helping me solve this problem.
 
Last edited:

Excel Facts

What is the shortcut key for Format Selection?
Ctrl+1 (the number one) will open the Format dialog for whatever is selected.
Hi.

I think you'd get better answers if you gave us some of your data - examples of where the code was working, where it wasn't etc, rather than just the code...
 
Upvote 0
Hello PaddyD,


Unfortunately I can not attach my Excel sheet.
The function works in almost all cases, except one, when there are 3 first names.
Let's take an example where the function does not work:
"CARRERA Mónica Pitaluga Diabólica"
I get for the name: "CARRERA Diabólica"
with 2 spaces between CARRERA and Diabólica, whereas there should be only CARRERA.
I get for the first names: "Mónica Pitaluga Diabólica" (it's OK).
 
Upvote 0
Yes, I get it!

Last names can be any length (compound last names).
The number of first names is unlimited.
Code:
Option Explicit
Dim regEx As Object 'memorization to save time in creating the object ("VBScript.RegExp") 'job75


Function DissectNP(NP As String, x As Byte) As String
'- NP : the string containing the NAME(s) + first name(s)
'- x : if x = 1 --> LAST NAME(s)
'      if x = 2 --> first name(s)
'Magic_Doctor


Dim strPattern As String, LongueurNBNomsMaj As Integer, LongueurNBNomsFirstMaj As Integer
Dim PatternNoms As String, PatternPreNoms As String
    
    Set regEx = CreateObject("VBScript.RegExp")


    LongueurNBNomsMaj = NB_MotsMmFirstM(NP, 1, 2) 'number of names in CAPITALS
    LongueurNBNomsFirstMaj = NB_MotsMmFirstM(NP, 3, 2) 'number of names beginning with a CAPITAL followed by lowercase
    
    PatternNoms = "([A-ZÈÉÊËÄÇÆŒÁÍÓÚÂÊÎÔÛÄËÏÜÑ\-\s]{" & LongueurNBNomsMaj + 1 & "})" 'pattern of NAMES
    PatternPreNoms = "([a-zçæéèíóúâêîôûäëïöüñA-ZÈÉÊËÄÇÆŒÁÍÓÚÂÊÎÔÛÄËÏÜÑ\-\s]{" & LongueurNBNomsFirstMaj & "})" 'pattern of last names
    strPattern = PatternNoms & PatternPreNoms
    
    With regEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .Pattern = strPattern
    End With
        
    If regEx.test(NP) Then
        Select Case x
            Case 1 'NAMES
                DissectNP = regEx.Replace(NP, "$1")
            Case 2 'last names
                DissectNP = regEx.Replace(NP, "$2")
        End Select
    Else
        DissectNP = "Not matched"
    End If
End Function


Function NB_MotsMmFirstM(phrase As String, x As Byte, Optional longueur As Byte = 0) As Integer
'Returns the number of words, in UPPERCASE or lowercase or starting with a UPPERCASE followed by lowercase, of a sentence
'Magic_Doctor
'- phrase : the string of characters studied
'- x: si = 1 --> words in CAPITALS
'      si = 2 --> words in lowercases
'      si = 3 --> words starting with a CAPITAL followed by lowercases
'- longueur: optional: "0" by default ---> the function then returns the sum of the lengths of all the words
'                      if longueur = 1 ---> sum of the lengths of all the searched words (without spaces)
'                      if longueur = 2 ---> sum of the lengths of all the searched words (with spaces)


Dim tmp, nbmots As Integer, mot() As String, i As Integer, lemot As String, n As Byte, nbmm As Integer
Dim longword As Byte, longallwords As Byte


    tmp = Split(phrase, " ")
    nbmots = UBound(tmp) + 1 'total number of words (whatever they are) in the sentence
    mot = Split(phrase)
    
    For i = 1 To nbmots
        lemot = mot(i - 1) 'Option Base 0
        Select Case x
            Case 1 'words in CAPITALS
                n = IIf(UCase(lemot) = lemot, 1, 0)
                longword = IIf(UCase(lemot) = lemot, Len(lemot), 0)
            Case 2 'words in lowercases
                n = IIf(LCase(lemot) = lemot, 1, 0)
                longword = IIf(LCase(lemot) = lemot, Len(lemot), 0)
            Case 3 'words starting with a CAPITAL followed by lowercases
                n = IIf(Left(lemot, 1) = UCase(Left(lemot, 1)) And Mid(lemot, 2, 1) = LCase(Mid(lemot, 2, 1)), 1, 0)
                longword = IIf(Left(lemot, 1) = UCase(Left(lemot, 1)) And Mid(lemot, 2, 1) = LCase(Mid(lemot, 2, 1)), Len(lemot), 0)
        End Select
        nbmm = nbmm + n 'total number of words
        longallwords = longallwords + longword 'sum of the lengths of all the words
    Next
    
    NB_MotsMmFirstM = IIf(longueur = 0, nbmm, IIf(longueur = 1, longallwords, longallwords + nbmm - 1))
End Function
 
Upvote 0
Here's another approach. This also handles apostrophes and hyphens in the last names, as shown by the examples in the test routine.
Code:
Public Sub Test()

    Dim nameString As String, parsed As String
    
    nameString = "CARRERA Mónica Pitaluga Diabólica"
    nameString = "D'ABO-CARON Ann-Marie Mónica"
    'nameString = "ÈÉÊËÄÇÆŒÁÍÓÚÂÊÎÔÛÄËÏÜÑ Açæéèíóúâêîôûäëïöüñ"
    Debug.Print nameString
    
    parsed = Get_LastNames(nameString)
    Debug.Print "Last Names: " & parsed
    
    parsed = Get_FirstNames(nameString)
    Debug.Print "First Names: " & parsed
    
End Sub


Private Function Get_LastNames(allNames As String) As String

    Dim regEx As Object
    Dim nameMatches As Object, nameMatch As Object
    
    Set regEx = CreateObject("VBScript.RegExp")
    With regEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
    End With
    
    '(              Start of numbered capture group
    '[A-ZÀ-Ÿ'-]+    One or more uppercase letters or apostrophe or hyphen (including letters with an umlaut)
    ')              End of capture group
    '\s+            One or more whitespace
    
    regEx.Pattern = "([A-ZÀ-Ÿ'-]+)\s+"
    Set nameMatches = regEx.Execute(allNames)
    
    If nameMatches.Count > 0 Then
        Get_LastNames = ""
        For Each nameMatch In nameMatches
            Get_LastNames = Get_LastNames & nameMatch.SubMatches(0) & " "
        Next
        Get_LastNames = Left(Get_LastNames, Len(Get_LastNames) - 1)
    Else
        Get_LastNames = "**NOT MATCHED**"
    End If
    
End Function


Private Function Get_FirstNames(allNames As String)
    
    Dim regEx As Object
    Dim nameMatches As Object, nameMatch As Object
    
    Set regEx = CreateObject("VBScript.RegExp")
    With regEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
    End With
    
    '\s+            One or more whitespace
    '(              Start of numbered capture group
    '[A-ZÀ-Ÿ]       Exactly one uppercase letter (including letters with an umlaut)
    '[a-zA-ZÀ-ÿ-]+  One or more lowercase or uppercase letters or hyphen (including letters with an umlaut)
    ')              End of capture group
    
    regEx.Pattern = "\s+([A-ZÀ-Ÿ][a-zA-ZÀ-ÿ-]+)"
    Set nameMatches = regEx.Execute(allNames)
    
    If nameMatches.Count > 0 Then
        Get_FirstNames = ""
        For Each nameMatch In nameMatches
            Get_FirstNames = Get_FirstNames & nameMatch.SubMatches(0) & " "
        Next
        Get_FirstNames = Left(Get_FirstNames, Len(Get_FirstNames) - 1)
    Else
        Get_FirstNames = "**NOT MATCHED**"
    End If
    
End Function
 
Last edited:
Upvote 0
Hello John_w,

It is indeed another approach that I will look more closely.
Meanwhile, I had completed my function so that now we can put between the names other than a simple whitespace ("|!", "," * "...).
I also simplified the "Patterns" by taking inspiration from your function.
Code:
Option Explicit
Dim regEx As Object 'memorization to save time in creating the object ("VBScript.RegExp") 'job75


Function DissectNP(NP As String, x As Byte, Optional sep As String = "") As String
'- NP : the string containing the NAME(s) + first name(s)
'- x : if x = 1 --> LAST NAME(s)
'      if x = 2 --> first name(s)
'- sep (optional: "" by default): the type of separator you want to put between the first names, rather than just a whitespace
'Magic_Doctor


Dim strPattern As String, LongueurNBNomsMaj As Integer, LongueurNBNomsFirstMaj As Integer
Dim PatternNoms As String, PatternPreNoms As String, i As Integer
Dim Tableau() As String, lindacadena As String
    
    Set regEx = CreateObject("VBScript.RegExp")


    LongueurNBNomsMaj = NB_MotsMmFirstM(NP, 1, 2) 'sum of the lengths of all LAST NAMES sought with the spaces between them
    LongueurNBNomsFirstMaj = NB_MotsMmFirstM(NP, 3, 2) 'sum of the lengths of all the first names sought with the spaces separating them
    
    PatternNoms = "([A-ZÀ-Ÿ-\s]{" & LongueurNBNomsMaj + 1 & "})" '"Pattern" of the LAST NAME
    PatternPreNoms = "([a-zA-ZÀ-ÿ-\s]{" & LongueurNBNomsFirstMaj & "})" '"Pattern" of the first names
    strPattern = PatternNoms & PatternPreNoms 'the complete "Pattern" that we will look for in the chain
    
    With regEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .Pattern = strPattern
    End With


    If regEx.test(NP) Then 'the "pattern" is well found in the chain
        Select Case x 'first names or last name)
            Case 1 'NOMS
                DissectNP = regEx.Replace(NP, "$1") 'we substitute the string by the 1st part of "strPattern" (= "PatternNoms")
            Case 2 'Prénoms
                DissectNP = regEx.Replace(NP, "$2") 'we substitute the string by the second part of "strPattern" (= "PatternPreNoms")
                If sep <> "" Then 'first names are separated by something other than a simple whitespace (default)
                    Tableau = Split(DissectNP, " ")
                    For i = LBound(Tableau) To UBound(Tableau)
                        lindacadena = lindacadena & Tableau(i) & sep
                    Next
                    DissectNP = Left(lindacadena, Len(lindacadena) - Len(sep)) 'we remove the unnecessary separator at the end of the chain
                End If
        End Select
    Else 'the pattern is not found in the chain
        DissectNP = "Not matched"
    End If
    Set regEx = Nothing
End Function
'


Function NB_MotsMmFirstM(phrase As String, x As Byte, Optional longueur As Byte = 0) As Integer
'Returns the number of words, in UPPERCASE or lowercase or starting with a UPPERCASE followed by lowercase, of a sentence
'Magic_Doctor
'- phrase : the string of characters studied
'- x: if = 1 --> words in CAPITALS
'     if = 2 --> words in lowercases
'     if = 3 --> words starting with a CAPITAL followed by lowercases
'- longueur: optional: "0" by default ---> the function then returns the sum of the lengths of all the words
'                      if longueur = 1 ---> sum of the lengths of all the searched words (without whitespaces)
'                      if longueur = 2 ---> sum of the lengths of all the searched words (with whitespaces)


Dim tmp, nbmots As Integer, mot() As String, i As Integer, lemot As String, n As Byte, nbmm As Integer
Dim longword As Byte, longallwords As Byte


    tmp = Split(phrase, " ")
    nbmots = UBound(tmp) + 1 'total number of words (whatever they are) in the sentence
    mot = Split(phrase)
    
    For i = 1 To nbmots
        lemot = mot(i - 1) 'Option Base 0
        Select Case x
            Case 1 'words in CAPITALS
                n = IIf(UCase(lemot) = lemot, 1, 0)
                longword = IIf(UCase(lemot) = lemot, Len(lemot), 0)
            Case 2 'words in lowercases
                n = IIf(LCase(lemot) = lemot, 1, 0)
                longword = IIf(LCase(lemot) = lemot, Len(lemot), 0)
            Case 3 'words starting with a CAPITAL followed by lowercases
                n = IIf(Left(lemot, 1) = UCase(Left(lemot, 1)) And Mid(lemot, 2, 1) = LCase(Mid(lemot, 2, 1)), 1, 0)
                longword = IIf(Left(lemot, 1) = UCase(Left(lemot, 1)) And Mid(lemot, 2, 1) = LCase(Mid(lemot, 2, 1)), Len(lemot), 0)
        End Select
        nbmm = nbmm + n 'total number of words
        longallwords = longallwords + longword 'sum of lengths of all searched words (without any whitespaces)
    Next
    
    NB_MotsMmFirstM = IIf(longueur = 0, nbmm, IIf(longueur = 1, longallwords, longallwords + nbmm - 1))
End Function
 
Upvote 0

Forum statistics

Threads
1,214,784
Messages
6,121,535
Members
449,037
Latest member
tmmotairi

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top