Find, remove matching text from string

Jeff P

Board Regular
Joined
Jun 11, 2011
Messages
69
In column C, i have a Text string spanning several hundred rows.
In columns ("H:M") are a list of names, that are repeated down.

I'm in need of a VBA script that will search through the text string, for key words from ("H:M") and remove all except those matching words.

As an example to go from this.

C ........................................................... H..... I..... J....K.....L....M
Bob and Mary, went to a dinner party...........Bob Fred Mary John Tim Sam
Sam met with John...................................Bob Fred Mary John Tim Sam

to this:

C........................................................... H..... I..... J....K.....L....M
Bob and Mary, went to a dinner party...........Bob Fred Mary
Sam met with John...................................Bob Fred Mary John Tim Sam


Thanks for any help!

Jeff
 

Peter_SSs

MrExcel MVP, Moderator
Joined
May 28, 2005
Messages
42,734
Office Version
365
Platform
Windows
This sort of text comparison can be tricky.
To clarify, what results would you expect from these 2 rows given that in row 4 John and Fred are not whole words, but neither really is Mary in row 5.

Excel Workbook
CDEFGHIJKLM
4Sam & Mary Johnson went to FredericktonBobFredMaryJohnTimSam
5Sam & Mary's dog had pupsBobFredMaryJohnTimSam
Extract names



Edit: .. and another one

Excel Workbook
CDEFGHIJKLM
6John-Bob is elevenBobFredMaryJohnTimSam
Extract names
 
Last edited:

Jeff P

Board Regular
Joined
Jun 11, 2011
Messages
69
Hi Peter_SSs, Thanks for the reply!

Lol, you highlighted a dilemma i've been struggling with for quite a while, extracting names from text.

My project involves searching through and identifying major names within a large body of text. Ive ran into a few problems whereby a particular name was part of another word. ( as you've pointed out)
So far my solution has been to search "exact" matching names. In a case like Fredrickson, Ive been doing a search on "Fred ", with a space at the end. Works fine, as long as its not followed by a punctuation. Either way, Ive been making alterations along the way for those sticky cases.

So to answer your question, If possible to search for Exact matches.. Unless you can think of a better solution?

Thanks again for the reply Peter!


Jeff
 
Last edited:

Peter_SSs

MrExcel MVP, Moderator
Joined
May 28, 2005
Messages
42,734
Office Version
365
Platform
Windows
A few more questions

1. Can you confirm that the names down columns H:M are identical in each row or might some rows have a different set of names?

2. Assuming the names are identical in each row,do you actually have, or need to have, them physically all the way down? Could they just be listed once somewhere - say P1:U1? I'm not exactly sure how I will attack this problem so I'm not sure if it would be best to have them repeated down the rows or just in a single spot so asking the question in case. :)

3. Do you actually want the names to remain in their original columns or would you want or accept results like this?

<b>Excel 365 (Windows) 32 bit</b><table cellpadding="2.5px" rules="all" style=";background-color: rgb(255, 255, 255);border: 1px solid;border-collapse: collapse; border-color: rgb(182, 170, 166)"><colgroup><col width="25px" style="background-color: rgb(240, 224, 224)" /><col /><col /><col /><col /><col /><col /><col /><col /></colgroup><thead><tr style=" background-color: rgb(240, 224, 224);text-align: center;color: rgb(32, 17, 22)"><th></th><th>C</th><th>D</th><th>E</th><th>F</th><th>G</th><th>H</th><th>I</th><th>J</th></tr></thead><tbody><tr ><td style="color: rgb(32, 17, 22);text-align: center;">1</td><td style=";">Text</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Names</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(32, 17, 22);text-align: center;">2</td><td style=";">Bob and Mary, went to a dinner party</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Bob</td><td style=";">Mary</td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(32, 17, 22);text-align: center;">3</td><td style=";">Sam met with John</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Sam</td><td style=";">John</td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(32, 17, 22);text-align: center;">4</td><td style=";">Sam & Mary Johnson went to Frederickton</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Sam</td><td style=";">Mary</td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(32, 17, 22);text-align: center;">5</td><td style=";">Sam & Mary's dog had pups</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Sam </td><td style=";">Mary</td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(32, 17, 22);text-align: center;">6</td><td style=";">John-Bob is eleven</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(32, 17, 22);text-align: center;">7</td><td style=";">John-Bob is eleven & Fred, Tim & Sam are ten</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Fred</td><td style=";">Tim</td><td style=";">Sam</td></tr></tbody></table><p style="width:7.8em;font-weight:bold;margin:0;padding:0.2em 0.6em 0.2em 0.5em;border: 1px solid rgb(182, 170, 166);border-top:none;text-align: center;background-color: rgb(240, 224, 224);color: rgb(32, 17, 22)">Extract names</p><br /><br />

Edit: One more

4. What about repeated names? Any different result?
eg "Mary & Bob went out. Only Mary came home."
 
Last edited:

Jeff P

Board Regular
Joined
Jun 11, 2011
Messages
69
Hi Peter_SSs, Thank you very much again on this.

To answer your questions.

1. The names will always be identical all the way down the rows. Ive indicated only columns H through M, for simplicity sake. My final need will vary and could encompass 50 or more columns of names in Row1, but always begin in column H.

If you can imagine a header of 50 or more names in row 1, beginning at H. and paragraphs of text containing those names throughout columns C2:C5000. What I'm hoping to achieve is a kind of visual grid point where they intersect letting me know what names are inside all that text.

2. I have a master list of names in another sheet, in a column, that is continually updated. New names added, others removed. Its sorted alphabetically, which means their name positions in the column varies. I will create a script to copy this column and paste /transpose into row H1 of my working sheet, acting as a header. I have another script will copy/paste this header row down to beyond the last cell of information in Column 1, creating identical copies.

My need was to remove all the names that are not found within the paragraphs of text in column C2:C5000, leaving only those visible inside.

I realize my method of copying identical rows is a very sloppy way to achieve what I'm after. However, If there's a way to simply reference a single column of names and achieve the same grid like positioning where the order of names in a column are identical the order of the header. that would be great!. And If you need to access another sheet, or work beyond P1:U1 this is perfectly fine.

3. The names will need to remain in their original column, to maintain the accurate positioning reflected in the header order.

4. As for repeated names? The names in the header top row are unique and will only appear once. In the paragraphs of text C1, there will likely be multiple instances. If the code is designed to remove (or find) all names but those indicated in the header list, would multiple instances in the paragraphs of text matter? But not knowing much in terms of VBA code myself, If multiple instances appear in the paragraphs of text, from C1, only one unique name should be present in its corresponding header column/row.

Sorry for the long winded reply.
Trying to give as much clarity as possible.

Thank you very much again!


Jeff
 

Akuini

Well-known Member
Joined
Feb 1, 2016
Messages
2,484
Office Version
365
Platform
Windows
I will create a script to copy this column and paste /transpose into row H1 of my working sheet, acting as a header. I have another script will copy/paste this header row down to beyond the last cell of information in Column 1, creating identical copies.
Hi, Jeff

Actually in col H onwards you only need the header (in row 1), you can leave row 2 downward empty.

Example:

BEFORE:
<b>Excel 2013/2016</b><table cellpadding="2.5px" rules="all" style=";background-color: rgb(255,255,255);border: 1px solid;border-collapse: collapse; border-color: rgb(187,187,187)"><colgroup><col width="25px" style="background-color: rgb(218,231,245)" /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /></colgroup><thead><tr style=" background-color: rgb(218,231,245);text-align: center;color: rgb(22,17,32)"><th></th><th>C</th><th>D</th><th>E</th><th>F</th><th>G</th><th>H</th><th>I</th><th>J</th><th>K</th><th>L</th><th>M</th></tr></thead><tbody><tr ><td style="color: rgb(22,17,32);text-align: center;">1</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Adrien Blaze</td><td style=";">Boone</td><td style=";">Dax Dominik</td><td style=";">Ruben</td><td style=";">Zaiden</td><td style=";">Zein</td></tr><tr ><td style="color: rgb(22,17,32);text-align: center;">2</td><td style=";">Adrien Ruben, Adrien Blaze Malaki Dax Leandro</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(22,17,32);text-align: center;">3</td><td style=";">Maximo  Frederick Blaze Anson</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(22,17,32);text-align: center;">4</td><td style=";">Bowen Dax Dominik, Rubens zein </td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(22,17,32);text-align: center;">5</td><td style=";">Boone Dominik Connor ,Zein</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(22,17,32);text-align: center;">6</td><td style=";">Cory Zeins deBoone</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td></tr></tbody></table><p style="width:4.8em;font-weight:bold;margin:0;padding:0.2em 0.6em 0.2em 0.5em;border: 1px solid rgb(187,187,187);border-top:none;text-align: center;background-color: rgb(218,231,245);color: rgb(22,17,32)">Sheet2</p><br /><br />

AFTER
<b>Excel 2013/2016</b><table cellpadding="2.5px" rules="all" style=";background-color: rgb(255,255,255);border: 1px solid;border-collapse: collapse; border-color: rgb(187,187,187)"><colgroup><col width="25px" style="background-color: rgb(218,231,245)" /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /></colgroup><thead><tr style=" background-color: rgb(218,231,245);text-align: center;color: rgb(22,17,32)"><th></th><th>C</th><th>D</th><th>E</th><th>F</th><th>G</th><th>H</th><th>I</th><th>J</th><th>K</th><th>L</th><th>M</th></tr></thead><tbody><tr ><td style="color: rgb(22,17,32);text-align: center;">1</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Adrien Blaze</td><td style=";">Boone</td><td style=";">Dax Dominik</td><td style=";">Ruben</td><td style=";">Zaiden</td><td style=";">Zein</td></tr><tr ><td style="color: rgb(22,17,32);text-align: center;">2</td><td style=";">Adrien Ruben, Adrien Blaze Malaki Dax Leandro</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Adrien Blaze</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Ruben</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(22,17,32);text-align: center;">3</td><td style=";">Maximo  Frederick Blaze Anson</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td></tr><tr ><td style="color: rgb(22,17,32);text-align: center;">4</td><td style=";">Bowen Dax Dominik, Rubens zein </td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Dax Dominik</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Zein</td></tr><tr ><td style="color: rgb(22,17,32);text-align: center;">5</td><td style=";">Boone Dominik Connor ,Zein</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Boone</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style=";">Zein</td></tr><tr ><td style="color: rgb(22,17,32);text-align: center;">6</td><td style=";">Cory Zeins deBoone</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td></tr></tbody></table><p style="width:4.8em;font-weight:bold;margin:0;padding:0.2em 0.6em 0.2em 0.5em;border: 1px solid rgb(187,187,187);border-top:none;text-align: center;background-color: rgb(218,231,245);color: rgb(22,17,32)">Sheet2</p><br /><br />

THE CODE:
Code:
[FONT=lucida console][color=Royalblue]Sub[/color] a1111478a()
[i][color=seagreen]'https://www.mrexcel.com/forum/excel-questions/1111478-find-remove-matching-text-string.html[/color][/i]
[color=Royalblue]Dim[/color] i [color=Royalblue]As[/color] [color=Royalblue]Long[/color], j [color=Royalblue]As[/color] [color=Royalblue]Long[/color], n [color=Royalblue]As[/color] [color=Royalblue]Long[/color], rc [color=Royalblue]As[/color] [color=Royalblue]Long[/color]
[color=Royalblue]Dim[/color] va, vb, vx
[color=Royalblue]Dim[/color] regEx [color=Royalblue]As[/color] [color=Royalblue]Object[/color]

n = Range([color=brown]"C"[/color] & Rows.count).[color=Royalblue]End[/color](xlUp).Row
rc = Cells([color=crimson]1[/color], Columns.count).[color=Royalblue]End[/color](xlToLeft).Column - [color=crimson]7[/color]

va = Range([color=brown]"C2:C"[/color] & n)
vx = Range([color=brown]"H1"[/color]).Resize([color=crimson]1[/color], rc)
[color=Royalblue]ReDim[/color] vb([color=crimson]1[/color] [color=Royalblue]To[/color] UBound(va, [color=crimson]1[/color]), [color=crimson]1[/color] [color=Royalblue]To[/color] UBound(vx, [color=crimson]2[/color]))

[color=Royalblue]For[/color] i = [color=crimson]1[/color] [color=Royalblue]To[/color] UBound(va, [color=crimson]1[/color])
    [color=Royalblue]For[/color] j = [color=crimson]1[/color] [color=Royalblue]To[/color] UBound(vx, [color=crimson]2[/color])

       [color=Royalblue]Set[/color] regEx = CreateObject([color=brown]"VBScript.RegExp"[/color])
       
        [color=Royalblue]With[/color] regEx
            .[color=Royalblue]Global[/color] = [color=Royalblue]True[/color]
            .MultiLine = [color=Royalblue]True[/color]
            .ignorecase = [color=Royalblue]True[/color]
            .pattern = [color=brown]"\b"[/color] & vx([color=crimson]1[/color], j) & [color=brown]"\b"[/color]
        [color=Royalblue]End[/color] [color=Royalblue]With[/color]

        [color=Royalblue]If[/color] regEx.test(va(i, [color=crimson]1[/color])) [color=Royalblue]Then[/color]
            vb(i, j) = vx([color=crimson]1[/color], j)
        [color=Royalblue]End[/color] [color=Royalblue]If[/color]

    [color=Royalblue]Next[/color]
[color=Royalblue]Next[/color]

Range([color=brown]"H2"[/color]).Resize(UBound(vb, [color=crimson]1[/color]), UBound(vb, [color=crimson]2[/color])) = vb

[color=Royalblue]End[/color] [color=Royalblue]Sub[/color][/FONT]
 

Peter_SSs

MrExcel MVP, Moderator
Joined
May 28, 2005
Messages
42,734
Office Version
365
Platform
Windows
I also have a Regular Expression approach but first some comments.

As already mentioned, & you are well aware, text comparisons are difficult to get 100% right. We also have very little idea of what the extent of variation in your data is so some of this is just guessing. However, it might shed some light or help develop a solution that better meets your needs.

Firstly, both Akuini's code and mine below, only require the 'names list' once in row 1 H1:??1 so unless you need them in all the other rows first for some reason, don't bother

I tested Akuini's code on the following sample data and by my understanding it produced correct results except for the coloured cells which are all part of longer or compound 'words'. This is due to how regular expressions define 'word boundaries'. Apostrophes are a particular problem as ideally we would like to treat the one in row 5 as a word boundary but not treat the one in row 10 as a word boundary.

Excel Workbook
CHIJKLMN
1TextBobFredMaryJohnTimSamConnor
2Bob and Mary, went to a dinner partyBobMary
3Sam met with JohnJohnSam
4Sam & Mary Johnson went to FredericktonMarySam
5Sam & Mary's dog had pupsMarySam
6John-Bob is eleven & Fred, Tim & Sam are tenBobFredJohnTimSam
7Mary & Bob went out. Only Mary came home.BobMary
8Tom John-Jones is hereJohn
9Connor WatsonConnor
10Bob O'ConnorBobConnor
Extract names (1)



My code produces this result instead, so you can still see the problem with the blue cell, but it eliminates the green ones. I'm assuming they should be eliminated?

Excel Workbook
CHIJKLMN
1TextBobFredMaryJohnTimSamConnor
2Bob and Mary, went to a dinner partyBobMary
3Sam met with JohnJohnSam
4Sam & Mary Johnson went to FredericktonMarySam
5Sam & Mary's dog had pupsMarySam
6John-Bob is eleven & Fred, Tim & Sam are tenFredTimSam
7Mary & Bob went out. Only Mary came home.BobMary
8Tom John-Jones is here
9Connor WatsonConnor
10Bob O'ConnorBobConnor
Extract names 2


My final comment is that this code is much more efficient. I tested with about 900 rows and 7 columns and this code took less than 0.01 seconds whereas the other code took a little over 4 seconds. With your 5,000 and up to 50 columns, that time different would be increased significantly I imagine.

Rich (BB code):
Sub Get_Names()
  Dim RX As Object, d As Object
  Dim a As Variant, b As Variant, mtch
  Dim lr As Long, cols As Long, i As Long

  Set d = CreateObject("Scripting.Dictionary")
  d.compareMode = 1
  cols = Cells(1, Columns.Count).End(xlToLeft).Column - 7
  For i = 1 To cols
    d(Cells(1, i + 7).Value) = i
  Next i
  Set RX = CreateObject("VBScript.RegExp")
  RX.Global = True
  RX.IgnoreCase = True
  RX.MultiLine = True
  RX.Pattern = "\b(" & Join(Application.Index(Cells(1, 8).Resize(, cols).Value, 1, 0), "|") & ")\b"
  lr = Cells(Rows.Count, 3).End(xlUp).Row
  a = Range("C2:C" & lr).Value
  ReDim b(1 To UBound(a), 1 To cols)
  For i = 1 To UBound(a)
    For Each mtch In RX.Execute(Replace(a(i, 1), "-", "5"))
      b(i, d(CStr(mtch))) = mtch
    Next mtch
  Next i
  Range("H2").Resize(UBound(b), cols).Value = b
End Sub

Anyway, give them both a try and post back with any problems or if you think we need to do something different.
 

Jeff P

Board Regular
Joined
Jun 11, 2011
Messages
69
Wow, Thank you very much for this!

Both codes worked perfectly. I couldn't see a difference in the results between them on my working sheets. ( other than the code from Akuini, adhering to the capitalization of names from the header.)

On my large sheets, where my header names reached to over 130, and the paragraph texts, more than 1000 rows,.. Peter_SSs, your code was lighting fast!,.. near instantaneous.

Amazing ..

Thanks again for the help .. and patience!
Cheers!!


Jeff
 

Peter_SSs

MrExcel MVP, Moderator
Joined
May 28, 2005
Messages
42,734
Office Version
365
Platform
Windows
.. adhering to the capitalization of names from the header.
Are the headers
- all upper case or
- all lower case or
- all proper case or
- could some names be upper and some proper etc and you want to match whatever is there?

Also, would your heading names ever include a hyphenated name (eg "John-Bob" or "Ashley-Cooper")? If so my code needs a few more tweaks.
 
Last edited:

Jeff P

Board Regular
Joined
Jun 11, 2011
Messages
69
Hi Peter_SSs. Thanks for the additional inquiry.

The comment I made regarding Akuini's code was merely a noticeable difference between his and yours. In my test sheet example, the first letter of each name in the Header was capitalized, which also reflected in their corresponding column throughout. For cosmetic purposes this would be great, however the code you provided works perfectly for my needs! Also, the text and names never include hyphens, so its all good!

Many thanks again, Peter_SSs

Cheers!

Jeff
 

Forum statistics

Threads
1,085,894
Messages
5,386,603
Members
402,009
Latest member
dcouto

Some videos you may like

This Week's Hot Topics

Top