Finding text-string matches of X character length in column - Not matching to known text string, looking for any repeated strings

Churchy LaFemme

Board Regular
Joined
Sep 22, 2010
Messages
135
Not sure how to phrase this. Have reviewed prior questions but most seem to be about matching known text.

I have a product column with almost 100,000 entries. Some are straight up duplicates (e.g., "Home Movies") but many included duplicpated text as part of the product name (e.g., "Home Movies Superstars").

I need a formula (although it's not advisable to use a formula on that many rows) that will say TRUE for any line for which the text in column A includes a text string that matches any other cell or cells in column A.

ShowsFormula Results
The Oddball Comedy & Curiosity Festival featuring Flight of the ConchordsTRUE
A Very Conshords ChristmasTRUE
Behind the ConchordsTRUE
Dr. Katz Bikini Beach Party FALSE
Flight of the Conchords, PilotFALSE
Flight of the Conchords, Season 2TRUE
Flight of the Conchords, Season 3TRUE
Flight of the Conchords: Censored TracksTRUE
Home Movies: Brendon Gets RabiesTRUE
Home Movies: Brendon's HatTRUE
Home Movies: Get Away From My MomTRUE
Home Movies: YokoTRUE
Rhys Darby - Conchords Exposed!TRUE
Science Court Outakes and BloopersFALSE
Short PoppiesFALSE
The Black Seeds (Debut)
The Mighty Boosh
FALSE

In this example, both the extact phrase "Conchords" and "Home Movies" have a "TRUE" result. This is fine. My goal at this point is to upload into Access and remove everything that is "FALSE."

Trimming won't help because the duplications may not be the first x characters in the cell. (X is probably 5, but I can play with this if that's too short.)

I have done this in the past by trimming both from left and from right and looking for duplicates, but this was a pain for a few thousand lines. For tens of thousands, a manual process would kill me.

Any ideas?
FALSE




<tbody>
</tbody><colgroup><col><col></colgroup>
 

Excel Facts

How to fill five years of quarters?
Type 1Q-2023 in a cell. Grab the fill handle and drag down or right. After 4Q-2023, Excel will jump to 1Q-2024. Dash can be any character.
So...if ANY word matches, you want "TRUE". We'd have to find a way to not match simple words (The, If, And, Or, Two, Three, Sequel, etc." and then find a way to retroactively match as well. The possibility of false positives is very high in this data set, at least how I envision performing the process.
 
Upvote 0
False positives are going to be a problem, but one I can live with. My only goal here is to knock out the lines that do not have anything matching another line. Prehaps I should have phrased it as looking for lines that are utterly unique.

However, once I have IDed lines that have duplicate strings such as "Home Movies," then I can do match or lookup functions to catalog those.
 
Upvote 0
Code:
Sub findingTextStringMatches()

    Dim strTest As String
    Dim strArray() As String
    Dim intCount As Integer
    
    Dim rowStart As Integer
    Dim rowEnd As Long
    
    rowStart = 2
    rowEnd = ActiveSheet.Cells(Rows.Count, "A").End(xlUp).Row
    
    For x = rowStart To rowEnd Step 1
    
        strTest = Cells(x, 1).Value
        strArray = Split(strTest, " ")
        
        For intCount = LBound(strArray) To UBound(strArray)
            For y = rowStart To rowEnd Step 1
                
                If Trim(strArray(intCount)) = "The" Or Trim(strArray(intCount)) = "On" Or Trim(strArray(intCount)) = "And" Or Trim(strArray(intCount)) = "In" Then
                    Exit For
                ElseIf InStr(Cells(y, 1), Trim(strArray(intCount))) Then
                    If y <> x Then
                        Cells(y, 2).Value = "True"
                    End If
                End If
            Next y


        Next
        
    Next x
End Sub

Try this. It assumes your values are in column A, and your results in column B.
 
Upvote 0
This works! But only on test data. My actual table is so large that Excel dies. I'm going to keep poking at it, though. This is a very exciting Macro.

Thank you.
 
Upvote 0
This code works and I have successfully used it on several data sets to test.

However, there something wrong with my data and it does not work on the full set. Possibly, there is some cruft in there from the original download.

For instance, a cell with z4west shows as "True." However, there is only one instance of z4west on the worksheet.

Thank you for assistance. This is an terrific macro.
 
Upvote 0
I mean I have time to go downstairs and get lasagna, eat the lasagna (with garlic bread), come back up to my office, and there is still an hourglass and I can't save. Had to use a break or a force shutdown.
 
Upvote 0

Forum statistics

Threads
1,214,585
Messages
6,120,391
Members
448,957
Latest member
Hat4Life

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top