clean data - Removing Unwanted characters

John Caines

Well-known Member
Joined
Aug 28, 2006
Messages
1,155
Office Version
  1. 2019
Platform
  1. Windows
Hello All.
I have a problem.:(
I have an excel sheet with approx 30,000 rows of data.
These are all keyword phrases.
They can be related to any subject, but for this example these are related to the root keyword phrase of "car Rent"

I have some software which basically pulls in keyword phrases from search engines and meta tags etc, including misspelled keyword phrases.

My problem;
The data is uncleaned.
In other words there might be I think it's called "Carriage return" data in there, so the row of data might be very deep (Instead of a row height of say 10.5 it could be anything, IE some could be 100 or 200 even).
There are unwanted characters, for example; ()[]{}+?!""^*

(If it could delete all unwanted characters except for letters/digits)
There is a problem I see, that if it removes _ or - between words, that it will join the words together which won't be of any use. If it deletes anything with a letter either side of it or a letter and digit, or 2 digits, 1 either side it would then need to add a space to replace the hyphen.

So for example; if there was a phrase in the list like
car_for rent
if it just removed the underscore, then the phrase would be
carfor rent
Which isn't correct. It would need to replace the underscore with a space.
I hope I'm making sense here:)
So basically I'd love to have if possible a macro button that runs through my entire column of data,
(Always in Column A , on a sheet called "AllKWs", and always starting from row3 downwards.)

If it could go through the list and delete all unwanted characters including double spaces.
So the end result is a keyword phrase list without a lot of junk basically.
After it's gone through the list I suppose it needs to then look at what's left and delete any duplicate phrases last (As once some of these unwanted characters are removed, the keyword list may have duplicates).

Once all this is completed, can a pop up window appear saying something like;
=======================================

Starting No. Phrases: 29,745
Finishing No. Phrases: 29,722
No.Deleted Characters: 12,345
No.Deleted Carriage Returns: 234
No.Deleted Spaces: 235
No.Deleted Duplicates: 23

Time Elapsed: 7.78seconds
======================================
I think that's about it:)
I really hope someone can help me out on this 1.
I can't write this for sure.
Out of my league I'm afraid:(
I hope it is possible as this would be very very useful for me.
Maybe it isn't possible as it is quite complicated.
If someone can have a look at it for me and have a go that would be brilliant


Thanks for your time.
Many Thanks
John Caines
 
Thanks sbendbuckeye

Many Thanks sbendbuckeye,,
I will look at the links.
Yes, this language sounds really powerful.

I'm just guessing here,, as I'm no programmer/coder. I'm a real novice actually.
But I know what I want to do,, but I'm not capable, and probably will never have the skills to program and write code that you and many others here have written.

What I'm guessing is;
The kind of questions and macros I seem to be asking for,, ie, searching for certain characters and not others, deleting some but not others etc,
this seems to be exactly what RE is built for?? And as such would; I'm guessing again; return results a lot quicker than say a formula written in an ordinary vb language??

So,, Regular Expression Language, usually equals speed??
A kind of VB language on steroids?:)

I'll read the link now sbendbuckeye.
I have to go to work in a couple of hours though.
I have sent you my workbook, as it looks now, and the keyword list.
Again, many thanks for your time
A very grateful
John Caines
 
Upvote 0

Excel Facts

Show numbers in thousands?
Use a custom number format of #,##0,K. Each comma after the final 0 will divide the displayed number by another thousand
Hello John,

Try posting this code into a Module in your workbook. While in the VB Editor, click Debug, Compile VBA Project to see if it compiles cleanly. If so, try the following:

A. Change this line to the first row of data you want to cleanse:
Private Const m_FIRST_DATA_ROW As Long = 3
B. Change this line to match the column you want to cleanse:
Private Const m_COLUMN_TO_TEST As String = "B"
C. Ensure that this line matches exactly the sheet name you want to cleanse: With ActiveWorkbook.Sheets("AllKWS")
Code:
Option Explicit

Private Const m_FIRST_DATA_ROW As Long = 3

Private Const m_COLUMN_TO_TEST As String = "B"

' Note: Requires a reference to Microsoft VBScript Regular Expressions 5.5

' \s{2,}
'
' Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) «\s{2,}»
'    Between 2 and unlimited times, as many times as possible, giving back as needed (greedy) «{2,}»
Private Const m_REGEX_PATTERN_MULTIPLE_WHITESPACE As String = "\s{2,}"

' [^a-zA-Z0-9 ]+
'
' Match a single character NOT present in the list below «[^a-zA-Z0-9 ]+»
'    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
'    A character in the range between "a" and "z" «a-z»
'    A character in the range between "A" and "Z" «A-Z»
'    A character in the range between "0" and "9" «0-9»
'    The character " " « »
Private Const m_REGEX_PATTERN_ALPHANUMERIC_ONLY As String = "[^a-zA-Z0-9 ]+"

Public Sub CleanData()
    On Error Resume Next
    Dim regEx As Object
    Dim strValue As String
    Dim lngRow As Long, lngEndRow As Long
    Set regEx = CreateObject("VBScript.RegExp")
    With regEx
        .MultiLine = True
        .Global = True
    End With
    With ActiveWorkbook.Sheets("AllKWS")
        lngEndRow = .Cells(.Rows.Count, m_COLUMN_TO_TEST).End(xlUp).Row
        For lngRow = m_FIRST_DATA_ROW To lngEndRow
            If lngRow Mod 10 = 0 Then Application.StatusBar = "Processing row " & lngRow
            regEx.Pattern = m_REGEX_PATTERN_ALPHANUMERIC_ONLY
            ' Are there any non alphanumeric characters in the current cell.
            If regEx.Test(.Cells(lngRow, m_COLUMN_TO_TEST).Value) Then
                ' Replace all non alphanumeric characters with a space.
                strValue = regEx.Replace(.Cells(lngRow, m_COLUMN_TO_TEST).Value, " ")
                regEx.Pattern = m_REGEX_PATTERN_MULTIPLE_WHITESPACE
                If regEx.Test(strValue) Then
                    ' Convert all multiple whitespaces to a single space.
                    strValue = Trim$(regEx.Replace(strValue, " "))
                End If
                .Cells(lngRow, m_COLUMN_TO_TEST).Value = Trim$(strValue)
                .Rows(lngRow).AutoFit
            End If
        Next lngRow
    End With
    Application.StatusBar = False
    Set regEx = Nothing
    If Err.Number Then Err.Clear
    On Error GoTo 0
End Sub 'CleanData
 
Upvote 0
many thanks-will try now

Many thanks for the reply sbendbuckeye.

I will try this now,, and post back as soon as I've looked at it closely
(Hopefully about 1hr):)

Many thanks
John Caines
 
Upvote 0
Hello John

I know that I told you privately that I was going to 'bow out' on this but with a little lateral thinking I've come up with a revised code:

Code:
Sub strp2()
Dim lastrow As Long, X As Variant, i As Long, j As Integer, r As Range, iRow As Long
Dim A As Integer, t As Date, Replaced As Variant, Spaces As Variant
Dim Returns As Variant, nDelRow As Long
t = Now()
Replaced = 0
Spaces = 0
Returns = 0
lastrow = Cells(Rows.Count, 1).End(xlUp).Row
Application.ScreenUpdating = False
Application.Cursor = xlWait
Set r = Range(Cells(3, 1), Cells(lastrow, 1))
For i = 3 To lastrow
    Application.StatusBar = Format(100 * i / (lastrow - 2), "0") & "% done..."
    X = Cells(i, 1).Value
    For j = 1 To Len(X)
        A = Asc(Mid(X, j, 1))
        Select Case A
            Case Is < 32
                Cells(i, 1).Value = Application.Substitute(Cells(i, 1).Value, Chr(A), Chr(32))
                Replaced = Replaced + 1
                If A = 10 Or A = 13 Then Returns = Returns + 1
            Case 33 To 47
                Cells(i, 1).Value = Application.Substitute(Cells(i, 1).Value, Chr(A), Chr(32))
                Replaced = Replaced + 1
            Case 58 To 63
                Cells(i, 1).Value = Application.Substitute(Cells(i, 1).Value, Chr(A), Chr(32))
                Replaced = Replaced + 1
            Case 91 To 96
                Cells(i, 1).Value = Application.Substitute(Cells(i, 1).Value, Chr(A), Chr(32))
                Replaced = Replaced + 1
            Case Is > 122
                Cells(i, 1).Value = Application.Substitute(Cells(i, 1).Value, Chr(A), Chr(32))
                Replaced = Replaced + 1
        End Select
    Next j
    X = Cells(i, 1).Value
    Cells(i, 1).Value = Application.Trim(X)
    Spaces = Spaces + Len(X) - Len(Cells(i, 1).Value)
'special case tilde ~ as first character
If Left(Cells(i, 1).Value, 1) = "~" Then Cells(i, 1).Value = Right(Cells(i, 1).Value, Len(Cells(i, 1).Value) - 1)
Next i
r.Sort Key1:=Range("A3"), Order1:=xlAscending, Header:=xlGuess
For iRow = r.Rows.Count To 3 Step -1
    If r(iRow) = r(iRow - 1) Then
        r(iRow).EntireRow.Delete
        nDelRow = nDelRow + 1
    End If
Next iRow
lastrow = Cells(Rows.Count, 1).End(xlUp).Row
For i = 1 To lastrow
    Rows(i).EntireRow.AutoFit
Next i
Application.ScreenUpdating = True
Application.Cursor = xlNormal
Application.StatusBar = False
Spaces = Spaces - Replaced
Replaced = Replaced - Returns
MsgBox ("Number of phrases: " & vbTab & vbTab & lastrow - 2 _
& vbCrLf & "Deleted characters: " & vbTab & Replaced _
& vbCrLf & "Deleted carriage returns: " & vbTab & Returns _
& vbCrLf & "Deleted spaces: " & vbTab & vbTab & Spaces _
& vbCrLf & "Deleted rows: " & vbTab & vbTab & nDelRow _
& vbCrLf & "Time elapsed: " & vbTab & vbTab & Format(Now() - t, "hh:mm:ss"))
End Sub

I've found it to be a lot faster than my previous efforts, plus it gets rid of the ~ problem.

For anybody still following this, why is

Code:
                Cells(i, 1).Value = Application.Substitute(Cells(i, 1).Value, Chr(A), Chr(32))

faster than

Code:
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
:confused:

Please try it, John, and let us know how you get on. :)
 
Upvote 0
update to this macro conundrum :)

Thanks everyone for there help on this macro.

First off,
I had just finished this reply,,,,
I mean I was typing in the "return message box",
It was a large reply.
Took me almost an hour to write an in depth reply to cover everything,,, and guess what??

Yes,, I've lost the bloody lot!:(
Can't believe it!
So,,, I'm typing this again,,, in TextPad first:)


I can't bring myself to do this in such depth again:)

A few points though..

sbendbuckeyes formula first,,
Written I believe in "Regular Expression".
Someone has written a reply to me as I asked about this and this is a short extract from it;
=========================================
"Regular expressions can do very complex string recognition, but processing time ranges from slow to hideously slow depending on the implementation. What you are doing doesn’t need them.
=========================================

Sbendbuckeyes formula returned the clean data perfectly. I mean the end result was so so clean.
I suppose the end result is the most important thing,, but there are a few issues compared to the other formulas by SHG, and VoGII.

1. Speed. It took approx 8minutes and 36seconds to run.
2. Didn't delete duplicates
3.No pop up info window at the end.

In no way am I criticizing here let me state. I have a huge respect for all you coders out there. I can't write this type of thing, and I admire you all, and also how genuinely helpful everyone is here.
What I'm just try to do is make some fair comparisons between the 3 formulas of
sbendbuckeyes
SHG's
VoGII's.

I must admit. I'd heard this "Reg Exp" language was powerful, and as such, I just assumed it would be quickest actually.
So, to sum up,,
sbendbuckeyes formula;
The end result of data seems absolutely perfect.
Runs quite slow though, and doesn't delete the duplicates.

Onto VoGII's new post.
Hello VoGII.
I saw your post when I got in at 5am this morning.
I had to try it.
Does have a problem though.
A few points;;;
It runs quite quickly. It took about 4minute 16 seconds (So about 100% quicker than the Reg Expression formula)
It does delete the duplicates.
Has pop up info window at the end.

I know you had an issue finding the ~ character VoGII with your previous formula,,, but this new post has found this also.
The data returned looks perfect also.
I can't see any errors actually.

there is a problem though VoGII;
After the macro has finished,
It's altered my rows 1 and 2.
In Row 1 I have a header Graphic,
Row 2 has Title headings and macro buttons.
What the macro has done ,, it's adjusted the row heights of Row 1 to 10.5 and row to to 14.25.
Row height of 1 was originally 120 (For the inserted jpeg header)
Row 2 was 21.

So,, to sum up,

Cleans data perfectly now also like sbendbuckeyes,
deletes duplicates also,
pop up info box
and quite quick.
Downside,, just an error in it altering Rows 1 and 2.

Now onto SHG
Well, SHG has just got back to me with version 2 of his macro.
I'll post it at the end of this.

Well,,, You just have to try it.
I think he's inserted a line of code that says "Nitro Supercharger mode":)
It's quick, real quick.

I know nothing about coding as I've stated many a time,, but you might want to take a look at this, as I don't know how he's written this, but it returns the data perfectly, and in a very quick time.

Time to run: 73 seconds!
Deletes duplicates and pop up info box.

There is 1 error which I've just emailed SHG about;
It takes out my cell "A2" the heading which is "Keyword Phrase",, which I hope he can fix.

Again, I have no idea as to how he has increased the speed of this by about 300% from his previous version, but it is very quick.

So, it looks like this post is coming to an end.
I do thank everyone for there efforts.
It's great to see how all these different ideas come in and how things just seem to evolve.
I'll post SHG's formula now for people to decipher.

Again,
many thanks to you all for your help on this post
I am very very grateful.
A very thankful
John Caines
Code:
Option Explicit 
 
Sub CleanEm() 
    Dim r As Range 
    Dim iRow As Long, lRowBeg As Long, lRowEnd As Long 
    Dim s As String 
    Dim iChr As Integer 
    Dim nDelChr As Long, nDelCR As Long, nDelRow As Long, nDelSp As Long 
    Const sFmt As String = "#,##0" 
    Dim t As Date 
     
    t = Now 
    lRowEnd = Range("A65536").End(xlUp).Row 
    lRowBeg = Cells(lRowEnd, 1).End(xlUp).Row 
    Set r = Range(Cells(lRowBeg, 1), Cells(lRowEnd, 1)) 
     
    Application.ScreenUpdating = False 
     
    For iRow = 1 To r.Rows.Count 
        s = r(iRow) 
        For iChr = 1 To Len(s) 
            Select Case Mid(s, iChr, 1) 
            Case "A" To "Z", "a" To "z", "0" To "9", " " 
                 ' do nothing
            Case vbLf ' replace with space
                s = Left(s, iChr - 1) & " " & Mid(s, iChr + 1) 
                nDelCR = nDelCR + 1 
            Case Else ' not alphanumeric - replace with space
                s = Left(s, iChr - 1) & " " & Mid(s, iChr + 1) 
                nDelChr = nDelChr + 1 
            End Select 
        Next 
        r(iRow) = Application.Trim(s) ' remove leading, trailing, and multiple spaces
         
        If iRow Mod 1000 = 0 Then 
            Application.ScreenUpdating = True 
            ActiveWindow.ScrollRow = iRow 
            Application.ScreenUpdating = False 
        End If 
         
        nDelSp = nDelSp + Len(s) - Len(r(iRow, 1)) 
    Next 
     
     ' sort to locate and delete duplicates
    With ActiveWorkbook.Worksheets(ActiveSheet.Name).Sort 
        .SortFields.Clear 
        .SortFields.Add Key:=r, SortOn:=xlSortOnValues, _ 
        Order:=xlAscending, DataOption:=xlSortNormal 
        .SetRange r 
        .Header = xlNo 
        .MatchCase = False 
        .Orientation = xlTopToBottom 
        .SortMethod = xlPinYin 
        .Apply 
    End With 
     
    For iRow = r.Rows.Count To 2 Step -1 
        If r(iRow) = r(iRow - 1) Then 
             'Application.ScreenUpdating = True
             'ActiveWindow.ScrollRow = r(iRow).Row
            r(iRow).EntireRow.Delete 
             'Application.ScreenUpdating = False
            nDelRow = nDelRow + 1 
        End If 
    Next 
     
    Application.ScreenUpdating = True 
    ActiveSheet.UsedRange 
     
    MsgBox _ 
    "Starting No Phrases:      " & Format(lRowEnd - lRowBeg + 1, sFmt) & vbLf _ 
    & "Deleted Duplicates:       " & Format(-nDelRow, sFmt) & vbLf _ 
    & "Finishing No. Phrases:    " & Format(r.Rows.Count, sFmt) & vbLf _ 
    & "Deleted Characters:       " & Format(nDelChr, sFmt) & vbLf _ 
    & "Deleted Carriage Returns: " & Format(nDelCR, sFmt) & vbLf _ 
    & "Deleted Spaces:           " & Format(nDelSp, sFmt) & vbLf _ 
    & "Elapsed time [s]:         " & Format(86400# * (Now - t), "0.0") 
End Sub


conunderum
 
Upvote 0
Hello John.

To correct the problem with rows 1 and 2 with my code, near the end change

Code:
For i = 1 To lastrow
    Rows(i).EntireRow.AutoFit
Next i

to

Code:
For i = 3 To lastrow
    Rows(i).EntireRow.AutoFit
Next i
 
Upvote 0
Thanks VoGII

Hello VoGII
I'll try this now and get back to you to see how it works.
Have to go to work later so,, really pushed for time:(

Have you looked at SHG's formula VoGII??

It seemed when I first started this post that your formula and SHG's were quite similar?
Can you see how he has speeded it up?

I don't have a clue, but is it just like adjusting say 1 or 2 lines of code and that makes all the difference??

Just wondered that's all.

I'll try your adjusted lines of code now VoGII.
Many Thanks
John Caines
 
Upvote 0
VoGII-Update

Hello VoGII,
Adjusted the code as you said,

Runs perfectly now:)

Many Thanks,,, funny how just 1 really small adjustment makes such a difference:)

Many Thanks
John aAines
 
Upvote 0

Forum statistics

Threads
1,215,297
Messages
6,124,113
Members
449,142
Latest member
championbowler

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top