clean data - Removing Unwanted characters

John Caines

Well-known Member
Joined
Aug 28, 2006
Messages
1,155
Office Version
  1. 2019
Platform
  1. Windows
Hello All.
I have a problem.:(
I have an excel sheet with approx 30,000 rows of data.
These are all keyword phrases.
They can be related to any subject, but for this example these are related to the root keyword phrase of "car Rent"

I have some software which basically pulls in keyword phrases from search engines and meta tags etc, including misspelled keyword phrases.

My problem;
The data is uncleaned.
In other words there might be I think it's called "Carriage return" data in there, so the row of data might be very deep (Instead of a row height of say 10.5 it could be anything, IE some could be 100 or 200 even).
There are unwanted characters, for example; ()[]{}+?!""^*

(If it could delete all unwanted characters except for letters/digits)
There is a problem I see, that if it removes _ or - between words, that it will join the words together which won't be of any use. If it deletes anything with a letter either side of it or a letter and digit, or 2 digits, 1 either side it would then need to add a space to replace the hyphen.

So for example; if there was a phrase in the list like
car_for rent
if it just removed the underscore, then the phrase would be
carfor rent
Which isn't correct. It would need to replace the underscore with a space.
I hope I'm making sense here:)
So basically I'd love to have if possible a macro button that runs through my entire column of data,
(Always in Column A , on a sheet called "AllKWs", and always starting from row3 downwards.)

If it could go through the list and delete all unwanted characters including double spaces.
So the end result is a keyword phrase list without a lot of junk basically.
After it's gone through the list I suppose it needs to then look at what's left and delete any duplicate phrases last (As once some of these unwanted characters are removed, the keyword list may have duplicates).

Once all this is completed, can a pop up window appear saying something like;
=======================================

Starting No. Phrases: 29,745
Finishing No. Phrases: 29,722
No.Deleted Characters: 12,345
No.Deleted Carriage Returns: 234
No.Deleted Spaces: 235
No.Deleted Duplicates: 23

Time Elapsed: 7.78seconds
======================================
I think that's about it:)
I really hope someone can help me out on this 1.
I can't write this for sure.
Out of my league I'm afraid:(
I hope it is possible as this would be very very useful for me.
Maybe it isn't possible as it is quite complicated.
If someone can have a look at it for me and have a go that would be brilliant


Thanks for your time.
Many Thanks
John Caines
 

Excel Facts

Is there a shortcut key for strikethrough?
Ctrl+S is used for Save. Ctrl+5 is used for Strikethrough. Why Ctrl+5? When you use hashmarks to count |||| is 4, strike through to mean 5.
Try this for a start. It is somewhat inefficient so it might take a while to run. As there is no Undo I suggest that you run it on a copy of your sheet. There is no summary or elimination of duplicates yet (I thought that the dups problem was dealt with in an earlier thread but maybe I'm getting confused :oops: ).

Code:
Sub strp()
Dim lastrow As Integer, X As Variant, i As Long, j As Integer, c As String, A As Integer
lastrow = Cells(Rows.Count, 1).End(xlUp).Row
For i = 3 To lastrow
    X = Cells(i, 1).Value
    For j = 1 To Len(X)
        c = Mid(X, j, 1)
        A = Asc(c)
        Select Case A
            Case Is < 48
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
            Case 58 To 64
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
            Case 91 To 96
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
            Case Is > 122
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
        End Select
    Next j
    X = Cells(i, 1).Value
    Cells(i, 1).Value = Application.Trim(X)
Next i
End Sub
 
Upvote 0
thanks for your help

Thanks for your reply VoGII

Just trying it now.

Yes, I did put a post up for a delete phrases, still not closed yet.

I'll reply shortly as to how this runs.
Many thanks though for taking a look.
John Caines

By the way (I still use your "No of Characters" Macro.
Brilliant.
That has a good pop up box on it with relevant info after it's ran.
Excellent:)
 
Upvote 0
update

Hello VogII.
There seems to be an error I think :(
When I ran your macro on some data (25,155 rows, which took about 85 seconds roughly)

I did get some rows come back empty.
I haven't looked at all 100 (as it was exactly 100 rows that were empty).
But all the ones I've looked at so far that have gone empty have ended with a "?" (Without parentheses).

So, a phrase (Row) like
luxury car rental boston?
is now an empty cell.

If you need the data to look at VoGII I can email it to you.
I would post it but I'm sure I would get banned for posting 25000 rows of data.:)
I have a car keyword list sheet also in a very raw state that has about 45,000 rows of data, which would be a very good tester.

If you need it VoGII or anyone else, just PM me, it will be no problem forwarding this.

Just to add VoGII,, when it did finish. There haven't been any rows deleted,
So, the formula you wrote VoGII, once it goes searching and deleting all odd characters etc, it at the end, it doesn't delete any duplicates?

I deleted duplicates manually and it said it deleted 178 rows of data.
(So now there are 24,977 Rows Of Phrases).
Well, exactly 98 were empty cells, so there must have been after the character clean up, 80 rows of duplicate phrases.

If possible VoGII, can this "Delete Duplicates" be written into the macro as it's last run function so it automatically cleans and removes the duplicate cells?

Hope this all makes sense.
It's definitely getting there though VoGII.
The list is much cleaner:)

Many Thanks again
John Caines
 
Upvote 0
John - please bear with me. '?' is being treated as a wildcard - I need to remember how to disable this :)

Back soon I hope...
 
Upvote 0
Richard - thanks - just found the same. Will need to do the same with asterisks.

How anybody remembers all this stuff is beyond me :)
 
Upvote 0
How anybody remembers all this stuff is beyond me :)

I know exactly where you're coming from - when I started out with the VBA stuff, I couldn't understand how anyone could ever remember that the arguments to Range().End() was xlUp, xlDown BUT! xlToRight and xToLeft - where the heck does the To come from???? What on earth is wrong with xlRight and xlLeft????? Of course, now I type those without thinking....
 
Upvote 0
formula

VoGII,

just to mention,
someone has got back to me also on this.
they gave me a formula but it had a debug error.
I will post it now, as it might be of some use to you??
here it is anyway;

=============================
Sub CleanEm()
Dim r As Range
Dim iRow As Long, lRow As Long
Dim s As String
Dim iChr As Integer
Dim nDelChr As Long, nDelCR As Long, nDelRow As Long, nDelSp As Long
Const sFmt As String = "#,##0"

lRow = Cells(Rows.Count, 1).End(xlUp)
Set r = Range(Cells(3, 1), Cells(lRow, 1))

For iRow = 1 To r.Rows.Count
s = r(iRow)
For iChr = 1 To Len(s)
Select Case Mid(s, iChr, 1)
Case "A" To "Z", "A" To "z", "0" To "9"
' do nothing
Case vbLf ' replace with space
s = Left(s, iChr - 1) & " " & Mid(s, iChr + 1)
nDelCR = nDelCR + 1
Case Else ' not alphanumeric - replace with space
s = Left(s, iChr - 1) & " " & Mid(s, iChr + 1)
nDelChr = nDelChr + 1
End Select
Next
r(iRow) = Trim(s) ' remove leading, trailing, and multiple spaces
nDelSp = nDelSp + Len(s) - Len(r(iRow))
Next

' sort to locate and delete duplicates
r.Sort key1:="A1", order1:=xlAscending, Header:=xlNo
For iRow = r.Rows.Count To 2 Step -1
If r(iRow) = r(iRow - 1) Then
r(iRow).EntireRow.Delete
nDelRow = nDelRow + 1
End If
Next

MsgBox "Starting No Phrases: " & Format(lRow - 3 + 1, sFmt) & vbLf _
& "Finishing No. Phrases: " & Format(r.Rows.Count, sFmt) & vbLf _
& "Deleted Characters: " & Format(nDelChr, sFmt) & vbLf _
& "Deleted Carriage Returns: " & Format(nDelCR, sFmt) & vbLf _
& "Deleted Spaces: " & Format(nDelSp, sFmt) & vbLf _
& "Deleted Duplicates: " & Format(nDelRow, sFmt)
End Sub
=============================
Hope this helps VoGII.
As I said, I got an error, so it didn't work.
the error by the way was this
============================
lRow = Cells(Rows.Count, 1).End(xlUp)
==============================

Again, hope this might help.

Many thanks
John Caines
 
Upvote 0
Try this

Code:
Sub strp()
Dim lastrow As Integer, X As Variant, i As Long, j As Integer
Dim A As Integer, t As Date, Replaced As Variant, Spaces As Variant, Returns As Variant
t = Now()
Replaced = 0
Spaces = 0
Returns = 0
lastrow = Cells(Rows.Count, 1).End(xlUp).Row
Application.ScreenUpdating = False
Application.Cursor = xlWait
For i = 3 To lastrow
    Application.StatusBar = Format(100 * i / (lastrow - 3), "0") & "% done..."
    X = Cells(i, 1).Value
    For j = 1 To Len(X)
        A = Asc(Mid(X, j, 1))
        Select Case A
            Case Is < 32
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
                Replaced = Replaced + 1
                If A = 10 Or A = 13 Then Returns = Returns + 1
            Case 33 To 47
                Cells(i, 1).Replace what:="~" & Chr(A), replacement:=Chr(32)
                Replaced = Replaced + 1
            Case 58 To 64
                Cells(i, 1).Replace what:="~" & Chr(A), replacement:=Chr(32)
                Replaced = Replaced + 1
            Case 91 To 96
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
                Replaced = Replaced + 1
            Case Is > 122
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
                Replaced = Replaced + 1
        End Select
    Next j
    X = Cells(i, 1).Value
    Cells(i, 1).Value = Application.Trim(X)
    Spaces = Spaces + Len(X) - Len(Cells(i, 1).Value)
    Rows(i).EntireRow.AutoFit
Next i
Application.ScreenUpdating = True
Application.Cursor = xlNormal
Application.StatusBar = False
Spaces = Spaces - Replaced
Replaced = Replaced - Returns
MsgBox ("Number of phrases: " & vbTab & vbTab & lastrow - 2 _
& vbCrLf & "Deleted characters: " & vbTab & Replaced _
& vbCrLf & "Deleted carriage returns: " & vbTab & Returns _
& vbCrLf & "Deleted Spaces: " & vbTab & vbTab & Spaces _
& vbCrLf & "Time elapsed: " & vbTab & vbTab & Format(Now() - t, "hh:mm:ss"))
End Sub
 
Upvote 0

Forum statistics

Threads
1,215,315
Messages
6,124,211
Members
449,148
Latest member
sweetkt327

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top