clean data - Removing Unwanted characters

John Caines

Well-known Member
Joined
Aug 28, 2006
Messages
1,155
Office Version
  1. 2019
Platform
  1. Windows
Hello All.
I have a problem.:(
I have an excel sheet with approx 30,000 rows of data.
These are all keyword phrases.
They can be related to any subject, but for this example these are related to the root keyword phrase of "car Rent"

I have some software which basically pulls in keyword phrases from search engines and meta tags etc, including misspelled keyword phrases.

My problem;
The data is uncleaned.
In other words there might be I think it's called "Carriage return" data in there, so the row of data might be very deep (Instead of a row height of say 10.5 it could be anything, IE some could be 100 or 200 even).
There are unwanted characters, for example; ()[]{}+?!""^*

(If it could delete all unwanted characters except for letters/digits)
There is a problem I see, that if it removes _ or - between words, that it will join the words together which won't be of any use. If it deletes anything with a letter either side of it or a letter and digit, or 2 digits, 1 either side it would then need to add a space to replace the hyphen.

So for example; if there was a phrase in the list like
car_for rent
if it just removed the underscore, then the phrase would be
carfor rent
Which isn't correct. It would need to replace the underscore with a space.
I hope I'm making sense here:)
So basically I'd love to have if possible a macro button that runs through my entire column of data,
(Always in Column A , on a sheet called "AllKWs", and always starting from row3 downwards.)

If it could go through the list and delete all unwanted characters including double spaces.
So the end result is a keyword phrase list without a lot of junk basically.
After it's gone through the list I suppose it needs to then look at what's left and delete any duplicate phrases last (As once some of these unwanted characters are removed, the keyword list may have duplicates).

Once all this is completed, can a pop up window appear saying something like;
=======================================

Starting No. Phrases: 29,745
Finishing No. Phrases: 29,722
No.Deleted Characters: 12,345
No.Deleted Carriage Returns: 234
No.Deleted Spaces: 235
No.Deleted Duplicates: 23

Time Elapsed: 7.78seconds
======================================
I think that's about it:)
I really hope someone can help me out on this 1.
I can't write this for sure.
Out of my league I'm afraid:(
I hope it is possible as this would be very very useful for me.
Maybe it isn't possible as it is quite complicated.
If someone can have a look at it for me and have a go that would be brilliant


Thanks for your time.
Many Thanks
John Caines
 
The other code that you posted with my comments.

Code:
Sub CleanEm()
Dim r As Range
Dim iRow As Long, lRow As Long
Dim s As String
Dim iChr As Integer
Dim nDelChr As Long, nDelCR As Long, nDelRow As Long, nDelSp As Long
Const sFmt As String = "#,##0"

lRow = Cells(Rows.Count, 1).End(xlUp) 'lRow = Cells(Rows.Count, 1).End(xlUp).Row
Set r = Range(Cells(3, 1), Cells(lRow, 1))

For iRow = 1 To r.Rows.Count
s = r(iRow)
For iChr = 1 To Len(s)
Select Case Mid(s, iChr, 1)
Case "A" To "Z", "A" To "z", "0" To "9"
' do nothing
Case vbLf ' replace with space
s = Left(s, iChr - 1) & " " & Mid(s, iChr + 1)
nDelCR = nDelCR + 1
Case Else ' not alphanumeric - replace with space
s = Left(s, iChr - 1) & " " & Mid(s, iChr + 1)
nDelChr = nDelChr + 1
End Select
Next
r(iRow) = Trim(s) ' does this actually remove replicate spaces or just leading and trailing? That's why I used Application.Trim
nDelSp = nDelSp + Len(s) - Len(r(iRow))
Next

' sort to locate and delete duplicates
r.Sort key1:="A1", order1:=xlAscending, Header:=xlNo
For iRow = r.Rows.Count To 2 Step -1
If r(iRow) = r(iRow - 1) Then
r(iRow).EntireRow.Delete
nDelRow = nDelRow + 1
End If
Next

MsgBox "Starting No Phrases: " & Format(lRow - 3 + 1, sFmt) & vbLf _
& "Finishing No. Phrases: " & Format(r.Rows.Count, sFmt) & vbLf _
& "Deleted Characters: " & Format(nDelChr, sFmt) & vbLf _
& "Deleted Carriage Returns: " & Format(nDelCR, sFmt) & vbLf _
& "Deleted Spaces: " & Format(nDelSp, sFmt) & vbLf _
& "Deleted Duplicates: " & Format(nDelRow, sFmt)
End Sub

See how much easier it is to read with code tags :biggrin: [/code]
 
Upvote 0

Excel Facts

Square and cube roots
The =SQRT(25) is a square root. For a cube root, use =125^(1/3). For a fourth root, use =625^(1/4).
getting there, looking good VoGII:)

Just tried it VoGII on an easier sheet.
I will test it on a really bad sheet in 2 minutes.
It's looking good VoGII.

I love these pop up info boxes!:)
(I know, I sound kind of sad really:)

As to the information.
It did take a while to run VoGII.
Info as follows in pop up window,
======================
No Phrases: 25153
Deleted Characters:777
Deleted Carriage Returns:0
Deleted Spaces:38
Time Elapsed:00:01:43
=====================
It's looking great VoGII.
I will try this now on a larger sheet that is terrible as in carriage returns and unwanted characters.
This is so we can see if the carriage returns data is working correctly.

It hasn't deleted any duplicate rows of data as a last run function though VoGII.
When it did finish, I manually had to delete duplicates. It said it found 86

As another point.
I remember another VB expert that helped me on a macro , PGC01, I think he is called.
He asked if ever there would be more than 64,000 (It was an odd number actually) in the sheet.
I said no.
I think he asked me this, as when he wrote the code, he wrote it differently, I think maybe to speed it up?
I'm not 100% sure, so don't quote me on this VoGII.
I mention it, as if you know this, then you might write something differently also.
I just thought I'd mention it to you anyway VoGII.

Many thanks for this.
I will try this large sheet with carriage returns in to see how your formula works with it.
Many Thanks
John Caines
 
Upvote 0
sorry VoGII:)

Sorry VoGII,,

I'll post from now on in Code boxes:)

I'm still looking at the macro.
Hang on.
Sorry:)
John Caines
 
Upvote 0
Error??

Sorry VoGII.
I'm trying to run your macro on this larger sheet of phrases that is ugly to be honest:)

I keep getting an error in your code???
On this line of code;
Code:
lastrow = Cells(Rows.Count, 1).End(xlUp).Row

Any ideas???
I can't understand why it won't run on this??
It did on my other keyword sheet VoGII??
very strange

John Caines
 
Upvote 0
just to add VoGII

Sorry VogII,
Just to add it says;
Run-time error'6'
Overflow
--------------

Hope this helps.
I haven't a clue??
Sorry:(

John Caines
 
Upvote 0
I apologise and I don't know how I made this error :oops:

Please change this line (the second one)

Code:
Dim lastrow As Long, X As Variant, i As Long, j As Integer
[/code]
 
Upvote 0
Excellent VoGII:)

I've just tried it on a really horrible list VoGII that is full of horrible characters/carriage returns etc.

Made the changes you said.
Worked brilliantly.
Was quicker also than the 25,000 phrase list??

Here's the results VoGII
========================
Number Of Phrases: 43697
Deleted Characters: 3324
Deleted Carriage Returns: 7366
Deleted Spaces: 6670
Time Elapsed: 00:01:19
========================
There are now no empty Rows VoGII.
There are some characters (Just a few when I sorted by your other macro ( No. Of Characters) that I'd like it to include and delete also.
these have slipped through the net VoGII;
Actually I'll sceen capture the first few rows of data so you can see.
It's mainly though {} - ', (This last character I think is a comma. It's in Row 3. Doesn't look like a comma though?)
==============================
HuaHin Car Rental Keyword Sorting Sheet.xlsm
ABCD
1
2KeywordPhrase
3.1
4a1
5-1
6v1
7}1
8{1
9i1
10`1
11c1
12uk2
13wv2
14cb2
15ab2
All KWs

======================================
1 last final thing VoGII.
It didn't delete any duplicates.
Can this be added also. So once it's been through the list it deletes duplicates also??
Before I ran the macro there were no duplicates.
After it had finished, I manually checked for duplicates.
It found 319.
So, if this could be added also that would be great, (And imputted into the pop up box also as No of Duplicates Deleted.

Many thanks for this VoGII.
It's almost there:)
John Caines
 
Upvote 0
characters to delete

Sorry VoGII

If I wasn't too clear on the characters to delte also.
Looks like the following
========================
'-,{}
========================
Actually VoGII the character in row 3 might be a full stop.
I think it is VogII.
I'm looking through the sheet now.
I'll paste in 6 rows of data;
========================
old cheap car
cheap car rim
www.geico.com
car.insurance
car insurance
cypruscardeal
=========================
See the car.insurance.
If that full stop was removed, that would be great, also that would then mean the list had another duplicate to delete.

Hope this all makes sense.

Many Thanks again VoGII
John Caines
 
Upvote 0
Found 1 more

Just 1 more VoGII.
I'm looking at row 10748, which is the China phrase below, has a colon in it.
===========
hotel bookings
hotels in agra
china: in 2004
hire hong kong
=======================
1 last thing VoGII.
At the end of my sheet there are some very long phrase, (Which really aren't of much use to be honest and I'll probably delete).
My question is as follows;
Why is cell 43,641 a deep row, yet row 43,641 isn't VoGII?
I'm not really 100% sure I understand deleteing carriage returns.
Should row 43,641 still be that deep? Or is it the next row 43,642, should that be deeper?
Not too sure what I'm trying to say actually:)
I'll attach a screen capture anyway
Maybe you might have a clue, and tell me what I'm supposed to be thinking:) :)
==============================
HuaHin Car Rental Keyword Sorting Sheet.xlsm
ABCD
43445ohusacvg ciudaddelcarmenmexicocme ciudadjuarezmexicocjs ciudadobregonmexicocen ciudadvictoriamxeicocvm clarksburg148
43446nfcanadafsp st.thomasvirginislandsstt st.vincentst.vincentsvd stansted(london)unitedkingdomstn statecollege/belefonte148
43447ohusacvg ciudaddelcarmenmexicocme ciudadjuarezmexicocjs ciudadobregonmexicocen ciudadvictoriamexicocvm clarksburg148
43448mtusabil billunddenmarkbll bintulumalaysiabtu birarocentralafricanrepubliciro birminghamunitedkingdombhx birmingham148
43449carhireaustraliarentalbrisbanehirequeenslandaustraliamelbournecairnsairliebeachaustraliantruck4wdcommercialairportpickuprentacar148
43450carrentalpaylessdenvercoloradocheaprentalcardiscountcarrentalsspecialsdenfreecontest25340east78thavenuedenvercolorado80249-6383148
43451carrentalpaylesstampafloridacheaprentalcardiscountcarrentalsspecialstpafreecontest1965nwestshoreboulevardtampaflorida33607-3908148
43452carrentalpaylesslasvegasnevadacheaprentalcardiscountcarrentaslspecialslasfreecontest5175rent-a-carroadlasvegasnevada89119-1229148
43453carrentalpaylesslasvegasnevadacheaprentalcardiscountcarrentalsspecialslasfreecontest5175rent-a-carroadlasvegasnevada89119-1229148
43454akusadut dysartaustraliadya dzaoudzimayottedza topofthepageabcdefghijklmnopqrstuvwxyz e city152
All KWs

==================================
Hope this helps
Many Thanks
John Caines
 
Upvote 0
as a note

Sorry VogII
The attached sheet above. The information of the cell numbers doesn't correspond to the sheet.
Sorry, I copied the wrong cells.

Should still hopefully make sense though.
Many Thanks
John Caines
 
Upvote 0

Forum statistics

Threads
1,215,308
Messages
6,124,178
Members
449,146
Latest member
el_gazar

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top