The list of frequent words used in a text

dscheste · Oct 16, 2004

Hello, gurus

I have read the post here about an alphabetically presented list of frequently used words.

The post riveted my attention and I decided to play around with an idea of creating a code, which will analyse how many times a word was used in a text and then sort the list having the most frequent words on top of the list.

As the text might be large I dismissed the idea of doing it in Excel and had to go to the "bastard child" ™ by Tom Urtis - Microsoft Word.

The code works like this:

1. Copy any text and paste it as unformatted text in to the document (in this case named Frequent_words.doc).
2. Run the macro.
3. Macro deletes the words it has already counted and then tells how many words were processed and how long it took.
4. The stats are in the Stats.doc, which is being saved by default in root.

I would like to ask you to comment on the code, any ideas how to make it run faster, shorter code ideas?

Appreciate your feedback

Here is the code: (VBE in Word)

Sub Frequent_words()
Dim MyRange, WordOrder, Word, Frequency, Statistics, stimer, ftimer, Text, cellWordCount, prevcharacter, lastspace, i, nextCharacter
stimer = Timer
With Application
.ScreenUpdating = False ' we do not wanna see what is happening
 With Selection 'We have to move to the end of the document and add a space in order not to loose the last word when parsing
 .EndKey unit:=wdStory 'Goes to the end
 .InsertAfter (" ") 'inserts a space
 End With
Set MyRange = ActiveDocument.Content
With MyRange.Find ' we have to pre-format the text and get rid of the paragraphs
 .Text = "^p"
 .Replacement.Text = " "
 .Forward = True
 .Wrap = wdFindContinue
 .Format = False
 .MatchCase = False
 .MatchWholeWord = False
 .MatchWildcards = False
 .MatchSoundsLike = False
 .MatchAllWordForms = False
 End With
Selection.Find.Execute Replace:=wdReplaceAll
Text = Trim(MyRange) 'Now we trim the text a bit to work with nice text

cellWordCount = 1
prevcharacter = ""
lastspace = 1

 Documents.Add DocumentType:=wdNewBlankDocument ' Here we would record our findings
 ChangeFileOpenDirectory "C:\" 'This path can be changed, when changing the name rename the window as well
 ActiveDocument.SaveAs Filename:="Stats.doc", FileFormat:= _
 wdFormatDocument, LockComments:=False, Password:="", AddToRecentFiles:= _
 True, WritePassword:="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts:= _
 False, SaveNativePictureFormat:=False, SaveFormsData:=False, _
 SaveAsAOCELetter:=False
Set Statistics = Windows("Stats")
'Get back to our sheep
Windows("Frequent_Words").Activate
For i = 1 To Len(Text)
' we will have to loop through the text and analyze every character, a space will mean new word is following
nextCharacter = Mid(Text, i, 1)
 If (nextCharacter = " ") And (prevcharacter <> " ") Then
 Word = Trim(Mid(Text, lastspace, i - lastspace)) ' that is our actual word
 lastspace = i
 With ActiveDocument.Content.Find
 .ClearFormatting
 Do While .Execute(FindText:=Word, Forward:=True, Format:=True, ReplaceWith:="", MatchWholeWord:=True) = True
 Frequency = Frequency + 1
 ' we have found all the instances of the word and now mark each time the loop revolves
 Loop
With Statistics.Selection
.InsertAfter (Frequency)
.MoveRight unit:=wdCharacter, Count:=1
.InsertAfter (" ")
.MoveRight unit:=wdCharacter, Count:=1
.InsertAfter (Word)
.MoveDown unit:=wdLine, Count:=1
.InsertParagraph
End With
Frequency = 1
 End With
 cellWordCount = cellWordCount + 1 ' count the actual number of words
 End If
 prevcharacter = nextCharacter ' proceed to the next character
Next i
If Len(Text) = 0 Then
 cellWordCount = 0
End If
'Now we sort the results descending
 Statistics.Selection.WholeStory
 Statistics.Selection.Sort ExcludeHeader:=False, FieldNumber:="Field1", _
 SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderDescending, Separator:= _
 wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
 :=wdEnglishUS
 Statistics.Selection.HomeKey unit:=wdStory
.ScreenUpdating = True
End With
ftimer = Timer
'A bit of show-off
MsgBox Str(cellWordCount) & " words processed in " & ftimer - stimer & " seconds."
End Sub



P.S. The code works for every alphanumerical language. Just tried in Russian and it worked

SIXTH SENSE · Oct 18, 2004

Hi!
Im not sure which of this code runs faster. Just try in your machine.
But as I look into your code, you loop thru each character to find a word which will take you too long for long document.

Just try this and feedback.

Sub CountFreq()
Dim NonWord
NonWord = "!:;?.,<>()[]{}"
Dim Tmp
Dim uniqlist
Dim Newdoc As Document
st = Timer
uniqlist = "+"
Application.ScreenUpdating = False
Selection.Find.Execute Replace:=wdReplaceAll
txt = (Trim(ThisDocument.Content))
For i = 1 To 31
 txt = Trim(Replace(txt, Chr(i), " ", 1, -1, vbTextCompare)) 'non printable characters
Next i
For i = 1 To Len(NonWord)
 txt = Trim(Replace(txt, Mid(NonWord, i, 1), "", 1, -1, vbTextCompare)) '
Next i
Do Until (Len(txt) - Len(Replace(UCase(txt), " ", " ", 1, -1, vbTextCompare))) = 0
 txt = Trim(Replace(txt, " ", " ", 1, -1, vbTextCompare))
Loop
Tmp = Split(txt)

Set Newdoc = Documents.Add(DocumentType:=wdNewBlankDocument)

For Each itxt In Tmp
 If InStr(1, (uniqlist), "+" & UCase(itxt) & "+") > 0 Then
 GoTo nxttxt
 Else
 uniqlist = uniqlist & UCase(itxt) & "+"

 cnt = (Len(txt) - Len(Replace(UCase(txt), UCase(itxt), "", 1, -1, vbTextCompare))) / Len(itxt)
 End If

Newdoc.Range(Newdoc.Range.Start, Newdoc.Range.End - 1).InsertAfter Format(cnt, "000") & Chr(9) & Trim(itxt) & Chr(13)

nxttxt:
Next itxt

Newdoc.Range.Sort excludeheader:=False, fieldnumber:="Field1", sortorder:=wdSortOrderDescending
Application.ScreenUpdating = True
st = Timer - st

wrdcnt = (Len(uniqlist) - 1 - Len(Replace(uniqlist, "+", "")))
MsgBox "There are " & wrdcnt & " Unique words found in " & st & "Seconds"
End Sub

dscheste · Oct 25, 2004

Hello Sixth Sence,

sorry for not getting back to you earlier.

I have let run both codes and for documents with 0 > 5000 characters they work with approximately the same speed.

When I launched a test on a John Grisham's novel..... your is much more efficient. I will definitely go with your code.

Thank you for the code, and especially for this idea:

txt = Trim(Replace(txt, Mid(NonWord, i, 1), "", 1, -1, vbTextCompare))

beautiful

Have a nice one.

dscheste

The list of frequent words used in a text

dscheste

Board Regular

Excel Facts

SIXTH SENSE

Well-known Member

dscheste

Board Regular

Similar threads

Forum statistics

Share this page

The list of frequent words used in a text

dscheste

Board Regular

Excel Facts

SIXTH SENSE

Well-known Member

dscheste

Board Regular

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock