Akuini

VBA Macro to create Word & Phrase Frequency

Akuini

Well-known Member
Joined
Feb 1, 2016
Messages
4,981
Office Version
  1. 365
Platform
  1. Windows
Akuini submitted a new Excel article:

VBA Macro to create Word & Phrase Frequency - VBA Macro to create Word & Phrase Frequency

This macro generates word & phrase frequency.
You can set the number of words contained in a phrase as needed, in this part:
Const sNumber As String = "1,2,3"
You can also set what characters should be considered as word characters, in this part:
Const xPattern As String = "A-Z0-9_'"
but you will need basic knowledge of regular expressions.

Tested on a text (from a novel) with 93623 rows (in the sheet), 586552 total...

Read more about this Excel article...
 

Excel Facts

Does the VLOOKUP table have to be sorted?
No! when you are using an exact match, the VLOOKUP table can be in any order. Best-selling items at the top is actually the best.
Hello Akuini, would you have a version of this for Word? I don't need phrases, just counts of each word in Word documents, excluding common words like a, an, the, is, and, etc. I have found two such macros but receive syntax errors when I try and run them. I have many documents I need to run this on.
 
This script is AMAZING! I've never seen anything like it and I've been searching for YEARS. Thanks very, very much for sharing.
One thing that would be absolutely huge would be the ability to add a list of stop words. Is that something you would possibly consider adding.
Thanks again for sharing this awesome macro :)
 
@wiscochris, welcome to MrExcel
I'm glad that you find the code useful.
One thing that would be absolutely huge would be the ability to add a list of stop words.
Not sure what you want to do with the stop words, but my understanding is that you don't want to count it, so it won't be part of the words count.
I added "sub stopWord" to remove the stop words.
How it works:
  1. Put the text in column A, start at A1.
  2. Put the list of stop words starting at B1 (cell B1 must not be empty).
  3. Run "Sub Word_Phrase_Frequency_v1".
  4. The code will replace all stop words (case insensitive) found in the text (on col A) with a "|", so it won't be counted as word.
Speed test:
Using stop words can actually speed up the process, I tested it on a novel i.e. War and Peace, by Leo Tolstoy, it contains 586930 words.
Without stop words it took 41.87891 seconds.
With 70 words as stop words it took only 12.375 seconds.

Example:
dhee - stop word try.xlsm
ABCDEFGH
1The booka1 WORDCOUNT2 WORDCOUNT
2War and Peace, by Leo Tolstoythebook1book War1
3andLeo1Leo Tolstoy1
4byPeace1
5Tolstoy1
6War1
Sheet1


Rich (BB code):
Option Explicit

Sub Word_Phrase_Frequency_v1()

'The code will generate word/phrase frequency
'How to use:
'1. Add reference to "Microsoft VBScript Regular Expressions 5.5" (you need to do it once only):
'   In Visual Basic Editor menu, select Tools –> References, then select Microsoft VBScript Regular Expressions 5.5, then click OK.
'2. Data must be in column A, start at A1
'3. Run Word_Phrase_Frequency_v1


'--- CHANGE sNumber & xPattern VALUE TO SUIT -----------------------------------

Const sNumber As String = "1,2,3"  '"1,2,3"
'sNumber = "1"  will generate 1 word frequency list
'sNumber = "1,2,3"  will generate 1 word, 2 word & 3 word frequency list

Const xPattern As String = "A-Z0-9_'"
'define the word characters, the above pattern will include letter, number, underscore & apostrophe as word character
'word with apostrophe such as "you're" counts as one word.
'word with underscore such as "aa_bb" counts as one word.


Const xCol As String = "C:ZZ" 'columns to clear
Dim i As Long, j As Long
Dim txa As String
Dim z, t

t = Timer
Application.ScreenUpdating = False
Range(xCol).Clear

'if there are errors, remove them
On Error Resume Next
Range("A:A").SpecialCells(xlCellTypeFormulas, xlErrors).ClearContents
Range("A:A").SpecialCells(xlConstants, xlErrors).ClearContents
On Error GoTo 0

j = Range("A" & Rows.Count).End(xlUp).Row

If j < 65000 Then
    txa = Join(Application.Transpose(Range("A1", Cells(Rows.Count, "A").End(xlUp))), " ")
Else
    For i = 1 To j Step 65000
    txa = txa & Join(Application.Transpose(Range("A" & i).Resize(65000)), " ") & " "
    Next
End If

'if you have stop words list then put the list starting at B1
If Range("B1") <> "" Then Call stopWord(xPattern, txa)

z = Split(sNumber, ",")
    
    'TO PROCESS
    For i = LBound(z) To UBound(z)
        Call toProcessY(CLng(z(i)), txa, xPattern)
    Next

Range(xCol).Columns.AutoFit
Application.ScreenUpdating = True

Debug.Print "It's done in:  " & Timer - t & " seconds"

End Sub


Sub stopWord(xP As String, tx As String)
Dim n As Long
Dim stW, x
Dim regEx As Object
n = Range("B" & Rows.Count).End(xlUp).Row
If n = 1 Then Exit Sub
stW = Range("B1:B" & n)

        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .Ignorecase = True
        End With
tx = " " & tx
For Each x In stW
        regEx.Pattern = "[^" & xP & "]" & x & "[^" & xP & "]"
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, "|")   'replace stop word with "|"
        End If
Next

End Sub

Sub toProcessY(n As Long, ByVal tx As String, xP As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long, rc As Long
Dim va, q

        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .Ignorecase = True
        End With

If n > 1 Then

        regEx.Pattern = "( ){2,}"

        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
        
        tx = Trim(tx)
               
'        regEx.Pattern = "[^A-Z0-9_' ]+"
        regEx.Pattern = "[^" & xP & " ]+" 'exclude xp and space
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character (excluding space) with new line char (vbLf)
        End If
        
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line

End If

    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

'    regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n)) 'match n words (the phrase) separated by a space
    regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n)) 'match n words (the phrase) separated by a space
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next
 
For i = 1 To n - 1
        
        regEx.Pattern = "^[" & xP & "]+ "
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)

'            regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n))
            regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n))
            
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If
Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

rc = Cells(1, Columns.Count).End(xlToLeft).Column
'put the result

With Cells(2, rc + 2).Resize(d.Count, 2)
    
    Select Case d.Count
    Case Is < 65536 'Transpose function has a limit of 65536 item to process
        
        .Value = Application.Transpose(Array(d.Keys, d.Items))
        
    Case Is <= 1048500
        
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
    
    Case Else
        
        MsgBox "Process is canceled, the result is more than 1048500 rows"
    
    End Select
    
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
    
End With


Cells(1, rc + 2) = n & " WORD"
Cells(1, rc + 3) = "COUNT"

End Sub
 
I'm speechless. You took my very minimally/poorly described specs and totally nailed it (and completed it on Christmas day.)
This is exactly what I was hoping for, crazy fast and very much appreciated. I'm confident others will greatly value this too. Thanks SO much!
 
You're welcome, glad to help & thanks for the feedback.:)
 
@wiscochris
I forgot to explain something:
The code will replace the stop word with "|". It means that the words that are separated by the stop word (or now by the "|") won't be considered as a phrase. For example:
Let's say "a" is a stop word. And you have a text "I buy a car".
It will change to "I buy | car", so you won't get "buy car" as a phrase. Example:

dhee - stop word try.xlsm
ABCDEFGH
1I buy a car.a1 WORDCOUNT2 WORDCOUNT
2I buy the box.thebuy2I buy2
3I2
4box1
5car1
Sheet1


If you want "buy car" as a phrase then in "Sub stopWord" you need to change this part:
VBA Code:
          tx = regEx.Replace(tx, "|")   'replace stop word with "|"
to this:
VBA Code:
          tx = regEx.Replace(tx, " ")   'replace stop word with a space


Example:

dhee - stop word try.xlsm
ABCDEFGHIJK
1I buy a car.a1 WORDCOUNT2 WORDCOUNT3 WORDCOUNT
2I buy the box.thebuy2I buy2I buy box1
3I2buy box1I buy car1
4box1buy car1
5car1
Sheet1
 
Thanks for the explanation. And that's a great addition/option.
 
Hi, I’m wondering if you would be willing to possibly help with a horizontal presentation version of the same script. This version would loop through row multiple multiple rows.

Here is a OneDrive file example of what I am hoping for. There are 2 sheets; "Original", which is your original example that includes your latest macro. The second sheet "Horizontal" is what I am hoping to get. Below is a L2BB share of that for easy reference.

kwfrequencyHorizontalSample.xlsm
ABCDEFGHIJKLMNOPQRSTUVWX
1NameText 1Text 2Text 3Text 4Text 51 WORD 1COUNT1 WORD 2COUNT1 WORD 3COUNT2 WORD 1COUNT2 WORD 2COUNT2 WORD 3COUNT3 WORD 1COUNT3 WORD 2COUNT3 WORD 3COUNT
2Chapter 1In captivity, in the shed, Pierre had learned, not with his mind, but with his whole being, his life, that man is created for happiness, that happiness is within him, in the satisfying of natural human needs,and that all unhappiness comes not from lack, but from superfluity.his3In3that3with his2all unhappiness1but from1all unhappiness comes1but from superfluity1but with his1
3Chapter 2This black-eyed, wide-mouthed girl, not pretty but full of liferan to hide her flushed face in the lace of her mother’s mantillanot paying the least attention to her severe remark—and began to laugh.her3to3not2attention to1began to1but full1attention to her1began to laugh1but full of1
4Chapter 3When everything was ready, the stranger opened his eyes,moved to the table, filled a tumbler with tea for himself and one for the beardless old man to whom he passed it."for2to2beardless1beardless old1everything was1for himself1beardless old man1everything was ready1he passed it1
5Chapter 4The proverbs, of which his talk was full, were..those folk sayings which taken without a context seem so insignificant, but when used appositely suddenly acquire a significance of profound wisdom.of2which2acquire1appositely suddenly1but when1context seem1appositely suddenly acquire1but when used1context seem so1
horizontal


The list of stop words would be on a separate sheet, stay as is in a vertical list, and apply to each loop/row individually as it does currently.

Here are some additional details to help explain the "Horizontal" sheet example:
  • The example has three 1 WORD (G, I, K), 2 WORD (M, O, Q), and 3 WORD (S, U, W)columns. It would be great if I could change that from 3 to 5 or 20 or whatever, depending upon my needs.
  • There are only (up to) 5 columns of text that can be entered (columns B-F). I realized that is limited vs the original script, but this is what would work for my needs.
  • In the example, there are 4 entries in column A. I would like it if the script could automatically loop through all rows that exist per column A, noting that if one is blank, processing of that row can be skipped.
I would really appreciate it if you can help me with this. I really, really LOVE this script and would be thrilled if it could work as I described.

Thanks very much, in advance.
 

Forum statistics

Threads
1,214,833
Messages
6,121,864
Members
449,052
Latest member
Fuddy_Duddy

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top