• If you would like to post, please check out the MrExcel Message Board FAQ and register here. If you forgot your password, you can reset your password.
  • Excel articles and downloadable files provided in the articles have not been reviewed by MrExcel Publishing. Please apply the provided methods / codes and open the files at your own risk.
    If you have any questions regarding an article, please use the Article Discussion section.
Akuini

VBA Macro to create Word & Phrase Frequency

Excel Version
  1. 365
  2. 2019
  3. 2016
  4. 2013
  5. 2011
  6. 2010
  7. 2007
This macro generates word & phrase frequency.
You can set the number of words contained in a phrase as needed, in this part:
Const sNumber As String = "1,2,3"
You can also set what characters should be considered as word characters, in this part:
Const xPattern As String = "A-Z0-9_'"
but you will need basic knowledge of regular expressions.

Tested on a text (from a novel) with 93623 rows (in the sheet), 586552 total words, 17688 unique words, with sNumber = "1,2,3", it took 43.2 seconds.

The criteria used by the code are:
1. A word only consist of alphabet (a-zA-Z), number(0-9), underscore or apostrophe. They are considered as word character.
For example:
each line below counts as 1 word:
You
A11
33
you're
a_bb

each line below counts as 2 words:
a,bb
a book
a\d

2. A phrase is multiple words that are separated by a space or spaces, and as an addition, check criteria no 3.
So for example:
"I see this" > it has 3 phrases:
"I see"
"see this"
"I see this"

"yes, this one" > it has only 1 phrase:
"this one"

3. A sentence/paragraph can occupy more than one cell, it happens when the end of the cell is a word character or a space. For example:
Book2
A
1Please go home
2tomorrow.
Sheet1

"home tomorrow" is considered as a phrase.
I decided to use this criteria because in some cases the text comes from a source that breaks the paragraphs, oftentimes it happens when you copy-paste the text from a PDF.

4. Uppercase letters are considered equivalent to their lowercase counterparts.

5. Multiple spaces will be replaced with a single space.

6. It doesn't support non-ASCII characters. All non-ASCII characters are considered as non-word character. For example: Zhárov counts as 2 words, i.e: "Zh" and "rov", while "á" will be deleted.

How to use:
1. Add reference to "Microsoft VBScript Regular Expressions 5.5":
In Visual Basic Editor menu, select Tools –> References, then select Microsoft VBScript Regular Expressions 5.5, then click OK. You need to do it only once.
2. Data must be in column A, start at A1
3. Run Word_Phrase_Frequency_v1

Example, with sNumber = "1,2,3"

dhee - Macro to create Word & Phrase Frequency 1.xlsm
ABCDEFGHIJ
1He is good you're cc-dd.1 WORDCOUNT2 WORDCOUNT3 WORDCOUNT
2yes, hegood2He is2He is good2
3is good.He2is good2good you're cc1
4Zhárovis2good you're1is good you're1
5cc1you're cc1
6dd1
7rov1
8yes1
9you're1
10Zh1
Sheet1


The code:
VBA Code:
Option Explicit

Sub Word_Phrase_Frequency_v1()

'The code will generate word/phrase frequency
'How to use:
'1. Add reference to "Microsoft VBScript Regular Expressions 5.5" (you need to do it once only):
'   In Visual Basic Editor menu, select Tools –> References, then select Microsoft VBScript Regular Expressions 5.5, then click OK.
'2. Data must be in column A, start at A1
'3. Run Word_Phrase_Frequency_v1


'--- CHANGE sNumber & xPattern VALUE TO SUIT -----------------------------------

Const sNumber As String = "1,2,3"  '"1,2,3"
'sNumber = "1"  will generate 1 word frequency list
'sNumber = "1,2,3"  will generate 1 word, 2 word & 3 word frequency list

Const xPattern As String = "A-Z0-9_'"
'define the word characters, the above pattern will include letter, number, underscore & apostrophe as word character
'word with apostrophe such as "you're" counts as one word.
'word with underscore such as "aa_bb" counts as one word.


Const xCol As String = "C:ZZ" 'columns to clear
Dim i As Long, j As Long
Dim txa As String
Dim z, t

t = Timer
Application.ScreenUpdating = False
Range(xCol).Clear

'if there are errors, remove them
On Error Resume Next
Range("A:A").SpecialCells(xlCellTypeFormulas, xlErrors).ClearContents
Range("A:A").SpecialCells(xlConstants, xlErrors).ClearContents
On Error GoTo 0

j = Range("A" & Rows.Count).End(xlUp).Row

If j < 65000 Then
    txa = Join(Application.Transpose(Range("A1", Cells(Rows.Count, "A").End(xlUp))), " ")
Else
    For i = 1 To j Step 65000
    txa = txa & Join(Application.Transpose(Range("A" & i).Resize(65000)), " ") & " "
    Next
End If


z = Split(sNumber, ",")
    
    'TO PROCESS
    For i = LBound(z) To UBound(z)
        Call toProcessY(CLng(z(i)), txa, xPattern)
    Next

Range(xCol).Columns.AutoFit
Application.ScreenUpdating = True

Debug.Print "It's done in:  " & Timer - t & " seconds"

End Sub

Sub toProcessY(n As Long, ByVal tx As String, xP As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long, rc As Long
Dim va, q

        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .ignorecase = True
        End With

If n > 1 Then

        regEx.Pattern = "( ){2,}"

        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
        
        tx = Trim(tx)
               
'        regEx.Pattern = "[^A-Z0-9_' ]+"
        regEx.Pattern = "[^" & xP & " ]+" 'exclude xp and space
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character (excluding space) with new line char (vbLf)
        End If
        
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line

End If

    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

'    regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n)) 'match n words (the phrase) separated by a space
    regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n)) 'match n words (the phrase) separated by a space
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next
 
For i = 1 To n - 1
        
        regEx.Pattern = "^[" & xP & "]+ "
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)

'            regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n))
            regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n))
            
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If
Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

rc = Cells(1, Columns.Count).End(xlToLeft).Column
'put the result

With Cells(2, rc + 2).Resize(d.Count, 2)
    
    Select Case d.Count
    Case Is < 65536 'Transpose function has a limit of 65536 item to process
        
        .Value = Application.Transpose(Array(d.Keys, d.Items))
        
    Case Is <= 1048500
        
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
    
    Case Else
        
        MsgBox "Process is canceled, the result is more than 1048500 rows"
    
    End Select
    
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
    
End With


Cells(1, rc + 2) = n & " WORD"
Cells(1, rc + 3) = "COUNT"

End Sub
Author
Akuini
Views
69
First release
Last update
Rating
0.00 star(s) 0 ratings

More Excel articles from Akuini

This Week's Hot Topics

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Top