Count words, but not HTML tags

chilly_bang

Board Regular
Joined
Jun 17, 2016
Messages
57
Hi
i want to count all words in a cell, but not those, which are placed between < and >. Or, count all words, which are placed between > and <.
How can i accomplish this? With regex it is a nobrainer, but in Excel - no clue:(
 
Having waded through those samples (& some smaller ones) I'm also in agreement that Rick's UDF returns the correct count. However, I think the OP's idea of a RexExp solution will produce a UDF at least twice as fast on sample data like that - which of course would only be an issue as the amount of data gets large or if there are cells with much longer strings.

Rich (BB code):
Function CountWords(s As String) As Long
  Static RX As Object
  
  If RX Is Nothing Then
    Set RX = CreateObject("VBScript.RegExp")
    RX.Global = True
    RX.Pattern = "<.*?>"
  End If
  CountWords = UBound(Split(Application.Trim(RX.Replace(s, " ")))) + 1
End Function
 
Last edited:
Upvote 0

Excel Facts

How can you turn a range sideways?
Copy the range. Select a blank cell. Right-click, Paste Special, then choose Transpose.
Right, so the problem with the formula was that there are elements longer than 255 char. So in the name, SplitTag, we can up the string length. E.g:

=TRIM(MID(SUBSTITUTE(SUBSTITUTE(Sheet1!A1,"<",REPT(" ",768)&"<"),">",">"&REPT(" ",768)),ROW(INDEX(Sheet1!A:A,1,1):INDEX(Sheet1!A:A,768,1))*768,768))

Then we get 92. However, as said before, I would lean toward a UDF, particularly the regex approach. ;)
 
Last edited:
Upvote 0
If you want to try a UDF, I think this may work for you...
Code:
[table="width: 500"]
[tr]
	[td]Function NonTagWordCount(S As String) As Long
  Dim X As Long, Phrases() As String
  Phrases = Split(Replace(S, "<", ">"), ">")
  For X = 0 To UBound(Phrases) Step 2
    NonTagWordCount = NonTagWordCount + UBound(Split(Application.Trim(Phrases(X)))) + 1
  Next
End Function[/td]
[/tr]
[/table]
I just got up from a good night's sleep (got a full 4.5 hours of sleep in this time) and in looking at my code, I see I made it slower than it should be. This version should probably be the fastest that a non-RegExp UDF can be although I do not know how that would compare speed-wise to the RegExp UDF posted here (anyone want to test it). Here is my revised code...
Code:
[table="width: 500"]
[tr]
	[td]Function NonTagWordCount(S As String) As Long
  Dim X As Long, Phrases() As String
  Phrases = Split(Replace(S, "<", ">"), ">")
  For X = 1 To UBound(Phrases) Step 2
    Phrases(X) = ""
  Next
  NonTagWordCount = UBound(Split(Application.Trim(Join(Phrases)))) + 1
End Function[/td]
[/tr]
[/table]
 
Last edited:
Upvote 0
I just got up from a good night's sleep (got a full 4.5 hours of sleep in this time) and in looking at my code (Message #5), I see I made it slower than it should be. This version should probably be the fastest that a non-RegExp UDF can be although I do not know how that would compare speed-wise to the RegExp UDF posted here (anyone want to test it). Here is my revised code...
Code:
[table="width: 500"]
[tr]
	[td]Function NonTagWordCount(S As String) As Long
  Dim X As Long, Phrases() As String
  Phrases = Split(Replace(S, "<", ">"), ">")
  For X = 1 To UBound(Phrases) Step 2
    Phrases(X) = ""
  Next
  NonTagWordCount = UBound(Split(Application.Trim(Join(Phrases)))) + 1
End Function[/td]
[/tr]
[/table]
I probably should have emphasized what I show in red above a little bit more... I am guessing that my new code should be faster based on my assumption that the repeated Application.Trim function calls along with additions are more "expensive" than a single Application.Trim function call coupled with a Join function call; however, in thinking about it some more, they may turn out to be equally fast as each other... and possibly my original code might turn out to be faster after all. Confused yet? Sorry, but in thinking about it some more, I am not so sure my analysis of the trade-offs between the two codes is completely solid. For example, if there was only one HTML tag in the text, I kind of think my first UDF would be the fastest, but if there were 100's of HTML tags, then I think my second UDF would have the edge. With that said, I am thinking the RegExp would probably be the way to go (although I would still be interested in a speed comparison test between them).
 
Last edited:
Upvote 0
Since you show some curiosity in terms of performance. ;)

I set-up a very crude test on my rubbish machine. No high-resolution timer needed. Results are clear, to me. Over 11370 cells with tagged text (using the very same A1 sample given):

1. My formula approach, with enhancement * is 47 seconds.
2. Peters Regex approach is 14 seconds.
3. Rick's Split Step 2 approach is 19 seconds.

* The enhancement I made was to my SplitTag name formula, such that it only returns as many elements as there will be in the text, rather than arbitrary e.g. 255.

Personally I think either UDF option is sufficient. Native formula approach (rephrase: my native formula approach) is NOT efficient.
 
Last edited:
Upvote 0
Since you show some curiosity in terms of performance. ;)

I set-up a very crude test on my rubbish machine. No high-resolution timer needed. Results are clear, to me. Over 11370 cells with tagged text (using the very same A1 sample given):

1. My formula approach, with enhancement * is 47 seconds.
2. Peters Regex approach is 14 seconds.
3. Rick's Split Step 2 approach is 19 seconds.

* The enhancement I made was to my SplitTag name formula, such that it only returns as many elements as there will be in the text, rather than arbitrary e.g. 255.

Personally I think either UDF option is sufficient. Native formula approach (rephrase: my native formula approach) is NOT efficient.
Thanks for the tests Jon... much appreciated. I am a little surprised that the UDF solutions are so much faster than the formula solution... I thought the conventional wisdom was that formulas, even complex ones, that used native Excel functions had the edge over formulas using UDF coded functions.
 
Upvote 0
Hi Rick

I am not surprised that the UDF is faster, but extent of the difference does surprise me. I find that Excel is somewhat lacking in string manipulation worksheet functions. I have had instances in the past where I have elected to use UDF's, although again mostly in favour of RegEx ones. But I don't want to label Excel formula's weak in this regard: an import caveat is that I find this true with my formulas. In this instance, to return a count of 92 I have three separate formula entities.

Target: Sheet1!A3

Name: MaxElements
RefersTo: =LEN(SUBSTITUTE(Sheet1!$A3,"<",">"))-LEN(SUBSTITUTE(SUBSTITUTE(Sheet1!$A3,"<",">"),">",""))+1

Name: SplitTag
RefersTo: =TRIM(MID(SUBSTITUTE(SUBSTITUTE(Sheet1!$A3,"<",REPT(" ",1024)&"<"),">",">"&REPT(" ",1024)),ROW(INDEX(Sheet1!$A:$A,1,1):INDEX(Sheet1!$A:$A,MaxElements,1))*1024,1024))

Result in: B3
Formula: =SUMPRODUCT(IF(LEN(SplitTag),IF(LEFT(SplitTag,1)<>"<",1+LEN(SplitTag)-LEN(SUBSTITUTE(SplitTag," ","")),0),0))
 
Last edited:
Upvote 0
@ [/COLOR]HOW TO INSTALL UDFs
Thank you for this tutorial! - it works exactly as you've explained.
The single question i have about this procedure:

On cleaning HTML tags with text editor, like Notepad++, with regex search for <[^>]*> and replace with nothing, i get slightly different word amounts, as with your UDF. Could you explain me such difference? The numbers i get with your UDF are always a little bit higher, but i can't understand, what is counted differently.
 
Last edited:
Upvote 0
How do your results compare if you use the regex pattern per Peter's Regex Function? I am confident in Rick's function result, as Peters function and my formula approach seem to evaluate the same result and indeed my manual evaluation also agreed with the results.

* referring to the sample that you supplied
 
Last edited:
Upvote 0
Thank you for this tutorial! - it works exactly as you've explained.
The single question i have about this procedure:

On cleaning HTML tags with text editor, like Notepad++, with regex search for <[^>]*> and replace with nothing, i get slightly different word amounts, as with your UDF. Could you explain me such difference? The numbers i get with your UDF are always a little bit higher, but i can't understand, what is counted differently.
The routines and formula posted here take the simple approach that anything between two spaces is considered to be a word. I don't know how Notepad++ (or whatever you used to count words) treats stand-alone numbers or groups of characters that do not contain alphabetic letters, but that may be one place that differences could occur. Another might be brackets encased in quote marks... our routines and formula assume a bracket is an HTML tag whereas Notepad++ (or whatever you used to count words) might have a more sophisticated algorithm the treats brackets in quotes as just a character without HTML significance.
 
Last edited:
Upvote 0

Forum statistics

Threads
1,214,824
Messages
6,121,783
Members
449,049
Latest member
greyangel23

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top