Count words, but not HTML tags

chilly_bang · Feb 28, 2017

Hi
i want to count all words in a cell, but not those, which are placed between < and >. Or, count all words, which are placed between > and <.
How can i accomplish this? With regex it is a nobrainer, but in Excel - no clue

Peter_SSs · Feb 28, 2017

Having waded through those samples (& some smaller ones) I'm also in agreement that Rick's UDF returns the correct count. However, I think the OP's idea of a RexExp solution will produce a UDF at least twice as fast on sample data like that - which of course would only be an issue as the amount of data gets large or if there are cells with much longer strings.

Rich (BB code):

Function CountWords(s As String) As Long
  Static RX As Object
  
  If RX Is Nothing Then
    Set RX = CreateObject("VBScript.RegExp")
    RX.Global = True
    RX.Pattern = "<.*?>"
  End If
  CountWords = UBound(Split(Application.Trim(RX.Replace(s, " ")))) + 1
End Function

Jon von der Heyden · Feb 28, 2017

Right, so the problem with the formula was that there are elements longer than 255 char. So in the name, SplitTag, we can up the string length. E.g:

=TRIM(MID(SUBSTITUTE(SUBSTITUTE(Sheet1!A1,"<",REPT(" ",768)&"<"),">",">"&REPT(" ",768)),ROW(INDEX(Sheet1!A:A,1,1):INDEX(Sheet1!A:A,768,1))*768,768))

Then we get 92. However, as said before, I would lean toward a UDF, particularly the regex approach.

Rick Rothstein · Feb 28, 2017

Rick Rothstein said:

If you want to try a UDF, I think this may work for you...

Code:

[table="width: 500"]
[tr]
	[td]Function NonTagWordCount(S As String) As Long
  Dim X As Long, Phrases() As String
  Phrases = Split(Replace(S, "<", ">"), ">")
  For X = 0 To UBound(Phrases) Step 2
    NonTagWordCount = NonTagWordCount + UBound(Split(Application.Trim(Phrases(X)))) + 1
  Next
End Function[/td]
[/tr]
[/table]

I just got up from a good night's sleep (got a full 4.5 hours of sleep in this time) and in looking at my code, I see I made it slower than it should be. This version should probably be the fastest that a non-RegExp UDF can be although I do not know how that would compare speed-wise to the RegExp UDF posted here (anyone want to test it). Here is my revised code...

Code:

[table="width: 500"]
[tr]
	[td]Function NonTagWordCount(S As String) As Long
  Dim X As Long, Phrases() As String
  Phrases = Split(Replace(S, "<", ">"), ">")
  For X = 1 To UBound(Phrases) Step 2
    Phrases(X) = ""
  Next
  NonTagWordCount = UBound(Split(Application.Trim(Join(Phrases)))) + 1
End Function[/td]
[/tr]
[/table]

Rick Rothstein · Feb 28, 2017

Rick Rothstein said:
I just got up from a good night's sleep (got a full 4.5 hours of sleep in this time) and in looking at my code (Message #5), I see I made it slower than it should be. This version should probably be the fastest that a non-RegExp UDF can be although I do not know how that would compare speed-wise to the RegExp UDF posted here (anyone want to test it). Here is my revised code...

Code:

[table="width: 500"] [tr] [td]Function NonTagWordCount(S As String) As Long Dim X As Long, Phrases() As String Phrases = Split(Replace(S, "<", ">"), ">") For X = 1 To UBound(Phrases) Step 2 Phrases(X) = "" Next NonTagWordCount = UBound(Split(Application.Trim(Join(Phrases)))) + 1 End Function[/td] [/tr] [/table]

I probably should have emphasized what I show in red above a little bit more... I am guessing that my new code should be faster based on my assumption that the repeated Application.Trim function calls along with additions are more "expensive" than a single Application.Trim function call coupled with a Join function call; however, in thinking about it some more, they may turn out to be equally fast as each other... and possibly my original code might turn out to be faster after all. Confused yet? Sorry, but in thinking about it some more, I am not so sure my analysis of the trade-offs between the two codes is completely solid. For example, if there was only one HTML tag in the text, I kind of think my first UDF would be the fastest, but if there were 100's of HTML tags, then I think my second UDF would have the edge. With that said, I am thinking the RegExp would probably be the way to go (although I would still be interested in a speed comparison test between them).

Jon von der Heyden · Feb 28, 2017

Since you show some curiosity in terms of performance.

I set-up a very crude test on my rubbish machine. No high-resolution timer needed. Results are clear, to me. Over 11370 cells with tagged text (using the very same A1 sample given):

1. My formula approach, with enhancement * is 47 seconds.
2. Peters Regex approach is 14 seconds.
3. Rick's Split Step 2 approach is 19 seconds.

* The enhancement I made was to my SplitTag name formula, such that it only returns as many elements as there will be in the text, rather than arbitrary e.g. 255.

Personally I think either UDF option is sufficient. Native formula approach (rephrase: my native formula approach) is NOT efficient.

Rick Rothstein · Feb 28, 2017

Jon von der Heyden said:
Since you show some curiosity in terms of performance.

I set-up a very crude test on my rubbish machine. No high-resolution timer needed. Results are clear, to me. Over 11370 cells with tagged text (using the very same A1 sample given):

1. My formula approach, with enhancement * is 47 seconds.
2. Peters Regex approach is 14 seconds.
3. Rick's Split Step 2 approach is 19 seconds.

* The enhancement I made was to my SplitTag name formula, such that it only returns as many elements as there will be in the text, rather than arbitrary e.g. 255.

Personally I think either UDF option is sufficient. Native formula approach (rephrase: my native formula approach) is NOT efficient.

Thanks for the tests Jon... much appreciated. I am a little surprised that the UDF solutions are so much faster than the formula solution... I thought the conventional wisdom was that formulas, even complex ones, that used native Excel functions had the edge over formulas using UDF coded functions.

Jon von der Heyden · Feb 28, 2017

Hi Rick

I am not surprised that the UDF is faster, but extent of the difference does surprise me. I find that Excel is somewhat lacking in string manipulation worksheet functions. I have had instances in the past where I have elected to use UDF's, although again mostly in favour of RegEx ones. But I don't want to label Excel formula's weak in this regard: an import caveat is that I find this true with my formulas. In this instance, to return a count of 92 I have three separate formula entities.

Target: Sheet1!A3

Name: MaxElements
RefersTo: =LEN(SUBSTITUTE(Sheet1!$A3,"<",">"))-LEN(SUBSTITUTE(SUBSTITUTE(Sheet1!$A3,"<",">"),">",""))+1

Name: SplitTag
RefersTo: =TRIM(MID(SUBSTITUTE(SUBSTITUTE(Sheet1!$A3,"<",REPT(" ",1024)&"<"),">",">"&REPT(" ",1024)),ROW(INDEX(Sheet1!$A:$A,1,1):INDEX(Sheet1!$A:$A,MaxElements,1))*1024,1024))

Result in: B3
Formula: =SUMPRODUCT(IF(LEN(SplitTag),IF(LEFT(SplitTag,1)<>"<",1+LEN(SplitTag)-LEN(SUBSTITUTE(SplitTag," ","")),0),0))

chilly_bang · Mar 1, 2017

Rick Rothstein said:
@ [/COLOR]HOW TO INSTALL UDFs

Thank you for this tutorial! - it works exactly as you've explained.
The single question i have about this procedure:

On cleaning HTML tags with text editor, like Notepad++, with regex search for <[^>]*> and replace with nothing, i get slightly different word amounts, as with your UDF. Could you explain me such difference? The numbers i get with your UDF are always a little bit higher, but i can't understand, what is counted differently.

Jon von der Heyden · Mar 1, 2017

How do your results compare if you use the regex pattern per Peter's Regex Function? I am confident in Rick's function result, as Peters function and my formula approach seem to evaluate the same result and indeed my manual evaluation also agreed with the results.

* referring to the sample that you supplied

Rick Rothstein · Mar 1, 2017

chilly_bang said:
Thank you for this tutorial! - it works exactly as you've explained.
The single question i have about this procedure:

On cleaning HTML tags with text editor, like Notepad++, with regex search for <[^>]*> and replace with nothing, i get slightly different word amounts, as with your UDF. Could you explain me such difference? The numbers i get with your UDF are always a little bit higher, but i can't understand, what is counted differently.

The routines and formula posted here take the simple approach that anything between two spaces is considered to be a word. I don't know how Notepad++ (or whatever you used to count words) treats stand-alone numbers or groups of characters that do not contain alphabetic letters, but that may be one place that differences could occur. Another might be brackets encased in quote marks... our routines and formula assume a bracket is an HTML tag whereas Notepad++ (or whatever you used to count words) might have a more sophisticated algorithm the treats brackets in quotes as just a character without HTML significance.

Count words, but not HTML tags

chilly_bang

Board Regular

Peter_SSs

MrExcel MVP, Moderator

Excel Facts

Jon von der Heyden

MrExcel MVP, Moderator

Rick Rothstein

MrExcel MVP

Rick Rothstein

MrExcel MVP

Jon von der Heyden

MrExcel MVP, Moderator

Rick Rothstein

MrExcel MVP

Jon von der Heyden

MrExcel MVP, Moderator

chilly_bang

Board Regular

Jon von der Heyden

MrExcel MVP, Moderator

Rick Rothstein

MrExcel MVP

Similar threads

Forum statistics

Share this page

Count words, but not HTML tags

Board Regular

MrExcel MVP, Moderator

Excel Facts

MrExcel MVP, Moderator

MrExcel MVP

MrExcel MVP

MrExcel MVP, Moderator

MrExcel MVP

MrExcel MVP, Moderator

Board Regular

MrExcel MVP, Moderator

MrExcel MVP

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock