Count words, but not HTML tags

chilly_bang

Board Regular
Joined
Jun 17, 2016
Messages
57
Hi
i want to count all words in a cell, but not those, which are placed between < and >. Or, count all words, which are placed between > and <.
How can i accomplish this? With regex it is a nobrainer, but in Excel - no clue:(
 

Excel Facts

Do you hate GETPIVOTDATA?
Prevent GETPIVOTDATA. Select inside a PivotTable. In the Analyze tab of the ribbon, open the dropown next to Options and turn it off
Have you considered a UDF using regex? If you were to use regex, what would you use?
 
Upvote 0
I think I have a formula that works. I think the best way is to create a name to hold the text, after being split into individual elements.

Hopefully this illustrates the approach using name in column C, but I have also illustrated using a single formula approach (no names) in column B.

<p>
Excel 2010
ABC
1Contentcount wordscount words using name
2some words some more wordsyet more words88
Sheet1
Cell Formulas
RangeFormula
B2{=SUMPRODUCT(IF(LEN(TRIM(MID(SUBSTITUTE(SUBSTITUTE(A2,"<",REPT(" ",255)&"<"),">",">"&REPT(" ",255)),ROW(INDEX(A:A,1,1):INDEX(A:A,255,1))*255,255))),IF(LEFT(TRIM(MID(SUBSTITUTE(SUBSTITUTE(A2,"<",REPT(" ",255)&"<"),">",">"&REPT(" ",255)),ROW(INDEX(A:A,1,1):INDEX(A:A,255,1))*255,255)),1)<>"<",1+LEN(TRIM(MID(SUBSTITUTE(SUBSTITUTE(A2,"<",REPT(" ",255)&"<"),">",">"&REPT(" ",255)),ROW(INDEX(A:A,1,1):INDEX(A:A,255,1))*255,255)))-LEN(SUBSTITUTE(TRIM(MID(SUBSTITUTE(SUBSTITUTE(A2,"<",REPT(" ",255)&"<"),">",">"&REPT(" ",255)),ROW(INDEX(A:A,1,1):INDEX(A:A,255,1))*255,255))," ","")),0),0))}
C2{=SUMPRODUCT(IF(LEN(SplitTag),IF(LEFT(SplitTag,1)<>"<",1+LEN(SplitTag)-LEN(SUBSTITUTE(SplitTag," ","")),0),0))}
Press CTRL+SHIFT+ENTER to enter array formulas.
Named Ranges
NameRefers ToCells
SplitTag=TRIM(MID(SUBSTITUTE(SUBSTITUTE(Sheet1!A2,"" ",255)&"),">",">"&REPT(" ",255)),ROW(INDEX(Sheet1!A:A,1,1):INDEX(Sheet1!A:A,255,1))*255,255))
 
Upvote 0
i want to count all words in a cell, but not those, which are placed between < and >. Or, count all words, which are placed between > and <.
Could we see 8-10 rows of representative sample data (with not too many 'words' in each cell) and the expected results?
 
Last edited:
Upvote 0
If you want to try a UDF, I think this may work for you...
Code:
[table="width: 500"]
[tr]
	[td]Function NonTagWordCount(S As String) As Long
  Dim X As Long, Phrases() As String
  Phrases = Split(Replace(S, "<", ">"), ">")
  For X = 0 To UBound(Phrases) Step 2
    NonTagWordCount = NonTagWordCount + UBound(Split(Application.Trim(Phrases(X)))) + 1
  Next
End Function[/td]
[/tr]
[/table]
 
Upvote 0
Could we see 8-10 rows of representative sample data (with not too many 'words' in each cell) and the expected results?
Sure, here a pair:
Code:
a1;<div class="brandingInfo_wrapper"><div class="footer_bandSingleText"><div id="band_info" class="dark"><h1 class="h2">x-men</h1><div class="textBlock"><p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos e <strong>X-Men</strong> Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum <strong><a href="/lis_Wolverine_bandbrand/">Wolverine</a></strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. <strong>X-Men: Apocalypse</strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.</p></div></div></div></div>

a2;<div class="footer_bandSingleText"><div id="band_info" class="dark"><h1 class="h2">star trek</h1><div class="textBlock"><p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. <strong>Star Trek</strong> Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. <strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. </strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. <strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. </strong> Lorem <strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. </strong> Ipsum <strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. </p></div></div></div>

@ Jon von der Heyden: On using your forumla from B2 i get always =, not dependently of content in A2. Is it because i'm on Office 2007?
@
Rick Rothstein: how is this code usable?
 
Last edited:
Upvote 0
@ [/COLOR]Rick Rothstein what do you mean with this code?[/QUOTE]
What I posted is called a UDF (user defined function). Perhaps this write-up which I have posted in the past will help...

HOW TO INSTALL UDFs
------------------------------------
If you are new to UDFs, they are easy to install and use. To install it, simply press ALT+F11 to go into the VB editor and, once there, click Insert/Module on its menu bar, then copy/paste the code I posted in Message #5 into the code window that just opened up. That's it.... you are done. You can now use NonTagWordCount just like it was a built-in Excel function. For example,

=NonTagWordCount(A1)

If you are using XL2007 or above, make sure you save your file as an "Excel Macro-Enabled Workbook (*.xlsm) and answer the "do you want to enable macros" question as "yes" or "OK" (depending on the button label for your version of Excel) the next time you open your workbook.
 
Upvote 0
Sure, here a pair:
@ Jon von der Heyden: On using your forumla from B2 i get always =, not dependently of content in A2. Is it because i'm on Office 2007?
I think that your result includes tags which is being parsed by the forum, so I can't remark on why the formula isn't working. Although you can probably tell from my first post that I would opt for a UDF. Either using Rick's, or if you know regex then you could apply the regex logic to your own UDF.

I should point out though that my formula is an array formula. Note the instruction in my post to type the formula with an "=" and to commit it with Control+Shift+Enter.
 
Last edited:
Upvote 0
I think this is the sample data you tried to post. (When posting HTML text, click Go Advanced at the bottom right of the Quick Reply window and look down for the option to turn off HTML for your post)

a1;<div class="brandingInfo_wrapper"><div class="footer_bandSingleText"><div id="band_info" class="dark"><h1 class="h2">x-men</h1><div class="textBlock"><p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos e <strong>X-Men</strong> Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum <strong><a href="/lis_Wolverine_bandbrand/">Wolverine</a></strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. <strong>X-Men: Apocalypse</strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.</p></div></div></div></div>

a2;<div class="footer_bandSingleText"><div id="band_info" class="dark"><h1 class="h2">star trek</h1><div class="textBlock"><p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. <strong>Star Trek</strong> Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. <strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. </strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. <strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. </strong> Lorem <strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. </strong> Ipsum <strong>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. </p></div></div></div>


So that is two sample texts, what about the last part of my question: "the expected results"?
.. and explain those expected results please.
 
Upvote 0
Well done Peter pulling that out!

@chilly_bank: Please do not overlook Peter's request for desired results.

Mine is more of a cautionary note: I believe there is a flaw in my formula, but I'm not feeling the love for the formula-based solution anyway so not feeling very inclined to debug it. Ricks UDF returns 92 (C1), which by my estimation is the correct result. Herewith a comparison of Rick's UDF and my original formula suggestion in B1.

<p>
Excel 2010
ABC
1x-menLorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos eX-Men Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsumWolverineLorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.X-Men: ApocalypseLorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.9492
Sheet1
Cell Formulas
RangeFormula
C1=NonTagWordCount(A1)
B1{=SUMPRODUCT(IF(LEN(SplitTag),IF(LEFT(SplitTag,1)<>"<",1+LEN(SplitTag)-LEN(SUBSTITUTE(SplitTag," ","")),0),0))}
Press CTRL+SHIFT+ENTER to enter array formulas.
Named Ranges
NameRefers ToCells
SplitTag=TRIM(MID(SUBSTITUTE(SUBSTITUTE(Sheet1!XFD1,"" ",255)&"),">",">"&REPT(" ",255)),ROW(INDEX(Sheet1!XFD:XFD,1,1):INDEX(Sheet1!XFD:XFD,255,1))*255,255))
 
Last edited:
Upvote 0

Forum statistics

Threads
1,213,487
Messages
6,113,943
Members
448,534
Latest member
benefuexx

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top