return all text strings beginning with "#" from a column

cosgrove

New Member
Joined
Mar 10, 2011
Messages
3
Hi, I have a single column with 860,000 rows (each cell is a tweet) and I need to extract all unique hashtags from that column.
(A hashtag is a text string that begins with "#" and has no spaces, but will vary in length.)
Many cells contain multiple hashtags, and the vast majority of hashtags will show up in hundreds or even thousands of cells. I need a simple list of all unique hashtags, and expect there to be between 200-500.
Any suggestions? Thanks so much.
 

Excel Facts

Enter current date or time
Ctrl+: enters current time. Ctrl+; enters current date. Use Ctrl+: Ctrl+; Enter for current date & time.
Welcome to the forums!

Try Text to Columns, using the # as a delimiter.
 
Upvote 0
Welcome to the forums!

Try Text to Columns, using the # as a delimiter.

Thanks, yes, only problem is that many cells have up to 7 or 8 hash tags, which will result in an unwieldy number of columns in addition to the 80000 rows, so was hoping to find an elegant formulaic solution. But, yes, I may end up having to T2C...
 
Upvote 0
Could you run a pivot table putting the text strings in the rows area. Then deselect the non-hashtags. That should give you a unique list.
 
Upvote 0
Can you provide a sample of your data, as well as desired outcome?
 
Upvote 0
Can you provide a sample of your data, as well as desired outcome?

Sure, here are some examples from the column:

#revolution or not .. situation is never going to be the same it used to be. and that's a start. #jan25 #egypt #tunisia

What about a No Fly Zone for the #Palestinians? #jan25 #feb14 #feb17

Tahrir Square completely empty this morning, guarded by Egyptian police. #Jan25 #egypt

charity event in #Tanta if you live there http://on.fb.me/eI5YHd #Tanta #Egypt #jan25

For which I would like the following results:
#revolution
#jan25
#Egypt
#tunisia
#feb14
#feb17
#egypt
#Tanta
 
Upvote 0
If you like VBA code, try this on the sample data you provided.

You'll note that the results given (in columnM) differ from the sample results you provided. You might like to examine both carefully and note the reasons for the differences.
Code:
sub reorganizeit()
Dim d As Object, a, e, f, k&, u()
Set d = CreateObject("scripting.dictionary")
For Each e In Range("A1").CurrentRegion.Resize(, 1)
x = e.Replace("#", Chr(30) & "#")
e = Split(e, Chr(30), -1)
For Each f In e
    If Left(f, 1) = "#" Then
        f = Split(f, " ")(0)
        d(f) = 0
    End If
Next f
Next e
ReDim u(1 To d.Count, 1 To 1)
For Each e In d.keys
    k = k + 1
    u(k, 1) = e
Next e
[m1].Resize(d.Count) = u
End Sub
Post back if you see any problem.
 
Upvote 0
Hello Cosgrove,

This method provides a more robust method of parsing the tags using Regular Expressions and, like the previous post, uses a Dictionary object to provide a unique list of tags. The tags are assumed to start in cell "A1" and the results start in cell "M1". Case is ignored when testing for uniqueness.

Code:
Sub ListTweetTags()

  Dim Cell As Range
  Dim Dict As Object
  Dim RegExp As Object
  Dim Item As Variant
  Dim Key As String
  Dim LastRow As Long
  Dim Rng As Range
  Dim Text As String
  Dim Wks As Worksheet
  
    Set Wks = ActiveSheet
    Set Rng = Wks.Range("A1")
    
      LastRow = Wks.Cells(Rows.Count, Rng.Column).End(xlUp).Row
      Set Rng = IIf(LastRow < Rng.Row, Rng, Rng.Resize(LastRow + Rng.Row - 1, 1))
      
      Set Dict = CreateObject("Scripting.Dictionary")
      Dict.CompareMode = vbTextCompare
      
      Set RegExp = CreateObject("VBScript.RegExp")
      RegExp.IgnoreCase = False
      RegExp.Pattern = "[^#]*(#\S+)\s*(.*)"
      
      For Each Cell In Rng
        Text = Cell
        Do While RegExp.Test(Text) = True
          Tag = RegExp.Replace(Text, "$1")
            If Tag <> "#" Then
              If Not Dict.Exists(Tag) Then Dict.Add Tag, ""
            End If
          Text = RegExp.Replace(Text, "$2")
        Loop
      Next Cell
      
   Wks.Range("M1").Resize(Dict.Count, 1).Value = WorksheetFunction.Transpose(Dict.Keys)
   
End Sub
Sincerely,
Leith Ross
 
Upvote 0
hi cosgrove,

Perhaps your example of the output you want was somewhat ambiguous.

Posts #7 and #8 above seem take different interpretations, and give differing results using more general data (irrespective of upper/lower case considerations). i.e. they do somewhat different things.

It seems to me, on studying your example that you wanted either of
(a) all strings listed (without duplication) that start with #, and with no spaces in these strings. My post #7 aimed to do this.
(b) all strings listed (without duplication) that are separated by a space and a hash, apart from the first on the line which needs a hash only.


My post #7 aimed to do (a). That one was only intended to be an example code, and it can be speeded up by 10 or times with very minor modification.

If (b) is what you want, the following code aims to do this. On some test data it took about 5 seconds to do 860,000 rows using a moderately specced computer.
Code:
Sub alternative()
Dim d As Object, n&, j&, x, e
Set d = CreateObject("scripting.dictionary")
n = Cells(Rows.Count, 1).End(3).Row
For Each e In Range("A1").Resize(n).Value
    x = Split(e, " ", -1)
    For j = 0 To UBound(x)
        If Left(x(j), 1) = "#" Then d(x(j)) = 1
    Next j
Next e
[n1].Resize(d.Count) = Application.Transpose(d.Keys)
End Sub
If the output you want is something other, then I guess the answer must lie elsewhere, perhaps in post #8.
 
Upvote 0
Hello Cosgrove,

I did catch a problem with my previous post. After running the macro on a collection of randomly sampled Tweets, I discovered there were tags which had punctuation attached to them. Since the hashtag convention is alphanumeric without spaces, these tags were being incorrectly identified as unique. To solve the problem, I simply changed the following line:
Code:
  From This...
    RegExp.Pattern = "[^#]*(#\S+)\s*(.*)"

  To This...
    RegExp.Pattern = "[^#]*(#\w+)\s*(.*)"
Here is the macro code with the correction included.
Code:
Sub ListTweetTags2()

  Dim Cell As Range
  Dim Dict As Object
  Dim RegExp As Object
  Dim Item As Variant
  Dim Key As String
  Dim LastRow As Long
  Dim Rng As Range
  Dim Text As String
  Dim Wks As Worksheet
  
    StartTime = Timer
    
    Set Wks = ActiveSheet
    Set Rng = Wks.Range("A1")
    
      LastRow = Wks.Cells(Rows.Count, Rng.Column).End(xlUp).Row
      Set Rng = IIf(LastRow < Rng.Row, Rng, Rng.Resize(LastRow + Rng.Row - 1, 1))
      
      Set Dict = CreateObject("Scripting.Dictionary")
      Dict.CompareMode = vbTextCompare
      
      Set RegExp = CreateObject("VBScript.RegExp")
      RegExp.IgnoreCase = False
      RegExp.Pattern = "[^#]*(#\w+)\s*(.*)"
      
      For Each Cell In Rng
        Text = Cell
        Do While RegExp.Test(Text) = True
          Tag = RegExp.Replace(Text, "$1")
            If Tag <> "#" Then
              If Not Dict.Exists(Tag) Then Dict.Add Tag, ""
            End If
          Text = RegExp.Replace(Text, "$2")
        Loop
      Next Cell
      
   Wks.Range("M1").Resize(Dict.Count, 1).Value = WorksheetFunction.Transpose(Dict.Keys)
   
   EndTime = Timer
   TotalTime = EndTime - StartTime
   
End Sub
Sincerely,
Leith Ross
 
Upvote 0

Forum statistics

Threads
1,224,599
Messages
6,179,831
Members
452,947
Latest member
Gerry_F

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top