VBA to combine data from multiple cells into one and then search the combined data for a specific string

BiochemBoi95

New Member
Joined
Jan 22, 2020
Messages
20
Office Version
  1. 365
  2. 2016
Platform
  1. Windows
  2. Web
So I have DNA sequences that excel imports in this format:

>PRS1_SeqF2
NNNNNNNNNNGGTTNNNNGTCNCGNCGTTGTANACGACGGCCAGCATGGAGTACAAGGGA
GGTACTTCCATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACA
CAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACT
CACATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCT
GCATTAATGAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGC
TTCCTCGCTCACTGACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCA
CTCAAAGGCGGTAATACGGTTATCCACAGAATCAGGGGATAACGCAGGAAAGAACATGTG
AGCAAAAGGCCAGCAAAAGGCCAGGAACCGTAAAAAGGCCGCGTTGCTGGCGTTTTTCCA
TAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCTCAAGTCAGAGGTGGCGAAA
CCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGCGCTCTCC
TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC
GCTTTCTCATAGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCT
GGGCTGTGTGCACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCG
TCTTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGCAGCCACTGGTAACAG
GATTAGCAGAGCGAGGTATGTAGGNCNGTGCTACAGAGTTCTTGAAGTGGTGGNCCTAAC
TACNGCTACACTAGAAAGAANNGTATTTGGNATCTGCNCTCTGCTGAANNCANNNACNNN
CGGAAAAAANAGTTGGGTAGCTNNTNNANTCCCGGGNNAANNAAACCACCNNGNTNNNNN
NGNNNNNNNNNNNNNNNNTTNGCNANNCCANNNNNNNNNNNNNNCTNGNNNNNNNNNGGN
NTNNNNANNNNANNNN
>PRS2_SeqF2
NNNNNNNNNNNNTNNNNTCACGNCGTTGTAAACGACGGCCAGCATGGAGTACAAGGGAGG
TACTTCCATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACA
ACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCA
CATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGC
ATTAATGAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGCTT
CCTCGCTCACTGACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCACT
CAAAGGCGGTAATACGGTTATCCACAGAATCAGGGGATAACGCAGGAAAGAACATGTGAG
CAAAAGGCCAGCAAAAGGCCAGGAACCGTAAAAAGGCCGCGTTGCTGGCGTTTTTCCATA
GGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCTCAAGTCAGAGGTGGCGAAACC
CGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGCGCTCTCCTG
TTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCGC
TTTCTCATAGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGG
GCTGTGTGCACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTC
TTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGCAGCCACTGGTAACAGGA
TTAGCAGAGCGAGGTATGTAGGCGGTGCTACAGAGTTCTTGAAGTGGTGGCCTAANTACG
GCTACACTAGAAAGAACAGTATTTGGTATCTGCGCTCTGCTGAAGCCCAGTTACCTTTCG
GAAAAAAGAGTTNGGNAGCTCTTGNTNCCGGGCAAANNNNCNACCGCTNGGTANNNGNNG
NNNTTTTTTTTNNTTGCNNNCNNGCANNNTNACGCCNNANNAAANNNNNNTTNNNANNNN

Each sequence varies in length so it may not occupy the same number of rows. I want each of these blocks of text to be combined into once cell under the sequence labels you see at the top of each block instead of spread out across multiple rows as it currently is. I just showed you these two blocks but my document has many more in the same format. Then I have code for searching these long strings for a specific substring but I want that to be incorporated into the same VBA after the text has been formatted into one cell.

Here is the code that I have for the search part:
=IFERROR(MID($A$1,SMALL(IF(MID($A$1,ROW(INDIRECT("1:"&LEN($A$1)-24)),6)="GTACAA",IF(MID($A$1,ROW(INDIRECT("25:"&LEN($A$1))),6)="GGGAGG",ROW(INDIRECT("7:"&LEN($A$1)-18)))),ROWS($A$3:$A3)),18),"")

What the code is supposed to do is search for subsrings "GTACAA" and "GGGAGG" and returns the letters between these two substrings but only if the two substrings are 18 characters apart. Potentially, I would also like to exclude any returned strings if there is an N present. Ideally I'd like the returned strings for each block of text to be put off the right somewhere in a single column.

I know that was a lot so any help is greatly appreciated.
 

Excel Facts

Control Word Wrap
Press Alt+Enter to move to a new row in a cell. Lets you control where the words wrap.
TestBase.xlsm
A
3
4>PRS1_SeqF2 NNNNNNNNNNGGTTNNNNGTCNCGNCGTTGTANACGACGGCCAGCATGGAGTACAAGGGA GGTACTTCCATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACA CAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACT CACATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCT GCATTAATGAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGC TTCCTCGCTCACTGACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCA CTCAAAGGCGGTAATACGGTTATCCACAGAATCAGGGGATAACGCAGGAAAGAACATGTG AGCAAAAGGCCAGCAAAAGGCCAGGAACCGTAAAAAGGCCGCGTTGCTGGCGTTTTTCCA TAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCTCAAGTCAGAGGTGGCGAAA CCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGCGCTCTCC TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC GCTTTCTCATAGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCT GGGCTGTGTGCACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCG TCTTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGCAGCCACTGGTAACAG GATTAGCAGAGCGAGGTATGTAGGNCNGTGCTACAGAGTTCTTGAAGTGGTGGNCCTAAC TACNGCTACACTAGAAAGAANNGTATTTGGNATCTGCNCTCTGCTGAANNCANNNACNNN CGGAAAAAANAGTTGGGTAGCTNNTNNANTCCCGGGNNAANNAAACCACCNNGNTNNNNN NGNNNNNNNNNNNNNNNNTTNGCNANNCCANNNNNNNNNNNNNNCTNGNNNNNNNNNGGN NTNNNNANNNNANNNN >PRS2_SeqF2 NNNNNNNNNNNNTNNNNTCACGNCGTTGTAAACGACGGCCAGCATGGAGTACAAGGGAGG TACTTCCATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACA ACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCA CATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGC ATTAATGAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGCTT CCTCGCTCACTGACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCACT CAAAGGCGGTAATACGGTTATCCACAGAATCAGGGGATAACGCAGGAAAGAACATGTGAG CAAAAGGCCAGCAAAAGGCCAGGAACCGTAAAAAGGCCGCGTTGCTGGCGTTTTTCCATA GGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCTCAAGTCAGAGGTGGCGAAACC CGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGCGCTCTCCTG TTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCGC TTTCTCATAGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGG GCTGTGTGCACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTC TTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGCAGCCACTGGTAACAGGA TTAGCAGAGCGAGGTATGTAGGCGGTGCTACAGAGTTCTTGAAGTGGTGGCCTAANTACG GCTACACTAGAAAGAACAGTATTTGGTATCTGCGCTCTGCTGAAGCCCAGTTACCTTTCG GAAAAAAGAGTTNGGNAGCTCTTGNTNCCGGGCAAANNNNCNACCGCTNGGTANNNGNNG NNNTTTTTTTTNNTTGCNNNCNNGCANNNTNACGCCNNANNAAANNNNNNTTNNNANNNN
Sheet1

This is what you would see for just the two groups in your example. There is a limit as to the number of characters a cell can hold.

Here is the code I used to put this in cell A4 after copying your data to rows 7 and down. Maybe you can work with it to do what you want. But I would not advise trying to put everything in one cell. Maybe each group per cell would be a better idea.
VBA Code:
Sub t()
Dim c As Range
For Each c In Range("A7", Cells(Rows.Count, 1).End(xlUp))
    If c <> "" Then
        If Range("A4") = "" Then
            Range("A4") = c.Value
        Else
            Range("A4") = Range("A4").Value & vbLf & c.Value
        End If
    End If
Next
End Sub
 
Last edited:
Upvote 0
I would struggle to do this with VBA/formulas because of all of the many things you want to do here, plus as @JLGWhiz says, the cell limit issue is there too.

It is relatively easy to do with other languages, however. This with OmniMark (very old OMLE actually, 1998 vintage), for example, run on your input file does everything you're after:
Code:
process
  submit #main-input

find (">PRS" any-text+)=>lpv-initial ((lookahead not (">PRS")) any)+=>lpv-PRS
  local stream ls-PRS
  local switch lsw-N initial {false}
  open ls-PRS as buffer
  repeat scan lpv-prs
    match uc=>lpv-uc
      put ls-PRS lpv-uc
    match any
      ;ignore
  again
  close ls-PRS
  output lpv-initial || '%n' || ls-PRS
  repeat scan ls-PRS
    match "GTACAA" uc{18}=>lpv-18 "GGGAGG"
      put #console "F"
      do scan lpv-18
        match unanchored "N"
          set lsw-N to true
      done
      do unless lsw-N
        output ",%x(lpv-18)"
      done
    match any
      put #console "."
  again
  output "%n"

I used this because I could very easily visualise the solution given OmniMark's pattern matching and stream processing is so strong...and that is what you are doing here, especially as the input file is clearly text and Excel is not really required? This will handle any size of PRS data.

The output, shown in a text editor:
1584730622851.png

...noting the "..." is manually omitted characters to illustrate, and the comma is to make it a .csv file, which will open in Excel, though who knows how it would behave due to the cell lengths?!

Opened in Excel (without the "..."s) and with bold, green and a highlighter manually applied to make it easier to see:
1584731172647.png


No doubt Perl, Python, maybe Powershell, or whatever could do something similar. I'd prefer a non-Excel solution to this myself as there seems no reason to use a less-suited tool (IMO) to do this particular job.
 
Upvote 0
What the code is supposed to do is search for subsrings "GTACAA" and "GGGAGG" and returns the letters between these two substrings but only if the two substrings are 18 characters apart. Potentially, I would also like to exclude any returned strings if there is an N present. Ideally I'd like the returned strings for each block of text to be put off the right somewhere in a single column.
If such a substring exists, will there be only one such substring at maximum or could any single DNA sequence contain more than one such substring?

While I am waiting for the answer to the above question, I thought you might like to see how I would write the code to combine each set of row for a single DNA sequence into a single celll...
VBA Code:
Sub DNAcombine()
  Dim Ar As Range
  For Each Ar In Columns("A").SpecialCells(xlConstants).Areas
    Ar(1).Offset(, 1).Value = Ar(1).Value
    Ar(1).Offset(1, 1).Value = Join(Application.Transpose(Ar(1).Offset(1).Resize(Ar.Rows.Count - 1)), "")
  Next
End Sub
 
Upvote 0
Wow, you answered that quickly... you probably missed my interim code that I added by editing Message #5 shortly after posting it.
 
Upvote 0

Forum statistics

Threads
1,214,991
Messages
6,122,628
Members
449,095
Latest member
bsb1122

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top