Parsing text strings when the source is not consistent

rizzo93

Active Member
Joined
Jan 22, 2015
Messages
303
Office Version
  1. 365
I'm trying to parse out text strings from a source that is not consistently formatted. These extractions will be inserted into separate cells.

In a "perfect scenario", the data source should be formatted in the following manner:

Cell C1
1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. Pilot, Hal Jordan 4b. Engineering, Barry Allen 4c. IT, Peter Parker 4d. Security, Tony Stark

The use of numbers, spaces and commas should always be constant. Blue text can be considered my markers with which to find things.

Sometimes the data comes over like this:

1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal Jordan 4b. , 4c. , 4d. ,
1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. , Hal Jordan 4b. , Barry Allen 4c. , Peter Parker 4d. , Tony Stark
Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal Jordan

I need to allow for these variances and extract the names and areas of expertise (the descriptions that are sometimes included just before the person's name like Engineering and Security above).

So using the "perfect scenario" above, it would look like this:

[TABLE="class: grid, width: 800"]
<tbody>[TR]
[TD]A
[/TD]
[TD]B
[/TD]
[TD]C
[/TD]
[TD]D
[/TD]
[TD]E
[/TD]
[TD]F
[/TD]
[TD]G
[/TD]
[/TR]
[TR]
[TD]Clark Kent
[/TD]
[TD]Diana Prince
[/TD]
[TD]Bruce Wayne
[/TD]
[TD]Pilot, Hal Jordan
[/TD]
[TD]Engineering, Barry Allen
[/TD]
[TD]IT, Peter Parker
[/TD]
[TD]Security, Tony Stark
[/TD]
[/TR]
</tbody>[/TABLE]

My formulas so far look like this, but I end up having to add special tweaks to cells E, F and G:

[TABLE="class: grid, width: 710"]
<tbody>[TR]
[TD]A
[/TD]
[TD]B
[/TD]
[TD]C
[/TD]
[TD]D
[/TD]
[TD]E
[/TD]
[TD]F
[/TD]
[TD]G
[/TD]
[/TR]
[TR]
[TD]=TRIM(MID($C1,14,FIND("2",$C1)-14))
[/TD]
[TD]=TRIM(MID($C1,FIND("n,",$C1)+2,FIND("3",$C1)-FIND("n,",$C1)-2))
[/TD]
[TD]=TRIM(MID($C1,FIND("l,",$C1)+2,FIND("4",$C1)-FIND("l,",$C1)-2))
[/TD]
[TD]=IF($G$4=TRUE,IF(LEFT(MID($C$1,FIND("4a",$C1)+4,(FIND("4b",$C1)-FIND("4a",$C1))-5),1)=",",MID($C$1,FIND("4a",$C1)+6,(FIND("4b",$C1)-FIND("4a",$C1))-7),MID($C$1,FIND("4a",$C1)+4,(FIND("4b",$C1)-FIND("4a",$C1))-5)),RIGHT($C1,LEN($C1)-FIND("4a",$C1)-5))
[/TD]
[TD]=IF($H$4=TRUE,IF(LEFT(MID($C$1,FIND("4b",$C1)+4,(FIND("4c",$C1)-FIND("4b",$C1))-5),1)=",",MID($C$1,FIND("4b",$C1)+6,(FIND("4c",$C1)-FIND("4b",$C1))-7),MID($C$1,FIND("4b",$C1)+4,(FIND("4c",$C1)-FIND("4b",$C1))-5)),RIGHT($C1,LEN($C1)-FIND("4b.",$C1,LEN($F7))-3))
[/TD]
[TD]Gets confusing at this point
[/TD]
[TD]Gets confusing at this point
[/TD]
[/TR]
</tbody>[/TABLE]


Any help would be very, very welcome.
 

Excel Facts

Did you know Excel offers Filter by Selection?
Add the AutoFilter icon to the Quick Access Toolbar. Select a cell containing Apple, click AutoFilter, and you will get all rows with Apple
Hi

This isn't perfect but hopefully this helps.

We can use a custom formula to get the Business value and the Security value, but we can use a generic formula to get all the other values.
This relies on having some headings with the correct value, as follows:

Assumes the following data is in cells A2:A5:

[TABLE="width: 1012"]
<tbody>[TR]
[TD][TABLE="width: 1195"]
<tbody>[TR]
[TD]1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. Pilot, Hal Jordan 4b. Engineering, Barry Allen 4c. IT, Peter Parker 4d. Security, Tony Stark[/TD]
[/TR]
[TR]
[TD]1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal Jordan 4b. , 4c. , 4d. ,[/TD]
[/TR]
[TR]
[TD]1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. , Hal Jordan 4b. , Barry Allen 4c. , Peter Parker 4d. , Tony Stark[/TD]
[/TR]
[TR]
[TD]Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal Jordan[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[/TR]
</tbody>[/TABLE]

In cell B1 enter the value: 1. Business,
in C1: 2. Solution,
in D1: 3. Operational,
in E1: 4a.
in F1: 4b.
in G1: 4c.
in H1: 4d.
These are the headings that the formulas rely on.

Formula in cell B2:
=TRIM(SUBSTITUTE(SUBSTITUTE(LEFT($A2,FIND(C$1,$A2)-1),B$1,""),"Business,",""))
copy down.

Formula in cell C2:
=IFERROR(TRIM(MID($A2,FIND(C$1,$A2)+LEN(C$1),FIND(D$1,$A2)-FIND(C$1,$A2)-LEN(C$1))),"")
copy down and across to column G

Formula in cell H2:
=IFERROR(TRIM(RIGHT($A2,LEN($A2)-FIND(H$1,$A2)-LEN(H$1)+1)),"")
Copy down.

Notice this leaves behind leading commas where the "security" (etc.) value is missing. If however we remove this using a substitute function, then the comma disappears where we want it to remain.

These are the results I am seeing in cells B1:H5....

[TABLE="width: 882"]
<colgroup><col span="7"></colgroup><tbody>[TR]
[TD]1. Business,[/TD]
[TD]2. Solution,[/TD]
[TD]3. Operational,[/TD]
[TD]4a.[/TD]
[TD]4b.[/TD]
[TD]4c.[/TD]
[TD]4d.[/TD]
[/TR]
[TR]
[TD]Clark Kent[/TD]
[TD]Diana Prince[/TD]
[TD]Bruce Wayne[/TD]
[TD]Pilot, Hal Jordan[/TD]
[TD]Engineering, Barry Allen[/TD]
[TD]IT, Peter Parker[/TD]
[TD]Security, Tony Stark[/TD]
[/TR]
[TR]
[TD]Clark Kent[/TD]
[TD]Diana Prince[/TD]
[TD][/TD]
[TD], Hal Jordan[/TD]
[TD],[/TD]
[TD],[/TD]
[TD],[/TD]
[/TR]
[TR]
[TD]Clark Kent[/TD]
[TD]Diana Prince[/TD]
[TD]Bruce Wayne[/TD]
[TD], Hal Jordan[/TD]
[TD], Barry Allen[/TD]
[TD], Peter Parker[/TD]
[TD], Tony Stark[/TD]
[/TR]
[TR]
[TD]Clark Kent[/TD]
[TD]Diana Prince[/TD]
[TD][/TD]
[TD][/TD]
[TD][/TD]
[TD][/TD]
[TD][/TD]
[/TR]
</tbody>[/TABLE]

Hopefully this is enough to get you started?

Andrew
 
Upvote 0
I'm guessing that this won't be perfect either and if it fails you might give a few more examples of data and expected results where it does fail.
For original data in A2 and down, the code produced the results shown below in columns B:H.

Code:
Sub Split_Data()
  Dim RX As Object
  Dim a As Variant, b As Variant, bits As Variant
  Dim i As Long, j As Long
  
  Set RX = CreateObject("VBScript.RegExp")
  RX.Global = True
  RX.IgnoreCase = True
  RX.Pattern = "\b[\d\. ]*Business[ ,]*|[\d\. ]*Solution[ ,]*|[\d\. ]*Operational[ ,]*|4a\.[ ,]*|4b\.[ ,]*|4c\.[ ,]*|4d\.[ ,]*(\b|$)"
  a = Range("A2", Range("A" & Rows.Count).End(xlUp)).Value
  ReDim b(1 To UBound(a), 1 To 7)
  For i = 1 To UBound(a)
    bits = Split(RX.Replace(a(i, 1), "^"), "^")
    For j = 1 To UBound(bits)
      b(i, j) = bits(j)
    Next j
  Next i
  Range("B2").Resize(UBound(b), 7).Value = b
End Sub


Book1
ABCDEFGH
11.Business2. Solution3. Operational4a.4b.4c.4d.
21. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. Pilot, Hal Jordan 4b. Engineering, Barry Allen 4c. IT, Peter Parker 4d. Security, Tony StarkClark KentDiana PrinceBruce WaynePilot, Hal JordanEngineering, Barry AllenIT, Peter ParkerSecurity, Tony Stark
31. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal Jordan 4b. , 4c. , 4d. ,Clark KentDiana PrinceHal Jordan
41. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. , Hal Jordan 4b. , Barry Allen 4c. , Peter Parker 4d. , Tony StarkClark KentDiana PrinceBruce WayneHal JordanBarry AllenPeter ParkerTony Stark
5Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal JordanClark KentDiana PrinceHal Jordan
Extract Text Parts (2)
 
Upvote 0
Hi Peter,
My apologies for the late reply (it's been a crazy week).
Thanks for your feedback. Very much appreciated. :)
 
Upvote 0

Forum statistics

Threads
1,223,098
Messages
6,170,103
Members
452,302
Latest member
TaMere

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top