Parsing text strings when the source is not consistent

rizzo93

Active Member
Joined
Jan 22, 2015
Messages
301
Office Version
  1. 365
I'm trying to parse out text strings from a source that is not consistently formatted. These extractions will be inserted into separate cells.

In a "perfect scenario", the data source should be formatted in the following manner:

Cell C1
1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. Pilot, Hal Jordan 4b. Engineering, Barry Allen 4c. IT, Peter Parker 4d. Security, Tony Stark

The use of numbers, spaces and commas should always be constant. Blue text can be considered my markers with which to find things.

Sometimes the data comes over like this:

1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal Jordan 4b. , 4c. , 4d. ,
1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. , Hal Jordan 4b. , Barry Allen 4c. , Peter Parker 4d. , Tony Stark
Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal Jordan

I need to allow for these variances and extract the names and areas of expertise (the descriptions that are sometimes included just before the person's name like Engineering and Security above).

So using the "perfect scenario" above, it would look like this:

A
B
C
D
E
F
G
Clark Kent
Diana Prince
Bruce Wayne
Pilot, Hal Jordan
Engineering, Barry Allen
IT, Peter Parker
Security, Tony Stark

<tbody>
</tbody>

My formulas so far look like this, but I end up having to add special tweaks to cells E, F and G:

A
B
C
D
E
F
G
=TRIM(MID($C1,14,FIND("2",$C1)-14))
=TRIM(MID($C1,FIND("n,",$C1)+2,FIND("3",$C1)-FIND("n,",$C1)-2))
=TRIM(MID($C1,FIND("l,",$C1)+2,FIND("4",$C1)-FIND("l,",$C1)-2))
=IF($G$4=TRUE,IF(LEFT(MID($C$1,FIND("4a",$C1)+4,(FIND("4b",$C1)-FIND("4a",$C1))-5),1)=",",MID($C$1,FIND("4a",$C1)+6,(FIND("4b",$C1)-FIND("4a",$C1))-7),MID($C$1,FIND("4a",$C1)+4,(FIND("4b",$C1)-FIND("4a",$C1))-5)),RIGHT($C1,LEN($C1)-FIND("4a",$C1)-5))
=IF($H$4=TRUE,IF(LEFT(MID($C$1,FIND("4b",$C1)+4,(FIND("4c",$C1)-FIND("4b",$C1))-5),1)=",",MID($C$1,FIND("4b",$C1)+6,(FIND("4c",$C1)-FIND("4b",$C1))-7),MID($C$1,FIND("4b",$C1)+4,(FIND("4c",$C1)-FIND("4b",$C1))-5)),RIGHT($C1,LEN($C1)-FIND("4b.",$C1,LEN($F7))-3))
Gets confusing at this point
Gets confusing at this point

<tbody>
</tbody>


Any help would be very, very welcome.
 

Excel Facts

Save Often
If you start asking yourself if now is a good time to save your Excel workbook, the answer is Yes
Hi

This isn't perfect but hopefully this helps.

We can use a custom formula to get the Business value and the Security value, but we can use a generic formula to get all the other values.
This relies on having some headings with the correct value, as follows:

Assumes the following data is in cells A2:A5:

1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. Pilot, Hal Jordan 4b. Engineering, Barry Allen 4c. IT, Peter Parker 4d. Security, Tony Stark
1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal Jordan 4b. , 4c. , 4d. ,
1. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. , Hal Jordan 4b. , Barry Allen 4c. , Peter Parker 4d. , Tony Stark
Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal Jordan

<tbody>
</tbody>

<tbody>
</tbody>

In cell B1 enter the value: 1. Business,
in C1: 2. Solution,
in D1: 3. Operational,
in E1: 4a.
in F1: 4b.
in G1: 4c.
in H1: 4d.
These are the headings that the formulas rely on.

Formula in cell B2:
=TRIM(SUBSTITUTE(SUBSTITUTE(LEFT($A2,FIND(C$1,$A2)-1),B$1,""),"Business,",""))
copy down.

Formula in cell C2:
=IFERROR(TRIM(MID($A2,FIND(C$1,$A2)+LEN(C$1),FIND(D$1,$A2)-FIND(C$1,$A2)-LEN(C$1))),"")
copy down and across to column G

Formula in cell H2:
=IFERROR(TRIM(RIGHT($A2,LEN($A2)-FIND(H$1,$A2)-LEN(H$1)+1)),"")
Copy down.

Notice this leaves behind leading commas where the "security" (etc.) value is missing. If however we remove this using a substitute function, then the comma disappears where we want it to remain.

These are the results I am seeing in cells B1:H5....

1. Business,2. Solution,3. Operational,4a.4b.4c.4d.
Clark KentDiana PrinceBruce WaynePilot, Hal JordanEngineering, Barry AllenIT, Peter ParkerSecurity, Tony Stark
Clark KentDiana Prince, Hal Jordan,,,
Clark KentDiana PrinceBruce Wayne, Hal Jordan, Barry Allen, Peter Parker, Tony Stark
Clark KentDiana Prince

<colgroup><col span="7"></colgroup><tbody>
</tbody>

Hopefully this is enough to get you started?

Andrew
 
Upvote 0
I'm guessing that this won't be perfect either and if it fails you might give a few more examples of data and expected results where it does fail.
For original data in A2 and down, the code produced the results shown below in columns B:H.

Code:
Sub Split_Data()
  Dim RX As Object
  Dim a As Variant, b As Variant, bits As Variant
  Dim i As Long, j As Long
  
  Set RX = CreateObject("VBScript.RegExp")
  RX.Global = True
  RX.IgnoreCase = True
  RX.Pattern = "\b[\d\. ]*Business[ ,]*|[\d\. ]*Solution[ ,]*|[\d\. ]*Operational[ ,]*|4a\.[ ,]*|4b\.[ ,]*|4c\.[ ,]*|4d\.[ ,]*(\b|$)"
  a = Range("A2", Range("A" & Rows.Count).End(xlUp)).Value
  ReDim b(1 To UBound(a), 1 To 7)
  For i = 1 To UBound(a)
    bits = Split(RX.Replace(a(i, 1), "^"), "^")
    For j = 1 To UBound(bits)
      b(i, j) = bits(j)
    Next j
  Next i
  Range("B2").Resize(UBound(b), 7).Value = b
End Sub


Book1
ABCDEFGH
11.Business2. Solution3. Operational4a.4b.4c.4d.
21. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. Pilot, Hal Jordan 4b. Engineering, Barry Allen 4c. IT, Peter Parker 4d. Security, Tony StarkClark KentDiana PrinceBruce WaynePilot, Hal JordanEngineering, Barry AllenIT, Peter ParkerSecurity, Tony Stark
31. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal Jordan 4b. , 4c. , 4d. ,Clark KentDiana PrinceHal Jordan
41. Business, Clark Kent 2. Solution, Diana Prince 3. Operational, Bruce Wayne 4a. , Hal Jordan 4b. , Barry Allen 4c. , Peter Parker 4d. , Tony StarkClark KentDiana PrinceBruce WayneHal JordanBarry AllenPeter ParkerTony Stark
5Business, Clark Kent 2. Solution, Diana Prince 3. Operational, 4a. , Hal JordanClark KentDiana PrinceHal Jordan
Extract Text Parts (2)
 
Upvote 0
Hi Peter,
My apologies for the late reply (it's been a crazy week).
Thanks for your feedback. Very much appreciated. :)
 
Upvote 0

Forum statistics

Threads
1,215,966
Messages
6,127,974
Members
449,414
Latest member
sameri

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top