Extract Characters from a Text String of Varying Length

JM360

New Member
Joined
Oct 9, 2017
Messages
4
I’m trying to come up two formulas that can extract characters from the middle of a string of text of varying lengths. Here is an example-
A1: 1234 309 Embankment Cu. Yds. 217
B1: 4567 587 12" Conduit, Type B, As Per Plan Lin. Ft. 14
C1: 4321 5432 Pull Box Adjusted to Grade - Contingency Ea. 1
D1: 7654 Spec. Power Distribution Cabinet, Complete (Item 4894) Lump Sum 1

My goal:
IDITEMDESCRIPTIONUNITQTY
1234309EmbankmentCu. Yds.217
456758912” Conduit, Type B, As Per PlanLin. Ft.14
43215432Pull Box Adjusted to Grade – ContingencyEa.1
7654Spec.Power Distribution Cabinet, Complete (Item 4894)Lump Sum1

<tbody>
</tbody>

I would like to formulate the Description and Unit columns. I’ve already have formulas set up for ID, ITEM, and QTY using =LEFT, =MID, and =TRIM/RIGHT/SUBSTITUTE/REPT formulas respectively.

Here is what I’ve noticed with the dataset:
ID is consistently four digits
ITEM varies in length but always has one set of characters surrounded by a space on each side
DESCRIPTION varies in length and in character usage
UNIT while different these are consistent in use
QTY numbers vary in length but always have one set of characters surrounded by a space on each side

Is this possible without VBA? Thank you in advance.
 

Excel Facts

Format cells as currency
Select range and press Ctrl+Shift+4 to format cells as currency. (Shift 4 is the $ sign).
I've come up with a bit of a clunky way I guess... someone may have a cleaner method. Mine assumes a couple of things:
  • You already have ID, ITEM and QTY figured out, as you've stated you have
  • You have a complete register of the UNITs you're expecting to find (I'll call this unitRange), and which are numbered consecutively in another column (I'll call this numUnitRange)

To find the UNIT, you could use the formula:
{=INDEX(unitRange,MAX(IF(ISNUMBER(SEARCH(unitRange,"String")),numUnitRange)))}​

Notice the curly brackets: this is an array formula and requires control+shift+enter.

You could then use something like this to remove everything but the DESCRIPTION:
=MID("String",LEN(ID&ITEM)+3,LEN("String")-LEN(ID&ITEM&UNIT&QTY)-4)​

The +3 is to compensate for the spaces I'm assuming you've TRIMmed off your ID and ITEM, and move to the first character of the DESCRIPTION, and the -4 is just for the spaces.

It's a bit scenario-specific, but I hope that works for you? Let me know if I need to clarify anything.
 
Last edited by a moderator:
Upvote 0
Thank you NiMip for your reply. I tried your method and it appears to work at first glance. I experimented with a number of clunky ways myself...some involving character and space (" ") counts but the method below worked the best for my scenario.

It requires all UNIT categories are manually entered but since there aren't many unit categories in my data set, utilizing Excel AutoFill and Flash Fill made this a quick chore. Formulas in Columns B, C, D, and F are simply copied down. Set up is as follows:


<tbody>
</tbody>
ABCDEF
1String Text From PDFIDITEMDESCRIPTIONUNITQTY
21234 309 Embankment Cu. Yds. 217=LEFT(A2,4)=TRIM(MID(A2,FIND(" ",A2)+1,FIND(" ",A2,FIND(" ",A2)+1)-FIND(" ",A2)))=TRIM(MID(A2,SEARCH(C2,A2)+LEN(C2),SEARCH(E2,A2)-SEARCH(C2,A2)-LEN(C2)))Cu. Yds.=TRIM(RIGHT(SUBSTITUTE(A2," ",REPT(" ",255)),255))
34567 587 12" Conduit, Type B, As Per Plan Lin. Ft. 14Lin. Ft.
44321 5432 Pull Box Adjusted to Grade - Contingency Ea. 1Ea.
57654 Spec. Power Distribution Cabinet, Complete (Item 4894) Lump Sum 1Lump Sum

<tbody>
</tbody>

I should note that the above requires that UNITS match the Units referenced in the pasted text in Column-A String Text From PDF and vice versa. For example, if OCR doesn't pick up the first period in one of the PDF rows, one could easily miss "Cu Yds." versus "Cu. Yds"
 
Upvote 0
1. So, do you have a result that you are satisfied with or are you still looking for something better?

2. If still looking, is there a particular reason for disallowing vba?
 
Upvote 0
Yes, both approaches above work. The reason I was looking for a non-VBA approach is because this spreadsheet will be shared with multiple novice Excel users. Less 'technical' steps, the better.
 
Upvote 0
Here is another formula approach that may be some use. The list of possible units (with no dots) is placed somewhere - column J for me.
Each of the formulas is copied down.

Notes:
a) You mentioned about OCR possibly not picking up all the dots so I have included 4 similar rows with various combinations of the dots in the units.
b) I have used a much smaller number of spaces in the QTY formula than you did. The larger number you use, the more processing has to be done & I figured you will not have any final numbers longer than 10 characters. If you might then increase it appropriately.
c) My 'Description' formula will fail (partially at least) if there are multiple dots in the early part of the original text - notice the missing final characters in the last sample. If that needs attention we possibly could reduce that risk a little but I didn't think it worth complicating the formulas further at this stage if it is unlikely to be an issue for you. (We would have a bit more flexibility/control over 'dot issues' if we able to introduce VBA :))


Book1
ABCDEFGHIJ
1DataIDITEMDESCRIPTIONUNITQTYCu Yds
27654 Spec. Power Distribution Cabinet Complete (Item 4894) Lump Sum 17654Spec.Power Distribution Cabinet Complete (Item 4894)Lump Sum1Cu Yards
31234 309 Embankment Cu. Yds. 2171234309EmbankmentCu Yds217Ea
41234 309 Embankment Cu Yds. 2171234309EmbankmentCu Yds217Lin Ft
51234 309 Embankment Cu. Yds 2171234309EmbankmentCu Yds217Lump Sum
61234 309 Embankment Cu Yds 2171234309EmbankmentCu Yds217
74321 5432 Pull Box Adjusted to Grade - Contingency Ea. 143215432Pull Box Adjusted to Grade - ContingencyEa1
84567 587 12" Conduit, Type B, As Per Plan Lin. Ft. 14456758712" Conduit, Type B, As Per PlanLin Ft14
95698 456986 Some description cu yards 2035698456986Some descriptionCu Yards203
107654 Spec. Power. Distribution. Cabinet. Complete (Item 4894) Lump Sum 17654Spec.Power. Distribution. Cabinet. Complete (Item 48Lump Sum1
Sheet1
Cell Formulas
RangeFormula
B2=LEFT(A2,4)
C2=REPLACE(LEFT(A2,FIND(" ",A2,6)-1),1,5,"")
D2=TRIM(REPLACE(LEFT(A2,LOOKUP(9E+99+307,SEARCH(" "&$J$1:$J$5&" "," "&SUBSTITUTE(A2,".","")&" "))-1),1,LEN(B2&C2)+2,""))
E2=LOOKUP(9.99E+307,SEARCH(" "&$J$1:$J$5&" "," "&SUBSTITUTE(A2,".","")&" "),$J$1:$J$5)
F2=TRIM(RIGHT(SUBSTITUTE(A2," ",REPT(" ",10)),10))+0
 
Last edited:
Upvote 0

Forum statistics

Threads
1,214,590
Messages
6,120,423
Members
448,961
Latest member
nzskater

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top