Extract Number from Alphanumeric String

dwgnome

Active Member
Joined
Dec 18, 2005
Messages
441
Looking for assistance in figuring out the best formula to extract the VOC content value from variable length alphanumeric string. The VOC content is the number just before the "g/L".

I need to extract just the number portion just before the units “g/L”. Most of the time there is a space before the number and after the number and before “g/L”. However, there are times when there may not be any spaces. In either case, I would like to extract the numbers with decimals if it has any.

See the following examples:


Excel Workbook
D
1SEALANT-RETAINING COMPOUND SINGLE COMPONENT ANAEROBIC, SCR40372A, *Loctite 242, 10 ML BOTTLE,146 g/L VOC
2RESIN-EPOXY ALUMINUM FILLED 4 OUNCE PATCH KIT\ * * HYSOL 6C10 g/L VOC
3SEALANT-THREAD LOCK *HIGH STR *MIL-S-22473 *GRADE AA *50 ML BOTTLE * * *SCR802269A\ * Loctite 089 Weld Sealant125.3 g/L VOC
4SEALANT-RETAINING COMPOUND ANAEROBIC MIL-S-22473 * * *GRADE C 50 ML BOTTLE * *SCR802266A\ *Loctite 84 * 167.4 g/L VOC
5RESIN-LAMINATING SYSTEM FOR REPAIR OF 250/350 DEG F CURING ADVANCED COMPOSITES ONE QUART KIT\ * EPOCAST 35A *12 g/L VOC
6CLEANER-THRUST REVERSER 55 GAL DRUM\ *TURCO 5805 * *0 g/L VOC
7SOLVENT-TETRAHYDROFURAN REAGENT GRADE IN 4-LITER * * JT Baker JT9450-5 * * 887 g/L VOC
8CLEANER-AIRCRAFT PARTS * HOT TANK * UNTHICKENED *55 GALLON DRUM\ * CALA 805 *25g/L VOC
9CLEANER\ * Fine Organics FO-585 * * *0 g/L VOC
10SEALANT-THREAD LOCK HIGH STRENGTH MIL-S-22473 *GRADE AV, *50 ML BOTTLE, *SCR802270A, 175.2 g/L VOC
11RESIN-EPOXY FIBERGLASS SELF EXTINGUISHING KIT BMS8-201TYII, *1 Qt Kit, *Epocast 50-A1, * *357 g/L VOC
12CLEANER-AIRCRAFT INTERIOR UTILITY 55 GAL DRUM\ * *AEROCLEAN X-410Q * *20 g/L VOC
13RESIN-PASTE PINHOLE FILLER FOR RADOMES 50 GRAM KIT\ * * * EPIBOND 156 A/B * * 10 g/L VOC
14SEALANT-RETAINING COMPOUND ANAEROBIC MIL-R-46082 TY 1 *250 ML BOTTLE * *SCR18016A\ * Loctite 675 141.7g/L *VOC
15RESIN-EPOXY CLEAR 3.35 OUNCE PATCH KIT\ * * *Hysol 0151 *Chemical Database uses UMS *ADH3002-96 * 34. g/L *VOC
Sheet1


The results I am looking for:


Excel Workbook
H
1146
210
3125.3
4167.4
512
60
7887
825
90
10175.2
11357
1220
1310
14141.7
1534
Sheet1


Any help would be appreciated.
 
Hi

I'm glad it helped.

Remark:

A floating point number without thousands separators appears usually in 4 formats

0
0.0
.0
0.

Sandeep is using a floating point pattern that does not account for the third format. I must say that I also considered using it, it's a pattern widely used for floats and you didn't post any example where you'd use it. I ended up using a pattern that accounts for all formats just out of caution but with a small tweak Sandeep's udf will also work for all formats.
 
Upvote 0

Excel Facts

Excel Joke
Why can't spreadsheets drive cars? They crash too often!
Hi,

Making a small change in the Pattern should resolve that issue.

Not tested, but try changing

Code:
"\d+\.?\d* ?g\/L"

to

Code:
"\d*\.?\d* ?g\/L"
 
Upvote 0
Hi Pedro,

Just out of curiosity, is the only way to match numbers including comma separators something like

\d*,?\d*,?\d*,? .......etc. or is there a better way?
 
Upvote 0
Thank you for taking the time to explain the floating point concept. Your right, I did not have that as an example. I only stumbled on it when I tested it outside the examples. Anyhow, thanks to Sandeep for the fix.
 
Upvote 0
Hi Pedro,

Just out of curiosity, is the only way to match numbers including comma separators something like

\d*,?\d*,?\d*,? .......etc. or is there a better way?


Hi Sandeep

Edit: Sandeep, I've just read what I've written and I rambled a lot, but I'm in a hurry and just leaving. Please comment. I'll be back later.


I would not use that pattern not the one you posted before. They are in the category of dangerous patterns as they match what you want AND NOTHING. They are usually avoided in RegEx's, they may cause you a big headache in a complex RegEx, as you may forget that they are not matching what you expect. When you want to match nothing you do it explicitely with a quantifier, like a*, a? or a{0,4}. In these cases there's no doubt that you are allowing zero matches for that pattern.

So in this case, in my opinion, instead of matching a floating point number OR NOTHING, like with your pattern

\d*\.?\d*

it is preferable

(match floating point)?

the expression inside the parentesis is matching a float and you add the ? to say that you allow it not to be present in the text.

In the case of this thread, with the examples that dwgnome posted, it seems there's no problem, but if it happens that the number before "g/L" is not available and you have a record like:

RESIN-EPOXY ALUMINUM FILLED 4 OUNCE PATCH KIT\ * * HYSOL 6C g/L VOC

you'll see that your function errors out. That's because your pattern "\d*\.?\d* ?g\/L" is matching just "g/L" which maybe you didn't want to.


- Whole number with thousand separators

The same logic for "\d*,?\d*,?\d*,?". This will match "123", "1,234,4", "1,2,3,4", ",,,,,", AND NOTHING.

For a whole number whith thousand separators, I think it's preferable to match explicitely the number with the format "a thousand separator each 3 digits", like

\d{1,3}(?:,\d{3})*

This is for the number format with the "," as thousands separator. Some countries use the ".". For some countries, like India, it may be more complex, if I'm not mistaken in India you have a first group of 3 digits and then groups of 2 digits.
 
Upvote 0
Hi Pedro,

Thanks for your explanation. After reading through your post last night (when I didn't understand anything :)) and again this morning, I realized how my pattern would allow a number to not exist at all.

I agree that for the problem posted in this thread it would work fine but not in a general scenario.

No wonder not many people (as far as I know) are comfortable with RegEx. A lot of thinking goes in preparing the pattern :biggrin:

In India we do have a different way of grouping digits which is like you say (1st group of 3 and subsequent groups of 2).

On trying to understand "\d{1,3}(?:,\d{3})*", I was wondering why "?" is included within (). Also wondering why ":" is used.
<input id="gwProxy" type="hidden"><!--Session data--><input *******="jsCall();" id="jsProxy" type="hidden">
 
Upvote 0
Hi Sandeep
Sorry about last post, I was in a hurry and it is confusing. Let's hope this one is clearer

On trying to understand "\d{1,3}(?:,\d{3})*", I was wondering why "?" is included within (). Also wondering why ":" is used.

The "?" and the ":" are together, they are part of the same construct:

(?:pattern)

There are 2 ways you may want to group text in a regex:

- match pattern and capture it for later use: (pattern)
- match pattern but don't capture it for later use: (?:pattern)

You don't really need the second one but if you don't use it the pattern is less aligned with the logic of your problem.

Let me give you an example, using the two.

Problem:

A string has 3 codes that may be separed by some text. Each code starts with "COD" that is followed by groups of (underscore whole number), like "COD_3" or "COD_12_23_4". We want to extract them and present the result as the 3 codes separed by a space.

We could match each code and build the result out of the matches collection, but for the purpose of this example let's match the whole text, capturing the codes into submatches:

After analysing the structure of the text, I see that we could use

Code:
Function ExtractCodes1(s As String) As String
Dim rexMatches As Object
 
With CreateObject("VBScript.RegExp")
    .Pattern = "^.*(COD(_\d+)*).*(COD(_\d+)*).*(COD(_\d+)*).*$"
    Set rexMatches = .Execute(s)
    If rexMatches.Count > 0 Then ExtractCodes1 = .Replace(s, "$1 $3 $5")
End With
End Function

Now this will work for the input strings that I have, the problem is the logic of the pattern. I'm using the submatches 1, 3 and 5. What about the submatches 2, 4 and 6? I'm not using them and if I look into the pattern I see that they have the groups of (underscore whole number) . I don't care about these submatches in this problem, they are an "internal" part of the code pattern and I have not use for them.

Another way is:

Code:
Function ExtractCodes2(s As String) As String
Dim rexMatches As Object
 
With CreateObject("VBScript.RegExp")
    .Pattern = "^.*(COD(?:_\d+)*).*(COD(?:_\d+)*).*(COD(?:_\d+)*).*$"
    Set rexMatches = .Execute(s)
    If rexMatches.Count > 0 Then ExtractCodes2 = .Replace(s, "$1 $2 $3")
End With
End Function

Now it's more clear, I need the 3 codes and the submatches are 1, 2 and 3, exactly as I need. The way I built this pattern is aligned with the logic of the problem, the submatches I define are the ones I need for the goal of the problem.


In conclusion, as I said you can live without the (?:pattern) construct, but it helps if you want to design the pattern to match the goal of the problem.
 
Last edited:
Upvote 0
Hi Pedro,

Sorry for the late reply. I didn't get enough time to read through this properly.

Let me first try to understand the pattern :)

^.*(COD(_\d+)*).*(COD(_\d+)*).*(COD(_\d+)*).*$

^.* = Starts with n number of characters
(COD(_\d+)*) = followed by "COD" and n number of "_" with 1 or more digits
.* = followed by n number of characters
(COD(_\d+)*) = followed by "COD" and n number of "_" with 1 or more digits
.* = followed by n number of characters
(COD(_\d+)*) = followed by "COD" and n number of "_" with 1 or more digits
.*$ = ends with n number of characters

Am I right with this?

I tested out the pattern with the string "abcdCOD_342jhamndCOD_1khndnfCOD_74bsdfbsdf" on a regexp evaluator

[0] = "abcdCOD_342jhamndCOD_1khndnfCOD_74bsdfbsdf"
[1] = "COD_342"
[2] = "_342"
[3] = "COD_1"
[4] = "_1"
[5] = "COD_74"
[6] = "_74"

Only the patterns within () are being matched. Why is it ignoring all the ".*"?

I think I get what you mean by match pattern but don't capture it for later use: (?:pattern)

The same string using the 2nd pattern gives

[0] = "abcdCOD_342jhamndCOD_1khndnfCOD_74bsdfbsdf"
[1] = "COD_342"
[2] = "COD_1"
[3] = "COD_74"

Another thing I'm confused about is use of replace in the format ".Replace(s, "$1 $3 $5")"

I know that, for example, using the pattern "\D" and then using .replace(s,"") will replace all non-digits with blanks. But how does the replace from the above example work?
<input id="gwProxy" type="hidden"><!--Session data--><input *******="jsCall();" id="jsProxy" type="hidden">
 
Upvote 0
Hi Sandeep

You are right about the analysis of the pattern.

Only the patterns within () are being matched. Why is it ignoring all the ".*"?

That's what the () do, they capture a subexpression for later use. In this case I don't need the .* characters for the result and so I don't need to capture them. If I needed them I would also enclose them in ().

Another thing I'm confused about is use of replace in the format ".Replace(s, "$1 $3 $5")"

I'm replacing the whole string by the first, third and fifth submatches separed by a space.

In your analysis you got:

[0] = "abcdCOD_342jhamndCOD_1khndnfCOD_74bsdfbsdf"
[1] = "COD_342"
[2] = "_342"
[3] = "COD_1"
[4] = "_1"
[5] = "COD_74"
[6] = "_74"

So using ".Replace(s, "$1 $3 $5")", you are replacing the string with

"COD_342 COD_1 COD_74"

I think I get what you mean by match pattern but don't capture it for later use: (?:pattern)

In the previous example you see the submatches 2, 4 and 6 that are not used in the result. They need to be matched but not captured. That's why in the second example I used the (?:pattern) construct. It allowed me to group the text without capturing. If you see the result of this:

[0] = "abcdCOD_342jhamndCOD_1khndnfCOD_74bsdfbsdf"
[1] = "COD_342"
[2] = "COD_1"
[3] = "COD_74"

you can see that now you have only captured the parts you need for your result.

Hope it helps.
 
Upvote 0
Hi Pedro,

Thanks for the explanations. They go a long way in making regexp a bit more understandable :)
<input id="gwProxy" type="hidden"><!--Session data--><input *******="jsCall();" id="jsProxy" type="hidden">
 
Upvote 0

Forum statistics

Threads
1,215,227
Messages
6,123,745
Members
449,116
Latest member
alexlomt

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top