Extract Number from Alphanumeric String

dwgnome · Apr 21, 2010

Looking for assistance in figuring out the best formula to extract the VOC content value from variable length alphanumeric string. The VOC content is the number just before the "g/L".

I need to extract just the number portion just before the units “g/L”. Most of the time there is a space before the number and after the number and before “g/L”. However, there are times when there may not be any spaces. In either case, I would like to extract the numbers with decimals if it has any.

See the following examples:

Excel Workbook

D

1

SEALANT-RETAINING COMPOUND SINGLE COMPONENT ANAEROBIC, SCR40372A, *Loctite 242, 10 ML BOTTLE,146 g/L VOC

2

RESIN-EPOXY ALUMINUM FILLED 4 OUNCE PATCH KIT\ * * HYSOL 6C10 g/L VOC

3

SEALANT-THREAD LOCK *HIGH STR *MIL-S-22473 *GRADE AA *50 ML BOTTLE * * *SCR802269A\ * Loctite 089 Weld Sealant125.3 g/L VOC

4

SEALANT-RETAINING COMPOUND ANAEROBIC MIL-S-22473 * * *GRADE C 50 ML BOTTLE * *SCR802266A\ *Loctite 84 * 167.4 g/L VOC

5

RESIN-LAMINATING SYSTEM FOR REPAIR OF 250/350 DEG F CURING ADVANCED COMPOSITES ONE QUART KIT\ * EPOCAST 35A *12 g/L VOC

6

CLEANER-THRUST REVERSER 55 GAL DRUM\ *TURCO 5805 * *0 g/L VOC

7

SOLVENT-TETRAHYDROFURAN REAGENT GRADE IN 4-LITER * * JT Baker JT9450-5 * * 887 g/L VOC

8

CLEANER-AIRCRAFT PARTS * HOT TANK * UNTHICKENED *55 GALLON DRUM\ * CALA 805 *25g/L VOC

9

CLEANER\ * Fine Organics FO-585 * * *0 g/L VOC

10

SEALANT-THREAD LOCK HIGH STRENGTH MIL-S-22473 *GRADE AV, *50 ML BOTTLE, *SCR802270A, 175.2 g/L VOC

11

RESIN-EPOXY FIBERGLASS SELF EXTINGUISHING KIT BMS8-201TYII, *1 Qt Kit, *Epocast 50-A1, * *357 g/L VOC

12

CLEANER-AIRCRAFT INTERIOR UTILITY 55 GAL DRUM\ * *AEROCLEAN X-410Q * *20 g/L VOC

13

RESIN-PASTE PINHOLE FILLER FOR RADOMES 50 GRAM KIT\ * * * EPIBOND 156 A/B * * 10 g/L VOC

14

SEALANT-RETAINING COMPOUND ANAEROBIC MIL-R-46082 TY 1 *250 ML BOTTLE * *SCR18016A\ * Loctite 675 141.7g/L *VOC

15

RESIN-EPOXY CLEAR 3.35 OUNCE PATCH KIT\ * * *Hysol 0151 *Chemical Database uses UMS *ADH3002-96 * 34. g/L *VOC

Sheet1

The results I am looking for:

Excel Workbook

H

1

146

2

10

3

125.3

4

167.4

5

12

6

0

7

887

8

25

9

0

10

175.2

11

357

12

20

13

10

14

141.7

15

34

Sheet1

Any help would be appreciated.

pgc01 · Apr 22, 2010

Hi

I'm glad it helped.

Remark:

A floating point number without thousands separators appears usually in 4 formats

0
0.0
.0
0.

Sandeep is using a floating point pattern that does not account for the third format. I must say that I also considered using it, it's a pattern widely used for floats and you didn't post any example where you'd use it. I ended up using a pattern that accounts for all formats just out of caution but with a small tweak Sandeep's udf will also work for all formats.

Sandeep Warrier · Apr 22, 2010

Hi,

Making a small change in the Pattern should resolve that issue.

Not tested, but try changing

Code:

"\d+\.?\d* ?g\/L"

to

Code:

"\d*\.?\d* ?g\/L"

Sandeep Warrier · Apr 22, 2010

Hi Pedro,

Just out of curiosity, is the only way to match numbers including comma separators something like

\d*,?\d*,?\d*,? .......etc. or is there a better way?

dwgnome · Apr 22, 2010

Thank you for taking the time to explain the floating point concept. Your right, I did not have that as an example. I only stumbled on it when I tested it outside the examples. Anyhow, thanks to Sandeep for the fix.

pgc01 · Apr 22, 2010

sandeep.warrier said:
Hi Pedro,

Just out of curiosity, is the only way to match numbers including comma separators something like

\d*,?\d*,?\d*,? .......etc. or is there a better way?

Hi Sandeep

Edit: Sandeep, I've just read what I've written and I rambled a lot, but I'm in a hurry and just leaving. Please comment. I'll be back later.

I would not use that pattern not the one you posted before. They are in the category of dangerous patterns as they match what you want AND NOTHING. They are usually avoided in RegEx's, they may cause you a big headache in a complex RegEx, as you may forget that they are not matching what you expect. When you want to match nothing you do it explicitely with a quantifier, like a*, a? or a{0,4}. In these cases there's no doubt that you are allowing zero matches for that pattern.

So in this case, in my opinion, instead of matching a floating point number OR NOTHING, like with your pattern

\d*\.?\d*

it is preferable

(match floating point)?

the expression inside the parentesis is matching a float and you add the ? to say that you allow it not to be present in the text.

In the case of this thread, with the examples that dwgnome posted, it seems there's no problem, but if it happens that the number before "g/L" is not available and you have a record like:

RESIN-EPOXY ALUMINUM FILLED 4 OUNCE PATCH KIT\ * * HYSOL 6C g/L VOC

you'll see that your function errors out. That's because your pattern "\d*\.?\d* ?g\/L" is matching just "g/L" which maybe you didn't want to.

- Whole number with thousand separators

The same logic for "\d*,?\d*,?\d*,?". This will match "123", "1,234,4", "1,2,3,4", ",,,,,", AND NOTHING.

For a whole number whith thousand separators, I think it's preferable to match explicitely the number with the format "a thousand separator each 3 digits", like

\d{1,3}(?:,\d{3})*

This is for the number format with the "," as thousands separator. Some countries use the ".". For some countries, like India, it may be more complex, if I'm not mistaken in India you have a first group of 3 digits and then groups of 2 digits.

Sandeep Warrier · Apr 23, 2010

Hi Pedro,

Thanks for your explanation. After reading through your post last night (when I didn't understand anything

) and again this morning, I realized how my pattern would allow a number to not exist at all.

I agree that for the problem posted in this thread it would work fine but not in a general scenario.

No wonder not many people (as far as I know) are comfortable with RegEx. A lot of thinking goes in preparing the pattern

In India we do have a different way of grouping digits which is like you say (1st group of 3 and subsequent groups of 2).

On trying to understand "\d{1,3}(?:,\d{3})*", I was wondering why "?" is included within (). Also wondering why ":" is used.
<input id="gwProxy" type="hidden"><input *******="jsCall();" id="jsProxy" type="hidden">

pgc01 · Apr 23, 2010

Hi Sandeep
Sorry about last post, I was in a hurry and it is confusing. Let's hope this one is clearer

sandeep.warrier said:
On trying to understand "\d{1,3}(?:,\d{3})*", I was wondering why "?" is included within (). Also wondering why ":" is used.

The "?" and the ":" are together, they are part of the same construct:

(?

attern)

There are 2 ways you may want to group text in a regex:

- match pattern and capture it for later use: (pattern)
- match pattern but don't capture it for later use: (?

attern)

You don't really need the second one but if you don't use it the pattern is less aligned with the logic of your problem.

Let me give you an example, using the two.

Problem:

A string has 3 codes that may be separed by some text. Each code starts with "COD" that is followed by groups of (underscore whole number), like "COD_3" or "COD_12_23_4". We want to extract them and present the result as the 3 codes separed by a space.

We could match each code and build the result out of the matches collection, but for the purpose of this example let's match the whole text, capturing the codes into submatches:

After analysing the structure of the text, I see that we could use

Code:

Function ExtractCodes1(s As String) As String
Dim rexMatches As Object
 
With CreateObject("VBScript.RegExp")
    .Pattern = "^.*(COD(_\d+)*).*(COD(_\d+)*).*(COD(_\d+)*).*$"
    Set rexMatches = .Execute(s)
    If rexMatches.Count > 0 Then ExtractCodes1 = .Replace(s, "$1 $3 $5")
End With
End Function

Now this will work for the input strings that I have, the problem is the logic of the pattern. I'm using the submatches 1, 3 and 5. What about the submatches 2, 4 and 6? I'm not using them and if I look into the pattern I see that they have the groups of (underscore whole number) . I don't care about these submatches in this problem, they are an "internal" part of the code pattern and I have not use for them.

Another way is:

Code:

Function ExtractCodes2(s As String) As String
Dim rexMatches As Object
 
With CreateObject("VBScript.RegExp")
    .Pattern = "^.*(COD(?:_\d+)*).*(COD(?:_\d+)*).*(COD(?:_\d+)*).*$"
    Set rexMatches = .Execute(s)
    If rexMatches.Count > 0 Then ExtractCodes2 = .Replace(s, "$1 $2 $3")
End With
End Function

Now it's more clear, I need the 3 codes and the submatches are 1, 2 and 3, exactly as I need. The way I built this pattern is aligned with the logic of the problem, the submatches I define are the ones I need for the goal of the problem.

In conclusion, as I said you can live without the (?

attern) construct, but it helps if you want to design the pattern to match the goal of the problem.

Sandeep Warrier · Apr 28, 2010

Hi Pedro,

Sorry for the late reply. I didn't get enough time to read through this properly.

Let me first try to understand the pattern

^.*(COD(_\d+)*).*(COD(_\d+)*).*(COD(_\d+)*).*$

^.* = Starts with n number of characters
(COD(_\d+)*) = followed by "COD" and n number of "_" with 1 or more digits
.* = followed by n number of characters
(COD(_\d+)*) = followed by "COD" and n number of "_" with 1 or more digits
.* = followed by n number of characters
(COD(_\d+)*) = followed by "COD" and n number of "_" with 1 or more digits
.*$ = ends with n number of characters

Am I right with this?

I tested out the pattern with the string "abcdCOD_342jhamndCOD_1khndnfCOD_74bsdfbsdf" on a regexp evaluator

[0] = "abcdCOD_342jhamndCOD_1khndnfCOD_74bsdfbsdf"
[1] = "COD_342"
[2] = "_342"
[3] = "COD_1"
[4] = "_1"
[5] = "COD_74"
[6] = "_74"

Only the patterns within () are being matched. Why is it ignoring all the ".*"?

I think I get what you mean by match pattern but don't capture it for later use: (?

attern)

The same string using the 2nd pattern gives

[0] = "abcdCOD_342jhamndCOD_1khndnfCOD_74bsdfbsdf"
[1] = "COD_342"
[2] = "COD_1"
[3] = "COD_74"

Another thing I'm confused about is use of replace in the format ".Replace(s, "$1 $3 $5")"

I know that, for example, using the pattern "\D" and then using .replace(s,"") will replace all non-digits with blanks. But how does the replace from the above example work?
<input id="gwProxy" type="hidden"><input *******="jsCall();" id="jsProxy" type="hidden">

pgc01 · Apr 28, 2010

Hi Sandeep

You are right about the analysis of the pattern.

Only the patterns within () are being matched. Why is it ignoring all the ".*"?

That's what the () do, they capture a subexpression for later use. In this case I don't need the .* characters for the result and so I don't need to capture them. If I needed them I would also enclose them in ().

Another thing I'm confused about is use of replace in the format ".Replace(s, "$1 $3 $5")"

I'm replacing the whole string by the first, third and fifth submatches separed by a space.

In your analysis you got:

[0] = "abcdCOD_342jhamndCOD_1khndnfCOD_74bsdfbsdf"
[1] = "COD_342"
[2] = "_342"
[3] = "COD_1"
[4] = "_1"
[5] = "COD_74"
[6] = "_74"

So using ".Replace(s, "$1 $3 $5")", you are replacing the string with

"COD_342 COD_1 COD_74"

I think I get what you mean by match pattern but don't capture it for later use: (?attern)

In the previous example you see the submatches 2, 4 and 6 that are not used in the result. They need to be matched but not captured. That's why in the second example I used the (?

attern) construct. It allowed me to group the text without capturing. If you see the result of this:

[0] = "abcdCOD_342jhamndCOD_1khndnfCOD_74bsdfbsdf"
[1] = "COD_342"
[2] = "COD_1"
[3] = "COD_74"

you can see that now you have only captured the parts you need for your result.

Hope it helps.

Sandeep Warrier · Apr 28, 2010

Hi Pedro,

Thanks for the explanations. They go a long way in making regexp a bit more understandable

Extract Number from Alphanumeric String

dwgnome

Active Member

pgc01

MrExcel MVP

Excel Facts

Sandeep Warrier

Well-known Member

Sandeep Warrier

Well-known Member

dwgnome

Active Member

pgc01

MrExcel MVP

Sandeep Warrier

Well-known Member

pgc01

MrExcel MVP

Sandeep Warrier

Well-known Member

pgc01

MrExcel MVP

Sandeep Warrier

Well-known Member

Similar threads

Forum statistics

Share this page

Extract Number from Alphanumeric String

Active Member

MrExcel MVP

Excel Facts

Well-known Member

Well-known Member

Active Member

MrExcel MVP

Well-known Member

MrExcel MVP

Well-known Member

MrExcel MVP

Well-known Member

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock