RegEx-pert needed to help with string matching!

nickbarnes

New Member
Joined
Aug 2, 2011
Messages
5
Hi all,

I am trying to search two columns of strings to pick out a reference number. The standard for the reference number has been abused over the years so I'm hoping that someone here could help me with a suitable regex to pick out any one of the possibilities so that I can return the official version?

I have a list of all the reference variations but its over 300k items and it takes an age to search it all with the array formula I found.

Here are a couple of examples which should cover any eventuality, the true reference is on the left with every eventuality separated by a pipe:

ABC/012 -> ABC/012 | ABC-012 | ABC_012 | ABC012
ABC/012 -> ABC/12 | ABC-12 | ABC_12 | ABC12
D/EFG/012 -> D/EFG/012 | D-EFG-012 | D_EFG_012 | DEFG012
D/EFG/012 -> D/EFG/12 | D-EFG-12 | D_EFG_12 | DEFG12

Another region includes an additional component to the reference, its not a priority but it would be great if someone is able to include it in the search:

H/IJ/K/012 -> H/IJ/K/012 | H-IJ-K-012 | H_IJ_K_012 | HIJK012
H/IJ/K/012 -> H/IJ/K/12 | H-IJ-K-12 | H_IJ_K_12 | HIJK12

Some of the key features of the reference:

  • There are never more than 3 characters before a /
  • There should always be 3 digits at the end, with leading zeroes for single and double digit numbers. Folks have sometimes left these off, hence the extra variation above
  • Numbers and letters are never mixed together between two /
  • The reference is always between 4 and 10 characters in length, including all numbers, letters and slashes

If the above variations can't be turned into a regex, perhaps someone could help me improve my lookup table efficiency?

=IFERROR(INDEX(<OfficialReferenceColumn>,MAX(IF(ISERROR(FIND(<VariationColumns>,UPPER(<TextToSearch1>))),-1,1)*ROW(<VariationColumns>)-1)),IFERROR(INDEX(<OfficialReferenceColumn>,MAX(IF(ISERROR(FIND(<VariationColumns>,UPPER(<TextToSearch2>))),-1,1)*ROW(<VariationColumns>)-1)),""))

I know this is a big ask so thanks to anyone who takes the time to have a look for me. I'd share my workbook but I'm not able to. Sorry.

Thanks again,

Nick
 

Excel Facts

Last used cell?
Press Ctrl+End to move to what Excel thinks is the last used cell.
Another thought would be to use the " find and replace " feature to get them all structured the same...?
 
Upvote 0
Another thought would be to use the " find and replace " feature to get them all structured the same...
That won't word for those without delimiters such as DEFG... the specs make it seem like D/EFG, DE/FG, DEF/G, D/EF/G, etc. could be possible originals (see István's question above).
 
Upvote 0
Thanks for getting back to me. Basically we are scraping an overloaded Outlook mailbox to get all the subject and body text. We are then going over that text using the formula above and an array of all the possibilities to find the correct reference. The reference itself is actually a unique location code. Using the code and the work type, we hope to generate task lists on either SharePoint or perhaps an Access database to reduce a huge amount of duplication and wasted effort that is going on at the moment.

Unfortunately the emails come from contractors across the country, many of whom work nationally, so there is no other usable location data from which we can form the correct location code. The code is built from things like region, the name of our assets in the region and an incrementing number. The number of components in the code isn't really decipherable from the email text.

I was hoping a regex would be more efficient but I'm starting to think that the lookup table is our best bet to minimise any errors or false-positives (one of the variations is quite similar to the start of a UK postcode, which many of our emails contain in the signature!)

Looking at the formula above, could it be improved? It is searching an array of the variations which is approx. 300k cells. The array itself consists of one column containing the 30k or so correct location codes and then 9 or so additional columns all using the SUBSTITUTE formula to change the "/" to "-", "_", " " and/or remove any leading zeroes etc.

I hope that explains my predicament a bit better.

Thanks again everyone for taking a look - now I think I might need an array formula expert! :P
 
Upvote 0
I am not sure you have addressed István's question in Message #2 yet. IF your list of 30k correct locations are unique enough that there can be no confusion, then the I think a macro in which you load those 30k location into an array, then process that array to create another array with all non-letters (non-numbers too?) removed, then do the same for your 300k data cells, then check the non-punctuated location list against the non-punctuated data array, and use indexes for your hits to check back into the original locations list to get your proper encoding might be an approach that could possibly work.
 
Upvote 0
Try this:

Column A contains the 30k „standard” strings. Sort this col (only) from A to Z.
Column B contains the 300k strings that should be converted to the standard format. Sort this col (only) from A to Z.

Col C is a helper column. Enter into C1 and copy down:

=SUBSTITUTE(A1,"/","")

Col D is another helper column. Enter into D1 and copy down:

=TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(REPLACE(C1,MIN(FIND({0,1,2,3,4,5,6,7,8,9},C1&"0123456789")),10,""),"-",""),"_",""),"/",""))&TEXT(-LOOKUP(1000,-ABS(RIGHT(C1,{1,2,3}))),"000")

You get the user-entered strings in the standard format in col E. Enter into E1 and copy down:

=INDEX($A$1:$A$4,MATCH(D1,$B$1:$B$4))

Let me know how it works (process time was not checked).
Excel Workbook
ABCDE
1ABC/012ABC_22ABC012ABC022ABC/012
2D/EFG/001ABC12DEFG001ABC012ABC/012
3D/EFG/012ABC-12DEFG012ABC012ABC/012
4H/IJ/K/012D/EFG/012HIJK012DEFG012D/EFG/012
5D_EFG_012DEFG012D/EFG/012
6DEFG012DEFG012D/EFG/012
7D-EFG-012DEFG012D/EFG/012
8DEFG1DEFG001D/EFG/001
9H_IJ_K_012HIJK012H/IJ/K/012
10H_IJ_K_12HIJK012H/IJ/K/012
11HIJK012HIJK012H/IJ/K/012
12H-IJ-K-012HIJK012H/IJ/K/012
13HIJK12HIJK012H/IJ/K/012
14H-IJ-K-12HIJK012H/IJ/K/012
Sheet
 
Upvote 0

Forum statistics

Threads
1,214,819
Messages
6,121,729
Members
449,049
Latest member
MiguekHeka

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top