Parse text from HTML text block to multiple columns

trubertiam

New Member
Joined
Aug 12, 2009
Messages
4
Hello everyone,
I have a .csv export file that contains description fields like the following (in A1):

<table border="0" cellpadding="0" cellspacing="0" width="383" height="160"><col width="383"><tr height="160"> <td class="xl63" style="height: 120pt; width: 287pt;" width="383" height="160">Experimentation for product and process improvement: response surface methodology and robust design methods; mixture experiments; optimal design topics; distribution theory and inference for linear models. <br/><em>
(RE) Prerequisite(s): 573 or consent of instructor. <br/>
Registration Restriction(s): Minimum student level - graduate.</em><br/>
</td> </tr></table>I need to parse this field into several fields depending on the type of prerequisite, co-requisite (not present in this example), or registration restriction. My goal is to extract the text following Prerequisite(s): and up to the "." into its own field. Same for any text following Registration Restriction(s): or any other restrictions that may appear in other entries.

I tried to create a formula for Prerequisites in Column B that went something like this:
=MID(A1,(FIND(“Prerequisite”,A1)-5),(FIND(".",A1)))

My intention is to find the text string "Prerequisite", start extracting 5 characters/spaces after the last "e", and extract everything up until the "." I've been adding and subtracting parentheses, wondering if I'm missing some or putting too many in. Or maybe there's something fundamentally wrong with this approach. I need to create similar columns for each type of restriction, so I'll eventually search for text strings like "Registration restriction" and "Co-requisite". I've seen some similar threads, so I hope it's ok to start a new one rather than jumping into an old one.

Any suggestions?
 

Excel Facts

Spell Check in Excel
Press F7 to start spell check in Excel. Be careful, by default, Excel does not check Capitalized Werds (whoops)
check these...

=MID(A1,(FIND(Prerequisite,A1)-5),(FIND(".",A1)))

... different than "
 
Upvote 0
A-ha! That did make a difference! Soooo, now my question is, why did this return:
<table border="0" cellpadding="0" cellspacing="0" width="182"><col width="182"><tr height="160"> <td style="height: 120pt; width: 137pt;" width="182" height="160">(RE) Prerequisite(s): 573 or consent of instructor. <br/>
Registration Restriction(s): Minimum student level - graduate.</em><br/></td> </tr></table>
Is there a way to extract only "573 or consent of instructor."? I hope to have another column capture the Reg Resriction, and cut this field off at the first occurrence of the "." I don't want the "(RE) Prerequisite(s):" text string in the results at all. Or maybe I need to nest a RIGHT formula in there, too. Can a RIGHT formula extract data to the right of something and up to another value? In other words, can I add a RIGHT statement to this to pick up just "573 or consent of instructor." now that it's returning more than I really need?

I wonder if that's the most efficient method to experiment with now.

Thanks so much for your help!
 
Upvote 0
why did this return:...
Because the Find returned the first position of Prerequisite, which in the grand scheme of things, is position 6 (according to my html copy/paste version); from there you backed off five (-5), putting you at position 1 to start the resulting string...

try: =MID(A1,(FIND(":",A1)+2),255)

Yes, you can nest Mid/Left/Right/Trim ; but you really don't need to in this case.
 
Upvote 0
Somehow yesterday's reply didn't make it up here. Trying again....

Starting from this HTML block in A2 (which loses its HTML coding when I paste it here - it really is all in one cell):
<table border="0" cellpadding="0" cellspacing="0" width="383"><col width="383"><tr height="160"> <td class="xl63" style="height: 120pt; width: 287pt;" width="383" height="160">Experimentation for product and process improvement: response surface methodology and robust design methods; mixture experiments; optimal design topics; distribution theory and inference for linear models. <br/><em>
(RE) Prerequisite(s): 573 or consent of instructor. <br/>
Registration Restriction(s): Minimum student level - graduate.</em><br/>
</td> </tr></table>I am now able to extract: (RE) Prerequisite(s): 573.... level - graduate.</em><br/>. The extracted string ends with the HMTL break, in case that disappears when I post this. Thanks for correcting the MID and FIND statements to help me get this far.

Now I need to do the following:
1. Find the first occurrence of the "." in this string and replace everything to the right of it with " ".

In another column I plan to modify the formula to extract the string beginning with Registration Restriction(s) and return everything between that string and the "." in another cell.

Basically, I'm trying to parse these random-length HTML blocks into columns of (RE) Prerequisites, Registration Restrictions, etc based on the predictable text blocks. The strings will always start with a particular text string like (RE)... or Registration.... and will always end in a ".".

This is the ugly formula I'm trying to make work right now. It seems like I might be able to get a REPLACE statement to work in conjunction with FINDs and MIDs, but I'm stuck.
icon9.gif


=REPLACE(RIGHT(MID(A2,(FIND("Prerequisite",A2)+17),255)),(RIGHT(A2,(FIND(".",A2,1)))),255," ")
 
Upvote 0

Forum statistics

Threads
1,215,731
Messages
6,126,539
Members
449,316
Latest member
sravya

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top