Extract PDF File name from HTML long text

AndrewP70

New Member
Joined
Feb 17, 2016
Messages
26
Hello All, Could I please have some help with the following. I have a file with 100,00 lines of html coding as per below. I am wanting to extract the PDF file name into a separate column if possible. All pdf files are preceded by the path Images/Datasheets/ if this helps.

<img src="Images/skf_snip.jpg">&nbsp;<b>22215 E</b>&nbsp;Spherical Roller Bearing<br><br><b><em>75.00 mm ID x 130.00 mm OD x 31.00 mm Width<br></b></em><br>• Cylindrical Bore<br>• Steel Cage<br>• Standard Clearance<br><br>Spherical roller bearings have two rows of symmetrical rollers, a common sphered outer ring raceway and two inner ring raceways inclined at an angle to the bearing axis. The centre point of the sphere in the outer ring raceway is at the bearing axis.<br><br><a href="" onclick="javascript:void window.open('Images/Datasheets/22215E.pdf','','width=700,height=500,toolbar=0,menubar=0,location=0,status=1,scrollbars=1,resizable=1,left=0,top=0');return false;"><img src="/Images/Datasheets/snip_pdf.jpg">&nbsp;Export as PDF</a>

Thanks in advance.
 

Excel Facts

How to fill five years of quarters?
Type 1Q-2023 in a cell. Grab the fill handle and drag down or right. After 4Q-2023, Excel will jump to 1Q-2024. Dash can be any character.
Can you confirm that these texts are currently in a column in Excel?

Are you looking for formula, macro or other approach?
 
Upvote 0
Can you confirm that these texts are currently in a column in Excel?

Are you looking for formula, macro or other approach?
Yes the html code long descriptions are in a column in excel. I think macro would be great if I know how to use it!
 
Upvote 0
I think macro would be great if I know how to use it!
If you are not so sure about that, what about we first see if a standard worksheet formula will suffice?

AndrewP70 2020-04-17 1.xlsm
AB
1<img src="Images/skf_snip.jpg">&nbsp;<b>22215 E</b>&nbsp;Spherical Roller Bearing<br><br><b><em>75.00 mm ID x 130.00 mm OD x 31.00 mm Width<br></b></em><br>• Cylindrical Bore<br>• Steel Cage<br>• Standard Clearance<br><br>Spherical roller bearings have two rows of symmetrical rollers, a common sphered outer ring raceway and two inner ring raceways inclined at an angle to the bearing axis. The centre point of the sphere in the outer ring raceway is at the bearing axis.<br><br><a href="" onclick="javascript:void window.open('Images/Datasheets/22215E.pdf','','width=700,height=500,toolbar=0,menubar=0,location=0,status=1,scrollbars=1,resizable=1,left=0,top=0');return false;"><img src="/Images/Datasheets/snip_pdf.jpg">&nbsp;Export as PDF</a>22215E.pdf
Extract pdf name
Cell Formulas
RangeFormula
B1B1=REPLACE(LEFT(A1,SEARCH(".pdf",A1,SEARCH("Images/Datasheets/",A1))+3),1,SEARCH("Images/Datasheets/",A1)+17,"")
 
Upvote 0
=REPLACE(LEFT(A1,SEARCH(".pdf",A1,SEARCH("Images/Datasheets/",A1))+3),1,SEARCH("Images/Datasheets/",A1)+17,"")
This Formula works perfectly, except I have discovered there are a range items that contain (2) pdf file names. Can this formula be modified to extract both. Can be in the same column. Example below.

<img src="Images/skf_snip.jpg">&nbsp;<b>WSTUC 012-CPSS-DFH</b>&nbsp;<span style="color:dodgerBlue"><b>Blue Range Food Line Ball Bearing Unit</b></span><br><br>• 3/4" Shaft Diameter<br>• Composite Take-Up Housing<br>• Stainless Steel (AISI grade 420) Insert Bearing<br>• Food grade grease NSA approved<br>• Additional Back Seal<br><br>The blue range units have been developed for compliance with food safety regulations – with food grade and corrosion resistant components, using blue materials for optical detectability. Each element has been designed to balance the highest performance and superior hygienic standards considering EHEDG guidelines.<br><br><a href="" onclick="javascript:void window.open('Images/Datasheets/BLUELINE.pdf','','width=700,height=500,toolbar=0,menubar=0,location=0,status=1,scrollbars=1,resizable=1,left=0,top=0');return false;"><img src="/Images/Datasheets/snip_pdf.jpg">&nbsp;Product Range</a>&nbsp;<a href="" onclick="javascript:void window.open('Images/Datasheets/WSTUC 012-CPSS-DFH.pdf','','width=700,height=500,toolbar=0,menubar=0,location=0,status=1,scrollbars=1,resizable=1,left=0,top=0');return false;"><img src="/Images/Datasheets/snip_pdf.jpg">&nbsp;Item Data Sheet</a>

Many thanks for your help.
Andrew
 
Upvote 0
@Peter_SSs
there is underscore, this is jpg not pdf ?

with Power Query
RAW
BLUELINE.pdf
WSTUC 012-CPSS-DFH.pdf

Rich (BB code):
let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    Split = Table.ExpandListColumn(Table.TransformColumns(Source, {{"RAW", Splitter.SplitTextByDelimiter("Images/Datasheets/", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "RAW"),
    Extract = Table.TransformColumns(Split, {{"RAW", each Text.BeforeDelimiter(_, "'"), type text}}),
    Filter = Table.SelectRows(Extract, each Text.Contains([RAW], ".pdf"))
in
    Filter
 
Last edited:
Upvote 0
The last part of the html code is only a reference to an image and not a technical pdf. Any html coding <img src= can be ignored
 
Upvote 0
@Peter_SSs
there is underscore, this is jpg not pdf ?

HTML
BLUELINE.pdf
WSTUC 012-CPSS-DFH.pdf


Rich (BB code):
let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    Split = Table.ExpandListColumn(Table.TransformColumns(Source, {{"RAW", Splitter.SplitTextByDelimiter("Images/Datasheets/", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "RAW"),
    Extract = Table.TransformColumns(Split, {{"RAW", each Text.BeforeDelimiter(_, "'"), type text}}),
    Filter = Table.SelectRows(Extract, each Text.Contains([RAW], ".pdf"))
in
    Filter
 
Upvote 0
@Peter_SSs
there is underscore, this is jpg not pdf ?
Yes, I realised that & removed my post. :)

Can be in the same column.
Did you mean in the same cell?

You could try this user-defined function. To implement ..
1. Right click the sheet name tab and choose "View Code".
2. In the Visual Basic window use the menu to Insert|Module
3. Copy and Paste the code below (can use the icon at the top right of the code pane below) into the main right hand pane that opens at step 2.
4. Close the Visual Basic window.
5. Enter the formula as shown in the screen shot below and copy down.
6. Your workbook will need to be saved as a macro-enabled workbook (*.xlsm)

VBA Code:
Function PDFName(s As String) As String
  Dim RX As Object, M As Object
  
  Set RX = CreateObject("VBScript.RegExp")
  RX.Global = True
  RX.Ignorecase = True
  RX.Pattern = "(Images/Datasheets/)([^<]+?\.pdf)"
  For Each M In RX.Execute(s)
    PDFName = PDFName & vbLf & M.SubMatches(1)
  Next M
  PDFName = Mid(PDFName, 2)
End Function

AndrewP70 2020-04-17 1.xlsm
AB
1<img src="Images/skf_snip.jpg">&nbsp;<b>22215 E</b>&nbsp;Spherical Roller Bearing<br><br><b><em>75.00 mm ID x 130.00 mm OD x 31.00 mm Width<br></b></em><br>• Cylindrical Bore<br>• Steel Cage<br>• Standard Clearance<br><br>Spherical roller bearings have two rows of symmetrical rollers, a common sphered outer ring raceway and two inner ring raceways inclined at an angle to the bearing axis. The centre point of the sphere in the outer ring raceway is at the bearing axis.<br><br><a href="" onclick="javascript:void window.open('Images/Datasheets/22215E.pdf','','width=700,height=500,toolbar=0,menubar=0,location=0,status=1,scrollbars=1,resizable=1,left=0,top=0');return false;"><img src="/Images/Datasheets/snip_pdf.jpg">&nbsp;Export as PDF</a>22215E.pdf
2<img src="Images/skf_snip.jpg">&nbsp;<b>WSTUC 012-CPSS-DFH</b>&nbsp;<span style="color:dodgerBlue"><b>Blue Range Food Line Ball Bearing Unit</b></span><br><br>• 3/4" Shaft Diameter<br>• Composite Take-Up Housing<br>• Stainless Steel (AISI grade 420) Insert Bearing<br>• Food grade grease NSA approved<br>• Additional Back Seal<br><br>The blue range units have been developed for compliance with food safety regulations – with food grade and corrosion resistant components, using blue materials for optical detectability. Each element has been designed to balance the highest performance and superior hygienic standards considering EHEDG guidelines.<br><br><a href="" onclick="javascript:void window.open('Images/Datasheets/BLUELINE.pdf','','width=700,height=500,toolbar=0,menubar=0,location=0,status=1,scrollbars=1,resizable=1,left=0,top=0');return false;"><img src="/Images/Datasheets/snip_pdf.jpg">&nbsp;Product Range</a>&nbsp;<a href="" onclick="javascript:void window.open('Images/Datasheets/WSTUC 012-CPSS-DFH.pdf','','width=700,height=500,toolbar=0,menubar=0,location=0,status=1,scrollbars=1,resizable=1,left=0,top=0');return false;"><img src="/Images/Datasheets/snip_pdf.jpg">&nbsp;Item Data Sheet</a>BLUELINE.pdf WSTUC 012-CPSS-DFH.pdf
Extract pdf name
Cell Formulas
RangeFormula
B1:B2B1=PDFName(A1)
 
Upvote 0

Forum statistics

Threads
1,216,816
Messages
6,132,858
Members
449,761
Latest member
AUSSW

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top