Extracting URLs from HTML Source code in Excel

bhaggs

New Member
Joined
Apr 17, 2014
Messages
5
Hi,

We receive HTML source code as excel dump that has lot of href attribute which points to a URL.Our knowledge team has to do a Ctrl+F to find each href within a code and list the URL in next column.This is a time consuming process

Iam looking for a VBA solution which can make this process simpler.Here is a sample data.

Answer IdHTML Source
1Some html source code then more html text and one more <href="http://www.oracle.com"> and then one more <href="http://www.microsoft.com">
2Some html source code then more html text and one more <href="http://www.google.com"> and then one more <href="http://www.ibm.com">
3Some html source code then more html text and one more <href="http://www.yahoo.com"> and then one more

<tbody>
</tbody>

Desired Output in next sheet or file
Answer IdURLs
1http://www.oracle.com
1http://www.microsoft.com
2http://www.google.com
2http://www.ibm.com
3http://www.yahoo.com

<tbody>
</tbody>

Thanks in advance !!

Bhaggs
 
Last edited:

Excel Facts

Copy formula down without changing references
If you have =SUM(F2:F49) in F50; type Alt+' in F51 to copy =SUM(F2:F49) to F51, leaving the formula in edit mode. Change SUM to COUNT.
Sub Extract_URL()

Sheets.Add after:=Sheets(Sheets.Count)
ActiveSheet.Name = "Data"
Sheets("sheet1").Select
Range("B2").Select
Range(Selection, Selection.End(xlDown)).Copy
Sheets("Data").Select
Range("A1").Select
Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
:=False, Transpose:=False
Application.CutCopyMode = False
Selection.TextToColumns Destination:=Range("A1"), DataType:=xlDelimited, _
TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=False, Tab:=False, _
Semicolon:=False, Comma:=False, Space:=False, Other:=True, OtherChar _
:="""", FieldInfo:=Array(Array(1, 1), Array(2, 1), Array(3, 1), Array(4, 1), Array(5 _
, 1)), TrailingMinusNumbers:=True

lcolumn = Sheets("Data").Cells(1, Columns.Count).End(xlToLeft).Column

For i = 1 To lcolumn

strFindThis = "http"

Cells(1, i).Select

Set Found = Selection.Find(What:=strFindThis, Lookat:=xlPart, MatchCase:=False)

If Found Is Nothing Then

Cells(1, i).EntireColumn.Delete

End If
Next


Sheets("Sheet1").Select
Range("A2").Select
Range(Selection, Selection.End(xlDown)).Select
Selection.Copy
Sheets("Data").Select
Columns("A:A").Select
Selection.Insert Shift:=xlToRight
Range("A1").Select
Application.CutCopyMode = False
Range("A1").Select
Range(Selection, Selection.End(xlToRight)).Select
Selection.Columns.AutoFit
Range("A1").Select
MsgBox ("Data pasted in Data sheet")


End Sub


Hope this works!
 
Upvote 0
Hi, I have the following solution.
This will just loop through column B in sheet1, and look for urls indicated by the "http://" indication. Change this if you want.

Code:
Sub ExtractHttpADDR()
Dim i%, LR%
Dim Str As String, URL As String
LR = Sheets("Sheet1").Cells(Rows.Count, "b").End(xlUp).Row 'Finds last row column B


For i = 2 To LR
        Str = Sheet1.Cells(i, 2).Value
        Do While InStr(1, Str, "http://") >= 1
            URL = Str
             Str = ExtractURL(Str)[COLOR=#b22222] 'I use a function to extract the URL[/COLOR]
             Sheet2.Cells(Rows.Count, 2).End(xlUp).Offset(1, 0).Value = Str[COLOR=#0000ff] 'places values found in sheet2, first cell available from the bottom[/COLOR]
             Str = Mid(URL, InStr(1, URL, "http://") + Len(Str), Len(URL)) [COLOR=#0000ff]'stores remainder of string[/COLOR]
        Loop
Next i


End Sub

[COLOR=#00ff00]'This is a function used in the sub above[/COLOR]
Function ExtractURL(SourceCode As String)
    Dim i As Long, j As Long
        i = InStr(1, SourceCode, "http://")
        j = InStr(1, Mid(SourceCode, i, Len(SourceCode)), Chr(34)) [COLOR=#0000ff]'finds last quotation which indicates string end, I used this since I was not sure all urls would end in .com, should be safer this way[/COLOR]
        SourceCode = Mid(SourceCode, i, j - 1)
    ExtractURL = SourceCode
End Function

Let me know how it works
 
Upvote 0
First of all Thanks Arithos and rohith554 for your quick responses.Really appreciated it

Arithos- I was getting object required error when I simply copy-pasted>saved (and ran) the VBA code.Do I need to set some checkbox

rohith554 - In your code, though it was generating another sheet tab 'Data', it wasn't populating the desired result

is there a way I can mail you the sample spreadsheet on your personal id's to test it out by yourself.I can't find any option to attach a file here.My email is

Thanks,
Bhaggs
 
Last edited by a moderator:
Upvote 0
Hi,

Arithos- I was getting object required error when I simply copy-pasted>saved (and ran) the <acronym title="visual basic for applications" style="border-width: 0px 0px 1px; border-bottom-style: dotted; border-bottom-color: rgb(0, 0, 0); cursor: help; color: rgb(87, 65, 35); font-size: 12.7272720336914px; background-color: rgb(250, 250, 250);">VBA</acronym> code.Do I need to set some checkbox

What was it hightlighting when it gave you the error (the function?)? If you mean reference by checkbox, that will not be needed.
I can implement a new sheet for the URS's if you want.


Be careful of posting you email online :) You will however recieve an email from me.
 
Last edited by a moderator:
Upvote 0
I think I might know what happened, did you change the code below to something else then Sheet1? If you did not, the macro will not work until you change it, or name one of you sheets as "Sheet1". However, you have recieved an email from me now, containing a wbk with the macro, and your example code. Let me know how it goes.

Code:
Sheets("Sheet1")
 
Upvote 0
Hi,

The code works when the data is placed starting A1.

Well if not you can share the workbook(if it is not confidential) where i can change the code accordingly!
 
Upvote 0
Hi,

The code works when the data is placed starting A1.

Well if not you can share the workbook(if it is not confidential) where i can change the code accordingly!

Hi, @rohith554

Just thought I would share a nicer way of copying data, this bit:
Code:
[COLOR=#333333]ActiveSheet.Name = "Data"[/COLOR]
[COLOR=#333333]Sheets("sheet1").Select[/COLOR]
[COLOR=#333333]Range("B2").Select[/COLOR]
[COLOR=#333333]Range(Selection, Selection.End(xlDown)).Copy[/COLOR]
[COLOR=#333333]Sheets("Data").Select[/COLOR]
[COLOR=#333333]Range("A1").Select[/COLOR]
[COLOR=#333333]Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _[/COLOR]
[COLOR=#333333]:=False, Transpose:=False[/COLOR]

can be typed as this, if you want the format to be the same.

Code:
Sheets("sheet1").Range("b2", Range("b2").End(xlDown)).Copy Sheets("Data").Range("A1")
Much simpler like this.


This next snipp of code will only copy values.

Code:
Dim s1 As Worksheet, Data As Worksheet
Set Data = Sheets("Data")
Set s1 = Sheets("sheet1")


Data.Range("A1").Resize(s1.Range("b2").End(xlDown).Row - 1).Value s1.Range("b2", s1.Range("b2").End(xlDown)).Value


Did some "Dim"'s aswell here, to make the code shorter, its important that in both cases the code is on one line.

It took some time before I was aware of this way of doing things, and thought I might possible help you aswell. You might already know it, but still :) I would have loved someone telling me sooner, would have made my coding so much better at an earlier point :eek:
 
Upvote 0
Assuming you output will go to Sheet2, see if the following macro does what you want (run it with the data sheet active)...
PHP:
Sub GetURLs()
  Dim R As Long, X As Long, Z As Long, Data As Variant, Result As Variant, href() As String
  Dim hrefCount As Long
  Data = Range("A2:B" & Cells(Rows.Count, "A").End(xlUp).Row)
  hrefCount = Evaluate(Replace("SUM((LEN(B2:B#)-LEN(SUBSTITUTE(B2:B#,""<href="","""")))/6)", "#", UBound(Data) + 1))
  ReDim Result(1 To hrefCount, 1 To 2)
  For R = 1 To UBound(Data)
    href = Split(Data(R, 2), "<href=""")
    For X = 1 To UBound(href)
      Z = Z + 1
      Result(Z, 1) = Data(R, 1)
      Result(Z, 2) = Split(href(X), """>")(0)
    Next
  Next
  Sheets("Sheet2").Range("A1:B1") = Array("Answer ID", "URL")
  Sheets("Sheet2").Range("A2:B" & UBound(Result) + 1) = Result
End Sub
 
Upvote 0
Thanks you all! I was overwhelmed by the responses you all guys gave to me

Arithos - Thanks for initiating a 1:1 email conversation on this topic to assist me further. After you shared your updated spreadsheet, I found this link that had another function to validate the URL. I used that and after making minor tweak in the xls, I was finally able to achieve what was needed :)

How can I attach the updated xls in this post so that it can benefit other users?Can anyone guide me?

Thanks,
Bhaggs
 
Upvote 0

Forum statistics

Threads
1,215,067
Messages
6,122,949
Members
449,095
Latest member
nmaske

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top