Scrape what after STRING tag

YasserKhalil

Well-known Member
Joined
Jun 24, 2010
Messages
852
Hello everyone
I am trying to get the text after each STRONG tag but I can't figure it out
Here's the code so far
Code:
Sub Final()
    Dim xmlP As New MSXML2.XMLHTTP60
    Dim htmlSubyDoc As New MSHTML.HTMLDocument
    Dim htmlSubyResults As MSHTML.IHTMLElementCollection
    Dim htmlSubyResult As MSHTML.IHTMLElement
    Dim htmlStrongs As MSHTML.IHTMLElementCollection
    Dim htmlStrong As MSHTML.IHTMLElement
    
    Dim str As String
    Dim r As Long
    Dim i As Long
    Dim counter As Long


    Application.ScreenUpdating = False




    xmlP.Open "GET", "http://plants.newgarden.com/12190005/Plant/3394/Deodar_Cedar", False
    xmlP.send


    If xmlP.Status <> 200 Then
        MsgBox "Problem" & vbNewLine & xmlP.Status & " - " & xmlP.statusText
        Exit Sub
    End If


    htmlSubyDoc.body.innerHTML = xmlP.responseText


    Set htmlSubyResults = htmlSubyDoc.getElementsByClassName("pdpBox")


    For Each htmlSubyResult In htmlSubyResults


        'Debug.Print htmlSubyResult.innerHTML
        Set htmlStrongs = htmlSubyResult.getElementsByTagName("STRONG")
        For Each htmlStrong In htmlStrongs
            Debug.Print htmlStrong.innerHTML
            
        Next htmlStrong
    Next htmlSubyResult


    Application.ScreenUpdating = True
End Sub
 

Excel Facts

How to create a cell-sized chart?
Tiny charts, called Sparklines, were added to Excel 2010. Look for Sparklines on the Insert tab.
Any help in this topic please
I need all the information after </Strong>
The problem is that some of them after the tag directly while others are after but in separate paragraph
 
Upvote 0
I don't do XLM, but as I understand it, the text you want returned to you appear between "<strong>" and "</strong>" tags. If that is correct, then perhaps this function, which can be used as a UDF (user defined function) if desired, will be of help...
Code:
[table="width: 500"]
[tr]
	[td]Function xlmTagText(ByVal Tag As String, xlmSourceText As String, _
                    Optional Delimiter As String = ", ") As String
  Dim X As Long, Parts() As String
  Tag = Replace(Replace(Tag, "<", ""), ">", "")
  Parts = Split(xlmSourceText, "<" & Tag & ">", , vbTextCompare)
  For X = 1 To UBound(Parts)
    xlmTagText = xlmTagText & ", " & Split(Parts(X), "</" & Tag & ">", , vbTextCompare)(0)
  Next
  xlmTagText = Mid(xlmTagText, Len(Delimiter) + 1)
End Function[/td]
[/tr]
[/table]
The function outputs all of the strong tag text within the source text as a delimited text string. The function's first argument is the tag's text (either with or without the angle brackets), its second argument is the source text (I am assuming you have it loaded into a variable or placed within a single cell so that you can pass it into the function) and it has an optional third argument allowing you to set the delimiter to be placed between tag text (the default delimiter is a comma followed by a space).
 
Upvote 0
Thanks a lot Mr. Rick for your reply. And it is great honor to reply in my issue
I tested the udf with STRONG tag ::
Code:
Debug.Print xlmTagText("STRONG", htmlStrong.innerHTML)

In fact the problem with not the headers which is tagged by STRONG .. I need what after STRONG
I tried to change the TAG with <P> But didn't get the desired results
Best Regards
 
Upvote 0
Thanks a lot Mr. Rick for your reply. And it is great honor to reply in my issue
I tested the udf with STRONG tag ::
Code:
Debug.Print xlmTagText("STRONG", htmlStrong.innerHTML)

In fact the problem with not the headers which is tagged by STRONG .. I need what after STRONG
I tried to change the TAG with <P> But didn't get the desired results
Best Regards
Can you show me a (small) sample of what htmlStrong.innerHTML contains and what my function returns for it (using "STRONG" as the tag) and tell me what you think it should have returned (just so I know we are both talking about the same result)?
 
Upvote 0
Well, you think that there's a text alongside with the tag, but it's not true.
For instance:

HTML:
<p class="CCPageText">
    <strong >Landscape Attributes:
</p>


</strong>Although there is < strong > tag, but < p > tag does not contain the text because it is in the next tag:


HTML:
<p class="CCPageText">
Deodar Cedar is an evergreen tree with a strong central leader and a stunning habit of growth which features almost oriental horizontally-tiered branches.
Its relatively fine texture sets it apart from other landscape plants with less refined foliage.
</p>

Thus your logic "required text is with tag" is incorrect.
 
Last edited:
Upvote 0
Thanks a lot for replies
Have a look at this
Code:
Sub Final()    Dim xmlP As New MSXML2.XMLHTTP60
    Dim htmlSubyDoc As New MSHTML.HTMLDocument
    Dim htmlSubyResults As MSHTML.IHTMLElementCollection
    Dim htmlSubyResult As MSHTML.IHTMLElement
    Dim htmlStrongs As MSHTML.IHTMLElementCollection
    Dim htmlStrong As MSHTML.IHTMLElement
    Dim str As String
    Dim r As Long
    Dim i As Long
    Dim counter As Long


    Application.ScreenUpdating = False
    xmlP.Open "GET", "http://plants.newgarden.com/12190005/Plant/3394/Deodar_Cedar", False
    xmlP.send


    If xmlP.Status <> 200 Then
        MsgBox "Problem" & vbNewLine & xmlP.Status & " - " & xmlP.statusText
        Exit Sub
    End If




    htmlSubyDoc.body.innerHTML = xmlP.responseText




    Set htmlSubyResults = htmlSubyDoc.getElementsByClassName("pdpQuickFactsBox")




    For Each htmlSubyResult In htmlSubyResults




        Debug.Print htmlSubyResult.innerHTML
'        Set htmlStrongs = htmlSubyResult.getElementsByTagName("p")
'        For Each htmlStrong In htmlStrongs
'            Debug.Print htmlStrong.innerHTML
'
'            'Debug.Print xlmTagText("P", htmlStrong.innerHTML)
'        Next htmlStrong
    Next htmlSubyResult




    Application.ScreenUpdating = True
End Sub

I have changed the Class name .. and in the immediate window you would find that
Code:
<P class=CCPageText><STRONG>Height:</STRONG> 80 feet</P><P class=CCPageText><STRONG>Spread:</STRONG> 60 feet</P>
<P class=CCPageText><STRONG>Sunlight:</STRONG>  <IMG title="full sun" class=pdpLightIcon alt="full sun" src="about:/Content/Images/Icons/NetPS-light-sun.gif"></IMG>  </P>
<P class=CCPageText><STRONG>Hardiness Zone:</STRONG> 6b</P>

These are four lines .. I need to deal with the first and second and fourth line ... and take what after </STRONG> tag
Which in this case :
80 feet
60 feet
6b

And as for the second line take out the Alt property which is : full sun

These four lines included in P tags while other elements on the page such as Description is not the same ...
Simply I need you to have a look at the page of this link

Deodar Cedar (Cedrus deodara) in Greensboro High Point Jamestown Summerfield Ridge North Carolina NC at New Garden Landscaping & Nursery

Resturn what after STRONG tags ...what after Height & Spread & Sunlight & Hardiness Zone & Other Names & Description & Ornamental Features & Landscape Attributes & Plant Characteristics
 
Upvote 0
This page has weird html formatting. Never seen before. Below code is working fine for this page but I'm not sure if it will work on other pages of this website.
<ul", "<="" ul=""></ul",>
Sub ByeBye2016()
Dim indx1 As Long
Dim indx2 As Long
Dim htmltext As String
Dim output As Variant
Dim reptags As Variant
Dim details As Variant
Dim ie As InternetExplorer
Dim doc As HTMLDocument
Dim plant As HTMLDivElement

Set ie = New InternetExplorer

reptags = Array("class=""CCPageText""", "<strong>", "</strong>", "</p>", _
"<!-- Plant Descriptions -->", "<p", "<li>", "</li>", "<ul", "</ul>", ">")

details = Array("Other Names:", "Description:", "Ornamental Features:", "Landscape Attributes:", "Plant Characteristics:")

With ie
.Visible = True
.Navigate "http://plants.newgarden.com/12190005/Plant/3394/Deodar_Cedar"
Do While .ReadyState <> 4: DoEvents: Loop
End With

Set doc = ie.Document

ReDim output(1 To 11)

Set plant = doc.getElementsByClassName("pdpBox")(0)
htmltext = plant.innerHTML

For Each ele In reptags
htmltext = Replace(htmltext, ele, "")
Next
htmltext = Application.Clean(Trim(htmltext))

output(1) = doc.getElementsByClassName("pdpPlantName")(0).getElementsByTagName("p")(0).innerText
output(2) = doc.getElementsByClassName("pdpPlantName")(0).getElementsByTagName("p")(1).innerText
output(3) = Split(doc.getElementsByClassName("pdpQuickFactsBox")(0).getElementsByTagName("p")(0).innerText, ":")(1)
output(4) = Split(doc.getElementsByClassName("pdpQuickFactsBox")(0).getElementsByTagName("p")(1).innerText, ":")(1)
output(5) = doc.getElementsByClassName("pdpQuickFactsBox")(0).getElementsByTagName("img")(0).Title
output(6) = Split(doc.getElementsByClassName("pdpQuickFactsBox")(0).getElementsByTagName("p")(3).innerText, ":")(1)

ie.Quit

For i = 0 To UBound(details)
indx1 = InStr(htmltext, details(i))
If i = UBound(details) Then
indx2 = 19999
Else
indx2 = InStr(htmltext, details(i + 1))
End If
output(i + 1 + 6) = Split(Mid(htmltext, indx1, indx2 - indx1), ":")(1)
Next

Range("A1:K1") = Array("Name", "Sci Name", "Height", "Spread", "Sunlight", "Hardiness Zone", "Other Names", _
"Description", "Ornamental Features", "Landscape Attributes", "Plant Characteristics")
Range("A2").Resize(, UBound(output)) = output
End Sub



<ul", "<="" ul=""></ul",></p",>
 
Last edited:
Upvote 0

Forum statistics

Threads
1,214,895
Messages
6,122,128
Members
449,066
Latest member
Andyg666

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top