Scrape what after STRING tag

YasserKhalil · Dec 29, 2016

Hello everyone
I am trying to get the text after each STRONG tag but I can't figure it out
Here's the code so far

Code:

Sub Final()
    Dim xmlP As New MSXML2.XMLHTTP60
    Dim htmlSubyDoc As New MSHTML.HTMLDocument
    Dim htmlSubyResults As MSHTML.IHTMLElementCollection
    Dim htmlSubyResult As MSHTML.IHTMLElement
    Dim htmlStrongs As MSHTML.IHTMLElementCollection
    Dim htmlStrong As MSHTML.IHTMLElement
    
    Dim str As String
    Dim r As Long
    Dim i As Long
    Dim counter As Long


    Application.ScreenUpdating = False




    xmlP.Open "GET", "http://plants.newgarden.com/12190005/Plant/3394/Deodar_Cedar", False
    xmlP.send


    If xmlP.Status <> 200 Then
        MsgBox "Problem" & vbNewLine & xmlP.Status & " - " & xmlP.statusText
        Exit Sub
    End If


    htmlSubyDoc.body.innerHTML = xmlP.responseText


    Set htmlSubyResults = htmlSubyDoc.getElementsByClassName("pdpBox")


    For Each htmlSubyResult In htmlSubyResults


        'Debug.Print htmlSubyResult.innerHTML
        Set htmlStrongs = htmlSubyResult.getElementsByTagName("STRONG")
        For Each htmlStrong In htmlStrongs
            Debug.Print htmlStrong.innerHTML
            
        Next htmlStrong
    Next htmlSubyResult


    Application.ScreenUpdating = True
End Sub

YasserKhalil · Dec 30, 2016

Any help in this topic please
I need all the information after 
The problem is that some of them after the tag directly while others are after but in separate paragraph

YasserKhalil · Dec 30, 2016

Can anyone help in this topic ..even if an idea !!

YasserKhalil · Dec 31, 2016

I am still having hope to find reply
Regards

Rick Rothstein · Dec 31, 2016

I don't do XLM, but as I understand it, the text you want returned to you appear between "" and "" tags. If that is correct, then perhaps this function, which can be used as a UDF (user defined function) if desired, will be of help...

Code:

[table="width: 500"]
[tr]
	[td]Function xlmTagText(ByVal Tag As String, xlmSourceText As String, _
                    Optional Delimiter As String = ", ") As String
  Dim X As Long, Parts() As String
  Tag = Replace(Replace(Tag, "<", ""), ">", "")
  Parts = Split(xlmSourceText, "<" & Tag & ">", , vbTextCompare)
  For X = 1 To UBound(Parts)
    xlmTagText = xlmTagText & ", " & Split(Parts(X), "</" & Tag & ">", , vbTextCompare)(0)
  Next
  xlmTagText = Mid(xlmTagText, Len(Delimiter) + 1)
End Function[/td]
[/tr]
[/table]

The function outputs all of the strong tag text within the source text as a delimited text string. The function's first argument is the tag's text (either with or without the angle brackets), its second argument is the source text (I am assuming you have it loaded into a variable or placed within a single cell so that you can pass it into the function) and it has an optional third argument allowing you to set the delimiter to be placed between tag text (the default delimiter is a comma followed by a space).

YasserKhalil · Dec 31, 2016

Thanks a lot Mr. Rick for your reply. And it is great honor to reply in my issue
I tested the udf with STRONG tag ::

Code:

Debug.Print xlmTagText("STRONG", htmlStrong.innerHTML)

In fact the problem with not the headers which is tagged by STRONG .. I need what after STRONG
I tried to change the TAG with But didn't get the desired results
Best Regards

Rick Rothstein · Dec 31, 2016

YasserKhalil said:
Thanks a lot Mr. Rick for your reply. And it is great honor to reply in my issue
I tested the udf with STRONG tag ::

Code:

Debug.Print xlmTagText("STRONG", htmlStrong.innerHTML)

In fact the problem with not the headers which is tagged by STRONG .. I need what after STRONG
I tried to change the TAG with But didn't get the desired results
Best Regards

Can you show me a (small) sample of what htmlStrong.innerHTML contains and what my function returns for it (using "STRONG" as the tag) and tell me what you think it should have returned (just so I know we are both talking about the same result)?

Sektor · Dec 31, 2016

Well, you think that there's a text alongside with the tag, but it's not true.
For instance:

HTML:

 Landscape Attributes: 

Although there is tag, but tag does not contain the text because it is in the next tag:

HTML:

 Deodar Cedar is an evergreen tree with a strong central leader and a stunning habit of growth which features almost oriental horizontally-tiered branches. Its relatively fine texture sets it apart from other landscape plants with less refined foliage. 

Thus your logic "required text is with tag" is incorrect.

YasserKhalil · Dec 31, 2016

Thanks a lot for replies
Have a look at this

Code:

Sub Final()    Dim xmlP As New MSXML2.XMLHTTP60
    Dim htmlSubyDoc As New MSHTML.HTMLDocument
    Dim htmlSubyResults As MSHTML.IHTMLElementCollection
    Dim htmlSubyResult As MSHTML.IHTMLElement
    Dim htmlStrongs As MSHTML.IHTMLElementCollection
    Dim htmlStrong As MSHTML.IHTMLElement
    Dim str As String
    Dim r As Long
    Dim i As Long
    Dim counter As Long


    Application.ScreenUpdating = False
    xmlP.Open "GET", "http://plants.newgarden.com/12190005/Plant/3394/Deodar_Cedar", False
    xmlP.send


    If xmlP.Status <> 200 Then
        MsgBox "Problem" & vbNewLine & xmlP.Status & " - " & xmlP.statusText
        Exit Sub
    End If




    htmlSubyDoc.body.innerHTML = xmlP.responseText




    Set htmlSubyResults = htmlSubyDoc.getElementsByClassName("pdpQuickFactsBox")




    For Each htmlSubyResult In htmlSubyResults




        Debug.Print htmlSubyResult.innerHTML
'        Set htmlStrongs = htmlSubyResult.getElementsByTagName("p")
'        For Each htmlStrong In htmlStrongs
'            Debug.Print htmlStrong.innerHTML
'
'            'Debug.Print xlmTagText("P", htmlStrong.innerHTML)
'        Next htmlStrong
    Next htmlSubyResult




    Application.ScreenUpdating = True
End Sub

I have changed the Class name .. and in the immediate window you would find that

Code:

<P class=CCPageText><STRONG>Height:</STRONG> 80 feet</P><P class=CCPageText><STRONG>Spread:</STRONG> 60 feet</P>
<P class=CCPageText><STRONG>Sunlight:</STRONG>  <IMG title="full sun" class=pdpLightIcon alt="full sun" src="about:/Content/Images/Icons/NetPS-light-sun.gif"></IMG>  </P>
<P class=CCPageText><STRONG>Hardiness Zone:</STRONG> 6b</P>

These are four lines .. I need to deal with the first and second and fourth line ... and take what after tag
Which in this case :
80 feet
60 feet
6b

And as for the second line take out the Alt property which is : full sun

These four lines included in P tags while other elements on the page such as Description is not the same ...
Simply I need you to have a look at the page of this link

Deodar Cedar (Cedrus deodara) in Greensboro High Point Jamestown Summerfield Ridge North Carolina NC at New Garden Landscaping & Nursery

Resturn what after STRONG tags ...what after Height & Spread & Sunlight & Hardiness Zone & Other Names & Description & Ornamental Features & Landscape Attributes & Plant Characteristics

Ombir · Dec 31, 2016

This page has weird html formatting. Never seen before. Below code is working fine for this page but I'm not sure if it will work on other pages of this website.
<ul", "<="" ul=""></ul",>
Sub ByeBye2016()
Dim indx1 As Long
Dim indx2 As Long
Dim htmltext As String
Dim output As Variant
Dim reptags As Variant
Dim details As Variant
Dim ie As InternetExplorer
Dim doc As HTMLDocument
Dim plant As HTMLDivElement

Set ie = New InternetExplorer

reptags = Array("class=""CCPageText""", "", "", "", _
"", "<p", "<li>", "</li>", "<ul", "</ul>", ">")

details = Array("Other Names:", "Description:", "Ornamental Features:", "Landscape Attributes:", "Plant Characteristics:")

With ie
.Visible = True
.Navigate "http://plants.newgarden.com/12190005/Plant/3394/Deodar_Cedar"
Do While .ReadyState <> 4: DoEvents: Loop
End With

Set doc = ie.Document

ReDim output(1 To 11)

Set plant = doc.getElementsByClassName("pdpBox")(0)
htmltext = plant.innerHTML

For Each ele In reptags
htmltext = Replace(htmltext, ele, "")
Next
htmltext = Application.Clean(Trim(htmltext))

output(1) = doc.getElementsByClassName("pdpPlantName")(0).getElementsByTagName("p")(0).innerText
output(2) = doc.getElementsByClassName("pdpPlantName")(0).getElementsByTagName("p")(1).innerText
output(3) = Split(doc.getElementsByClassName("pdpQuickFactsBox")(0).getElementsByTagName("p")(0).innerText, ":")(1)
output(4) = Split(doc.getElementsByClassName("pdpQuickFactsBox")(0).getElementsByTagName("p")(1).innerText, ":")(1)
output(5) = doc.getElementsByClassName("pdpQuickFactsBox")(0).getElementsByTagName("img")(0).Title
output(6) = Split(doc.getElementsByClassName("pdpQuickFactsBox")(0).getElementsByTagName("p")(3).innerText, ":")(1)

ie.Quit

For i = 0 To UBound(details)
indx1 = InStr(htmltext, details(i))
If i = UBound(details) Then
indx2 = 19999
Else
indx2 = InStr(htmltext, details(i + 1))
End If
output(i + 1 + 6) = Split(Mid(htmltext, indx1, indx2 - indx1), ":")(1)
Next

Range("A1:K1") = Array("Name", "Sci Name", "Height", "Spread", "Sunlight", "Hardiness Zone", "Other Names", _
"Description", "Ornamental Features", "Landscape Attributes", "Plant Characteristics")
Range("A2").Resize(, UBound(output)) = output
End Sub

<ul", "<="" ul=""></ul",></p",>

Scrape what after STRING tag

YasserKhalil

Well-known Member

Excel Facts

YasserKhalil

Well-known Member

YasserKhalil

Well-known Member

YasserKhalil

Well-known Member

Rick Rothstein

MrExcel MVP

YasserKhalil

Well-known Member

Rick Rothstein

MrExcel MVP

Sektor

Well-known Member

YasserKhalil

Well-known Member

Ombir

Active Member

Forum statistics

Share this page

Scrape what after STRING tag

Well-known Member

Excel Facts

Well-known Member

Well-known Member

Well-known Member

MrExcel MVP

Well-known Member

MrExcel MVP

Well-known Member

Well-known Member

Active Member

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock