Extract values from nested divs

Anka

New Member
Joined
Oct 20, 2012
Messages
45
Hello Everyone!!
Hope you all are doing well.
Here i am again, to get some advice
I want to scrape a site that have this structure
HTML:
-<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div id="games">

<div id="date-nav-wrapper">
<h1 id="date-nav-title">THIS IS <span>THE SITE</span></h1>
<div id="date-nav">
<div id="dn-left">
<a href="XXX"><span class="dn-left-arrow"></span></a>
</div>
<div id="dn-title">
 1 Oct 2020 </div>
<div id="dn-right">
<a href="XXX"><span class="dn-right-arrow"></span></a>
</div>
</div>
</div>

<div>
<div class="league-header">
<span class="XXX"><img class="XXX" src="XXX" alt="XXX">
</span>
<span class="league-name"><span>Soccer League</span></span>
</div>
<a class="game" href="I WANT THIS HREF">
<span class="time notime started">16:30</span>
<span class="players">
<span class="home">BOB CLARENCE (Rou)</span>
<span class="score">3 - 1</span>
<span class="away"><span>KOHNING FRYTZZI(Fin)</span></span>
</span>
<span class="set">
<span class="1">6-5</span><span class="2">5-4</span><span class="3">8-9</span>
</span>
</a>
<a class="game" href="I WANT THIS HREF">
<span class="time notime started">16:45</span>
<span class="players">
<span class="home">JOHN B.(Fra)</span>
<span class="score">10 - 2</span>
<span class="away"><span>LILYTH (Lgb)</span></span>
</span>
<span class="set">
<span class="1">4-3</span><span class="2">2-1</span><span class="3">8-3</span>
</span>
</a>
<a class="game" href="I WANT THIS HREF">
<span class="time notime started">17:30</span>
<span class="players">
<span class="home">ERYK-VANHAN (Spa)</span>
<span class="score">CANC</span>
<span class="away"><span>KLEIN (Ger)</span></span>
</span>
<span class="set">
<span class="1">6-7</span><span class="2">1-3</span><span class="3">3-2</span>
</span>
</a>
<div class="league-header">
<span class="XXX"><img class="XXX" src="XXX" alt="XXX">
</span>
<span class="league-name"><span>Basket League</span></span>
</div>
<a class="game" href="I WANT THIS HREF">
<span class="time notime started">16:30</span>
<span class="players">
<span class="home">MAVERYCK</span>
<span class="score">3 - 3</span>
<span class="away"><span>BUENA DONNA</span></span>
</span>
<span class="set">
<span class="1">6-5</span><span class="2">5-4</span><span class="3">8-9</span>
</span>
</a>
<a class="game" href="I WANT THIS HREF">
<span class="time notime started">16:45</span>
<span class="players">
<span class="home">HOLLY MOLLY</span>
<span class="score">3 - 7</span>
<span class="away"><span>UAHAHA</span></span>
</span>
<span class="set">
<span class="1">4-3</span><span class="2">2-1</span><span class="3">8-3</span>
</span>
</a>
<a class="game" href="I WANT THIS HREF">
<span class="time notime started">17:30</span>
<span class="players">
<span class="home">***DANIEL</span>
<span class="score">8 - 5</span>
<span class="away"><span>ERIKA </span></span>
</span>
<span class="set">
<span class="1">6-7</span><span class="2">1-3</span><span class="3">3-2</span>
</span>
</a>
<a class="game" href="I WANT THIS HREF">
<span class="time notime started">17:30</span>
<span class="players">
<span class="home">IGLESIAS***</span>
<span class="score">8 - 5</span>
<span class="away"><span> MORICHKO</span></span>
</span>
<span class="set">
<span class="1">3-7</span><span class="2">3-3</span><span class="3">3-2</span>
</span>
</a>
<a class="game" href="I WANT THIS HREF">
<span class="time notime started">17:30</span>
<span class="players">
<span class="home">MARY</span>
<span class="score">postp</span>
<span class="away"><span>GLEN</span></span>
</span>
<span class="set">
<span class="1">3-7</span><span class="2">3-3</span><span class="3">3-2</span>
</span>
</a>
<div class="league-header">
<span class="XXX"><img class="XXX" src="XXX" alt="XXX">
</span>
<span class="league-name"><span>Hockey League</span></span>
</div>
<a class="game" href="I WANT THIS HREF">
<span class="time notime started">16:30</span>
<span class="players">
<span class="home">MARIANOVICH ALEX</span>
<span class="score">3 - 1</span>
<span class="away"><span>BOB BIG-BOY (Usa)</span></span>
</span>
<span class="set">
<span class="1">6-5</span><span class="2">5-4</span><span class="3">8-9</span>
</span>
</a>
<a class="game" href="I WANT THIS HREF">
<span class="time notime started">16:45</span>
<span class="players">
<span class="home">ANTHONY WELD </span>
<span class="score">10 - 2</span>
<span class="away"><span> MARYANN LOUIS**</span></span>
</span>
<span class="set">
<span class="1">4-3</span><span class="2">2-1</span><span class="3">8-3</span>
</span>
</a>
</div>
</body>
</html>
And this is my excel vba code. Maybe it's a little chaotic, but that's the best I can do!
VBA Code:
Sub Getinfo()
    
    Dim http As New XMLHTTP60
    Dim html As New HTMLDocument
    Dim y As Long
    Dim i As Long
    
    Sheets("Sheet3").Select
    Cells.Select
    Selection.ClearContents
    Range("A1").Select
  
    With http
        .Open "GET", "C:\SITE STRUCTURE.HTML", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim posts As MSHTML.IHTMLElementCollection
    Dim post As MSHTML.IHTMLElement
    Set posts = html.getElementsByClassName("game")
    
    Dim homes As MSHTML.IHTMLElementCollection
    Dim home As MSHTML.IHTMLElement
    Set homes = html.getElementsByTagName("body")
    
    Dim Leagues As MSHTML.IHTMLElementCollection
    Dim League As MSHTML.IHTMLElement
    Set Leagues = html.getElementsByClassName("league-name")
    
    If Leagues.Length > 0 Then
            For y = 0 To Leagues.Length - 1
     i = 3
    For Each home In homes
    'For Each league In leagues
    For Each post In posts
    
    
    
        'If league.className = "league-name" Then
        
        Sheet3.Range("A" & i).Value = home.getElementsByTagName("div")(4).innerText   ' Works fine in this case but i'm not sure if is the right mode to do this
        Sheet3.Range("B" & i).Value = Leagues(0).innerText
        Sheet3.Range("C" & i).Value = post.Children(0).innerText
        'Debug.Print post.innerText
        Sheet3.Range("D" & i).Value = post.Children(1).Children(0).innerText
        Sheet3.Range("E" & i).Value = post.Children(1).Children(1).innerText
        Sheet3.Range("F" & i).Value = post.Children(1).Children(2).innerText
        Sheet3.Range("G" & i).Value = post.Children(2).Children(0).innerText
        
        Sheet3.Range("H" & i).Value = post.Children(2).Children(1).innerText
        Sheet3.Range("I" & i).Value = post.Children(2).Children(2).innerText
        Sheet3.Range("J" & i).Value = post.href
      
    i = i + 1
    
    Next
    
    Next
    Next
    End If
    
End Sub

Ok. This is what i get

get.jpg


And this is what i need

need.jpg


I have two (big and wide) problems
1. I can't extract the league type correctly (column B)
2. I need to clean (on the fly) any other character around the players' names, but I do not know how to do .
I need a little help ( with some explanation ) to do this.
Thank you in advance
 

Some videos you may like

Excel Facts

Move date out one month or year
Use =EDATE(A2,1) for one month later. Use EDATE(A2,12) for one year later.

Anka

New Member
Joined
Oct 20, 2012
Messages
45
Hello again.
The second part of the problem ( clean (on the fly) any other character around the players' names ) is not so important. I think I found a solution ...
The most important part is how to extract the league type correctly (column B) .
Any help is appreciated
 

John_w

MrExcel MVP
Joined
Oct 15, 2007
Messages
6,459
This is quite tricky because the games are siblings of the prior league, not children, and the leagues are also siblings. Try this macro (you'll need to change the base address of the web site used in the Replace function):

VBA Code:
Option Explicit

Public Sub XMLhttp_Getinfo()
   
    Dim http As XMLHTTP60
    Dim HTMLdoc As HTMLDocument
    Dim gameDateDiv As HTMLDivElement
    Dim leagues As IHTMLDOMChildrenCollection 'or As IHTMLElementCollection
    Dim game As HTMLAnchorElement
    Dim exitLeague As Boolean
    Dim i As Long, r As Long
    Dim destSheet As Worksheet
   
    Set http = New XMLHTTP60
    Set HTMLdoc = New HTMLDocument
   
    Set destSheet = Worksheets("Sheet1")
    With destSheet
        .Cells.ClearContents
        .Range("A1:J1").Value = Split("DATE,LEAGUE TYPE,K.OFF,HOME PLAYER,FINAL SCORE,AWAY PLAYER,SET 1,SET 2,SET 3,WEB ADDRESS", ",")
        r = 2
    End With
   
    With http
        .Open "GET", "C:\SITE STRUCTURE.HTML", False
        .send
        HTMLdoc.body.innerHTML = .responseText
    End With
   
    '<div id="dn-title">1 Oct 2020 </div>
    Set gameDateDiv = HTMLdoc.getElementById("dn-title")
   
    '<div class="league-header">
    Set leagues = HTMLdoc.querySelectorAll("div.league-header") 'or Set leagues = HTMLdoc.getElementsByClassName("league-header")
   
    For i = 0 To leagues.Length - 1
       
        Set game = leagues(i).NextSibling
       
        Do
       
            exitLeague = True
           
            If Not game Is Nothing Then
                If game.tagName = "A" Then
                    destSheet.Cells(r, "A").Value = CDate(gameDateDiv.innerText)
                    destSheet.Cells(r, "B").Value = leagues(i).innerText
                    destSheet.Cells(r, "C").Value = game.Children(0).innerText  'time
                    destSheet.Cells(r, "D").Value = Clean_Player(game.Children(1).Children(0).innerText)  'home player
                    destSheet.Cells(r, "E").Value = "'" & game.Children(1).Children(1).innerText  'score
                    destSheet.Cells(r, "F").Value = Clean_Player(game.Children(1).Children(2).innerText)  'away player
                    destSheet.Cells(r, "G").Value = "'" & game.Children(2).Children(0).innerText  'set 1
                    destSheet.Cells(r, "H").Value = "'" & game.Children(2).Children(1).innerText  'set 2
                    destSheet.Cells(r, "I").Value = "'" & game.Children(2).Children(2).innerText  'set 3
                    destSheet.Cells(r, "J").Value = Replace(game.href, "about:", "http://the.website.com")  'URL
                    r = r + 1
                    DoEvents
                    Set game = game.NextSibling
                    exitLeague = False
                End If
            End If
           
        Loop Until exitLeague
   
    Next
        
End Sub


Private Function Clean_Player(playerName As String) As String
    Dim p As Long
    Clean_Player = Replace(playerName, "*", "")
    p = InStr(Clean_Player, "(")
    If p > 0 Then Clean_Player = Left(Clean_Player, p - 1)
    Clean_Player = Trim(Clean_Player)
End Function
 

Anka

New Member
Joined
Oct 20, 2012
Messages
45
Say whaaaat !!? Honestly, I have no words to thank you.
It's the second time (if not the third) when you manage to solve my problem.
Anyway, everything work like a charm (including the clean function).
Thanks a lot, again !!
 

Watch MrExcel Video

Forum statistics

Threads
1,112,767
Messages
5,542,420
Members
410,550
Latest member
ganeshsamant
Top