Searching XML File

mchac

Well-known Member
Joined
Apr 15, 2013
Messages
531
The sub i'm currently using does an instr search for a unique string in an xml file. This works fine but i'd like to speed up the search if possible because I have a large number of xml files to search. My problem is that i have very little experience in parsing xml.

I notice that if i load the xml file and look at the namespaces i see item(1) through about item(20) and the unique string is always contained in one of those items. Note that the number of items can be different for each xml file i load.

The xml file is loaded into an object i define as oInstance and i'd like to do something like:

Do
counter = counter+1

if oInstance.namespaces.item(counter) = uniquestring then
DoThis
Exit Do
end if

Loop Until oInstance.namespaces.item(counter) Is Nothing

I tried basically that and it didn't work.

So I tried:
Code:
     Dim COUNTERV2 As Integer    

     Dim oNodelistV2 As MSXML2.IXMLDOMNodeList
    
    Set oNodelistV2 = oInstance.SelectNodes("namespaces")


    COUNTERV2 = 0
    Do
        
        COUNTERV2 = COUNTERV2 + 1
        
        If oNodelistV2.Item(COUNTERV2) = "http://fasb.org/us-gaap/2013-01-31" Then
            m_strUSGAAP_TaxonomyVersion = "http://fasb.org/us-gaap/2013-01-31"
            m_strInvest_TaxonomyVersion = "http://xbrl.sec.gov/invest/2013-01-31"
            Exit Do
        End If

     Loop Until oNodelistV2.Item(COUNTER) Is Nothing
but now i get Err.Number 91

If anyone can point me in a direction to investigate I would appreciate it.
 
Technically dimming the objects (early or late) will only create a variable for it (more or less something like a pointer to the object that will be created). Either way CreateObject() and New will do the same thing, which is actually instantiate the object. I'm not experienced with trying to reduce memory usage in a case like this. In theory setting any object to nothing will free up the memory. In practice there may be more to it than that. Note that MSXML objects (I think, anyway) would be pretty lightweight - but the xml contents could add up quickly, especially if the space is not really being re-used efficiently as the code progresses).
 
Upvote 0

Excel Facts

What is =ROMAN(40) in Excel?
The Roman numeral for 40 is XL. Bill "MrExcel" Jelen's 40th book was called MrExcel XL.
Ok.
Oddly, when execution is stopped and the sheet closed about half of the the memory increase isn't released. I need to exit excel and start it again.

Tks for your help.
 
Upvote 0
xenou - projects led to other projects and i'm only just now getting back to this.
your code works like a charm.
i bought a book on xml and have been reading stuff on the web but i still know so little that it looks like wizardry to me at the moment.

could you please explain what the following two lines in your code do and why they are necessary?
thanks a million in advance.

Code:
    dom.setProperty "SelectionLanguage", "XPath"
    dom.setProperty "SelectionNamespaces", "xmlns:xsl='http://www.w3.org/1999/XSL/Transform'"
 
Upvote 0
Hi,

I guess we don't need to use the first setting if using MSXML 6.0 (or higher). The default selection language is XPath already in those versions. Previous versions of MSXML used something called XSLPattern. I've never even heard of it and don't know what it is - we want to use XPath. I suspect XSLPattern is very similar, but I have heard it numbers some indexes differently than XPath (0-based vs. 1-based, or vice versa).
See:
SelectionLanguage Property

For the second setting, it registers the xsl prefix for use with xsl transforms. Personally I just always include these two lines all the time (even if I don't use any transforms). That way I'm safe!

ξ
 
Upvote 0
Ok. Again, thanks very much!
Before seeing your post, I ran the procedure without these two lines and it was fine. But I see your point and will build this in (and will research xsl transforms).

By the way, I'm about to post this separately but search by parsing uses much less memory than InStr. In one search, parsing used 460mb less RAM than InStr.

** I've been fighting this memory problem for months so THANK YOU for helping me sort it out.

I have Win 7 32 bit and 4 gb ram and iterate through 43k xml pages. For some reason I have not yet identified, as I iterate through, the amount of memory used by Excel (Working Set) increases. It's not a monotonic increase but the trend is higher as the number of iterations increase. By about half way through I'm finding that some of the xml pages are not processed (searched) because I get an Out of Memory error in the sub that currently uses an InStr search. I'm going to substitute your idea this afternoon and run overnight. It will be interesting to see if it is faster or slower given that the string being searched for appears at the top of the xml docs.

Thanks again.


I have a follow up question on this for you. The code as I have it at the moment (see the second code block below) iterates through the collection (I think it's a collection) exactly as you suggested. This means that it will take the first namespace (I need to read tonight about what this is) and test it against the four If conditions in the For Next loop.

Is there an alternative where the object is something like n.getNamedItem("http://fasb.org/us-gaap/2013-01-31").Value so the test can be something more like the following code block (I'm just throwing this out to explain what I mean ... I couldn't get it to work). That way the parsing goes right to the needed value, if it is there, and I can avoid iterating through the entire collection for each of my 43k XML docs.

Code:
Do
    Set Result = n.getNamedItem("http://fasb.org/us-gaap/2013-01-31").Value
    If Result.Length > 0 Then
        'Do ThingOne
        Exit Do
    End If

    Set Result = n.getNamedItem("http://fasb.org/us-gaap/2012-01-31").Value
    If Result.Length > 0 Then
        'Do ThingTwo
        Exit Do
    End If

    Set Result = n.getNamedItem("http://fasb.org/us-gaap/2011-01-31").Value
    If Result.Length > 0 Then
        'Do ThingThree
        Exit Do
    End If

    Set Result = n.getNamedItem("http://xbrl.us/us-gaap/2009-01-31").Value
    If Result.Length > 0 Then
        'Do ThingThree
        Exit Do
    End If

Loop While False

Below is the code I'm using that incorporates your solution. The commented out section (obviously) are the InStr calls I will to eliminate.
You can see two .Load commands at the top. The first is for a small XML document (and is currently commented out). The second is for the larger XML doc and it's searching this one that parsing saves me peak memory use of 460mb.
Code:
Sub tester3()


    Dim oInstance As MSXML2.DOMDocument60
    Dim n
    Dim o
    
    Dim oNodelist As MSXML2.IXMLDOMNodeList
    Dim m_strUSGAAP_TaxonomyVersion As String
    Dim m_strInvest_TaxonomyVersion As String
    Dim m_strDEI_TaxonomyVersion As String
    
    Set oInstance = New MSXML2.DOMDocument60
    oInstance.setProperty "SelectionLanguage", "XPath"
    oInstance.setProperty "SelectionNamespaces", "xmlns:xsl='http://www.w3.org/1999/XSL/Transform'"
    oInstance.async = False
    oInstance.validateOnParse = False
    'oInstance.Load "http://www.sec.gov/Archives/edgar/data/1143908/000114390813000035/ck0001143908-20130831.xml"
    oInstance.Load "http://www.sec.gov/Archives/edgar/data/7323/000006598413000050/etr-20121231.xml"
    
    m_strContextForInstants = "ERROR"
    m_strContextForDurations = "ERROR"
        
    '++++++++++++++++++++*
    
    Set n = oInstance.namespaces
    
    Do
        For Each o In n
            If o = "http://fasb.org/us-gaap/2013-01-31" Then
                m_strUSGAAP_TaxonomyVersion = "http://fasb.org/us-gaap/2013-01-31"
                m_strInvest_TaxonomyVersion = "http://xbrl.sec.gov/invest/2013-01-31"
                NO_USGAAP_TAXONOMY_INFO = False
                Exit Do
            ElseIf o = "http://fasb.org/us-gaap/2012-01-31" Then
                m_strUSGAAP_TaxonomyVersion = "http://fasb.org/us-gaap/2012-01-31"
                m_strInvest_TaxonomyVersion = "http://xbrl.sec.gov/invest/2012-01-31"
                NO_USGAAP_TAXONOMY_INFO = False
                Exit Do
            ElseIf o = "http://fasb.org/us-gaap/2011-01-31" Then
                m_strUSGAAP_TaxonomyVersion = "http://fasb.org/us-gaap/2011-01-31"
                m_strInvest_TaxonomyVersion = "http://xbrl.sec.gov/invest/2011-01-31"
                NO_USGAAP_TAXONOMY_INFO = False
                Exit Do
            ElseIf o = "http://xbrl.us/us-gaap/2009-01-31" Then
                m_strUSGAAP_TaxonomyVersion = "http://xbrl.us/us-gaap/2009-01-31"
                m_strInvest_TaxonomyVersion = "http://xbrl.us/invest/2019-01-31"
                NO_USGAAP_TAXONOMY_INFO = False
                Exit Do
            Else: NO_USGAAP_TAXONOMY_INFO = True
            End If
        Next o
    Loop While False


    'DEI Taxonomy
    Do
        For Each o In n
            If o = "http://xbrl.sec.gov/dei/2013-01-31" Then
                m_strDEI_TaxonomyVersion = "http://xbrl.sec.gov/dei/2013-01-31"
                NO_DEI_TAXONOMY_INFO = False
                Exit Do
            ElseIf o = "http://xbrl.sec.gov/dei/2012-01-31" Then
                m_strDEI_TaxonomyVersion = "http://xbrl.sec.gov/dei/2012-01-31"
                NO_DEI_TAXONOMY_INFO = False
                Exit Do
            ElseIf o = "http://xbrl.sec.gov/dei/2011-01-31" Then
                m_strDEI_TaxonomyVersion = "http://xbrl.sec.gov/dei/2011-01-31"
                NO_DEI_TAXONOMY_INFO = False
                Exit Do
            ElseIf o = "http://xbrl.us/dei/2009-01-31" Then
                m_strDEI_TaxonomyVersion = "http://xbrl.us/dei/2009-01-31"
                NO_DEI_TAXONOMY_INFO = False
                Exit Do
            Else: NO_DEI_TAXONOMY_INFO = True
            End If
        Next o
    Loop While False


    '+++++++++++++++++++++
    
    'If InStr(1, oInstance.XML, "http://fasb.org/us-gaap/2013-01-31") <> 0 Then
        'This IS the 2012 US GAAP taxonomy
    '    m_strUSGAAP_TaxonomyVersion = "http://fasb.org/us-gaap/2013-01-31"
    '    m_strInvest_TaxonomyVersion = "http://xbrl.sec.gov/invest/2013-01-31"  'NEED TO CONFIRM THE DATE HERE
    'ElseIf InStr(1, oInstance.XML, "http://fasb.org/us-gaap/2012-01-31") <> 0 Then
        'This IS the 2012 US GAAP taxonomy
    '    m_strUSGAAP_TaxonomyVersion = "http://fasb.org/us-gaap/2012-01-31"
    '    m_strInvest_TaxonomyVersion = "http://xbrl.sec.gov/invest/2012-01-31"
    'ElseIf InStr(1, oInstance.XML, "http://fasb.org/us-gaap/2011-01-31") <> 0 Then
        'This IS the 2011 US GAAP taxonomy
    '    m_strUSGAAP_TaxonomyVersion = "http://fasb.org/us-gaap/2011-01-31"
    '    m_strInvest_TaxonomyVersion = "http://xbrl.sec.gov/invest/2011-01-31"
    'ElseIf InStr(1, oInstance.XML, "http://xbrl.us/us-gaap/2009-01-31") <> 0 Then
        'This IS the 2009 US GAAP taxonomy
    '    m_strUSGAAP_TaxonomyVersion = "http://xbrl.us/us-gaap/2009-01-31"
    '    m_strInvest_TaxonomyVersion = "http://xbrl.us/invest/2019-01-31"
    'Else: NO_USGAAP_TAXONOMY_INFO = True
    'End If
    
    'DEI Taxonomy
    'If InStr(1, oInstance.XML, "http://xbrl.sec.gov/dei/2013-01-31") <> 0 Then
        'This IS the 2012 DEI taxonomy
    '    m_strDEI_TaxonomyVersion = "http://xbrl.sec.gov/dei/2013-01-31"
    'ElseIf InStr(1, oInstance.XML, "http://xbrl.sec.gov/dei/2012-01-31") <> 0 Then
        'This IS the 2012 DEI taxonomy
    '    m_strDEI_TaxonomyVersion = "http://xbrl.sec.gov/dei/2012-01-31"
    'ElseIf InStr(1, oInstance.XML, "http://xbrl.sec.gov/dei/2011-01-31") <> 0 Then
        'This IS the 2011 DEI taxonomy
    '    m_strDEI_TaxonomyVersion = "http://xbrl.sec.gov/dei/2011-01-31"
    'ElseIf InStr(1, oInstance.XML, "http://xbrl.us/dei/2009-01-31") <> 0 Then
       'This IS the 2009 DEI taxonomy
    '     m_strDEI_TaxonomyVersion = "http://xbrl.us/dei/2009-01-31"
    'Else: NO_DEI_TAXONOMY_INFO = True
    'End If
    
    '+++++++++++++++++++++
    
    If NO_USGAAP_TAXONOMY_INFO = True Or NO_DEI_TAXONOMY_INFO = True Then
        Exit Sub
    End If
    
    oInstance.setProperty "SelectionNamespaces", "xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns:xbrli='http://www.xbrl.org/2003/instance' xmlns='http://www.xbrl.org/2003/instance' xmlns:xbrldi='http://xbrl.org/2006/xbrldi' xmlns:us-gaap='" & m_strUSGAAP_TaxonomyVersion & "' xmlns:dei='" & m_strDEI_TaxonomyVersion & "' xmlns:currency='http://xbrl.sec.gov/currency/2012-01-31' xmlns:invest='" & m_strInvest_TaxonomyVersion & "'"
                
    Set oNodelist = oInstance.SelectNodes("//dei:EntityRegistrantName")
    
    If oNodelist.Length > 0 Then
        Debug.Print "found it: " & oInstance.SelectSingleNode("//dei:EntityRegistrantName").Text
    Else
        Debug.Print "didn't find it"
    End If


End Sub
 
Last edited:
Upvote 0

Forum statistics

Threads
1,215,136
Messages
6,123,246
Members
449,093
Latest member
Vincent Khandagale

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top