Searching XML File

mchac

Well-known Member
Joined
Apr 15, 2013
Messages
531
The sub i'm currently using does an instr search for a unique string in an xml file. This works fine but i'd like to speed up the search if possible because I have a large number of xml files to search. My problem is that i have very little experience in parsing xml.

I notice that if i load the xml file and look at the namespaces i see item(1) through about item(20) and the unique string is always contained in one of those items. Note that the number of items can be different for each xml file i load.

The xml file is loaded into an object i define as oInstance and i'd like to do something like:

Do
counter = counter+1

if oInstance.namespaces.item(counter) = uniquestring then
DoThis
Exit Do
end if

Loop Until oInstance.namespaces.item(counter) Is Nothing

I tried basically that and it didn't work.

So I tried:
Code:
     Dim COUNTERV2 As Integer    

     Dim oNodelistV2 As MSXML2.IXMLDOMNodeList
    
    Set oNodelistV2 = oInstance.SelectNodes("namespaces")


    COUNTERV2 = 0
    Do
        
        COUNTERV2 = COUNTERV2 + 1
        
        If oNodelistV2.Item(COUNTERV2) = "http://fasb.org/us-gaap/2013-01-31" Then
            m_strUSGAAP_TaxonomyVersion = "http://fasb.org/us-gaap/2013-01-31"
            m_strInvest_TaxonomyVersion = "http://xbrl.sec.gov/invest/2013-01-31"
            Exit Do
        End If

     Loop Until oNodelistV2.Item(COUNTER) Is Nothing
but now i get Err.Number 91

If anyone can point me in a direction to investigate I would appreciate it.
 

Excel Facts

How to create a cell-sized chart?
Tiny charts, called Sparklines, were added to Excel 2010. Look for Sparklines on the Insert tab.
Hi,
With XML you should just select the matching node using a XPath expression. Where are you going with this? In short, it should be very easy to get the node but what do you want to do with it?

I might be thinking about this wrong though, so if you can post a little sample of the relevant XML it would help (enclose it in PHP tags to display on the board).

Note that it's a little odd with your loop. Either the string to match is there or it isn't so why bother looping or anything else? What happens if it's not there?
 
Last edited:
Upvote 0
Hi,
I've been looking at some XBRL samples and namespaces come into play in different ways - I think we need to know what kind of xml document you have here (ideally a sample) -- otherwise I'd probably put something out as an example that wouldn't work for you. By the way you can also just search the XML text as text if you just want to know if a namespace value exists in it (may or may not work as well as parsing the XML, depending on the size of the file and location of the searched for text).
 
Upvote 0
Xenou, thanks for your posts and for looking into XBRL.

I am trying to pull financial data from XBRL instance documents that are filed with the SEC. This is a personal project to test a concept and I'm borrowing code available from Accuracy Rate Raises to 97.3% for SEC XBRL Filings, Prototype for Using Information - Blog: Digital Financial Reporting (using XBRL) - Semantic, structured, model-based authoring of digital financial reports leveraging the XBRL medium.

Many financial filings include XBRL instance documents and I'm specifically looking at 10-K and 10-Q data from 2012Q4 through 2013Q4 (or what's been filed through the run date since the quarter isn't over). That leaves me with 43k xml files to search.

XBRL uses a naming convention (taxonomy) that gets updated now once a year and the code needs to evaluate which taxonomy is being used so that you can pull specific pieces of data from the xml file. For example, if you wanted to pull Wal Mart's Current Liabilities you would need to manually look up their financial filing online or download a .xls file and write code to search that file. I did this a few months ago and it is very time consuming to open, search and close those excel files. I've read that using direct access to the xml file on the SEC server by using XBRL naming is much faster so I'm trying to learn about it and implement code to do this.

I have the directory path to those 43k xml files in an excel spreadsheet so my code takes the first xml path, loads the xml into memory, searches for the taxonomy version using Instr then with that framework I search for the specific financial info I want (such as Accounts Payable). Then I loop with the next xml path etc etc.

Using Instr works fine (see code below) but I thought it would be faster if I just search the Items I mentioned in my first post rather than search the entire xml file (multiplied by doing this for 43k xml files). Then I got interested in how this would work since I have basically no xml parsing experience.

Code:
    If InStr(1, strEntireInstanceDocument_TEMP, "http://fasb.org/us-gaap/2012-01-31") <> 0 Then
        'This IS the 2012 US GAAP taxonomy
        m_strUSGAAP_TaxonomyVersion = "http://fasb.org/us-gaap/2012-01-31"
        m_strInvest_TaxonomyVersion = "http://xbrl.sec.gov/invest/2012-01-31"
    End If
    If InStr(1, strEntireInstanceDocument_TEMP, "http://fasb.org/us-gaap/2011-01-31") <> 0 Then
        'This IS the 2011 US GAAP taxonomy
        m_strUSGAAP_TaxonomyVersion = "http://fasb.org/us-gaap/2011-01-31"
        m_strInvest_TaxonomyVersion = "http://xbrl.sec.gov/invest/2011-01-31"
    End If
    If InStr(1, strEntireInstanceDocument_TEMP, "http://xbrl.us/us-gaap/2009-01-31") <> 0 Then
        'This IS the 2009 US GAAP taxonomy
        m_strUSGAAP_TaxonomyVersion = "http://xbrl.us/us-gaap/2009-01-31"
        m_strInvest_TaxonomyVersion = "http://xbrl.us/invest/2019-01-31"
    End If

So I tried to write a loop to look in each of those Items to see if any one of a set of strings is there. If so, that's the taxonomy and I can proceed to pulling data. I wrote a loop because the Item that holds the taxonomy reference can be different for each xml file. If that taxonomy reference isn't present then I can't use the naming convention to pull the data I want and I skip that financial filing. Note that it appears that most of the filings do have a taxonomy reference.

I think your first post implied that there is a direct way to see of the string is in any of the Items but I don't understand how to write that code. If you can provide references I can read up on it.

Here are some xml document paths:
www.sec.gov/Archives/edgar/data/1084869/000110465913082760/flws-20130929.xml
www.sec.gov/Archives/edgar/data/66740/000110465913079583/mmm-20130930.xml
www.sec.gov/Archives/edgar/data/913142/000119312513429221/bdc-20130929.xml
www.sec.gov/Archives/edgar/data/10795/000119312513456802/bdx-20130930.xml

Thanks again.
 
Upvote 0
For all four of them, it's at the top in about the third line once rendered.

In the first one (for 1 800 flowers) you can see:
xmlns:us-gaap="http://fasb.org/us-gaap/2012-01-31"
for the accounting line items and this:
xmlns:dei="http://xbrl.sec.gov/dei/2012-01-31"
for the document and entity information.

It seems, from a very small sampling, that it's usually at the top of the rendered xml file. Perhaps that means that Instr() as a brute force method is suitable? I suppose for that to be true the xml files would need to be saved in the same order that they a rendered in a browser. I noticed that for a couple seconds the xml file looks very different before the browser does whatever it does to display a typical xml structure. But I don't know what that means, if anything, for how/where things are stored.

If you are interested in info regarding the taxonomies, here are a couple links to the SEC website:
Standard Taxonomies
Information for EDGAR Filers
 
Upvote 0
Okay thanks. I was looking at the xml in a browser instead of downloading the file and view the raw xml.

So here's how to get the namespaces -- a lot like your original code, I think, which works best here. It assumes you have the flowers xml doc on your desktop (but edit the file path as needed wherever you put the file):
Code:
[COLOR="Navy"]Sub[/COLOR] Foo()
[COLOR="Navy"]Dim[/COLOR] dom [COLOR="Navy"]As[/COLOR] [COLOR="Navy"]Object[/COLOR] [COLOR="SeaGreen"]'MSXML2.DOMDocument[/COLOR]
[COLOR="Navy"]Dim[/COLOR] node [COLOR="Navy"]As[/COLOR] [COLOR="Navy"]Object[/COLOR] [COLOR="SeaGreen"]'MSXML2.IXMLDOMNode[/COLOR]
[COLOR="Navy"]Dim[/COLOR] n, o

    [COLOR="Navy"]Set[/COLOR] dom = CreateObject("MSXML2.DOMDocument")
    dom.setProperty "SelectionLanguage", "XPath"
    dom.setProperty "SelectionNamespaces", "xmlns:xsl='http://www.w3.org/1999/XSL/Transform'"
    dom.Load "C:\users\<username>\Desktop\flws-20130929.xml"  '//EDIT AS NEEDED
    
    [COLOR="Navy"]Set[/COLOR] n = dom.namespaces
    [COLOR="Navy"]For[/COLOR] [COLOR="Navy"]Each[/COLOR] o [COLOR="Navy"]In[/COLOR] n
        [COLOR="Navy"]If[/COLOR] o [COLOR="Navy"]Like[/COLOR] "*us-gaap*" [COLOR="Navy"]Then[/COLOR]
            [COLOR="Navy"]Debug[/COLOR].[COLOR="Navy"]Print[/COLOR] o & " <-- Found!"
        [COLOR="Navy"]Else[/COLOR]
            [COLOR="Navy"]Debug[/COLOR].[COLOR="Navy"]Print[/COLOR] o
        [COLOR="Navy"]End[/COLOR] [COLOR="Navy"]If[/COLOR]
    [COLOR="Navy"]Next[/COLOR] o
    
    [COLOR="Navy"]Set[/COLOR] dom = [COLOR="Navy"]Nothing[/COLOR]
    
[COLOR="Navy"]End[/COLOR] [COLOR="Navy"]Sub[/COLOR]


I've tried to learn a little about XBRL but its a beast! I think they have made it overly complicated - mad scientists are in charge.
 
Upvote 0
ok. Thanks very much for that.

My variable declarations and doc loading code look like this where the LoadInstanceDocIntoMemory sub is called from the sub that iterates through xml paths saved in the excel sheet:

Code:
Private oInstance As MSXML2.DOMDocument60
Private oNodelist As MSXML2.IXMLDOMNodeList

Private Sub LoadInstanceDocIntoMemory(COUNTER As Long)


    If ActiveCell.Offset(COUNTER, 0) <> "" Then
    
        Set oInstance = Nothing
        Set oInstance = New MSXML2.DOMDocument60
        oInstance.async = False
        oInstance.validateOnParse = False
        oInstance.Load "http://" & ActiveCell.Offset(COUNTER, 0)
        
    
        If oInstance.parseError.ErrorCode <> 0 Then
            MsgBox "Error loading file: " & vbCrLf & vbCrLf & _
                "File URL: " & oInstance.parseError.URL & vbCrLf & _
                "Error Code: " & oInstance.parseError.ErrorCode & vbCrLf & _
                "Error Description: " & oInstance.parseError.reason & vbCrLf _
                , vbCritical
        End If
        
    End If


End Sub

What is the difference between your:
Dim dom As Object 'MSXML2.DOMDocument
Set dom = CreateObject("MSXML2.DOMDocument")

And the code I'm using:
Dim oInstance As MSXML2.DOMDocument60
Set oInstance = New MSXML2.DOMDocument60

?
 
Upvote 0
Hi,
My code uses "late-binding". I usually publish code like that since it doesn't requiring setting references in advance. Your code uses "early-binding". I don't think it makes any real difference here - with the exception that I'm not 100% sure which MSXML library I'm using, whereas you have specifically targeted the 6.0 library.

Whenever you use late binding you must use the CreateObject() method to create objects. When you have early-binding you can use the New keyword (or CreateObject - either way would work in that case). Early binding is always better while writing code though, since you get the advantage of intellisense. Late binding *can* be an advantage if the use upgrades to a newer version of an application - the code will just start using the newer version seamlessly since there isn't any reference out there to the old version that no longer exists. However, in the case of MSXML developers generally want to target a specific version and use early binding, since it has been changing rapidly.
 
Upvote 0
ok. Thanks again for taking the time to look into this and provide a solution. I'll implement later today and come back to you after it's run.

So if I understand you correctly, your code dims an object but doesn't create it as MSXML DOMdoc until it's used but the code I'm using dims an MSXML DOMdoc object then drops data into it.

Do you know if there is a way to release the memory taken by the object? I'm running 43k xml lookups and I can see Excel memory increasing on average as the number of iterations executed increases. My code gave me an out of memory error at 13k iterations then when I hit Continue it gave me the same error at 28k iterations. After that I stopped execution to look into this but can't find a solution.

At first I was wondering if the code I was using creates a 'New' document that isn't fully released when I set it = Nothing then the next New document adds to that. I'm running a test between the two sets of code to see if I find a meaningful difference in memory used and that will be done in about an hour.

In the meantime, any thoughts you have on a memory problem like this is appreciated.
 
Upvote 0

Forum statistics

Threads
1,215,150
Messages
6,123,312
Members
449,094
Latest member
Chestertim

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top