how to separate HTML tables from one cell into multiple cells?

Excel702

New Member
Joined
Jun 3, 2020
Messages
4
Office Version
  1. 365
Platform
  1. Windows
Hi,
I have a complicated situation with an unwieldy amount of data.
The situation is I have data from my website in excel, each row is a profile (around 100k profiles), that all biographical data is crammed into one data field (or in this case, a cell, since in Excel) in HTML format.
The HTML format contains 8 generically named tables and a few paragraphs in between.
The previous owner put me between a rock and a hard place because all the individual biographical data fields like name, age, etc were deleted after being merged into a single field for rendering the post and I have no way to undo it. I extracted all my data into excel, it's structured so each row is all data fields of a profile from my website.
the combined profile biographical data per profile has it's own cell, column C.
I am trying to extract each biographical data from this unfortunately combined cell into other cells so I can make the data useful on reimport such as having categories and tags to filter.

I have tried using a formula like this to extract a profile's name
Code:
=MID(C2,36,FIND("</b",C2,36)-36)
But trying to do it this way becomes more complicated/difficult as I progress into the string to try to extract more.

Is there a way or any excel addon that is able to parse through each excel cell in the column and convert the HTML tables contained within, extracting each HTML table cell value to additional columns in excel?
 

Excel Facts

Spell Check in Excel
Press F7 to start spell check in Excel. Be careful, by default, Excel does not check Capitalized Werds (whoops)

Anthony47

Well-known Member
Joined
Mar 29, 2006
Messages
2,147
I seem that this is a job for vba with the help of the Microsoft HTML Object Library, mshtml.tlb
For additional help you should provide a sample workbook with at least one demo profile in the current format.

Bye
 

John_w

MrExcel MVP
Joined
Oct 15, 2007
Messages
6,753
The situation is I have data from my website in excel, each row is a profile (around 100k profiles), that all biographical data is crammed into one data field (or in this case, a cell, since in Excel) in HTML format.
The HTML format contains 8 generically named tables and a few paragraphs in between.
Welcome to Mr Excel forums.

Without seeing a sample of your data, to know exactly what your HTML looks like, I don't know if this macro will work successfully.

It loops through each cell in column C in the active sheet, starting at C2, and extracts all the text (in the HTML table cells and paragraphs) to columns in a newly added sheet.

VBA Code:
Option Explicit

Public Sub Extract_HTML_Text()

    Dim HTMLcells As Range, HTMLcell As Range
    Dim destCell As Range, r As Long
    Dim HTMLdoc As Object
    
    With ActiveSheet
        Set HTMLcells = .Range(.Range("C2"), .Cells(.Rows.Count, "C").End(xlUp))
    End With
    
    With ActiveWorkbook
        .Worksheets.Add After:=.Worksheets(.Worksheets.Count)
        Set destCell = .Worksheets(.Worksheets.Count).Range("A2")
    End With
        
    Set HTMLdoc = CreateObject("HTMLfile")
    
    r = 0
    For Each HTMLcell In HTMLcells
        HTMLdoc.body.innerHTML = HTMLcell.Value
        DOM_Text_To_Cells HTMLdoc.body, destCell.Offset(r)
        r = r + 1
    Next
    
End Sub


Private Function DOM_Text_To_Cells(node As Object, destCell As Range) As Long
    
    Const TEXT_NODE = 3
    Dim child As Object
    
    'Traverse DOM nodes and put text nodes in next column
    
    DOM_Text_To_Cells = 0
    If node.NodeType = TEXT_NODE Then
        destCell.Offset(, DOM_Text_To_Cells).Value = node.NodeValue
        DOM_Text_To_Cells = 1
    Else
        For Each child In node.ChildNodes
            DOM_Text_To_Cells = DOM_Text_To_Cells + DOM_Text_To_Cells(child, destCell.Offset(, DOM_Text_To_Cells))
        Next
    End If
    
End Function
 

Excel702

New Member
Joined
Jun 3, 2020
Messages
4
Office Version
  1. 365
Platform
  1. Windows
I apologize for my late reply, here is an example row
ExampleRow.xlsx
ABC
1idTitleContent
2209Maria Barreto Leite<p style="text-align: justify;"><b>Maria Barreto Leite</b> is a Brazilian famous celebrity, who was born on 1916 in Santana do Livramento, Rio Grande do Sul, Brazil. Age <strong>80 years (age at death)</strong> old. Maria Barreto Leite Zodiac Sign is <strong></strong>, Ethnicity White & religion Not Available.</p> <h2>Maria Barreto Leite Net Worth 2018</h2> <p style="text-align: justify;">Maria Barreto Leite estimated net worth in 2018 is <b>Under Review</b>. Here we also added Maria Barreto Leite previous years Net Worth, Income, Salary & Property details.</p> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Net Worth, Income & Salary</th> </tr> </thead> <tbody class="row-hover"> <tr class="row evan"> <td class="column-1">Net Worth</td><td class="column-2">Under Review</td> </tr> <tr class="row odd"> <td class="column-1">Salary</td><td class="column-2">Not Available</td> </tr> <tr class="row evan"> <td class="column-1">Estimated Net Worth (2017)</td><td class="column-2">Under Review</td> </tr> <tr class="row odd"> <td class="column-1">Estimated Net Worth (2016)</td><td class="column-2">Under Review</td> </tr> <tr class="row evan"> <td class="column-1">Estimated Net Worth (2015)</td><td class="column-2">Under Review</td> </tr> </tbody></table> <p style="text-align: justify;"><b>Height & Weight</b>: Are you interested to know that <strong>How tall is Maria Barreto Leite?</strong> Let’s check it. Maria Barreto Leite’s height <strong>Not Available & Weight Unknown.</strong></p> <p style="text-align: justify;"><b>Body Measurements</b>: Maria Barreto Leite full body measurements Not Available Right Now. We will update once we get the correct information. Body build Large, Eye color Brown – Dark & Hair color Salt and Pepper.</p> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Profile</th> </tr> </thead> <tbody class="row-hover"> <tr class="row"> <td class="column-1">Birth Name</td><td class="column-2">Maria Barreto Leite</td> </tr> <tr class="row odd"> <td class="column-1">Nick Name</td><td class="column-2">Not Known</td> </tr> <tr class="row evan"> <td class="column-1">Alternative Name</td><td class="column-2"></td> </tr> <tr class="row odd"> <td class="column-1">Occupation</td><td class="column-2"></td> </tr> <tr class="row evan"> <td class="column-1">Star/Sun Sign</td><td class="column-2">Not Known</td> </tr> <tr class="row evan"> <td class="column-1">Caste</td><td class="column-2">Unknown</td> </tr> </tbody></table> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Birth Details</th> </tr> </thead> <tbody class="row-hover"> <tr class="row evan"> <td class="column-1">Age</td><td class="column-2">80 years (age at death)</td> </tr> <tr class="row odd"> <td class="column-1">Date of Birth</td><td class="column-2">1916</td> </tr> <tr class="row evan"> <td class="column-1">Place of Birth</td><td class="column-2">Santana do Livramento, Rio Grande do Sul, Brazil</td> </tr> <tr class="row odd"> <td class="column-1">Nationality</td><td class="column-2">Brazilian</td> </tr> <tr class="row evan"> <td class="column-1">Ethnicity</td><td class="column-2">White</td> </tr> <tr class="row odd"> <td class="column-1">Religion</td><td class="column-2">Not Known</td> </tr> </tbody></table> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Academic Background</th> </tr> </thead> <tbody class="row-hover"> <tr class="row evan"> <td class="column-1">Qualification</td><td class="column-2">No Information Found About Educational Qualification.</td> </tr> <tr class="row evan"> <td class="column-1">School</td><td class="column-2">School Name will be update soon.</td> </tr> <tr class="row evan"> <td class="column-1">University</td><td class="column-2">Not Available</td> </tr> </tbody></table> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Parents & Relatives</th> </tr> </thead> <tbody class="row-hover"> <tr class="row"> <td class="column-1">Father</td><td class="column-2">Fathers name not available right now.</td> </tr> <tr class="row"> <td class="column-1">Mother</td><td class="column-2">Not Known</td> </tr> <tr class="row"> <td class="column-1">Brother</td><td class="column-2">Unknown</td> </tr> <tr class="row"> <td class="column-1">Sister</td><td class="column-2">Unknown</td> </tr> <tr class="row"> <td class="column-1">Siblings</td><td class="column-2">Not Available</td> </tr> <tr class="row"> <td class="column-1">Relatives</td><td class="column-2">Vera Barreto Leite (daughter) (actress), Luiza Barreto Leite (daughter) (actress), Mariana de Moraes (granddaughter) (actress)</td> </tr> <tr class="row"> <td class="column-1">Best Friend(s)</td><td class="column-2">Not Shared about Best/Close Friends.</td> </tr> </tbody></table> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Married Life</th> </tr> </thead> <tbody class="row-hover"> <tr class="row evan"> <td class="column-1">Marital Status</td><td class="column-2">Update Soon</td> </tr> <tr class="row odd"> <td class="column-1">Spouse</td><td class="column-2">Update Soon</td> </tr><tr class="row evan"> <td class="column-1">Children</td><td class="column-2">We will update soon about Childrens.</td> </tr></tbody></table> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Body Measurements</th> </tr> </thead> <tbody class="row-hover"> <tr class="row odd"> <td class="column-1">Figure/Measurements</td><td class="column-2">No varified information has been found about Body Measurements.</td> </tr> <tr class="row"> <td class="column-1">Bra/Chest</td><td class="column-2">N/A</td> </tr> <tr class="row"> <td class="column-1">Waist</td><td class="column-2">N/A</td> </tr> <tr class="row"> <td class="column-1">Hip/Biceps</td><td class="column-2">N/A</td> </tr> </tbody></table> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Physical Stats</th> </tr> </thead> <tbody class="row-hover"> <tr class="row"> <td class="column-1">Sexual Orientation</td><td class="column-2">Straight</td> </tr> <tr class="row"> <td class="column-1">Distinctive Features</td><td class="column-2"></td> </tr> <tr class="row"> <td class="column-1">Body Build</td><td class="column-2">Large</td> </tr> <tr class="row"> <td class="column-1">Eye Color</td><td class="column-2">Brown – Dark</td> </tr> <tr class="row"> <td class="column-1">Hair Color</td><td class="column-2">Salt and Pepper</td> </tr> <tr class="row"> <td class="column-1">Shoe/Feet Size</td><td class="column-2"></td> </tr> <tr class="row"> <td class="column-1">Dress Size</td><td class="column-2"></td> </tr> </tbody></table> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Height & Weight</th> </tr> </thead> <tbody class="row-hover"> <tr class="row evan"> <td class="column-1">Height</td><td class="column-2">Not Known</td> </tr> <tr class="row odd"> <td class="column-1">Weight (in kilograms)</td><td class="column-2">Not Known</td> </tr> <tr class="row evan"> <td class="column-1">Weight (in Pounds)</td><td class="column-2">Not Known</td> </tr> </tbody></table> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Career Facts</th> </tr> </thead> <tbody class="row-hover"> </tbody></table> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Career Debut</th> </tr> </thead> <tbody class="row-hover"> </tbody></table> <table class="tablepress tablepress-id-global"><caption></caption> <thead> <tr class="row-tablehead odd"> <th colspan="2">Dating/Affairs History</th> </tr> </thead> <tbody class="row-hover"> <tr class="row evan"> <td class="column-1"
Sheet1
 

Anthony47

Well-known Member
Joined
Mar 29, 2006
Messages
2,147

ADVERTISEMENT

So you have in column C a set of cells containing information similar to what you published.
The following macro will extract from each of the cells "all" the information presented as Tables:
Code:
Sub GetDetails()
Dim HTDoc As Object, OutSh As Worksheet
Dim oArr(), AColl As Object, myItm As Object, I As Long, J As Long
Dim II As Long, tDtD, tRtR, TI As Long
'
Set OutSh = Sheets("Details")       '<<< The Output worksheet
'
OutSh.Cells.ClearContents
Set HTDoc = CreateObject("HTMLfile") 'late binding alla html obj lib
'HTDoc.Open
For II = 2 To Cells(Rows.Count, "C").End(xlUp).Row
    HTDoc.Open
    HTDoc.write Cells(II, "C").Value
    TI = 0: J = 0 + (II - 2) * 4
    If Not HTDoc Is Nothing Then
        Set AColl = HTDoc.getElementsByTagName("table")
        For Each myItm In AColl
            OutSh.Cells(I + 1, J + 1) = "Table# " & TI + 1
            TI = TI + 1: I = I + 1
            For Each tRtR In myItm.Rows
                For Each tDtD In tRtR.Cells
                    OutSh.Cells(I + 1, J + 1) = tDtD.innerText
                    J = J + 1
                Next tDtD
                I = I + 1: J = 0 + (II - 2) * 4
            DoEvents
            Next tRtR
        I = I + 1
        Next myItm
    End If
    I = 0
Next II
End Sub

Copy the code in a "standard module" of your vba project; then select the sheets with the source information and start Sub GetDetails

The macro will create in the worksheet named "Details" (this name can be changed within the code; make sure the worksheet exists before running the macro) a sequence of information organized in column for each of the cells compiled in the starting sheet. The first set starts in A1, the second in E1, the third one in I1, and so on

Have a look at it...

Bye
 

Excel702

New Member
Joined
Jun 3, 2020
Messages
4
Office Version
  1. 365
Platform
  1. Windows
Thank you so much @Anthony47

I am completely new to VBA scripting lol
OutofRange_imageminified.jpg
I realized, oh! I have to make the Details page ahead of time for that to work.


Got it to work but it seems to stop after 6 profiles.
I have 5,000 rows (containing profiles) per excel page.
All cells in the C column on those pages are structured the same.

I am trying to understand the code, to see if there would be a way to output all table data in each C column cell from whatever page, into all cells across a single row in a new page.
That way it matches the structure of the pages with the profile rows so I can add the new columns over into the pages they were extracted from.

Would you be willing to give me a crash course in how this code works so I can learn and make it work?
Maybe point me in the right direction for stuff to read?
I am starting knowledge-wise from nothing

In any event, this was a huge help! I now know it's possible to fix this problem, thank you!
 

Attachments

  • OutofRange_error.jpg
    OutofRange_error.jpg
    8.7 KB · Views: 10

Anthony47

Well-known Member
Joined
Mar 29, 2006
Messages
2,147
You wrote that the process stops after importing only a few profiles: did you get an error, or any message at that point, or what?
I copied the block you published in 20 cells and I got all the 20 profiles.
So if you can publish a sample workbook with blocks that fails we can examine it and check what is the problem and bypass/ fix it

In the meantime I shall find some useful information about the mshtml library that I exploited for extracting the data.

Bye
 

Excel702

New Member
Joined
Jun 3, 2020
Messages
4
Office Version
  1. 365
Platform
  1. Windows
Thank you, I believe it was my own user error.
I figured out that if I don't click the Details page, the script will continue running.
If I click the Details page to view it while the script is running, it will prematurely stop the script.
 

Watch MrExcel Video

Forum statistics

Threads
1,129,829
Messages
5,638,566
Members
417,037
Latest member
vedan2020

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Top