Thanks for the hint, I used the Data sampler in HTML. An example of the data is:
<b>Excel 2010</b><table cellpadding="2.5px" rules="all" style=";background-color: #FFFFFF;border: 1px solid;border-collapse: collapse; border-color: #BBB"><colgroup><col width="25px" style="background-color: #DAE7F5" /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /></colgroup><thead><tr style=" background-color: #DAE7F5;text-align: center;color: #161120">
<th>A</th><th>B</th><th>C</th><th>D</th><th>E</th><th>F</th><th>G</th><th>H</th><th>I</th><th>J</th><th>K</th><th>L</th><th>M</th><th>N</th><th>O</th><th>P</th><th>Q</th><th>R</th><th>S</th><th>T</th><th>U</th><th>V</th></tr></thead><tbody><tr ><td style="color: #161120;text-align: center;">1</td><td style="background-color: #00B0F0;;">Source</td><td style=";">Additional Info</td><td style="background-color: #00B0F0;;">First Name</td><td style="background-color: #00B0F0;;">Last Name</td><td style=";">Type</td><td style="background-color: #00B0F0;;">Title</td><td style=";">Company</td><td style=";">Tags</td><td style="background-color: #00B0F0;;">Notes</td><td style="background-color: #00B0F0;;">Id</td><td style="background-color: #00B0F0;;">Primary Email</td><td style="background-color: #00B0F0;;">PrimaryPhone</td><td style="background-color: #00B0F0;;">PrimaryCell</td><td style="background-color: #00B0F0;;">PrimaryFax</td><td style="background-color: #00B0F0;;">Primary Website</td><td style="background-color: #00B0F0;;">Primary Address_1</td><td style="background-color: #00B0F0;;">Primary Address_2</td><td style="background-color: #00B0F0;;">Primary City</td><td style="background-color: #00B0F0;;">Primary State</td><td style="background-color: #00B0F0;;">Primary Zip_code</td><td style="background-color: #00B0F0;;">Primary Country</td></tr><tr ><td style="color: #161120;text-align: center;">39172</td><td style=";">Dance NPRG</td><td style="text-align: right;;"></td><td style=";">Juanita</td><td style=";">Caddy</td><td style="text-align: right;;"></td>
<td style=";">ISC Funding and Stakeholder Engagement</td><td style=";">Department of Education, Employment and Workplace Relations DEEWR FED</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td>
<td style=";">xxx@deewr.gov.au</td>
<td style=";">02 1234 5678</td><td style=";"></td><td style=";"> </td><td style=";">http://www.deewr.gov.au</td><td style=";"></td><td style="text-align: right;;"></td><td style=";"></td><td style="text-align: right;;"></td><td style=";"></td><td style="text-align: right;;"></td></tr><tr ><td style="color: #161120;text-align: center;">39173</td><td style=";">VACD NPRG</td><td style="text-align: right;;"></td><td style=";">Juanita</td><td style=";">Caddy</td><td style="text-align: right;;"></td><td style=";"></td><td style=";">Department of Education, Employment and Workplace Relations DEEWR FED</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td>
<td style=";">xxx@deewr.gov.au</td>
<td style=";">02 1234 5678</td><td style=";"></td><td style="text-align: right;;"></td><td style=";">http://www.deewr.gov.au</td><td style=";">Gpo Box</td><td style="text-align: right;;"></td><td style=";">Canberra</td><td style=";">Australian Capital Territory</td><td style="text-align: right;;">2601</td><td style=";">Australia</td></tr></tbody></table><p style="width:4.8em;font-weight:bold;margin:0;padding:0.2em 0.6em 0.2em 0.5em;border: 1px solid #BBB;border-top:none;text-align: center;background-color: #DAE7F5;color: #161120">Contacts</p><br /><br />
You can see that the Surname and Email address are the same but the first row has the title and the second row is an address. I'd like to merge these two together. So the end result would be:
<b>Excel 2010</b><table cellpadding="2.5px" rules="all" style=";background-color: #FFFFFF;border: 1px solid;border-collapse: collapse; border-color: #BBB"><colgroup><col width="25px" style="background-color: #DAE7F5" /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /><col /></colgroup><thead><tr style=" background-color: #DAE7F5;text-align: center;color: #161120">
<th>A</th><th>B</th><th>C</th><th>D</th><th>E</th><th>F</th><th>G</th><th>H</th><th>I</th><th>J</th><th>K</th><th>L</th><th>M</th><th>N</th><th>O</th><th>P</th><th>Q</th><th>R</th><th>S</th><th>T</th><th>U</th><th>V</th></tr></thead><tbody><tr ><td style="color: #161120;text-align: center;">1</td><td style="background-color: #00B0F0;;">Source</td><td style=";">Additional Info</td><td style="background-color: #00B0F0;;">First Name</td><td style="background-color: #00B0F0;;">Last Name</td><td style=";">Type</td><td style="background-color: #00B0F0;;">Title</td><td style=";">Company</td><td style=";">Tags</td><td style="background-color: #00B0F0;;">Notes</td><td style="background-color: #00B0F0;;">Id</td><td style="background-color: #00B0F0;;">Primary Email</td><td style="background-color: #00B0F0;;">PrimaryPhone</td><td style="background-color: #00B0F0;;">PrimaryCell</td><td style="background-color: #00B0F0;;">PrimaryFax</td><td style="background-color: #00B0F0;;">Primary Website</td><td style="background-color: #00B0F0;;">Primary Address_1</td><td style="background-color: #00B0F0;;">Primary Address_2</td><td style="background-color: #00B0F0;;">Primary City</td><td style="background-color: #00B0F0;;">Primary State</td><td style="background-color: #00B0F0;;">Primary Zip_code</td><td style="background-color: #00B0F0;;">Primary Country</td></tr><tr ><td style="color: #161120;text-align: center;">39173</td><td style=";">VACD NPRG</td><td style="text-align: right;;"></td><td style=";">Juanita</td><td style=";">Caddy</td><td style="text-align: right;;"></td><td style=";">ISC Funding and Stakeholder Engagement</td><td style=";">Department of Education, Employment and Workplace Relations DEEWR FED</td><td style="text-align: right;;"></td><td style="text-align: right;;"></td><td style="text-align: right;;"></td>
<td style=";">xxx@deewr.gov.au</td>
<td style=";">02 1234 5678</td><td style=";"></td><td style="text-align: right;;"></td><td style=";">http://www.deewr.gov.au</td><td style=";">Gpo Box</td><td style="text-align: right;;"></td><td style=";">Canberra</td><td style=";">Australian Capital Territory</td><td style="text-align: right;;">2601</td><td style=";">Australia</td></tr></tbody></table><p style="width:4.8em;font-weight:bold;margin:0;padding:0.2em 0.6em 0.2em 0.5em;border: 1px solid #BBB;border-top:none;text-align: center;background-color: #DAE7F5;color: #161120">Contacts</p><br /><br />
Ideally I'd like to specify what columns are used to determine if it's a duplicate entry. I could start with First Name, Last Name, Title, Company etc etc and as the duplicates are found, merged and removed I can progressively reduce the required columns to determine if it's a duplicate. Hopefully to manage the data quality. I hope that makes senses.
Thanks for your patience.