greenMamba
New Member
- Joined
- Sep 24, 2015
- Messages
- 1
I'm trying to find out how to properly strip Excel styles out of a workbook. I am using this open source project as a starting point but I'm not sure what' s going wrong here. https://excelfilecleaner.codeplex.com/
Whenever I process certain workbooks through it I open them and get an error that says there is "Unreadable content" contained in the workbook and it asks if I want to attempt to recover it. I click "yes" and the document appears to recover without an issue. When it recovers it gives me a log saying it Removed Records: Style from xl/styles.xml part (Styles)I have tried running through the code to look for any discrepancies in how it is handling the workbook, but can't find any. I have extracted and compared "styles.xml" from the same workbook in its original format, cleaned format, and repaired format, and the only discrepancies I can find are in the number of fonts that are still left in the styles.xml file. The "Cleaned" workbook contains the same as the "Original", but the "Repaired" workbook appears to have much less. Also, in the "Cleaned" workbook there appears to be a "x:" pre-fix before everything. Does anyone have any ideas why this could be happening? Am I missing something on the anatomy of these workbooks?
This is the class for cleaning the workbooks - it's written in c#
Whenever I process certain workbooks through it I open them and get an error that says there is "Unreadable content" contained in the workbook and it asks if I want to attempt to recover it. I click "yes" and the document appears to recover without an issue. When it recovers it gives me a log saying it Removed Records: Style from xl/styles.xml part (Styles)I have tried running through the code to look for any discrepancies in how it is handling the workbook, but can't find any. I have extracted and compared "styles.xml" from the same workbook in its original format, cleaned format, and repaired format, and the only discrepancies I can find are in the number of fonts that are still left in the styles.xml file. The "Cleaned" workbook contains the same as the "Original", but the "Repaired" workbook appears to have much less. Also, in the "Cleaned" workbook there appears to be a "x:" pre-fix before everything. Does anyone have any ideas why this could be happening? Am I missing something on the anatomy of these workbooks?
This is the class for cleaning the workbooks - it's written in c#
Code:
private void CleanStyles(Stylesheet stylesheet)
{
List<CellFormat> cfList = new List<CellFormat>();
// build a list of cell formats
// NOTE: In the CellStyle namespace there is a FormatId
// the CellStyleFormats are actually an array
// of styles and the FormatId in the CellStyle
// references the index position of the
// CellStyleFormat. So what we have to do is
// load all these into an List and then using
// the formatID of the cell style get reference
// to the related CellStyleFormat
Log("Reviewing styles.");
foreach (CellFormat cf in stylesheet.CellStyleFormats)
cfList.Add(cf);
// now get a list of the named Styles
Dictionary<int,styleCount> styleList = new Dictionary<int,styleCount>();
foreach (CellStyle cs in stylesheet.CellStyles)
{
try
{
// add all styles and the cellformats
int formatId = int.Parse(cs.FormatId);
styleList.Add(formatId, new styleCount(cs, cfList[formatId], !(cs.BuiltinId == null)));
}
catch { } // ignore
}
Log("There are " + styleList.Count.ToString() + " styles in the workbook.");
// now that we have aligned our cellstyleformats with cellformats
// we need to delete everyone of them that are not related
// to a specific CellFormat.
int counter = 0;
foreach (CellFormat xf in stylesheet.CellFormats)
{
try
{
int formatId = int.Parse(xf.FormatId);
if (styleList.Keys.Contains(formatId))
{
styleList[formatId].inUse = true;
styleList[formatId].Found = true;
styleList[formatId].cellFormat = xf;
counter++;
}
}
catch { } // ignore
}
Log("There are " + counter.ToString() + " styles in use.");
Log("Cleaning unused styles:");
// now loop though the stylelist and delete anything that
// is not in use and then log it
foreach (KeyValuePair<int,styleCount> c in styleList)
{
// only those not in use and those that are not built in
if (c.Value.Found == false && c.Value.builtIn == false)
{
Log("Cleaning style: " + c.Value.cellStyle.Name);
c.Value.cellStyleFormat.Remove();
c.Value.cellStyle.Remove();
}
}
// clean the list in reverse - deleting styles that were
// removed in the operation above
for (int i = (styleList.Count - 1); i >= 0; i--)
{
int keyVal = styleList.Keys.ToList<int>()[i];
if (styleList[keyVal].inUse == false && styleList[keyVal].builtIn == false)
styleList.Remove(keyVal);
}
Log("Updating style indexes.");
// now correct all the indexes
foreach (KeyValuePair<int, styleCount> c in styleList)
{
int idx = 0; // keep track of the index
// now we loop though the collection of the
// cellstyleformats (xsf) remaining in the
// stylesheet. If the cellstyleformat is the
// same cellstyleformat we are referencing in
// our collection via styleList, then we will
// update both the cellStyle xfId and the
// cellFormat xfId in the stylesheet to
// reference the new index location...
foreach (CellFormat xsf in stylesheet.CellStyleFormats)
{
// CellStyleFormat (refersTo>) CellFormat
// CellStyle (refersTo>) CellFormat
if (c.Value.cellStyleFormat == xsf)
{
if (c.Value.Found) // only if matched to a cellformat
{
// we have a match... index them
c.Value.cellStyle.FormatId.Value = (uint)idx;
c.Value.cellFormat.FormatId.Value = (uint)idx;
}
break;
}
idx++;
}
}
// update counts
stylesheet.CellStyles.Count.Value = (uint)styleList.Count;
stylesheet.CellStyleFormats.Count.Value = (uint)styleList.Count;
// clean
stylesheet.Save();
stylesheet = null;
Log(DIVIDER);
}