Back
Excel

Reading Excel Files from Linux

Today’s author, Chris Rae, a Program Manager on the Excel team, talks about using Perl code to read Excel’s new file format.

The OOXML Format

 

As most readers will know, we spent a lot of time during the development of Office 2007 in creating and documenting a new XML-based file format (Office Open XML) to replace the much more complex binary formats the various Office applications have used for the last twenty years. A lot of time. Many of us dream in XML now. OOXML encompasses an XML file format for each of the largest Office applications, these formats being WordprocessingML, PresentationML and SpreadsheetML. It became an ECMA standard (ECMA-376) in 2006 and was approved with modifications as an ISO/IEC standard (IS 29500) in early 2008, though it has yet to be released by ISO/IEC. I think it would be fair to say that this Office feature has had more press coverage, both good and bad, than all other Office 2007 features combined. Political positions aside, one thing that can be said for definite is that the move made Office files easier to understand, and has created new ways for our customers to work with their data. In this post I’m going to walk through the solution to an imaginary customer’s interoperability problem, but first I’d like to cover a couple of OOXML basics.

 

Reading Material

 

The standardisation process means that the file format itself now belongs to ECMA and ISO/IEC, rather than to Microsoft. All of the documentation, therefore, lives on ECMA International’s web site (and will live on the ISO/IEC site once they have completed their editing of IS 29500). These documents are a huge help to anyone interested in cracking open their spreadsheet files – the documents are really quite readable and, hey, they’re free. The main ECMA-376 page is at http://www.ecma-international.org/publications/standards/Ecma-376.htm – the documents are available for download here as DOCXs or PDFs. There’s a lot in these file formats, so don’t click “print” without first observing the page count. Aside from the format documentation itself there’s also a good XML in Office Developers’ Portal on MSDN.

Several parts of the OOXML standard are going to be particularly useful for looking at spreadsheet files. These are:

  • Parts 1 and 2 (how the OOXML format generally fits together)
  • Part 3 (how the various parts of SpreadsheetML interrelate)
  • Part 4 (detailed information about specific elements and attributes)

I’ll make reference occasionally during this article to these parts.

 

OOXML Files and Excel

 

Any Excel file with the extension “xlsx”, “xlsm” or “xlam” is an OOXML file. It’s the default file format for Office 2007 and can be saved by Office 2003 or Office XP if you download and install the Compatibility Pack (http://office.microsoft.com/en-us/products/HA101686761033.aspx). An OOXML file (or “package”) is simply a zip archive – if you don’t believe me, rename one to .zip and unpack it. Inside the zip file are some .xml files (inside which the worksheets, charts et cetera are defined), some binary files to hold things that don’t lend themselves to being stored in XML (such as images) and some .rels files which define the relationships between them all. The packaging model is covered in much more detail in Part 2 of ECMA-376, section 8.

 

The Scenario

 

I used to work for a derivatives trading desk, writing spreadsheets. As anyone in investment banking knows, Excel is like the Outlook of the trading floor – traders use it to price absolutely everything; keep the departmental vacation calendar; plan their weddings and name their babies. At the same time, the more prudent parts of the organisation have enormous back-end systems to manage the organisation’s risk and monitor their positions. These systems are almost always on a wide range of platforms – Windows Server, Solaris, Linux – and have been in place for very long periods of time. One thing you can say with a lot of confidence is that these machines don’t have Excel installed.

 

The need for agility in trading means that almost all trades start life in Excel before migrating to whatever risk systems the bank has in place. There is a constant need to take numbers from spreadsheets and port them into back-end risk management systems, and this is done in various different ways today. One popular method is to have a “click here before you go home” macro on your spreadsheet, which connects to the database and uploads trades. This works fine until you end up with different versions of the macro smattered around the place in random trading sheets, or the trader goes on vacation and forgets to tell his stand-in about it, or anyone from audit finds out about it. Another option is to write some code on a centrally-managed machine which will load up all of the trading sheets, recalculate them all and write all the data. This works fine until a VBA bug appears in the middle of the night, or the market data permissions aren’t right, or the IT guy who was supposed to run this is out sick.

 

So let’s say we have a derivatives trader, Sally, who is experimenting with some new trades. In this case, Sally is experimenting merely with holding large quantities of her favourite stocks and keeping her fingers crossed, so she’s probably due to be fired soon. The Office 2003 spreadsheet on which these positions live is an enormous beast – it’s full of buggy VBA macros and littered with sheets and sheets of old data and notes from previous trades. Sally knows that somehow she has to get this into the back-end risk systems but she’s terrified of messing with her pride and joy and doesn’t want the responsibility of having to upload them every day. Her list of positions looks like this:

 

image


 

In a darkened room on the other side of the road, we have our IT guy, Andy, who’s tasked with running the Linux-based risk management and reporting functions. Andy knows that Sally has the Compatibility Pack, which means that her copy of Excel 2003 can save OOXML files. He also memorised the entire ECMA standard a few weeks ago whilst investigating whether it could be turned into a hit Disney musical in order to win a bet. His “OOXML on Ice” idea turned out to be impractical, and he is anxious to prove to his wife that the episode wasn’t yet another example of a boneheaded goose chase when he could have spent the time more effectively wallpapering the bathroom. He comes up with a plan to retrieve Sally’s positions directly from the spreadsheet file. Andy gets Sally to agree to do the following:

  • Save the file as an XLSM instead of an XLS (the M denotes a macro-enabled OOXML file – she needs this in order to keep the macros in her workbook)
  • Create a workbook-level named range, “MyPortfolio”, which will point to a two-cell-wide list of stock ticker symbols and numbers of shares held
  • Save the file on a network share that the Linux machines can see

Andy now sets about writing a Perl script which will run every night on one of his Linux machines to suck the positions out of the spreadsheet and store them in the database. Andy isn’t a very good Perl programmer, and as such some of his code may look suboptimal. He’s about to be fired soon too. Andy is me. I’m going to stop talking about him in the third person now.

 

Excuses

 

This is not intended to be a library for reading OOXML files in Perl. It’s really just intended to highlight how simple reading OOXML files is, and serve as a useful reference to anyone thinking of parsing the files themselves, in any programming language. Perhaps worst of all there’s no error handling at all, and I have my suspicions that the method I used for converting relative paths to absolute ones will not withstand being exposed to direct sunlight. This would all have been a bit worse without the help of Alex Babanov, Excel dev and closet Perl hacker.

 

If you’re working on Windows, there is an excellent .NET SDK for reading and writing OOXML files – if you’re not on Windows, have a look at OpenXMLDeveloper.org. They have sample code in Python, Java and others, and some active forums.

 

Libraries

 

The trickiest parts of deconstructing an OOXML document are reading the zip file and traversing the XML. The Comprehensive Perl Archive Network (http://www.cpan.org) is an excellent resource – on there I found Archive::Zip to dismantle the zip file, and XML::Twig to parse the XML. If you’re going to try and run the code that’s in this article, you’ll need those libraries. Let’s go ahead and declare them, initialise them and set a few constants.

#!/bin/perl -w
# Read stock prices in from spreadsheet. There is no error checking. Sheet names in quotes won’t work too well.

use Archive::Zip qw(:ERROR_CODES :CONSTANTS);
use XML::Twig;

# The workbook file and destination named range
my $zipName = ‘/mnt/sally/pabook.xlsm’;
my $rangeName = ‘MyPortfolio’;

# The two parts we’ll be looking for by type
my $typeOfficeDocument = ‘http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument’;
my $typeSharedStrings = ‘http://schemas.openxmlformats.org/officeDocument/2006/relationships/sharedStrings’;

my $zip = Archive::Zip->new();
$zip->read($zipName);
my $tw1=new XML::Twig();

 

Finding the Defined Name

 

In order to find the defined name, we have to dig into the OOXML document a little.

 

One file that must exist inside an OOXML package is /_rels/.rels. This is an XML file which contains a sort of top-level “contents list” for the package. The .rels file in Sally’s workbook happens to contain:

 

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>
<Relationships xmlns
=”http://schemas.openxmlformats.org/package/2006/relationships”>
    <Relationship Id=”rId3″ Type=”http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties” Target
=”docProps/app.xml”/>
    <Relationship Id=”rId2″ Type=”http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties” Target
=”docProps/core.xml”/>
    <Relationship Id=”rId1″ Type=”http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument” Target
=”xl/workbook.xml”/>
</Relationships>

 

For each item in the contents list, there is a relationship ID number, a type and a target. The target field points to its location in the zip file. In our package, the most interesting part is xl/workbook.xml, which we can discover from Part 1 of ECMA-376, section 8.4 is an Office document. The beauty of the XML file format, of course, is that we could equally easily have just guessed that from the type. We can go ahead and find the file which represents this part from the .rels file, then load the XML in.

# Find the main rels file and load into a Twig structure
my $strRootRels = $zip->contents(‘_rels/.rels’);
my $xmlRootRels = $tw1->parse($strRootRels);

# Find the workbook part by checking for the correct schema definition
my $fileWorkbook = ($xmlRootRels->root->get_xpath(“/Relationships/Relationship[@Type='$typeOfficeDocument']“))[0]->att(‘Target’);

# Read the workbook part into a Twig structure
my $strWorkbook = $zip->contents($fileWorkbook);
my $xmlWorkbook = $tw1->parse($strWorkbook);

The workbook.xml file contains information about the spreadsheet at a workbook level. Part 3 of ECMA-376 contains an excellent introduction to the content of this and each of the other parts in a spreadsheet. Sally’s workbook.xml contains the following:

 

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>
<workbook xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main” xmlns:r=”http://schemas.openxmlformats.org/officeDocument/2006/relationships”>
    <fileVersion appName=”xl” lastEdited=”4″ lowestEdited=”4″ rupBuild=”4505″/>
    <workbookPr codeName=”ThisWorkbook” defaultThemeVersion=”124226″/>
    <bookViews>
        <workbookView xWindow=”120″ yWindow=”105″ windowWidth=”15255″ windowHeight=”14790″/>
    </bookViews>
    <sheets>
        <sheet name=”Output” sheetId=”1″ r:id=”rId1″/>
        <sheet name=”Sheet2″ sheetId=”2″ r:id=”rId2″/>
        <sheet name=”Sheet3″ sheetId=”3″ r:id=”rId3″/>
    </sheets>
    <definedNames>
        <definedName name=”MyPortfolio”>Output!$B$2:$C$5</definedName>
        <definedName name=”othername”>Output!$E$17</definedName>
    </definedNames>
    <calcPr calcId=”125725″/>
</workbook>

 

Here we see the defined name that we’re after – MyPortfolio.

 

Finding The Source Data

 

Let’s go ahead and read in the defined name:

# Find the defined name we wanted. Will look something like SheetName!$A1:B10
my $wholeRef = ($xmlWorkbook->root->get_xpath(“/workbook/definedNames/definedName[@name='$rangeName']“))[0]->text;

From Sally’s file we now have the string “Output!$B$2:$C$5″. Because we’re going to loop through these cells top-left to bottom right, let’s separate it into variables for the sheet name, first column, first row, last column and last row. With Perl regex one can do this very elegantly.

my ($sheetName,$tlCol,$tlRow,$brCol,$brRow) = $wholeRef =~ /^(.+)!$?([A-Z]+)$?(d+):$?([A-Z]+)$?(d+)$/;

Now we need to find the XML file which contains the data for worksheet Output. To do this, we need to look up the relID of the sheet (“rId1″, in this case) inside the .rels file for the workbook (which details all of the relationships that the workbook.xml file has to other parts of the package). If any part has relationships to others, these are stored in a file which has “.rels” appended to the name of the part, and in a subfolder entitled “_rels” (much more about this in Part 2 of ECMA-376, section 8.3.4). Let’s go ahead and load that rels file in, and find the target of relationship rId1.

# Find the sheet to which it refers
# Get the relID for the sheet
my $relIdWorksheet = ($xmlWorkbook->root->get_xpath(“/workbook/sheets/sheet[@name='$sheetName']“))[0]->att(‘r:id’);

# Get the relId of the sheet from the workbook rels file
my $fileWorkbookRels = “$fileWorkbook.rels”;
$fileWorkbookRels =~ s#(/[^/]+)$#/_rels1#;
my $strWorkbookRels = $zip->contents($fileWorkbookRels);
my $xmlWorkbookRels = $tw1->parse($strWorkbookRels);

# Get the sheet filename by looking up the relationship ID
my $fileWorksheet = ($xmlWorkbookRels->root->get_xpath(“/Relationships/Relationship[@Id='$relIdWorksheet']“))[0]->att(‘Target’);

 

This tells us that relationship rId1 points to file worksheets/sheet1.xml. This is where we’re going to retrieve the cell data from. The path is a relative one when compared to the workbook.xml file, so I’ll just append the workbook XML path to the filename, then go ahead and load the XML for the sheet. As I mentioned earlier, my treatment of relative paths here is a little suspicious and will be unhappy if there are any “.” or “..” folders in there. It works fine for our purposes here, though.

my $workbookPath = $fileWorkbook;
$workbookPath =~ s/[^/]+$//;
# Change the sheet path from relative to absolute
$fileWorksheet = $workbookPath . $fileWorksheet;
# And load in the XML
my $tw2=new XML::Twig();
my $strWorksheet = $zip->contents($fileWorksheet);
my $xmlWorksheet = $tw2->parse($strWorksheet);

 

Reading the Cell Data

 

If we look at xl/worksheets/sheet1.xml, it contains something like this (I’ve cut it down a little by removing a few cells):

 

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>
<worksheet xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main” xmlns:r=”http://schemas.openxmlformats.org/officeDocument/2006/relationships”>
    <sheetPr codeName=”Sheet1″/>
    <dimension ref=”B2:H5″/>
    <sheetViews>
        <sheetView tabSelected=”1″ workbookViewId=”0″>
            <selection activeCell=”B3″ sqref=”B3″/>
        </sheetView>
    </sheetViews>
    <sheetFormatPr defaultRowHeight=”15″/>
    <cols>
        <col min=”3″ max=”3″ width=”11.5703125″ customWidth=”1″/>
        <col min=”7″ max=”7″ width=”9.42578125″ customWidth=”1″/>
    </cols>
    <sheetData>

 


 

        <row r=”3″ spans=”2:8″>
            <c r=”B3″ t=”s”>
                <v>2</v>
            </c>
            <c r=”C3″>
                <v>85</v>
            </c>
            <c r=”G3″ t=”s”>
                <v>7</v>
            </c>
            <c r=”H3″>
                <v>4</v>
            </c>
        </row>

 

 

    </sheetData>
    <pageMargins left=”0.7″ right=”0.7″ top=”0.75″ bottom=”0.75″ header=”0.3″ footer=”0.3″/>
    <pageSetup orientation=”portrait” r:id=”rId1″/>
</worksheet>

 

We can see pretty clearly from here where Sally’s share positions are – the 85 in cell C3 matches up nicely with our spreadsheet. However, B3 doesn’t bear any resemblance to “MSFT”. Looking at the definition of the c element in Part 4 of ECMA-376, section 3.3.1.3, we can see that the t attribute specifies the type of the cell contents – it’s an optional attribute which defaults to “n” (number). In the definition of the t attribute (section 3.18.12), we can see that “s” denotes “Shared string”. We can then see from section 3.4 that workbooks can contain a table of shared strings, in order to save memory space and speed up load times. There’s a much more detailed description of the Shared String Table in there too.

 

When cells contain shared string data, the value in the v element is an index into the shared string table. To determine the location of this shared string table, we can parse the workbook’s rels file for a relationship of type sharedStrings and find its target. Let’s do that and then load the whole SST into a Twig structure so we can use it while reading data out of the cells.

 

# And while we’re at it, load the shared string table because we need that
# Find the name of the SST
my $fileSharedStrings = ($xmlRootRels->root->get_xpath(“/Relationships/Relationship[@Type='$typeSharedStrings']“))[0]->att(‘Target’);

# Load it. Need a new Twig, for reasons beyond my ken. All hail programming by coincidence
$fileSharedStrings = $workbookPath . $fileSharedStrings;
my $tw3=new XML::Twig();
my $strSharedStrings = $zip->contents($fileSharedStrings);
my $xmlSharedStrings = $tw3->parse($strSharedStrings);

 

We’re almost done. We now have the location of our data, and the data itself. We now have to just loop through the rows and columns of our reference, and display the output. When we come across a cell value which is from the shared string table, we can index into it to find the string. My choice of looping method here doesn’t handle columns after ‘Z’ – this is in the interests of keeping the code readable, rather than a sneak preview of an Office 14 feature.

 

I should point out here that the r attribute of the c element (the one which contains the Excel-style reference) is not obligatory in OOXML. It’s quite possible to have a spreadsheet file without these attributes, although obviously the data will all be left-aligned. Excel reads files like that fine (making the creation of tabular SpreadsheetML files a little simpler for developers) but will always write the r attribute. Because Excel is the app we’re using in my example scenario, I’m going to assume that these tags are there. Reading OOXML files without those tags isn’t tough, I’m just trying to keep the code as short as I can.

# And now start to load the data, starting at the top left. Some more programming
# required if user wants, perhaps not unreasonably, to use columns after ‘Z’
foreach my $y ($tlRow..$brRow) {
       foreach my $x ($tlCol..$brCol) {
              my $val = ($xmlWorksheet->root->get_xpath(“/worksheet/sheetData/row/c[@r='$x$y']“))[0];
              my $type = $val->att(‘t’);
              if ((defined $type) && ($type eq ‘s’)) {
                     #It’s a shared string, so load it from the shared string table
                     my $ss = ($xmlSharedStrings->root->children(‘si’))[$val->text]->first_child(‘t’)->text;
                     print $ss;
                     }
              else {
                     print $val->text;
              }
              print ‘,’ unless $x eq $brCol; # OK, so I admit it… Alex told me I could do that
       }
       print “n”;
}

 

And finally, here’s a screenshot of the final output:

image

Obviously we’re not going to display this on the screen, we’re much more interested in uploading it to our risk management database, but it’s getting late now.

 

Although Office 2007 contained an unusually large number of new features and improvements, I believe that the introduction of the XML file formats will prove to be one of its most important elements. Even for Windows-only organisations, the XML formats are worlds apart from the binary formats in terms of readability and over the next few years I suspect we’ll start to see interesting and innovative ways of using Office documents across different platforms and systems. I didn’t cover it here but, of course, these XML packages can be written to as easily as they can be read, and the same goes for Word and Powerpoint files. And, of course, you can use them on any device which can be made to understand zip and XML so mobile phones, thirty-year-old computers and refrigerators are all well within reach. Because the Compatibility Pack enables OOXML in Office 2003, you can probably start right now.