Back
Excel

Calculating Excel Models on an HPC Cluster

Thanks to Gabhan Berry for putting together this series on high performance computing.

In a previous post, I introduced the work we have done with the Windows HPC team to enable customers to calculate Excel models on a Windows HPC cluster. In this follow up post, I go a little deeper into how we can harness the utility and power of Excel 2010 and Windows HPC to solve complex modeling problems.

Modeling Complex Systems using Excel 2010 and Windows HPC

Mathematically modeling complex problems is far more main stream than it once was and Excel has played a major role in making that happen. From linear programming to statistical models, customers use Excel to implement many different modeling techniques. But there is one particular type of modeling I want to talk about in this post: Monte Carlo simulations.

Let’s say that we work for the local government and we want to calculate when we will likely have to perform expensive maintenance work to a road bridge in a nearby town. We know that the rate of fatigue of the bridge is a function of several variables: average daily traffic density, average daily air temperature and average material durability are just a few that come to mind. We also know that each of these variables behaves randomly. That is, we cannot predict their exact values – but we do know what range of values they will have and we know that some values are more likely than others. Because of the non-deterministic nature of the system we’re modeling, one way of solving our problem is to build a stochastic model and to use a Monte Carlo simulation to help us figure out when we’ll most likely need to repair the bridge.

How would we use Excel 2010 and Windows HPC to do that?

First of all, we need to create a mathematical model for the traffic flow over the bridge. We won’t actually build the mathematical model in this post, but we will think about what it would look like because understanding the overall structure of the model is important to understanding the solution.

Our model would really consist of a multi-variable equation. When supplied with values for all the variables, the equation will calculate the number of months we should expect the bridge to remain trouble free. The variables of this equation are the different factors that we think affect the lifetime of the bridge. We already know some of these: average daily traffic density, average daily air temperature, material lifetime etc. By the way, this is a gross simplification of what the model would really be, but, that’s ok. In this post, we just need to understand the general idea.

Let’s now think about the variables (parameters) of our model. First of all, we know that they have some amount of randomness. That is, each variable doesn’t have one correct value – instead their values follow a particular probability distribution. So we can’t say that the daily average traffic density will be 11 cars per minute, for example. That would be just too unrealistic. But we can say that the daily average traffic density will be some value between (for example) 3 cars per minute and 15 cars per minute and that certain values in that range are more probable than others (this is what the probability distribution defines). We can say the same thing for the other variables of our model.

By supplying various possible values for the model’s variables we can come up lots of different possible values of the model. But which value is the right value? Well, we can’t really say which value is right but we can say which value is most probable. To figure out which value is most probable we need to recalculate our model over and over again, varying the input variables (according to their distributions) and record the frequencies of the values calculated by the model. These frequencies will then enable us to determine which range of values is most likely to occur. This process is called Monte Carlo simulation.

It’s obvious that the more times we calculate the model the longer it will take. For example, it will take longer to perform 1,000 calculations of the model than 100 calculations. But the law of large numbers tells us that doing more calculations is better, so that would lead us to think that we should calculate the model as many times as we possibly can. However, we can’t wait forever for the answer so we need to somehow strike a balance between performing lots and lots of calculations and getting a useful answer as fast as we can, and this where cluster computing comes in.

By utilizing the resources of a cluster we can perform more calculations in a given time period than we can on a desktop computer. Instead of performing the calculations serially one after another, on the cluster we perform the calculations simultaneously. This enables us to perform many calculations without having to wait for long periods of time. The more computers we have in the cluster, the more calculations we can do at the same time.

image

At this stage, you probably have lots of questions about how this really works. How is the model calculated on each of the nodes in the cluster? How do we split up the model amongst the nodes? How do we collate together the results to form the final result? Let’s now go through and provide answers to these questions.

Building a Cluster Enabled Monte Carlo Simulation with Excel 2010

The ability to calculate Excel workbooks on a Windows HPC cluster is exposed by the Windows HPC SOA API. SOA means Service Oriented Architecture and you can find out more about this here.

So, to implement our simulation on the cluster, we use the SOA API to build a custom application.

Note: The Windows HPC SOA API will likely ship with example Monte Carlo applications that you may be able to take and use in your solution.

A cluster enabled Monte Carlo simulation application typically performs three tasks. First, it divides up all the calculations into smaller batches of calculations, each of which can be calculated in parallel. Then the application uses the cluster to asynchronously calculate each batch. And finally, it merges together the results of every batch and derives the final result of the simulation.

Our application is free to divide up the calculations into batches however it wants to. The SOA API doesn’t really place any restrictions on how we do that; and likewise for merging the results of the calculations. We are free to do this however we want to. That’s the benefit of having this service exposed as an API – we are able to integrate the cluster into our custom applications in ways that suit us.

For example, we could create a generic application dedicated to performing Monte Carlo simulations on a Windows HPC cluster. Or we could integrate the cluster as a feature into an existing application. The API gives us quite a lot of flexibility.

image

Designing for the Cluster

Although it is the desktop version of Excel that runs on the server, we shouldn’t think that we can get Excel to do on the cluster anything it can do on the desktop. When running on a cluster, Excel is a blend of a calculation engine and large matrix of data. We shouldn’t use it as the general purpose spreadsheet application we use on the desktop and we shouldn’t expect that we can take any slow running workbook and have it magically go faster by running it on the grid.

Workbooks need to be designed for the cluster. This means that existing workbooks won’t necessarily work on the cluster without modification. In the same way that we typically need to think slightly differently about code that runs on a cluster, we also need to think slightly differently about workbooks that run on the cluster.

When building a workbook for the cluster, we need to make sure that the workbook only uses Excel’s grid and formulas. Interactive-centric features such as PivotTables and Charts are not safe to use on the cluster. However, we can create user defined functions (UDFs) and use them in our workbooks on the cluster. For example, in our traffic flow problem, if we needed to use a UDF in our equation then that would be fine. We would just need to make sure the add-in containing the UDF is installed on each of the computers in the cluster.

A High Performance Computation Engine

By using Excel’s built in functions and having the flexibility to develop and use custom worksheet functions, we are able to build a wide variety of models in Excel. During our beta testing we’ve seen customers build models to address various problems in finance, medical research and agriculture and we have seen the breathtaking performance that large scale parallelization brings. With one customer, a simulation that took more than 7 days to calculate on the desktop took less than 2 hours when run on the cluster. That’s an amazing boost in performance.

One other thing … when Windows HPC runs jobs on the cluster, the applications used in these jobs are run within certain contexts. These contexts enable the cluster software to monitor and manage the applications. When an Excel job is run, the cluster software is able to detect whether calculation has been blocked by a popup dialog (a rogue message box for example). If this happens, the cluster is able to dismiss the dialog and unblock the calculation. If it needs to, the cluster can also automatically terminate and reschedule the jobs if it cannot unblock the calculation.

Many people in industry and in research build mathematical models using Excel. With Excel 2010 and Windows HPC, these people can now achieve much more in less time. I already mentioned the example where we took a simulation from 7 days to less than 2 hours. That particular example was in finance and we had similar amazing results in other domains. In agriculture, we had a model taking 1.5 days on the desktop and only 3 minutes on the cluster. When summarizing Excel’s integration with Windows HPC, I say that we’ve created a high performance computation engine. I think that sums up things rather nicely.