Excel 2010 and High Performance Computing

Thanks to Gabhan Berry for putting together this series on high performance computing.

In this post, I introduce the work we’ve done with the Windows HPC team to integrate Excel 2010 with cluster computing. This is a high level overview and I’ll follow up with the technical details in later posts.

High Level Overview of Excel 2010 and Cluster Computing

High Performance Computing (HPC) is a branch of technical computing that uses supercomputers and cluster computing to solve computationally intensive problems. Cluster computing involves utilizing the physical resources of multiple computers in order to perform a computationally intensive task. In effect, cluster computing enables you to create a virtual supercomputer with hundreds or even thousands of CPU cores. You’re probably sitting in an office building right now; imagine if you could hook up every PC in the building and use them as if they were one massive computer at your personal disposal. That’s what cluster computing is.

In this release of Excel, we’ve spent some time partnering with the Windows HPC team to enable Excel 2010 to integrate with cluster computing. This is a very exciting new direction for Excel. Just imagine being able to combine the flexibility and utility of Excel’s modeling capabilities with the processing power of hundreds or even thousands of servers.

We’ve done two pieces of work to integrate Excel with cluster computing. Firstly, we’ve enabled Excel, running on a desktop PC, to offload the evaluation of certain types of user defined functions (UDF) to a compatible cluster. Offloading the evaluation of UDFs enables Excel to continue calculating other parts of the workbook in parallel to the cluster calculating the UDFs. Also, the calculation of the UDFs is distributed across the cluster. This can reduce overall calculation time of a workbook – especially if the UDFs take a long time to run.

Secondly, we’ve enabled entire workbooks to calculate on the cluster. In this scenario, we have a model implemented in a workbook and we want to recalculate the model many thousands of times. When calculating on the cluster, the calculations of the model are distributed across all the servers in the cluster, meaning that we make use of the hundreds or thousands of CPU cores on the cluster to perform batches of calculations in parallel. This parallelization greatly reduces the time needed to perform all of the calculations of the model. Additionally, we can use VBA in our model if we want to (though there are restrictions on what the VBA can do on the cluster).

The key concept to both of these features is parallelization. Whether we are offloading UDFs to the cluster or calculating models on the cluster we realize a performance increase compared to the desktop because lots of parts of the overall task execute in parallel to one another. The individual computations are not (necessarily) happening faster. The cluster simply has the capability to do lots of computations at the same time. Why is the cluster capable of doing this? Because, it is many more CPU cores than a desktop PC. It is not unusual for a cluster to have thousands of CPU cores enabling it to do thousands of computations at the same time.

An everyday example of parallelization at work can be seen at the checkouts in grocery stores. Imagine if grocery stores employed just one checkout operator to serve all customers in the store. Because the operator can serve just one customer at a time, if multiple customers want served at the same time they have to form a queue and wait their turn. To avoid this, grocery stores employ multiple checkout operators who each serve customers independently of and in parallel to every other checkout operator. As customers arrive at the checkouts, they can be served by any of the operators and typically go to the operator that is least busy. If you understand this explanation, then you pretty much understand how Excel users benefit from cluster computing.

What Kinds of Problems Can I Solve with Excel and Cluster Computing?

Cluster computing was originally envisioned to solve big problems; really big problems with lots and lots of data. Solving the Grand Challenges such as computational fluid dynamics and symbolic calculations (e.g., weather forecasting, aerospace engineering, automated reasoning, natural language comprehension) was what the technology was originally intended to do, but the utility and power of cluster computing and its relative low cost enables businesses and scientists to use cluster computing to tackle just about any computationally intensive mathematical problem they can think of. And that’s where Excel comes in.

Excel is a great data analysis and modeling tool. Whether you’re studying the fluid dynamics of air masses, the Brownian motion of particles or modeling customer buying habits, Excel can be used to help solve the problem. When I think about the types of problems that Excel and HPC can be used to solve, I group them into the following three categories.

Mathematical Analysis Problems

These types of problems typically involve performing some set of mathematical tests on a large set of data. For example, finding correlations between huge arrays of data or fitting a large data set to a theoretical distribution.

Data Processing Problems

Corporations and scientists are creating lots of data these days and, like most data, these data need to be cleaned according to certain rules. Typically interesting models and analysis cannot be done until the input data has been cleaned. The faster we can process the data and get it ready for analysis the more time we have for analyzing it.

Monte Carlo Simulations

Modeling uncertainty and randomness is becoming commonplace in both scientific and business domains. Monte Carlo simulations are so utilitarian that they can be used to model a wide variety of problems in many different domains, from financial services to particle physics. The number of times a Monte Carlo model is calculated can be very large. For example, let’s say that your model has 4 input parameters and you want to simulate 100 values for each parameter. That means you need to perform 100^4 = 100,000,000 calculations of your model. And that is a fairly modest simulation. Some simulations have hundreds of parameters.

Forthcoming Posts on Excel and High Performance Computing…

This has been an introductory high level overview of Excel 2010 integration with cluster computing. My goal here was to introduce the features, explain a little about how they work and what problems they can be used to solve. In coming posts, I’ll get into the technical details of how these features work.