You can use your favorite social network to register or link an existing account:
Or use your email address to register without a social network:
Sign in with these social networks:
Or enter your username and password
Forgot your password?
Yes, please link my existing account with for quick, secure access.
No, I would like to create a new account with my profile information.
Thanks to Gabhan Berry for putting together this series on high performance computing.
In this post, I introduce the work we’ve done with the Windows HPC team to integrate Excel 2010 with cluster computing. This is a high level overview and I’ll follow up with the technical details in later posts.
High Performance Computing (HPC) is a branch of technical computing that uses supercomputers and cluster computing to solve computationally intensive problems. Cluster computing involves utilizing the physical resources of multiple computers in order to perform a computationally intensive task. In effect, cluster computing enables you to create a virtual supercomputer with hundreds or even thousands of CPU cores. You’re probably sitting in an office building right now; imagine if you could hook up every PC in the building and use them as if they were one massive computer at your personal disposal. That’s what cluster computing is.
In this release of Excel, we’ve spent some time partnering with the Windows HPC team to enable Excel 2010 to integrate with cluster computing. This is a very exciting new direction for Excel. Just imagine being able to combine the flexibility and utility of Excel’s modeling capabilities with the processing power of hundreds or even thousands of servers.
We’ve done two pieces of work to integrate Excel with cluster computing. Firstly, we’ve enabled Excel, running on a desktop PC, to offload the evaluation of certain types of user defined functions (UDF) to a compatible cluster. Offloading the evaluation of UDFs enables Excel to continue calculating other parts of the workbook in parallel to the cluster calculating the UDFs. Also, the calculation of the UDFs is distributed across the cluster. This can reduce overall calculation time of a workbook – especially if the UDFs take a long time to run.
Secondly, we’ve enabled entire workbooks to calculate on the cluster. In this scenario, we have a model implemented in a workbook and we want to recalculate the model many thousands of times. When calculating on the cluster, the calculations of the model are distributed across all the servers in the cluster, meaning that we make use of the hundreds or thousands of CPU cores on the cluster to perform batches of calculations in parallel. This parallelization greatly reduces the time needed to perform all of the calculations of the model. Additionally, we can use VBA in our model if we want to (though there are restrictions on what the VBA can do on the cluster).
The key concept to both of these features is parallelization. Whether we are offloading UDFs to the cluster or calculating models on the cluster we realize a performance increase compared to the desktop because lots of parts of the overall task execute in parallel to one another. The individual computations are not (necessarily) happening faster. The cluster simply has the capability to do lots of computations at the same time. Why is the cluster capable of doing this? Because, it is many more CPU cores than a desktop PC. It is not unusual for a cluster to have thousands of CPU cores enabling it to do thousands of computations at the same time.
An everyday example of parallelization at work can be seen at the checkouts in grocery stores. Imagine if grocery stores employed just one checkout operator to serve all customers in the store. Because the operator can serve just one customer at a time, if multiple customers want served at the same time they have to form a queue and wait their turn. To avoid this, grocery stores employ multiple checkout operators who each serve customers independently of and in parallel to every other checkout operator. As customers arrive at the checkouts, they can be served by any of the operators and typically go to the operator that is least busy. If you understand this explanation, then you pretty much understand how Excel users benefit from cluster computing.
Cluster computing was originally envisioned to solve big problems; really big problems with lots and lots of data. Solving the Grand Challenges such as computational fluid dynamics and symbolic calculations (e.g., weather forecasting, aerospace engineering, automated reasoning, natural language comprehension) was what the technology was originally intended to do, but the utility and power of cluster computing and its relative low cost enables businesses and scientists to use cluster computing to tackle just about any computationally intensive mathematical problem they can think of. And that’s where Excel comes in.
Excel is a great data analysis and modeling tool. Whether you’re studying the fluid dynamics of air masses, the Brownian motion of particles or modeling customer buying habits, Excel can be used to help solve the problem. When I think about the types of problems that Excel and HPC can be used to solve, I group them into the following three categories.
Mathematical Analysis Problems
These types of problems typically involve performing some set of mathematical tests on a large set of data. For example, finding correlations between huge arrays of data or fitting a large data set to a theoretical distribution.
Data Processing Problems
Corporations and scientists are creating lots of data these days and, like most data, these data need to be cleaned according to certain rules. Typically interesting models and analysis cannot be done until the input data has been cleaned. The faster we can process the data and get it ready for analysis the more time we have for analyzing it.
Monte Carlo Simulations
Modeling uncertainty and randomness is becoming commonplace in both scientific and business domains. Monte Carlo simulations are so utilitarian that they can be used to model a wide variety of problems in many different domains, from financial services to particle physics. The number of times a Monte Carlo model is calculated can be very large. For example, let’s say that your model has 4 input parameters and you want to simulate 100 values for each parameter. That means you need to perform 100^4 = 100,000,000 calculations of your model. And that is a fairly modest simulation. Some simulations have hundreds of parameters.
This has been an introductory high level overview of Excel 2010 integration with cluster computing. My goal here was to introduce the features, explain a little about how they work and what problems they can be used to solve. In coming posts, I’ll get into the technical details of how these features work.
Unless I missed something and Excel not includes functions to solve for eigenvectors and eigenvalues as well as exotic norms, metrics and transformations, those people only 90% out of their minds who might consider using Excel for any portion of fluid dynamics simulations would need to perform the bulk of their calculations outside of Excel on the HPC. Excel would become little more than a big (grossly too big) grid control for displaying results. Why bother?
As for cleansing data, anyone using Excel rather than scripting languages with regular expressions for pattern matching is either insane or ignorant of the advantages of scripting languages. Either way, Excel is decidedly not the best tool for the task.
I realize the advantages of HPC. I just question the wisdom of using Excel as anything other that the final presentation portion of any compute-intensive process which could benefit from HPC.
"Imagine if grocery stores employed just one checkout operator to serve all customers in the store. Because the operator can serve just one customer at a time, if multiple customers want served at the same time they have to form a queue and wait their turn. To avoid this, grocery stores employ multiple checkout operators who each serve customers independently of and in parallel to every other checkout operator..." --
This is a good idea and Microsoft should pass it along to KMart and Walmart and other stores. Around here these stores always have a dozen check-out lanes, but always have only ONE person working a cash register at a time. And that person usually has to stop the line to go check a price on some shelf that doesn't show up in their computer. Yeah, Microsoft should pass on their "multiple customer checkout" scheme to the people who own modern stores.
I can't agree that it is pointless providing Excel with these capabilities. I spend a large part of my time doing structural analysis, and I use a finite element analysis package to do the heavy number crunching, but typically I'd spend well over half my time in Excel, which is used as both a pre-processor, to generate the finite element model, and a post-processor to analyse the results and carry out the design calculations. Increasingly it is also used to control iterative calculations, analysing intermediate output from the fea package, modifying the model as appropriate, and moving on to the next stage. It's much more than just a reporting tool.
I agree that providing better facilities for interaction with high perfpormance computer systems is an exciting development.
Excel's success or failure for HPC and visualization will come down to speed. Like may scientists, Excel 2007 was essentially useless for us. Things that worked immediately in 2003 took minutes under 2007. The charting engine in 2007 stank - it couldn't have been any worse. So we're still with 2003. There are encouraging noises regarding 2010's improved performance, but time will tell.
how dowanload this ms excel 2010 tel me i need this
Harlan - Thanks for your comments. You are correct; Excel does not have all the functions required to solve various types of business and scientific problems. In these scenarios, customers develop and deploy user defined functions (DFs) - and we have enabled support for UDFs on the cluster.
When running on the cluster, Excel isn't used to display data to a human - it is used as a computation engine. The ability for users to use Excel as a computation engine and extend it (with UDFs) where they need to is something that many people attach a lot of value to. As an example, we recently had some customers deploy this technology in their own environment and use it to model their own complex problems. These customers' models involved custom code and they attached great value to being able to use Excel as a computation engine while integrating their own code where they wanted to.
Tim - yes, performance is critical for HPC scenarios (and for visualization). We have worked hard on improving performance in 2010. Please let us know what you think of how we did.
UDFs are nice but not sufficient for realisting HPC use.
Why not show how dedicated MSFT is for this by implementing Excel add-ins which provide the *full* capabilities of LAPACK? Surely you already have LAPACK implemented for MSFT HPC clusters, no?
I really interested to know how these work for me. For now I can appreciate very well and it only monsterous in look.
Can I get some step by step guide to know how to use these
I still don't see where the advantage would be for doing CFD in using Excel, over let's say a compiled Fortran code and post/pre-processor for the number crunching and a Tecplot or Paraview for a graphic post-processor. I'll be happy to be proven wrong...