Thursday, November 20, 2008

Traditional HPC vs Distributed HPC

What has changed, what has remained the same. (or is the grid dead?)

I'm just typing the comments/questions that come up during this BoF.



At This BoF, the organizers noted that while the TerraGrid has the word "grid" in it, and the systems have grid interfaces (Globus), most users use the resources as individual HPC systems either a single system or a few in sequence

CaBIG/CaGrid brought up by someone in the audience as "barely grid computing" - data access, but a possible example of successful grid.

One audience member mentions that middleware has fallen behind, Globus difficult to install and get working.

Another audience wants to know exactly what grid computing is, and what the goals are. Are the goals to be automatic? - if a user has to specify all the distributed resources it would be too difficult to use.

One audience member asked who has programmed with low level Globus API, three people raised hands and he noted that was actually a large number. He says the Web Services APIs, that Globus pushes, have too much overhead. He also mentions that local schedulers are an issue, it is difficult to coordinate multiple resources to be available at the same time.

One audience says "a grid project hijacked OGF (Open Grid Forum) and pushed Web Services and then abandoned OGF" he wouldn't mention the project by name, but it was understood he meant Globus. He also is asking if the additional level of complexity is worth it, since distributed computing does add an additional level of complexity. No one really knows the answers.

Another audience member asks is "do we see a future in Grid computing given all the problems?". One of the organizers responds and says she doesn't see a big difference between distributed computing and traditional HPC, she has done distributed computing and the Grid vision is to automate some of it and make it more accessible.

Comment from the audience "a lot of people are doing ad-hoc distributed computing". He sees a future in grid computing for sharing medical information.

Can we do everything on large systems, do we really need to span multiple systems? TACC says that it is hard to move data around the country and that drives people to use a single system. File systems are the biggest inhibitors to performance. If the data is naturally distributed (gathered or generated in separate locations) that helps drive people to use distributed computing.

one audience member put up a slide

why use distributed resources:
  • higher availability
  • peak-requirement not met on any one system: e.g. mpich-g2
  • easy to substitute single long running simulation with multiple smaller time
  • ease of modular and incremental growth
  • automatic spread of resource requirements

One audience member says nothing exciting has happened in the industry, specifically in grid computing, all the problems are still the same. Another audience member says that the advances have been in the applications and they are much more complicated, and workflows are much more complicated - perhaps this is what makes progress on grid computing so slow.


One problem noted is the lack of funding for middleware - some of the middle ware being developed is business based, and not targeted towards science users.

Big argument about why we are stuck with MPI, which is at such a low level. There were parallel languages in the 70s - 80s that some people are arguing were much better than MPI another person arguing that they just didn't work - there was too much overhead and they just didn't scale. MPI may be less elegant to program for, but it does work and it can scale. "So these advanced tools didn't work, so we're stuck with MPI? Well that is a sorry state"


Not everyone belongs in distributed computing - it may take longer to prepare a problem for the grid or distributed computing than it would take to just run it on a cluster.


Overall themes:

The continuing improvements in price/performance of HPC reduces need for distributed computing.

Lack of standards in middleware hinders use of distributed machines.

Globus is hard to use. Need for better tools.

Is the effort worth the return?

No comments: