Thursday, November 20, 2008

Traditional HPC vs Distributed HPC

What has changed, what has remained the same. (or is the grid dead?)

I'm just typing the comments/questions that come up during this BoF.



At This BoF, the organizers noted that while the TerraGrid has the word "grid" in it, and the systems have grid interfaces (Globus), most users use the resources as individual HPC systems either a single system or a few in sequence

CaBIG/CaGrid brought up by someone in the audience as "barely grid computing" - data access, but a possible example of successful grid.

One audience member mentions that middleware has fallen behind, Globus difficult to install and get working.

Another audience wants to know exactly what grid computing is, and what the goals are. Are the goals to be automatic? - if a user has to specify all the distributed resources it would be too difficult to use.

One audience member asked who has programmed with low level Globus API, three people raised hands and he noted that was actually a large number. He says the Web Services APIs, that Globus pushes, have too much overhead. He also mentions that local schedulers are an issue, it is difficult to coordinate multiple resources to be available at the same time.

One audience says "a grid project hijacked OGF (Open Grid Forum) and pushed Web Services and then abandoned OGF" he wouldn't mention the project by name, but it was understood he meant Globus. He also is asking if the additional level of complexity is worth it, since distributed computing does add an additional level of complexity. No one really knows the answers.

Another audience member asks is "do we see a future in Grid computing given all the problems?". One of the organizers responds and says she doesn't see a big difference between distributed computing and traditional HPC, she has done distributed computing and the Grid vision is to automate some of it and make it more accessible.

Comment from the audience "a lot of people are doing ad-hoc distributed computing". He sees a future in grid computing for sharing medical information.

Can we do everything on large systems, do we really need to span multiple systems? TACC says that it is hard to move data around the country and that drives people to use a single system. File systems are the biggest inhibitors to performance. If the data is naturally distributed (gathered or generated in separate locations) that helps drive people to use distributed computing.

one audience member put up a slide

why use distributed resources:
  • higher availability
  • peak-requirement not met on any one system: e.g. mpich-g2
  • easy to substitute single long running simulation with multiple smaller time
  • ease of modular and incremental growth
  • automatic spread of resource requirements

One audience member says nothing exciting has happened in the industry, specifically in grid computing, all the problems are still the same. Another audience member says that the advances have been in the applications and they are much more complicated, and workflows are much more complicated - perhaps this is what makes progress on grid computing so slow.


One problem noted is the lack of funding for middleware - some of the middle ware being developed is business based, and not targeted towards science users.

Big argument about why we are stuck with MPI, which is at such a low level. There were parallel languages in the 70s - 80s that some people are arguing were much better than MPI another person arguing that they just didn't work - there was too much overhead and they just didn't scale. MPI may be less elegant to program for, but it does work and it can scale. "So these advanced tools didn't work, so we're stuck with MPI? Well that is a sorry state"


Not everyone belongs in distributed computing - it may take longer to prepare a problem for the grid or distributed computing than it would take to just run it on a cluster.


Overall themes:

The continuing improvements in price/performance of HPC reduces need for distributed computing.

Lack of standards in middleware hinders use of distributed machines.

Globus is hard to use. Need for better tools.

Is the effort worth the return?

Wednesday, November 19, 2008

The G word isn't Grid anymore

Green500 BoF

Grid isn't the hot buzz word in HPC anymore, it is "Green". This is evident by walking around the show floor and seeing how many vendors are touting the "greenness" of their solutions.

With the increased pressure on reducing our carbon footprint, there is now a movement to not only improve the speed of the fastest computers in the world, but to also improve the efficiency with regards to power consumption. If power consumption continues to grow at the current rate (linear or even superlinear) with respect to performance, in the near future a large top-10 cluster could conceivably require its own power plant to operate. There is a desire to level off the power consumption of these large supercomputers and be able to increase performance without increasing power requirements.

The Green 500:
The metric is flops per watt, with the Linpack benchmark, and the system must make the normal Top500 list to be ranked. This list, which I believe was started last year, lets us know how the fastest computers in the world rank when taking power consumption into account. Perhaps making the top of the Green 500 will come with the same prestige as making the top of the Top 500. The major topic of this BoF is how the Green 500 should be redesigned to provide maximum utility. What benchmark and how many should be run? Should we have multiple metrics? If so, how would they be weighted in order to rank the systems? Or should they be condensed into a single score? What should be measured? Just the computer? The entire data center?

Update:
Fist fights
At the Green 500 BoF, a fist fight almost broke out (not really, but the discussion got very "lively") between someone that advocated using single precision math for the early part of a computation, then switching to double precision to converge to the final solution, and someone that thought this was gaming the system. This idea is "green" because many processors have much better single precision performance than double precision - over 10x better on a cell, 4x better on a GPGPU, probably even 2x on a x86. This lets more work get done with fewer processors or in less time, consuming less power. Should "green algorithms" be allowed, or even encouraged for the benchmark?

There was also discussion of creating various classes of systems, either by price or performance. One problem is that a megaflops per watt ranking favors small systems.

Gregg (who tagged along to steal power) posted his thoughts over at his blog. He tried to think of it from a CIO type perspective.

Genomic Sequence search on Blue Gene/P

This afternoon I find my self in a talk about massively parallel sequence search.

I'll take a look at the conference procedings and edit this to put in a reference to their paper.


As we know sequence search is a fundamental tool in computational biology. A popular search algorithm in BLAST.

Genomic databases are growing faster than compute capability of single CPUs (clock scaling hits the power wall). This requires more and more processors to complete the search in a timely manner. The BLAST algorithm is O(n^2) worst case.

These researchers are using mpiBLAST on the Blue Gene/P, with very high efficiency. There have been many scalability improvements to mpiBLAST in the past few years, but it still didn't scale well beyond several thousand processors. They have identified key design issues of scalable sequence search and have made modifications to mpiBLAST to improve its performance at massive scales.

one limitation was the fixed worker-to-master mapping, and high overhead with fine-grained load balancing. Their optimizations include improvements that allow mapping arbitrary workers to a master, and hide balancing overhead with query prefetching.

There are I/O challenges as well. They have implemented asynchronous two-phase I/O to get high throughput without forcing synchronization.

They show 93% efficiency on 32,000 processors with their modified mpiBLAST.

OpenMPI BoF

I am currently attending the OpenMPI BoF, being led by Jeff Squires of Cisco, one of the OpenMPI main developers. Prior to working at Cisco on OpenMPI, Jeff was part of the LAM/MPI project at Indiana University.

For a little background, OpenMPI is a project that was spawned when a bunch of MPI implementers got together and decided to work together since they were all working on basically the same thing. LAM/MPI (which we have been using at the Lab), FT-MPI, Sun CT 6, LA-MPI, PACX-MPI


What's new in 1.3 (to be released soon):
  • ConnetX XRC support
  • More scalability improvements
  • more compiler and run time environment support
  • fine-grained processor affinity control
  • MPI 2.1 compliant
  • notifier framework
  • better documentation
  • more architectures, more OSes, more batch systems
  • thread safety (some devices, point to point only)
  • MPI_REAL16, MPI_COMPLEX32 (optional, no clean way in C)
  • C++ binding improvements
  • valgrind (memchecker) support
  • updated ROMIO version
  • condensed error messages (MPI_Abort() only prints one error message)
  • lots of little improvements
Scalability
  • keep the same on-demand connection setup as prior version
  • decrease memory footprint
  • sparse groups and communicators
  • many improvements in OpenMPI run time system
Pont to point Message Layer (PML)
  • improved latency
  • smaller memory footprint

collectives
  • more algorithms, more performance
  • special shared memory collective
  • hierarchical collective active by default
open fabrics: now support iWarp, not just infiniband open fabric. XRC support, message coalescing (resisted because only really useful for benchmarking). uDAPL improvements by Sun (not really open fabric)

Fault Tolerance
  • coordinated checkpoint/restart
  • support BLCR and self (self means you give function pointer to call for checkpoint)
  • able to handle real process migration (i.e. change network type during migration)
  • improved message logging
OpenMPI on Roadrunner - scaling to 1 petaflop
  • reduce launch times by order of magnitude
  • reliability: cleanup, robustness
  • maintainability: cleanup, simplify program. remove everything not required for OMP
routed out of band communications

Roadmap:
v1.4 in planning phase only, feature list not fully decided

run-time usability
  • parameter usability options
  • sysadmin lock certain parameter values
  • spelling checks, validity checks
run-time system improvements
  • next-gen launcher
  • integration with other run-time systems
more processor and memory affinity support, topology awareness

shared memory improvements: allocations sizes, sharing. scalability to manycore

I/O redirection features
  • line by line tagging
  • output multiplexing
  • "screen"-like features
Blocking progress
MPI connectivity map
refresh included software


Upcoming Challenges:
fFault tolerance, first step similar to FT-MPI approach - if a rank dies the rest of the ranks are still able to communicate, up to programmer to detect and recover if possible
Scalability at run time and MPI level
Collective communication - when to switch between algorithms, take advantage of physical topology

MPI Forum
HLRS is selling MPI 2.1 spec at cost $22 (586 pages), both #1353
what do you want in MPI 3.0?
what don't you want in MPI 3.0?


Feedback:
Question regarding combining OpenMPI with OpenMP: Jeff: yes and no, OpenMPI has better threading support now, but can't guarantee it won't break yet - should be fine with devices that support mpi thread multiple

Can you compare OpenMPI with other mpi impelmentations? Jeff: We steal from them, they steal from us. Some say competition is good, but having many implementations available, especially on a single cluster, is confusing to users. Jeff would like to see more consolidation.

show of hands how important is...

thread safety (multiple threads making simultaneous MPI calls). about 10 in a full room
Parallel I/O. only a few hands
one-sided operations - only a couple users

Predictive Medicine with HPC

I just attended a talk about using computer modeling to make medical predictions. Things such as modeling an aneurysm, predicting how it will grow, the blood flow through it, and the forces applied on the vessel wall . With these computer models they can determine what the dangers are and what coarse of action could be taken. They also showed computer models of drugs dispersed directly into the blood stream near the heart through a catheter. They wanted to see how well th drugs would be absorbed into the vessel wall.

The major theme was that eventually they will be using these computer models to make predictions about your future health and take proactive approaches to managing it, rather than waiting for something bad to happen and treating reactively.

During the Q.A. someone form the audience asked how much computational power was needed for these models (specifically the turbulence models in aneurysms and cardiac arteries) - is it something that a doctor could do in his office. The presenter said that it isn't something that could be done on the laptop yet, but can be done with a small sized cluster - they may run on 128 processors or as low as 16 depending on what they are doing or how quickly they need the computations. They don't require massive systems. He says ideally computing power will get to the point where this can be coupled with a medical imaging device.

Tuesday, November 18, 2008

Adventures on the Floor

Today i visited the SiCortex booth, and didn't really learn anything I didn't know (low power, really great at problems that are "chatty" due to the fact that everything is connected via a very fast backplane rather than a network). The guy knew about Phil Dickens [1] (they haven't been at this that long, so they don't have too many customers yet) and had a great quote from Phil that has been making its way around the guys at SiCortex: when asked what he needed to do for prep for his 648 processor SiCortex he said "Well I might have swept the floor once.". He also liked the fact that he could fit it in his lab without adding additional power or cooling and have a grad student admin it and not involve the IT department (don't worry Gregg, I won't put one in my office).




I also visited the NVIDIA booth and spoke with a CUDA expert, and confirmed what I had been suspecting after I actually thought this through a little better - the GPGPU may not be the right fit for my FFT-heavy program. The problem is the program does a lot of small FFTs (it could be a bunch of 15x15 2D FFTs, definately not a lot of the 128x128 or 256x256 FFTs that the NVIDIA CUDA expert says would be necessary to start seeing a bennefit from offloading to the GPU). There is a latency to transfer the data to and from the GPU - you want to have a larger FFT so that you don't end up spending more time waiting for the data to move across the PCI bus than you save by off-loading the FFT. The good news is their CUDA-based FFT library is very similar to the FFTW library I am using and that it would require very little coding changes. He said it may be possible to batch multipe small FFTs together, but I don't know if that will fit in well with this code. Perhaps I can get someone to donate a TESLA-equipped worksation to do some testing on.




I also spent some time in a break out room at the Cluster Resources booth with Chris Samuel of VPAC and Scott Jackson of CRI to discuss TORQUE. We agreed that getting job arrays finished and solid is a high priority task for the following year. Hopefully by SC in Portland job arrays will be in wide use among TORQUE users.

Top 500 Bof Live Blog. Mine is Bigger.

I decided to copy Gregg and do a live blog during the Top 500 BoF. This is the 32nd Top 500 list, a list of the 500 fastest computers in the world (that choose to run and submit Linpack benchmark numbers. there are many computers who's existence and specs are not publicly known). This info is all available on the Top 500 website, so nothing new is being unveiled here. If a fist fight breaks out I'll let you all know though.

#1

A slightly enhanced Roadrunner system (IBM BladeSystem based), which broke the petaflop barrier last June, held on to its top spot on the list with a 1.105 petaflop Linpack bechmark when Jaguar posted a "mere" 1.058 petaflop score.

update: 5 new or significantly upgraded systems in the top 10

#2 Jaguar at Oak Ridge National Laboratory, a Cray Xt5 system, over 150,000 cores

#3 NASA/Ames Research Center, SGI system, over 51,000 cores

#4 DOE/NNSA/LLNL Blue Gene L

#5 Argonne National Laboratory, IBM Blue Gene/P

#6 TACC Ranger (Sun, over 62,000 AMD Opteron Cores)

#7 NERSC/LBNL Franklin Cray Xt4

#8 Oak Ridge Jaguar Cray Xt4

#9 NASA/Sandia Cray Red Storm

#10 Shanghai Supercomputer Center, Dawning 5000A (Chinese company) First top 10 Windows cluster


Update

awards being given out

"Mine is bigger" T-shirt for Roadrunner


update:

The top 3 got certificates and t-Shirts with double meaning

Also an award for the fastest non-US supercomputer, the Microsoft powered Dawson at Shanghai Supercomputer Center.

update:

Award for fastest in Europe went to an IBM Blue Gene/P at Juelich in Germany.

update:
Purdue won the student challenge where they have to build a cluster from scratch and run linpack in a few days. They got over 700 Gflops. MIT came in 10th with less than 17 GFlops.
Turns out Purdue was using a SiCortex system.

update:
In June 300 systems fell off the list. This time 267 more systems fell off the list!

These are usually systems near the bottom of the list. Systems at the top generally stay near the top, until the institution needs the space to build something even faster.


update:
single core processors almost gone from list. quad core is taking over popularity from dual core. a small number of systems use cell processors with 9 cores.

update:
processors/performance Xeon E54xx(harpertown) #1 at 25%, Opteron Quad Core processors account for 19% of the performance on the top 500.

update:
HP delivered 42% of the systems, IBM 37%, Cray, Dell, and SGI provide a large amount of the power because they have a small number of very powerful systems. IBM still #1 by total performance on top 500

update:
U.S. still #1 consumer, has gone up.
Internationally Japan's market is going down in Asia, China has grown to #2, India to #3. South Korea shrinking.


U.K. #1 in Europe, Germany still #2 (they made big gains last year, fell further behind U.K. this year)

SC Keynote. Record attendance, Dell Sales pitch, Vector computing is back

Once again, Gregg has done a great job over at Mental Burdocks. I do have to say, I'd have to rank Michael Dell's keynote below some recent keynote speakers: Bill Gates in Seattle 2005, Ray Kurzweil in Tampa 2006, and Neil Gershenfeld in Reno 2007. It felt a bit like a Dell sales pitch - most of Dell's personal innovation has been in the Dell business model and not the technology. I am typing this after the fact, so I won't try to compete with Gregg's live account of the keynote.

Michale mentioned that all of the advances in graphics processors for game consoles and high-end gaming video cards are beginning to have an impact in high performance computing. I had started to get excited about this, and had attended GPU programming sessions at previous SC conferences, but now it seems like this is really ready to take off. There is now a good developer ecosystem around this, and NVIDIA is actively engaging the community and developing products specific to HPC. Now it is possible to use libraries and frameworks that have already been developed to take advantage of these massively parallel processors. Higher level programming and hardware abstraction is helping to accelerate the adoption. Vector computing is back!

I would like to note that although there is record attendance this year, it actually seems less chaotic than other years I have attended. In Seattle when I was waiting for the show floor to open for the Gala I felt like I was in line to see Pearl Jam. The line was huge, and when I did get to the show floor the food and drink stations were mobbed by hoards of locust-like geeks. This year, like Reno (I don't remember how smoothly things went at Tampa...), things were much calmer. I didn't wait in line, and there was plenty of food and beverages to go around - it was not difficult to find food and beverage stations with little or no line, especially after the initial rush into the show.

Monday, November 17, 2008

Back from the SC08 Gala


I had a long post here, but then I decided it was boring and no one would care! I'm looking forward to getting started with the technical program tomorrow and actually getting to spend some time talking with people on the show floor when the place isn't mobbed by people looking for free shwag. Here is me with a laser pointer I snagged from a major networking company.

I should have brought a camera!

So my blog is very boring without pictures. I'm going to hook up with Gregg on the show floor occasionally this week so that he can snap some pictures of me and of some of the things I blog about so that this is not completely text!

Update: I just used my MacBook Pro iSight to snap a photo of all the stuff strewn over my hotel bed. Most of this came out of the bag of goodies I got when I registered over at SC08. Speaking of Macs, I am always pleased by the number of them that I see at the Sun HPC Consortium. They must have accounted for at least 75% of the laptops I saw.

Sun HPC Consortium

After taking a year off from attending the Sun HPC Consortium, I am glad I chose to attend again this year. (I have attended the Seattle and Tampa meetings, and skipped Reno - but still attended SC07). It is always good to see what is in the pipeline at Sun, what a few Sun Partners have up their sleeve, and also see what other customers are doing. As Gregg mentioned over at Mental Burdocks, we had to sign a NDA to attend, so there are details we can't blog about. I will be blogging at a very high level and not specific details so I'll be safe. Sun folks are aware of Gregg's blog, so he is also making sure that he is honoring the NDA and in some cases has asked them what is and isn't OK to blog about.

Gregg is doing a great job of discussing things, so I'm just going to elaborate on a few things I found particularly exciting. First up is a potential GPGPU application for a project I have been working on:

For quite some time I have been working on software for the Institute for Molecular Biophysics (IMB), (for Joerg Bewersdorf, Dr. rer. nat., at The Jackson Laboratory). The software is a parallel implementation of the 3D sub diffraction localization software for the Biplane FPALM microscope. The parallel version has been a great improvement over the serial version they run on their lab workstations. For example, a run that had taken over two days on the workstation could be completed in under an hour on our cluster. I don't remember how many cores this test was run on, but I don't think it was any more than 40 cores. Despite the apparent success of this parallelization effort, I am nervous that they will eventually overwhelm our small cluster as they scale their problem size up dramatically. There may be some algorithmic optimizations that could help us get more out of our hardware, but right now a very large percentage of time is spent in the fftw library, and I would have no hope of implementing a faster FFT than the fftw authors. I may implement an option to use single-precision floating point math - I don't think double precision is a necessity, but will have to talk to Dr. Bewersdorf about that. We also have the option of an off-site compute resource, but it would be nice to be able to do the jobs in house, with a fast turn around, and not have to spend a ton of money on more nodes that might sit idle when there isn't any biplane data to crunch. What I have been thinking about this weekend is the NVIDIA Tesla - I have heard a lot about fast FFT on the Tesla this weekend, and adding some of their 1U 4-Tesla units to a small subset of our cluster nodes could very well give us all the computing power we need for this project. I will be doing more investigation at SC08.

Blogging Sun HPC Consortium / SC 08

Gregg, Senior IT Manager and Research Liaison, blogger, and musician suggested that I should be blogging SC since I will have a different perspective on things than he does. He noted that we take notes on completely different things during the same talk. The Sun HPC Consortium is almost over, but I'll try to post my thoughts when I get a moment, and then I'll start blogging my adventures over at the SC conference.

Of course there is a very non-zero probability that I will completely blow this off and not post a single thing in this blog.