Preface in German

"Es (das parallele Programmieren, Anm. nils) ist die bereits greifbare Chance, die fast 50 Jahre waehrende Knechtschaft des sequentiellen Denkens und Programmierens, Testens und Debuggens endlich abstreifen und in eine n-dimensionale Freiheit der parallelen Ausdrucksmoeglichkeiten entfliehen zu koennen."

Vorwort aus "Parallele Rechnersysteme, Programmierung und Anwendung"; Praktische Informatik; Burkhardt u.a.; Verlag Technik, Berlin, Muenchen

Parallel Computing

So far first experiences using PVM (page at ANL) were made. A ray-tracer was parallelised with a speedup of around 2.5 on a 4 processor Convex SPP/1000 using PVM. I also used the shared memory model of that machine to speedup the computation of the Mandelbrot-set. Some papers are available from the publications link of my home-page.

A tool for automatic parallelisation
Parallel Virtual Machine
Top 500 supercomputer list
Supercomputer vendor home-pages
Parallel Computing Seminar, IWR, TU Dresden
- Lesson 1: parallel computers, programming paradigms, message passing, problems (folies, script)
- Lesson 2: parallelisation, programming shared memory, parallel object oriented programming (folies )
MPI function reference
PVM function reference
Chameleon function reference

High Performance Languages

High Performance C++
High Performance FORTRAN (Edinburgh Parallel Computing Centre Lesson)

A tool for automatic parallelisation

During my first weeks at FZR I got a problem which forced me to think about an Automatic Parallelisation Tool, which I could formulate into some small code. It seams as if I am on the right way, but didn't find time to continue this project. The concept itself is clear.

Source (for non-commercial use only)

What is PVM (Parallel Virtual Machine)?

Assuming a heterogenous environment of workstations and supercomputers it should be possible to build a real power computer using all available capacities. This is what PVM does. I've tested a cluster consisting of:

(This is a "glickäble immedsch")

This gives a theoretical peak performance of 2.8 GFlop/s (vendor data and applied to architecture) and a more acceptable number of 600 MFlop/s (sum of performance of each physical processor, data source: LinPack, Dongarra '95, applied to architecture).
With this set of 13 processors I tried out some benchmarks:

A scalar product like benchmark

Using N=20'000'000 iterations on a=b+c*d (where a,b,c,d are float type variables) the following realistic specs for scalar floating point performance (SFPP) were obtained. Scalar means that only one processor is used on a multiprocessor machine. First the time for the loop and the assignment was measured (T0), then the full time including the floating point operation (Tfp). The spec is computed as Flop/s=N/(Tfp-T0). The experiment was done with real load (not exclusive).

machine	minimum	average performance	peak performance
Convex C-3220	10.20 MFlop/s	10.63 MFlop/s	10.84 MFlop/s
Convex C-3820	34.62 MFlop/s	36.34 MFlop/s	38.71 MFlop/s
Convex SPP-1000	-	32.41 MFlop/s	57.69 MFlop/s
HP 9000/715	18.18 MFlop/s	24.76 MFlop/s	55.56 MFlop/s
HP 9000/735	24.69 MFlop/s	32.82 MFlop/s	36.36 MFlop/s
IBM RS/6000-550	-	55.75 MFlop/s	136.36 MFlop/s
IBM RS/6000-591	142.86 MFlop/s	232.80 MFlop/s	666.67 MFlop/s

Mandelbrot set

By computing the iteration X[N+1]=X[N]^2+C and recording the speed of convergence one get an image of the Mandelbrot set over the complex Gaussian area. This computation was done in the interval [-1,1]x[-1i,1i] using up to 3200*3200 points. Measuring the time to solution the following speedups were observed.

One has to pay attention to these data because they only represent a kind of overall performance. The machines were not dedicated to my job but also to other users tasks.

Gauss-Banachjewitsch linear equation solver

The algorithm for solving A*x=b is given by:

     procedure swap(k,n)
       - choose a line with index l>k but l<=n with abs(A[n,k])>=EPSILON
       - if there is such line, exchange line k with line l
         else terminate (A is singular)

     for k:=1 to n-1 do begin         { loop1 }
       if (abs(A[k,k]) less EPSILON) then swap(k,n);
       for i:=k+1 to n do begin       { loop2 }
         p:=A[i,k]/A[k,k];
         for j:=k+1 to n do           { loop3 }
           A[i,j]:=A[i,j]-p*A[k,j];
         b[i]:=b[i]-p*b[k];
       end;
     end;
     for i:=n downto 1 do begin       { loop4 }
       x[i]:=b[i];
       for j:=i+1 to n do             { loop5 }
         x[i]:=x[i]-A[i,j]*x[j];
       x[i]:=x[i]/A[i,i];
     end;

Because of dependency only the inner loops (loop2 and loop3) can be parallelised. This leads to the following implication:

Considering a parallel machine with S MFlop/s per node and minimum RAM transfer rate of M, distributing loop2, so that each node computes a scalar-product of dimension n-k+1 we should be able to get data transfer rates over

             (3*(n-k+1)-1)*sizeof(float)           
      B >=   ---------------------------
                 (2/S+4/M)*(n-k+1)

to be efficient (we have to send both the elimination line and the line to be computed). When we have a sharp look at this formula we find out that we may identify the lower term with the LINPACK benchmark.

               |B| >= 12 * |LINPACK|

For our example this will require rates above 264 MByte/s. Taking an Ethernet-network with maximum 10 MBit/s=1.25 MByte/s peak rate into account we can state that it is inefficient in any case to parallelise a Gauss-Banachjewitsch linear equation solver on a PVM cluster.
But it is possible to proof efficiency on the Convex Exemplar SPP-1000 because the CTI (the Exemplar communication system) has raw data rates around 600 MByte/s, which might give a realistic value of about 400 MByte/s.
Since what we discussed above is a scalar-product computation we can expand our conclusion to all direct linear equation solvers which do not exploit the (sparse) structure of A.

Exploiting sparsity and/or matrix structure

If we can transform a given (sparse) matrix A into a block-diagonal structure we can split the problem of solving a big linear equation system into several smaller independent problems which can be solved parallel.
For better understanding the problem of sparsity some studies at the ISC at the TU-Dresden should be observed. We also want to reference to
"templates for the solution of linear systems: Building blocks for iterative methods" (Barret et.al.).
Also have a look at the activities of

Argonne National Laboratory

Silicon Graphics

PSRV

ZHR TU Dresden

Heinrich, Contact for SGI Origin2000 at URZ TUD

Top 500 Supercomputers

Supercomputers with vendor pages

SIMD (vector computer)

Cambridge Parallel Processing Gamma II

Cray J90 and T90

HP Convex C4 series

Digital Equipment Corp. Alpha Server

NEC SX-4

SGI Power Challenge

Tera MTA

MIMD (real multiprocessor)

Cray T3E

Fujitsu VPP300 series

HP Convex Exemplar SPP-1200

IBM 9076 SP2

Intel Paragon XP

Meiko Computing Surface 2

Parsytec Power Explorer

MUlti processor System with Intelligent Communication

List from: "Overview of Recent Supercomputers"; A.J. van der Steen, J.J. Dongarra

LAM/MPI homepage

Seymour Cray Memorial

Department,

Institute