Preface in German

"Es (das parallele Programmieren, Anm. nils) ist die bereits greifbare Chance, die fast 50 Jahre waehrende Knechtschaft des sequentiellen Denkens und Programmierens, Testens und Debuggens endlich abstreifen und in eine n-dimensionale Freiheit der parallelen Ausdrucksmoeglichkeiten entfliehen zu koennen."

Vorwort aus "Parallele Rechnersysteme, Programmierung und Anwendung"; Praktische Informatik; Burkhardt u.a.; Verlag Technik, Berlin, Muenchen

Parallel Computing

So far first experiences using PVM (page at ANL) were made. A ray-tracer was parallelised with a speedup of around 2.5 on a 4 processor Convex SPP/1000 using PVM. I also used the shared memory model of that machine to speedup the computation of the Mandelbrot-set. Some papers are available from the publications link of my home-page.

High Performance Languages


A tool for automatic parallelisation

During my first weeks at FZR I got a problem which forced me to think about an Automatic Parallelisation Tool, which I could formulate into some small code. It seams as if I am on the right way, but didn't find time to continue this project. The concept itself is clear.

Source (for non-commercial use only)

What is PVM (Parallel Virtual Machine)?

Assuming a heterogenous environment of workstations and supercomputers it should be possible to build a real power computer using all available capacities. This is what PVM does. I've tested a cluster consisting of:

cluster.html

(This is a "glickäble immedsch")

This gives a theoretical peak performance of 2.8 GFlop/s (vendor data and applied to architecture) and a more acceptable number of 600 MFlop/s (sum of performance of each physical processor, data source: LinPack, Dongarra '95, applied to architecture).
With this set of 13 processors I tried out some benchmarks:

A scalar product like benchmark
Using N=20'000'000 iterations on a=b+c*d (where a,b,c,d are float type variables) the following realistic specs for scalar floating point performance (SFPP) were obtained. Scalar means that only one processor is used on a multiprocessor machine. First the time for the loop and the assignment was measured (T0), then the full time including the floating point operation (Tfp). The spec is computed as Flop/s=N/(Tfp-T0). The experiment was done with real load (not exclusive).

machine minimum average performance peak performance
Convex C-3220 10.20 MFlop/s 10.63 MFlop/s 10.84 MFlop/s
Convex C-3820 34.62 MFlop/s 36.34 MFlop/s 38.71 MFlop/s
Convex SPP-1000 - 32.41 MFlop/s 57.69 MFlop/s
HP 9000/715 18.18 MFlop/s 24.76 MFlop/s 55.56 MFlop/s
HP 9000/735 24.69 MFlop/s 32.82 MFlop/s 36.36 MFlop/s
IBM RS/6000-550 - 55.75 MFlop/s 136.36 MFlop/s
IBM RS/6000-591 142.86 MFlop/s 232.80 MFlop/s 666.67 MFlop/s
Mandelbrot set
By computing the iteration X[N+1]=X[N]^2+C and recording the speed of convergence one get an image of the Mandelbrot set over the complex Gaussian area. This computation was done in the interval [-1,1]x[-1i,1i] using up to 3200*3200 points. Measuring the time to solution the following speedups were observed.

One has to pay attention to these data because they only represent a kind of overall performance. The machines were not dedicated to my job but also to other users tasks.

Gauss-Banachjewitsch linear equation solver
The algorithm for solving A*x=b is given by:
     procedure swap(k,n)
       - choose a line with index l>k but l<=n with abs(A[n,k])>=EPSILON
       - if there is such line, exchange line k with line l
         else terminate (A is singular)

     for k:=1 to n-1 do begin         { loop1 }
       if (abs(A[k,k]) less EPSILON) then swap(k,n);
       for i:=k+1 to n do begin       { loop2 }
         p:=A[i,k]/A[k,k];
         for j:=k+1 to n do           { loop3 }
           A[i,j]:=A[i,j]-p*A[k,j];
         b[i]:=b[i]-p*b[k];
       end;
     end;
     for i:=n downto 1 do begin       { loop4 }
       x[i]:=b[i];
       for j:=i+1 to n do             { loop5 }
         x[i]:=x[i]-A[i,j]*x[j];
       x[i]:=x[i]/A[i,i];
     end;
Because of dependency only the inner loops (loop2 and loop3) can be parallelised. This leads to the following implication:

Considering a parallel machine with S MFlop/s per node and minimum RAM transfer rate of M, distributing loop2, so that each node computes a scalar-product of dimension n-k+1 we should be able to get data transfer rates over

             (3*(n-k+1)-1)*sizeof(float)           
      B >=   ---------------------------
                 (2/S+4/M)*(n-k+1)                          
to be efficient (we have to send both the elimination line and the line to be computed). When we have a sharp look at this formula we find out that we may identify the lower term with the LINPACK benchmark.
               |B| >= 12 * |LINPACK|
For our example this will require rates above 264 MByte/s. Taking an Ethernet-network with maximum 10 MBit/s=1.25 MByte/s peak rate into account we can state that it is inefficient in any case to parallelise a Gauss-Banachjewitsch linear equation solver on a PVM cluster.
But it is possible to proof efficiency on the Convex Exemplar SPP-1000 because the CTI (the Exemplar communication system) has raw data rates around 600 MByte/s, which might give a realistic value of about 400 MByte/s.
Since what we discussed above is a scalar-product computation we can expand our conclusion to all direct linear equation solvers which do not exploit the (sparse) structure of A.
Exploiting sparsity and/or matrix structure
If we can transform a given (sparse) matrix A into a block-diagonal structure we can split the problem of solving a big linear equation system into several smaller independent problems which can be solved parallel.
For better understanding the problem of sparsity some studies at
the ISC at the TU-Dresden should be observed. We also want to reference to
"templates for the solution of linear systems: Building blocks for iterative methods" (Barret et.al.).
Also have a look at the activities of
  • Argonne National Laboratory
  • Silicon Graphics
  • PSRV
  • ZHR TU Dresden
  • Heinrich, Contact for SGI Origin2000 at URZ TUD

  • Top 500 Supercomputers


    Supercomputers with vendor pages

    SIMD (vector computer)

  • Cambridge Parallel Processing Gamma II
  • Cray J90 and T90
  • HP Convex C4 series
  • Digital Equipment Corp. Alpha Server
  • NEC SX-4
  • SGI Power Challenge
  • Tera MTA
  • MIMD (real multiprocessor)

  • Cray T3E
  • Fujitsu VPP300 series
  • HP Convex Exemplar SPP-1200
  • IBM 9076 SP2
  • Intel Paragon XP
  • Meiko Computing Surface 2
  • Parsytec Power Explorer
  • MUlti processor System with Intelligent Communication
  • List from: "Overview of Recent Supercomputers"; A.J. van der Steen, J.J. Dongarra

    LAM/MPI homepage

    Seymour Cray Memorial


    Up Department, Up Institute