\documentstyle [11pt] {article}
\newtheorem{thm}{THEOREM}[section]
\newtheorem{lemma}[thm]{LEMMA}
\newtheorem{cor}[thm]{COROLLARY}
\newtheorem{remark}[thm]{REMARK}
\newcommand{\pf}{{\em Proof:\ }}
\newcommand{\C}{\mbox{${\bf C}$}}
\newcommand{\N}{\mbox{${\bf N}$}}
\newcommand{\nn}{\mbox{${\bf n}$}}
\newcommand{\kk}{\mbox{${\bf k}$}}
\newcommand{\norm}[2]{\mbox{$\parallel #1 \parallel _{#2}$}}
\newcommand{\bin}[2]{\left({#1\atop #2}\right)}
\newcommand{\abs}[1]{\mbox{$\,\mid\! #1 \!\mid\,$}}
\newcommand{\re}{\mbox{Re}\,}
\newcommand{\im}{\mbox{Im}\,}
\newcommand{\h}{\mbox{$\frac{1}{2}$}}
%\renewcommand{\baselinestretch}{2}
\setlength{\topmargin}{0.2cm}
\setlength{\oddsidemargin}{0.5cm}
\setlength{\evensidemargin}{0.5cm}
\setlength{\textwidth}{15cm}
\setlength{\textheight}{21cm}
\begin{document}
\title{Optimal Combinatorial Functions \\
Comparing Multiprocess Allocation Performance \\
in Multiprocessor Systems}
\author{H\aa kan Lennerstad\thanks{Both authors were supported in part
by the Blekinge Research Foundation.}\\
Dept. of Mathematics\\
Univ. of Karlskrona/Ronneby\\
S-371 79 Karlskrona
\and
Lars Lundberg$^*$\\
Dept. of Computer Science\\
Univ. of Karlskrona/Ronneby\\
S-372 25 Ronneby}
\date{}
\maketitle
\section{Abstract}
For the execution of an arbitrary parallel program $P$, consisting of a set
of processes, we consider two alternative multiprocessors.
The first multiprocessor has $q$ processors and allocates
parallel programs dynamically, i.e. processes may be reallocated from one
processor to another. The second employs cluster allocation
with $k$ clusters and $u$ processors in each cluster - here processes
may be reallocated within a cluster only. Let $T_d(P,q)$
and $T_c(P,k,u)$ be execution times for the parallel program $P$
with optimal allocations. We derive a formula for the program independent performance function
$$G(k,u,q)=\sup_{\mbox{{\footnotesize all parallel programs $P$}}}
\:\frac{T_c(P,k,u)}{T_d(P,q)}.$$
Hence, with optimal allocations, the execution of $P$ can never
take more than a factor $G(k,u,q)$ longer time with the second
multiprocessor than with the first, and there exist programs showing that
the bound is sharp.
The supremum is taken over all parallel programs consisting of any number
of processes. Any interprocess dependency structure is allowed for the
parallel programs, except deadlock.
Overhead for synchronization and reallocation is neglected only.
We further present optimal formulas which exploits a priori knowledge of
the class of parallel programs intended for the multiprocessor,
thus resulting in sharper optimal bounds. The function $g(n,k,u,q)$
is the above maximum taken over all parallel programs consisting
of $n$ processes. The function $s(n,v,k,u)$ is the same maximum, with $q=n$,
taken over all parallel programs of $n$ processes which has a
degree of parallelism characterized by a certain parallel profile vector
$v=(v_1,...,v_n)$.
The functions can be used in various ways to obtain optimal performance
bounds, aiding in multiprocessor architecture decisions.
An immediate application is the evaluation
of heuristic allocation algorithms.
It is well known that the problems of finding the corresponding optimal
allocations are NP-complete. We thus in effect present a methodology to obtain optimal control of NP-complete scheduling problems.
\section{Problem definition and results}
This report establishes optimal worst case bounds
comparing the execution time for a program on two multiprocessors.
The multiprocessors have identical processors but different
organization and different processor quantity.
The multiprocessors execute parallel programs $P$ which
may have any set of runtimes and any possible structure of interprocess dependency, only excepting deadlock.
The parallel program model under consideration is general.
A parallel program consists of a set of processes,
some of which may run in parallel, and a set of dependencies
between the processes, called syncronization,
in the sense that one process, at a specific point in the process, may
not execute unless another process has reached a certain point. We have
no limitations of the occurence of such dependencies in a program,
other that the program is required to be executable. For example, some
programs include time consuming transferring of data between subprocesses.
Then the transferring itself can be viewed as a part of the process,
while the dependency is realized as synchronization signals of no
time duration at several points during the transferring.
In cluster allocation the processors are organized in $k$ groups,
the clusters, where each cluster contains $u$ processors. Here we thus have
in total $ku$ processors.
Once a process is allocated to a processor in a specific cluster it can only be
executed on a processor in this cluster to completion. It may be transferred any
time, but only to a processor in the same cluster.
The cost of all transferring of processes is neglected.
If a process is put into a waiting state, it will thus
later be restarted on a processor in the same cluster.
Dynamic and static allocations are both special cases of cluster
allocation. Dynamic allocation represents the case of having all processors
in one
cluster, $k=1$, hence the processes may be transferred between all processors
without limits. In static allocation each cluster has one single processor,
$u=1$, thus a process may never be transferred from the processor where
it was initiated.
% LLU-start
% For a parallel program $P$, there is a certain number of
% different cluster allocations. This number is usually very large, however
% certainly finite. Each of the allocations of the program has a well
% defined execution time, when executed in the order given by the allocation
% with no unnecessary delay.
% The set of execution times of the program $P$ for all cluster allocations
% is finite, so it has a minimum. An allocation scheme giving minimal
% execution time is called an optimal allocation scheme. It is well known that
% it is a NP-complete problem to find an optimal cluster allocation.
% Håkan!
% Du ser att jag har ändrat i din text lite, men egentligen tycker jag att
% man kan strycka allt mellan LLU-start och LLU-end, vad tycker du?
% Vilken viktig information försvinner? Optimal allocation framgår rätt
% bra ändå i Din text nedan tycker jag. Eventuellt kan man
% lägga till en förtydligande bisats efter "optimal cluster allocation" nedan.
For a parallel program $P$, there is a certain number of
different cluster allocations. This number is usually very large, however
certainly finite. The execution time of each alloation is affected by the
order in which the processes allocated to the same cluster are executed.
However, each of the allocations of the program has a well defined
minimal execution time. We are interested in the minimal execution times
for two multiprocessor organizations. Therefore, we only
need to consider the minimal execution time for each allocation.
Since we now have a well defined execution time for each allocation, the
set of execution times of the program $P$ for
all cluster allocations is finite, so it has a minimum. An allocation,
which results in an execution time shorter than or equal to that of any other alloation
is called an optimal allocation scheme. It is
well known that it is a NP-complete problem to find an optimal cluster allocation.
% LLU-end
For any fixed parallel program $P$, we compare the minimal execution time of $P$
for two multiprocessors.
The first multiprocessor has $q$ processors and allocates
parallel programs dynamically. For a parallel program $P$, we denote
the execution time for $P$ with optimal dynamic allocation
by $T_d(P,q)$, i.e. with a dynamic allocation such that there is no
other dynamic allocation for which the execution time of $P$ is shorter.
The second multiprocessor performs cluster allocation
with $k$ clusters and $u$ processors in each cluster.
Let $T_c(P,k,u)$ denote the execution time for the parallel program $P$
with optimal cluster allocation.
Overhead for synchronization and reallocation is neglected
throughout the report. In section 4 the allocation problem is defined in detail.
The allocation problem will be formulated as a mathematical optimization
problem about so called
$0,1$-matrices, i.e. matrices where all entries are $0$ or $1$.
The extremal matrices, corresponding to extremal parallel programs,
are characterized. This makes it possible
to establish a formula for the performance function $g$:
$$g(n,k,u,q)=\max_{\mbox{{\footnotesize all $n$-programs $P$}}}\: \frac{T_c(P,k,u)}{T_d(P,q)}.$$
Here $n$-programs denotes parallel programs with $n$ processes.
The formula is presented in section 6.
The degree of parallelism for a program $P$ can be characterized
by a vector $v=(v_1,...,v_n)$. Here $v_i$ is the percentage of the
total execution time where exactly $i$ processes execute simultaneously
if each process is allocated to a separate processor.
From the function $g$ we derive a formula for the performance function
$$s(n,v,k,u)=\max_P \:\frac{T_c(P,k,u)}{T_d(P,n)},$$
where the maximum is taken over all parallel programs $P$ of $n$ processes
which has the parallel profile vector $v$.
The function $g(n,k,u,q)$ is proven to be incresing in the variable $n$,
a fact which can be used to compute a formula for the process independent performance function
$$G(k,u,q)=\sup_{\mbox{{\footnotesize all parallel programs $P$}}} \:\frac{T_c(P,k,u)}{T_d(P,q)}=\lim_{n\rightarrow\infty}g(n,k,u,q).$$
This performance function is applicable for a multiprocessor
intended for any parallel program. No prior knowledge of the
parallel programs can be used
to extract further information from the function $G(k,u,q)$.
On the other hand
the functions $g(n,k,u,q)$ and $s(n,v,k,u)$ are
useful when there is prior knowledge
of certain kinds about the parallel programs.
Note that beside the allocations, the performance functions
themselves are optimal, while representing bounds
which cannot be improved. The term "optimal" thus appears in two senses.
\section{Previous results and applications}
The present report extends the results in [2] and [6]. Here the
the formula for the function $g$ in the case $u=1, k=q$ is treated only,
representing static versus dynamic allocation on the same multiprocessor.
The mathematics of the subject is focused in this report. It can be
viewed as the theoretical fundament for the reports
[4], [7], [8] and [10], which treat different applications
from a computer science point of view.
The report [9] provides an application not covered by the present report.
Here a test execution of the program can be used to provide sharper
bounds. The results involve linear programming where values of the
function $G$ occur as cofficients.
Furthermore, the basic method presented here has proved to
be useful in other contexts. In the report [11], the efficiency of cache
memories for single processors is studied. Here optimal bounds
comparing more flexible with less flexible cache memory organization
alternatives are derived, the final part of the argument is similar
to that of the present report. [5] is a survey article of the results
in [11].
Since the problems of finding optimal allocations are NP-complete,
general and preferably optimal results are needed already
for a medium number of processors and processes in order
to choose multiprocessor architecture which opimizes performance.
Other than the reports [2] to [14], the only
general results concerning allocation strategies
of parallel programs appear to be the results by R. L. Graham [1]. The overhead
for process reallocation and synchronization is neglected also in that work.
% LLU-start
% In [1] it is proved that in the
% unsynchronized case, i.e. when there are no interprocess dependencies,
% the optimal worst case ratio of static versus dynamic allocations is 2, taken
% over all parallel programs consisting of any number of processes.
So called self-scheduling algorithms are
% also
considered.
This term is used for dynamic allocation
algorithms where, when a processor becomes idle and there are waiting
executable processes, one of these is immediately allocated to this processor.
It is
% further
established in [1] that the execution time for a program allocated
with a self-scheduling algorithm is never higher than two times the
% completion
execution time with optimal dynamic allocation.
% LLU-end
The process allocation scheme is of major importance for the
performance of a multiprocessor, thus an immediate area of applications
for the present results is multiprocessor design.
The results can further be used to evaluate the efficiency of
cluster allocation algorithms versus the best possible algorithm,
i.e. the algorithm which finds the optimal cluster allocation.
One difficulty in this application which concerns the
functions $G$ and $g$ is that
this execution time is compared to the execution time with
optimal dynamic allocation, which is unknown and NP-hard.
The function $s$ compares the optimal cluster allocation with the case of
having each process allocated to a processor on its own, hence the optimal
% LLU-start
% dynamic
% LLU-end
allocation is obvious and the difficulty does
not arise in this case.
% LLU-start
% For the functions $G$ and $g$, by using the estimates for the
% allocation obtained by
% a self-scheduling algorithm versus an optimal dynamic allocation
% mentioned above ([1]), the NP-completeness in
% finding the optimal dynamic allocation can be avoided, and an estimate of the
% execution time with optimal static allocation is obtained.
% This estimate is optimal except for at most a factor of 2, and it
% can be improved by running several different self-scheduling
% algorithms. A part of the results is a characterization of
% the extremal programs, corresponding to so called complete matrices,
% a given cluster allocation algorithm can thus
% be tested on such parallel programs. On the other hand, a certain
% cluster allocation may be weak for other types of parallel programs
% than the optimal algorithm: if the characteristics of the critical
% programs is known, these should be tested.
%
% So called free-fly algorithms, which allocates
% parallel programs which are not completely known at start, can also
% be evaluated in this way in the common case when the complete program
% is known at completion. The case of average ratio between
% the allocation strategies is not considered here, we present strictly worst
% case estimates.
% LLU-end
The most general result in this report is clearly the one represented
by the function $G$. Here we have no demands on the program.
The function $g$ is restricted to programs with $n$ processes,
while the function $s$ concerns programs with $n$ processes and
a certain parallel profile. It is expected that the
techniques presented here can be extended to take advantage of
further kinds of program specifics, thus improving the bounds by
keeping away from those programs which maximize the ratio studied in
this work.
In effect we present a methodology to obtain optimal
control of NP-complete scheduling problems.
One example of a multiprocess architecture feature which immediately
follows from the function $G(k,u,q)$
is the number of extra processors which would compensate
for a more static allocation, i.e. the trade off between multiprocess
power and allocation efficiency. The concluding graphics section
shows plots of this and many other
ways to exploit the performance functions for multiprocessor
architecture purposes.
% LLU-start
% In many applications the results need to be modified
% with an estimate of the reallocation overhead.
% Such applications obviously favor cluster allocation with small clusters and
% in particular static allocation.
% On the other hand, a major problem in practice is that
% it is easier to achieve close to optimal dynamic scheduling,
% one example of this is self-scheduling,
% than close to optimal cluster scheduling. The present results thus
% give optimal quantitative indication on how close to optimal a
% specific cluster scheduling is.
% LLU-end
As a mathematical result it is applicable for scheduling problems
in other contexts. For example, the construction of a house is clearly
a multiprocess with subprocesses which are in some
respects dependent and in some respects independent. Here the
processors are not as general as computers in the sense that some
processes probably cannot efficiently be allocated to all processors.
Humans are usually specialists.
Within automation and robotics scheduling problems are frequent, and
here the skills of the processors can often be clearly defined.
\vspace{5mm}
We next give an overview of the report.
In the following section the allocation problem is described and analyzed in
detail, and transformed to a mathematical problem.
In section 5 we give a full formulation of the mathematical problem and
introduce necessary notation.
In section 6 the formulas for the program dependent performance functions
$g(n,k,u,q)$ and $s(n,v,u,q)$ are stated and proved. Basic
properties of the function $g(n,k,u,q)$ are also established here.
Section 7 deals with the general performance function $G(k,u,q)$.
The report is concluded with a graphics section. The purpose of
this section is twofold: to show the performance functions quantitatively
and geometrically, and to suggest how the functions can be used
to get optimal aid in multiprocessor design decisions.
\section{From programs to matrices}
A program $P$ consists of $n$ processes of possibly very different
execution times. The processes are of course usually dependent of each other.
One can expect dependencies of the type that process $i$ cannot
execute further at the time point $t_i$ unless process
$j$ has reached the time point $t_j$. When process $j$ has
reached the time point $t_j$
it is said to execute a synchronizing signal to process $i$,
restarting this process. Certainly there can be many synchronizing signals
to a time point $t_j$, in which case all have to be executed before
the process restarts.
The execution time of synchronizing signals is neglected.
Most parallel programs contain many synchronizing signals. In this
report any set of synchronization signals is allowed,
% LLU-start
% .
% Our only assumption is the obvious one - that the program is executable.
% We allow any set of synchronizing signals,
% LLU-end
except those which include a deadlock.
Thus a parallel program $P$ of $n$ processes is defined by the total execution
times of the $n$ processes and by the set of synchronizing signals.
Now consider a parallel program $P$. Assume that we have found an optimal dynamic
allocation, with execution time $T_d(P,q)$. This optimal dynamic allocation
will be kept fixed during the entire argument
dealing with the program $P$ and its descendant $P'$.
Next we introduce a discretization
of the time interval in subintervals $(t_i,t_{i+1})$ of equal length, such
that all synchronizing signals, process initiations and process terminations
appear on the time points $t_i$, where
$t_i=\frac{i}{m}T_d(P,q), i=0,...,m$. Obviously
all processes in the interval $(t_{i-1},t_i)$ are completed before
any part of the processes corresponding to the interval $(t_i,t_{i+1})$
when using this allocation, since this is so without the discretization.
Such a discretization is possible if all synchronizing
signals and process terminations occur at rational time points, which we can assume.
Observe that $m$ might be very large even if the program $P$ is small and has a simple structure.
% LLU-start
For the sake of the function $s(n,v,k,u)$, let $v_1,...,v_n$ be the parallel
profile of the program $P$: when having the processes allocated on
separate processors, $i$ processors are active simultanouosly exactly
the share $v_i$ of the total execution time of $P$. The numbers $v_i$
are assumed to be non-negative rational numbers such that
$\sum_{i=1}^{n}v_i=1$. Then $m$ is necessarily a multiple of the smallest
common divisor of the numbers $v_1,...,v_n$. There are integers $x_i=mv_i$
so that exactly $x_i$ intervals have $i$ active processes.
% LLU-end
Consequently, during a time interval $(t_i,t_{i+1})$, no process of the program $P$ starts, and no process stops.
From the program $P$ and for the functions $G$ and $g$,
we next construct another program $P'$
by two changes of the program $P$: we introduce
new synchronizing signals and prolong certain processes.
For the sake of the function $s(n,v,k,u)$ only one of these changes is done:
new synchronization is introduced.
Any prolonging of processes would destroy the specific
parallel profile character, hence for this case the only difference
between the programs $P$ and $P'$ is the synchronization.
At every time point
$t_i$ we introduce all possible synchronization between the processes. This
means that the synchronization now requires that all processes in the interval
$(t_{i-1},t_i)$ have to be completed before
any part of the processes corresponding to the interval $(t_i,t_{i+1})$,
which will increase the execution time with most other allocations.
Since the execution time of synchronizing signals is neglected this does
not change the total execution time with the fixed optimal dynamic
allocation, which is $T_d(P,q)$. Further, for the functions $G$ and $g$,
all processors are made to be
busy at all time intervals. This is achieved by, if necessary,
prolonging some processes.
However no process is prolonged beyond $T_d(P,q)$, hence $T_d(P,q)=T_d(P',q)$.
It is of no importance
that the prolonging of processes can be made in many ways; many
programs can play the role of $P'$ to a specific program $P$.
By the construction we thus have $T_d(P,q)=T_d(P',q)$.
However, since introducing more synchronization and prolonging processes never
shortens the execution time, for other allocations the execution time is
either increased or unchanged.
In particular, for optimal cluster allocation
we therefore have $T_c(P,k,u)\leq T_c(P',k,u)$. Consequently,
$$\frac{T_c(P,k,u)}{T_d(P,q)}\leq\frac{T_c(P',k,u)}{T_d(P',q)}.$$
Certainly there are programs $P$ which are left unchanged by
the above transformation: programs such that $P=P'$.
Since these programs constitute a subset of the parallel programs
we consider, we actually have
$$\max_P \frac{T_c(P,k,u)}{T_d(P,q)}=\max_P \frac{T_c(P',k,u)}{T_d(P',q)}.$$
Therefore, in order to calculate the maximum,
only programs of the type $P'$ need to be considered.
We next represent a program $P'$ by an $m\times n$ matrix of 0:s and 1:s
only. Here each process is represented by a column, and each time period
is represented by a row. The entry at the position $(i,j)$ of the matrix
is 1 if the $j$:th process
is active between $t_{i-1}$ and $t_i$; if it is inactive the entry is 0.
Each row contains exactly $q$ 1:s, since each processor is constantly busy.
In the sequel such a matrix is referred to as an $m, n, q$-type matrix.
The main part of the report analyzes these matrices. For example we
characterize the type of matrix which corresponds to the worst case.
Because of the complete synchronization,
each row has to be completed before the next
row. The execution time of the program $P'$ with optimal dynamic allocation in this time unit is $m$.
For the formula for $s$ there is exactly $x_i$ rows which has $i$ 1:s,
$\sum_{i=1}^{n}x_i=m$. The matrix for this case is composed by $n$
$x_i, n, i$-type matrices.
Our next objective is to compute the optimal cluster execution
time of the program $P'$.
To compute this we need to
decide how the $n$ processes are to be allocated to the $k$ clusters.
Since every process in this case is to be executed within one
cluster only, the cluster allocation corresponds to
a way of grouping the $n$ columns of the matrix
together in $k$ sets, one set for each
cluster. Assume $l$ processes are allocated to a
specific cluster at a specific time inteval.
Within a cluster there are $u$ processors and the processes
are allocated dynamically.
Since there is no synchronization within a time interval,
the programs are independent and we will next
see that the execution time is here $\max(l/u,1)$. The execution time
clearly cannot be lower than $l/u$.
Further, the execution time cannot be lower than $1$
since each process is considered to be completely synchronized to itself.
That is, no part of a process can be executed in parallel with another
part of the same process.
In the case $l\geq u$
the limit $l/u$ is reached, which can be seen by
dividing the execution time for each processor
in $l$ parts of size $1/u$ each. On the first processor each process is
executed during exactly one of the intervals of size $1/u$. On the second
processor this execution scheme is permuted cyclically.
On the $i$:th processor, the same cyclic permutation is carried
out $i$ times. With this execution scheme for the processors, the
total execution time is $l/u$ and
no process is executed in parallel with itself.
Because of the complete synchronization at the time points $t_i$, processing
cannot proceed before completion of the slowest cluster.
Hence, we obtain a maximum of $\max(l/u,1)$ over
the $k$ clusters.
The execution time with cluster allocation in the time unit given by the
interval length is the sum of these maximas.
This is the {\em optimal}
cluster execution time $T_c(P',k,u)$ if we have found an
allocation of the $n$ columns
together in $k$ sets which minimizes
the cluster execution time.
If we change the time unit so that $T_d(P,q)=m$, then
$T_c(P',k,u)$ is an integer, and will be denoted by $T(P',k,u)$.
In section 6 we establish a formula for the function $g(n,k,u,q)$
representing the worst case, i.e. for any program $P'$:
$$\frac{T(P',k,u)}{T_d(P',q)}\leq g(n,k,u,q)=\max_P \frac{T_c(P,k,u)}{T_d(P,q)}=\max_P\frac{T(P,k,u)}{m}.$$
From the function $g$ formulas for the
performance functions $s$ and $G$ can be derived.
\section{The matrix problem}
The operator $T$ with no index in the equation
above belongs to the mathematical formulation. It is
a function, to be defined,
taking $0,1$-matrices as arguments and having integers as values.
$T(P,k,u)$ represents the execution time with optimal cluster allocation,
having $k$ clusters with $u$ processors in each,
while $m$ is the dynamic analogue with $q$ processors.
For simplicity of notation we will
frequently write $T(P)$ instead of $T(P,k,u)$.
In $T(P)$, $P$ is an $m\times n$ matrix $P$ of 0:s and 1:s only,
such that each row has exactly $q$ 1:s,
and thus $n-q$ 0:s, $1\leq q\leq n$.
We say that a matrix $P$ of this type is of $m,n,q$-type.
Thus any program with $n$ processes which is
executed with dynamic allocation on a processor with $q$ processors
is represented by a $m,n,q$-type matrix $P$.
Consider an $m,n,q$-type matrix $P$ and a partition $A$
of the $n$ column vectors into $k$ sets.
We will be mostly concerned with partitions where the sizes of the
sets in the partition differ as little as possible. If $n/k$ is an integer
$w$, every set in such a partition has $w$ members.
Denote the integer part of $n/k$, the floor function, by
$\lfloor n/k\rfloor $, and the smallest integer greater than or equal
to $n/k$, the ceiling function, by $\lceil n/k\rceil$.
If $n/k$ is not an integer, the sets in a partition where the sizes differ
as little as possible have $\lfloor n/k\rfloor $ or $\lceil n/k\rceil$
members.
Denote the number of 1:s in cluster $l$ at row $j$ by $c(l,j)$.
As described in the previous section, the execution time with cluster
allocation using the partition $A$ is then $$T_{A}(P,k,u)=\sum_{j=1}^{m}\max_{l=1,...,k}(c(l,j)/u,1).$$
We want to choose the partition $A$ so that $T_{A}(P)$ is minimal.
The execution time with optimal cluster allocation is thus
$$T(P)=\min_{\mbox{{\scriptsize all partitions $A$}}}T_{A}(P).$$
We can now define the performance function $g(n,k,u,q)$:
$$g(n,k,u,q)=\max\{\frac{T(P,k,u)}{m},
\mbox{all } m,n,q-\mbox{type matrices }P\}.$$
The function $g$ is here defined only for $k\leq n$ and $q\leq n$.
The definition is extended to $k>n$ by $g(n,k,u,q)=1$,
and to $q>n$ by $g(n,k,u,q)=g(n,k,u,n)=\max(\lceil\frac{n}{k}\rceil/u,1)$.
This is in accordance with the application
since in these cases we have unnecessary clusters and unnecessary processors,
respectively. That the extension to $k>n$ is appropriate follows
from results in the sequel.
Given non-negative rational numbers $v_1,...,v_n$ with sum 1 and an integer
$m$ which is a multiple of the smallest common divisor of $v_1,...,v_n$. Then
$$s(n,v,k,u)=\max\frac{T(P,k,u)}{m},$$
where this maximum is taken over
all $m\times n$ $0,1$-matrices $P$ where the number of rows with exactly
$i$ 1:s is $mv_i$.
The most general performance function is
$$G(k,u,q)=\sup_ng(n,k,u,q).$$
\vspace{2mm}
We conclude the preparations for the main result by introducing
some notation relevant in this situation.
We call a matrix $P$ complete if all possible rows,
that is, if all $\bin{n}{q}$ permutations of the
$q$ 1:s, occur equally frequently as rows of $P$. The number of rows
is thus necessarily divisible by $\bin{n}{q}$.
We also need the following three combinatorial functions.
Let $I$ be a finite sequence of non-negative integers. Then we define:
\begin{description}
\item $b(I)=$ the number of distinct integers in $I$
\item $a(I,j)=$ the number of occurences of the $j$:th distinct integer in
$I$, enumerated in size order, $1\leq j\leq b(I)$.
\item $\pi(k,w,q,l)$ = the number of permutations of $q$ 1:s distributed
in $kw$ slots, which are divided in $k$ sets with $w$ slots in each,
such that the set with maximum number of 1:s has exactly $l$ 1:s.
\end{description}
\newpage
\section{Process dependent performance functions}
We can now state the formula for $g(n,k,u,q)$, from which explicit formulas
for the functions $s(n,v,u,q)$ and $G(k,u,q)$ follow.
\begin{thm}
Given positive integers $m, n, k, q$ and $u$; $n\geq k$ and $n\geq q$.
In the case where $w=n/k$ is an integer, we have for all matrices $P$ of $m,n,q$-type:
$$T(P,k,u)/m \leq g(n,k,u,q)=\frac{1}{u\bin{n}{q}}
\sum_{l=1}^{\min(w,q)}\max(l,u)\pi(k,w,q,l).$$
If $w=n/k$ is not an integer, we let $w=\lfloor n/k\rfloor $ and
denote the remainder of $n$ divided by $k$ by $n_k$,
i.e. $n_k=n-k\lfloor n/k\rfloor $. Then we have for all matrices
$P$ of $m,n,k$-type:
$$T(P,k,u)/m \leq g(n,k,u,q)=$$
$$\frac{1}{u\bin{n}{q}}
\sum_{l_1=\max(0,\lceil\frac{q-(k-n_k)w}{n_k}\rceil)}^{\min(w+1,q)}
\sum_{l_2=\max(0,\lceil\frac{q-l_1n_k}{k-n_k}\rceil)}^{\min(w,q-l_1)}
\max(l_1,l_2,u)\times$$
$$\sum_{i=\max(l_1,q-l_2(k-n_k))}^{\min(l_1n_k,q-l_2)}
\pi(n_k,w+1,i,l_1)\pi(k-n_k,w,q-i,l_2).$$
In the special cases $\min(q,w)kl$, $\pi(k,w,q,l)=0$.
Otherwise if $k=1$, $\pi(1,w,q,l)=\bin{n}{l}$. In all other cases we have
$$\pi(k,w,q,l)=\bin{w}{l}
\sum_I \bin{w}{i_1}\cdot...\cdot\bin{w}{i_{k-1}}\frac{k!}
{\Pi_{j=1}^{b(\{l,i_1,...,i_{k-1}\})}a(\{l,i_1,...,i_{k-1}\},j)!}.$$
Here the sum is taken over
all sequences of non-negative integers $I=\{i_1,...,i_{k-1}\}$
which are decreasing: $i_j\geq i_{j+1}$ for all $j=1,...,k-2$,
bounded by $l: i_1\leq l$,
and have sum $q-l$: $\sum_{j=1}^{k-1}i_j=q-l$.
For each $m,n,q-$type matrix $P$ the minimum
$$T(P)=\min_{\mbox{{\scriptsize all partitions $A$}}}T_{A}(P)$$
is attained for a partition where the
sizes of the sets in the partition differ as little as possible.
The bound is optimal in the sense that if $\bin{n}{q}$ divides $m$,
in which case there exist complete matrices,
we have $T(P)/m=g(n,k,u,q)$ for all complete matrices $P$.
\end{thm}
As described earlier $g(n,k,u,q)$ is defined also in the cases
$k>n$ and $q>n$, here we have $g(n,k,u,q)=1$ and
$g(n,k,u,q)=g(n,k,u,n)=\max(\lceil\frac{n}{k}\rceil/u,1)$, respectively.
A consequence of this and of the theorem is that for $k=1$ we have
$g(n,1,u,q)=\max(\min(n,q)/u,1)$.
The 256 first values of $g(n,k,u,q)$, for $1\leq n,k,u,q\leq4$, are
presented in graphics part 2. Part 3 of the graphics
section show three plots of $g$ representing three cases of increasing
degrees of cluster organization using the same number of processors.
The last factor in the function $\pi$ is obviously a multinomial cofficient,
we have preferred to write it out.
The following lemma provides an algorithm
which generates all decreasing sequences. It is proved in [2].
We say that the least decreasing sequence of length $\mu$ and sum
$\sigma$ is the sequence
$\{\lceil\frac{\sigma}{\mu}\rceil,...,\lceil\frac{\sigma}{\mu}\rceil,
\lfloor\frac{\sigma}{\mu}\rfloor,...,\lfloor\frac{\sigma}{\mu}\rfloor\}$.
If $\sigma_{\mu}$ is the remainder when $\sigma$ is divided by $\mu$,
the number of $ \lceil\frac{\sigma}{\mu}\rceil$:s is $\sigma_{\mu}$, and
the number of $ \lfloor\frac{\sigma}{\mu}\rfloor$:s is $\mu-\sigma_{\mu}$,
making the sum of the sequence $\sigma$.
\begin{lemma}
Let $\lambda$ and $\sigma$ be non-negative
integers and $\mu$ be a positive integer such that $\lambda\leq
\sigma\leq \lambda\mu$.
Every sequence of $\mu$ integers in the interval $0\leq i \leq \lambda$
which is decreasing, bounded by $\lambda$ and has
sum $\sigma$ is generated exactly once by the following algorithm:
\begin{enumerate}
\item Take $I$ as the least decreasing sequence of length $\mu$ and sum
$\sigma$.
\item Find the {\bf rightmost} position in $I$, say $j_1$, which fullfills:
\begin{enumerate}
\item $i_{j_1}0$
\end{enumerate}
The algorithm terminates if no such $j_1$ can be found.
\item The next sequence is obtained from $I$
by increasing the entry in position $j_1$ by one, and replacing
the subsequence $\{i_{j_1+1},...,i_{\mu}\}$ with the least
decreasing subsequence of length $\mu-j_1$ and sum $\sum_{j=j_1+1}^{\mu}i_j-1$.
\item Go to step 2.
\end{enumerate}
\end{lemma}
Observe that $\sum_{l=1}^{\min(w,q)}\pi(k,w,q,l)=\bin{n}{q}$ in the
case $n/k$ an integer, and otherwise
\newpage
$$\sum_{l_1=0}^{\min(w+1,q)}
\sum_{l_2=0}^{\min(w,q-l_1)}
\sum_{i=\max(l_1,q-l_2(k-n_k))}^{\min(l_1n_k,q-l_2)}
\pi(n_k,w+1,i,l_1)\pi(k-n_k,w,q-i,l_2)=\bin{n}{q},$$
since we are only counting all permutations of the $q$ 1:s in the $n$ slots
in different ways. Thus the formula can be regarded as a weighted average of the
numbers $\{\max(l/u,1):l=1,2,...,\min(\lceil n/k\rceil,q)\}$. In the case
$u\geq\min(\lceil n/k\rceil,q)$
all these numbers are $1$, hence we in this case obtain $g(n,k,u,q)=1$.
In multiprocessor terms this corresponds to a collapse
of cluster allocation into dynamic allocation
since then one cluster alone is equally large as the first multiprocessor.
\vspace{4mm}
{\em Proof of the theorem:} This proof is similar to the corresponding
proof in [2] - we refer to that proof.
We next will establish a formula for the function $s$ in terms of $g$.
Given non-negative rational numbers $v_1,...,v_n$ with sum 1 and an integer
$m$ which is a multiple of the smallest common divisor. Then $s$ has been
defined as
$$s(n,v,k,u)=\max\frac{T(P,k,u)}{m},$$
where the maximum is taken over
all $m\times n$ $0,1$-matrices $P$ which has $mv_i$ rows with $i$ 1:s.
\begin{thm}
$$s(n,v,k,u)=\sum^{n}_{i=1}v_ig(n,k,u,i).$$
\end{thm}
{\em Proof:} As in the previous proof, given in [2],
the argument of permuting the columns
is performed. This gives a matrix $\tilde{P}$ of $n!m$ rows, where
$n!x_i$ rows has exactly $i$ 1:s. Among the rows with $i$ 1:s all
$\bin{n}{i}$ permutations occurs equally often. Denote the
matix composed by these rows by $P_i$. Then
Theorem 6.1 gives $T(P_i)=g(n,k,u,i)n!x_i$. We obtain
$$s(n,v,k,u)=\max\frac{T(P,k,u)}{m}=\max\frac{T(\tilde{P},k,u)}{n!m}$$
$$=\frac{g(n,k,u,1)n!x_1+...+g(n,k,u,n)n!x_n}{n!m}
=\sum^{n}_{i=1}v_ig(n,k,u,i).$$
\begin{thm}\label{prop} The function $g(n,k,u,q)$ has the following properties:
\begin{enumerate}
\item $g(n,k,u,q)$ is increasing in the variables $n$ and $q$,
and decreasing in the variables $u$ and $k$.
\item For any positive integer $w$,
$g(wk,k,u,q)$ and $g(wk,k,1,k)$ are increasing as functions of $k$.
\item Given positive integers $\nu$, $w\geq u$ and a
positive real number $\alpha$, we have:
%\begin{eqnarray*}
$$\max(\frac{w+1}{u},1)-\frac{1}{u}(1-(\frac{w}{\alpha})^{-(w+1)})^{\nu}$$
$$\geq\lim_{k\rightarrow\infty}g(wk+\nu,k,u,\alpha k)$$
$$\geq \max(\frac{w}{u},1,\max(\frac{w+1}{u},1)
-(1-(\frac{w}{\alpha})^{-(w+1)})^{\nu}).$$
%\end{eqnarray*}
\item The function $g(n,k,u,q)$ is unbounded.
\end{enumerate}
\end{thm}
Note further that $g(n,k,u,q)=1$ if $u\geq\min(\lceil n/k\rceil,q)$ and if $k\geq n$,
and $g(n,k,u,q)=g(n,k,u,n)=\max(k/u,1)$ for all $q\geq n$.
The upper and lower bounds of the functions
$D(w,\nu)=$ $\lim_{k\rightarrow\infty}g(wk+\nu,k,1,k)$
formulated in the theorem are plotted in part 6 of the graphics section
for $\nu=2,...6$.
\vspace{4mm}
{\em Proof:} (1): $g(n,k,u,q)=
\min_{\mbox{{\scriptsize all partitions $A$}}}T_{A}(P)/m$, where $P$ is
a complete $m,n,q$-type matrix. Let $A$ be a partition
where the sizes of the sets differ as little as possible.
That $g$ is increasing as a function of $n$ follows by adding to the matrix
$P$ one column of zeros only. We then obtain the non-complete
$m,n+1,q$-type matrix $P_0$. This column is added
to the partition $A$, producing $A_0$, in such a way
that $A_0$ also has sizes of the sets which differ as little as
possible. By the proof of theorem 2 it follows that
$g(n,k,u,q)=T_{A}(P)/m=T_{A_0}(P_0)/m\leq T_{A_0}(\tilde{P_0})/m=g(n+1,k,u,q),$
where $\tilde{P_0}$ is a complete matrix of $m,n+1,q$-type.
We can choose $m$ so that it is divisible by both $\bin{n}{q}$ and
$\bin{n+1}{q}$.
Consider $q0$
there is a domain $\{(n,k): n>wk, n<(w+1)k+\nu_w, k>\mu_w\}$ so that
$\abs{g(n,k,u,\alpha k)-w/u}<\epsilon$ here.
$\nu_w$ and $\mu_w$ appear to increase very rapidly with $w$.
The function $g(n,k)=g(n,k,1,k)$ which compares static with dynamic allocation
is neither increasing nor decreasing as a function of $k$ (see [2]).
A plot of the local extreme values of $g(n,k)$ as a function of $k$,
for each fixed $n$, is shown in graphics part 8. Detail structure
otherwise not visible is revealed in this plot.
The optimal performance function $g(n,k)$
thus has a shape resembling a winding staircase with constant step height
and an infinite number of steps,
where each step is narrower, less sharp-edged and much more distant
from the origin than the previous step.
\section{The process independent performance function}
It turns out that it also is possible to
compute explicitely the optimal performance function $G(k,u,q)$:
\begin{thm}
For any positive integers $k, u$ and $q$,
$$G(k,u,q)=\lim_{n\rightarrow\infty}g(n,k,u,q)=\sup_{n\in{\bf N}}g(n,k,u,q),$$
where $G(1,u,q)=\max(q/u,1)$, and if $k>1$:
$$G(k,u,q)=\frac{k!q!}{k^qu}\sum_{l=1}^{l=q}\frac{\max(l,u)}{l!}\sum_I
(\Pi_{j=1}^{k-1}i_j!\,
\Pi_{j=1}^{b(\{l\}+I)}a(\{l\}+I,j)!)^{-1}.$$
The sum is taken over
all sequences $I=\{i_1,...,i_{k-1}\}$ of non-negative integers
which are decreasing; $i_j\geq i_{j+1}$ for all $j=1,...,k-2$,
bounded by $l; i_1\leq l$, and have sum $q-l$; $\sum_{j=1}^{k-1}i_j=q-l$.
Further, the function $G(k,u,q)$ is decreasing as a function of $k$ and
as a function of $u$, and increasing as a function of $q$.
$G(k)=G(k,1,k)$ is an increasing function.
\end{thm}
Here we use the notation $\{l\}+I=\{l,i_1,...,i_{k-1}\}$.
In part 1 of the graphics section
level curves of the function $G(k,u,q)$ are depicted. Part 4
and 5 show two ways in which the function $G$ can be used to
compare the performance of allocation strategies.
In the case of static allocation and $k=q$, the case
considered in [2], we get $G(k)=G(k,1,k)$, where
$$G(k)=\frac{(k!)^2}{k^k}\sum_{l=1}^{l=k}\frac{1}{(l-1)!}\sum_I
(\Pi_{j=1}^{k-1}i_j!\,
\Pi_{j=1}^{b(\{l\}+I)}a(\{l\}+I,j)!)^{-1}.$$
The last sum is taken over the same sequences $I$ as in the previous formula,
with $q=k$.
Note that
the number of terms in the sum increases rapidly with $k$. However,
being unit fractions with products of factorials as denominators,
all terms are very small if $k$ is large.
From the Stirling formula it indeed follows that $(k!)^2\gg k^k\gg k!$
for large $k$.
We give the first values
of $G(k)$ as maximally reduced rational numbers, prime factorizations and decimal expansions:
\begin{description}
\item $G(1)=1$
\item $G(2)=\frac{3}{2}=1.5$
\item $G(3)=\frac{17}{9}=\frac{17}{3^2}=1.888...$
\item $G(4)=\frac{17}{8}=\frac{17}{2^3}=2.125$
\item $G(5)=\frac{1429}{625}=\frac{1429}{5^4}=2.2864$
\item $G(6)=\frac{3121}{1296}=\frac{3121}{2^43^4}=2.40818...$
\item $G(7)=\frac{295189}{117649}=\frac{211\cdot1399}{7^6}=2.50907...$
\item $G(8)=\frac{680849}{262144}=\frac{13\cdot83\cdot631}{2^{18}}=2.597232...$
\item $G(9)=\frac{38404547}{14348907}=\frac{43\cdot107\cdot491\cdot1717}{3^{15}}=2.67648...$
\item $G(10)=\frac{274868911}{100000000}=\frac{3929\cdot69959}{2^85^8}=2.74868911.$
\end{description}
In part 7 of the graphics section the function $G(k)$ and
a quantitative plot of the limiting procedure
$\lim_{n\rightarrow\infty}g(n,k)=G(k)$ are shown.
\vspace{5mm}
{\em Proof:} By Theorem \ref{prop}
the function $g(n,k,u,q)$ is increasing as a function of $n$, hence
the limit and the supremum in the theorem coincide. Further it follows that
$\lim_{n\rightarrow\infty}g(n,k,u,q)=\lim_{w\rightarrow\infty}g(wk,k,u,q)$.
Note also that exactly the same decreasing sequences are generated in
the function $\pi(k,w,k,l)$ for all $w\geq k$, if $l$ and $k$ are fixed. This fact
makes it possible to explicitly compute the function $G(k,u,q)$.
When letting $w\rightarrow\infty$, it is enough to study the behaviour of
the $w$-dependent part of each term, which is
$$\frac{\bin{w}{l}\bin{w}{i_1}\cdot...\cdot\bin{w}{i_{c}}}{\bin{wk}{q}}.$$
We will next prove that
$$\lim_{w\rightarrow\infty}
\frac{\bin{w}{l}\bin{w}{i_1}\cdot...\cdot\bin{w}{i_{c}}}{\bin{wk}{q}}=
\frac{q!k^{-q}}{l!i_1!\cdot...\cdot i_c!}.$$
Here $c$ is the number of nonzero entries in the sequence $I$. In the following
we will use the fact that $l+i_1+...+i_c=q$.
From the Stirling formula $n!\approx(\frac{n}{e})^n\sqrt{2\pi n}$, we get for large $n$:
$$\bin{n}{k}\approx\frac{e^{-k}}{k!}\sqrt{\frac{n}{n-k}}\frac{n^n}{(n-k)^{n-k}}.$$
Thus the quotient approximately equals
$$\frac{e^{-l}}{l!}\sqrt{\frac{w}{w-l}}\frac{w^w}{(w-l)^{w-l}}
\frac{e^{-i_1}}{i_1!}\sqrt{\frac{w}{w-i_1}}\frac{w^w}{(w-i_1)^{w-i_1}}\cdot...$$
$$\cdot\frac{e^{-i_c}}{i_c!}\sqrt{\frac{w}{w-i_c}}\frac{w^w}{(w-i_c)^{w-i_c}}
\frac{q!}{e^{-q}}\sqrt{\frac{kw-q}{kw}}\frac{(kw-q)^{kw-q}}{(kw)^{kw}}=$$
$$
\frac{q!}{l!i_1!\cdot...\cdot i_c!}$$
$$\sqrt{\frac{w}{w-l}\frac{w}{w-i_1}\cdot...\cdot\frac{w}{w-i_c}\frac{kw-q}{kw}}$$
$$(1-\frac{q}{kw})^{kw}\frac{w^{(c+1)w}}{(w-l)^{w-l}(w-i_1)^{w-i_1}\cdot...\cdot(w-i_c)^{w-i_c}}(kw-q)^{-q}.$$
The first line after the last equality clearly is independent of $w$. The second
line tends to $1$. The first factor of the third line tends to $e^{-q}$. The second
factor can be written as
$$(1+\frac{l}{w-l})^w(1+\frac{i_1}{w-i_1})^w\cdot...\cdot(1+\frac{i_c}{w-i_c})^w.$$
Now,
$$(1+\frac{i}{w-i})^w\rightarrow e^i\mbox{ as } w\rightarrow\infty,$$
so we get
$$e^{l}e^{i_1}\cdot...\cdot e^{i_c}=e^q$$
as $w\rightarrow\infty$, cancelling the factor $e^{-q}$. From the third factor we get
$$(\frac{kw-q}{w-l})^{-l}(\frac{kw-q}{w-i_1})^{-i_1}\cdot...\cdot(\frac{kw-q}{w-i_c})^{-i_c}\rightarrow k^{-l}k^{-i_1}\cdot...\cdot k^{-i_c}=k^{-q}.$$
Thus the
third line above contributes with a factor $k^{-q}$ as $w\rightarrow\infty$.
Since $g(n,1,u,q)=\max(\min(n,q)/u,1)$, it immediately
follows that $G(1,u,q)=\max(q/u,1)$. Note that this case represents
the worst case ratio of dynamic versus dynamic allocation, with
$u$ and $q$ processors respectively.
This completes the proof of the formula for $G(k,u,q)$.
The increasing properties for $G(k,u,q)$ and $G(k)$ follow from (3) and
(4) of theorem \ref{prop}. For example,
the inequality $g(wk,k,u,q)\leq g(w(k+1),k+1,u,q)$ transforms into the inequality $G(k,u,q)\leq G(k+1,u,q)$ as $w\rightarrow\infty$.
\vspace{4mm}
We conclude with some notes on the computability of the
functions.
The variable $u$ does not affect the number of floating point operations
necessary for computing the value of the function $g(n,k,u,q)$.
This number increases slightly with $q$ and $n$, if $n$ is a
multiple of $k$, because of the larger values in the binomials
and factorials to be computed. However, the number of operations increases
rapidly with the variable $k$ since the number of decreasing sequences
increase rapidly. This is also true for the variable $n$ if it is not
a multiple of $k$. The function $h(n,k,u,q)=
g(k\lceil n/k\rceil,k,u,q)$ is thus essentially faster to compute, since
only a fraction of the values are needed, and those which are
used are the computationally fastest. Part 9 of the
graphics section shows $g(n,k), h(n,k)=h(k\lceil n/k\rceil,k,u,q)$ and
the difference $h(n,k)-g(n,k)$ for $n,k\leq50$.
Since $g(n,k)$ is increasing as a function
of $n$ we obviously have $h(n,k)\geq g(n,k)$, with equality at
the straight lines $n=wk$ for any positive integer $w$.
\vspace{4mm}
{\bf References:}
\begin{description}
\item[{[1]}] R. L. Graham, {\em Bounds on Multiprocessing Anomalies},
SIAM Journal of Applied Mathematics, Vol. 17, No 2, March 1969, pp. 416-429.
\item[{[2]}] H. Lennerstad, L. Lundberg, {\em An Optimal Execution Time
Estimate of Static versus Dynamic Allocation in Multiprocessor Systems},
SIAM Jou. of Comp., Vol. 24, No. 4, pp.751-764, August 1995.
\item[{[3]}] H. Lennerstad, L. Lundberg, {\em Optimal Scheduling Results for Parallel Computing},
SIAM News, Vol. 27, No. 7, 1994. In {\em Applications of
Advanced Architecture Computers}, ed. Greg Astfalk, SIAM, 1996.
\item[{[4]}] H. Lennerstad, L. Lundberg, {\em Combinatorics for Multiprocessor
Scheduling Optimizationand Other Contexts in Computer Architecture},
Proceedings of the Conference of Combinatorics and Computer Science, Brest,
France, 1995. In Lecture Notes of Computer Science 1120, Springer Verlag, 1996.
\item[{[5]}] H. Lennerstad, L. Lundberg, {\em Combinatorial Formulas for
Optimal Cache Memory Efficiency}, SIAM News, 1996.
\item[{[6]}] L. Lundberg, H. Lennerstad, {\em An Optimal Bound on
the Gain of Using Dynamic versus Static Allocation in Multiprocessor Computers}, Technical Report, Univ. of Karlskrona/Ronneby (1992).
\item[{[7]}] L. Lundberg, H. Lennerstad, {\em An Optimal Performance Bound
on the Gain of Using Dynamic versus Static Allocation in Multiprocessors
with Clusters}, Technical Report, Univ. of Karlskrona/Ronneby (1993).
\item[{[8]}] L. Lundberg, H. Lennerstad, {\em An Optimal Upper Bound on
the Minimal Completion Time in Distributed Supercomputing},
Proceedings of the 8th ACM Conference on Supercomputing, Manchester, England, July 1994.
\item[{[9]}] L. Lundberg, H. Lennerstad, {\em An Optimal Lower Bound on the
Maximal Speedup in Multiprocessors with Clusters}, First International
Conference on Algorithms and Architechtures for Parallel Processing,
Australia, April 1995.
\item[{[10]}] L. Lundberg, H. Lennerstad, {\em An optimal bound on the gain of
using one large processor cluster instead of a number of small clusters}, Proceedings of the ISCA International Conference on Parallel and Distributed
Computing Systems, Orlando, Fl, 21-23 Sept 1995.
\item[{[11]}] L. Lundberg, H. Lennerstad, {\em Optimal worst case formulas
comparing cache memory associativity}, Research Report 5/95, Univ. of Karlskrona/Ronneby (1995).
\item [{[12]}] L. Lundberg, {\em Static Process Allocation Using Information
about Program Behavior}, Proceedings of the 24'th Hawaii International
Conference on System Sciences, Hawaii, pp. 1-9 (1991).
\item[{[13]}] C.C. Price, {\em Task Allocation in Data Flow Multiprocessors: an annotated bibliography}, Comput. Archit. News, vol. 19, no. 1, (March 1991) pp. 128-134.
\item[{[14]}] Proceedings of the Twentieth Southeastern Conference on Combinatorics, Graph theory and Computing, Boca Raton, FL (1989).
\item[{[15]}] Proceedings of the Twenty-second Southeastern Conference on Combinatorics, Graph theory and Computing, Baton Rouge, LA, (1991).
\item[{[16]}] B. Shirazi, Ming-fang Wang, {\em Heuristic Functions for Static Task Allocation}, Microprocess. Microprogr. (Netherlands) Vol 26, no. 3, (Oct 1989) pp. 187-194.
\item[{[17]}] J. Zahorjan, C. McCann, {\em Processor Scheduling in Shared Memory Multiprocessors}, Performance Evaluation Review (USA), vol. 18, no. 1, (May 1990).
\end{description}
\end{document}