\documentstyle[11pt]{article}
\newtheorem{thm}{THEOREM}
\newtheorem{lemma}[thm]{LEMMA}
\newtheorem{cor}[thm]{COROLLARY}
\newtheorem{remark}[thm]{REMARK}
\newcommand{\ap}{\mbox{$A_{p}$}}
\newcommand{\dD}{\mbox{$\partial D$}}
\newcommand{\dx}{\mbox{$\partial /\partial x$}}
\newcommand{\dy}{\mbox{$\partial /\partial y$}}
\newcommand{\ainf}{\mbox{$A_{\infty}$}}
\newcommand{\binf}{\mbox{$B_{\infty}$}}
\newcommand{\pf}{{\em Proof:\ }}
\newcommand{\lcd}{Lipschitz crack domain}
\newcommand{\ld}{Lipschitz domain}
\newcommand{\ou}{\mbox{$\cal O $}}
\newcommand{\oet}{\mbox{${\cal O} _{1}$}}
\newcommand{\ot}{\mbox{${\cal O} _{2}$}}
\newcommand{\ru}{\mbox{$\r^{+}$}}
\newcommand{\cu}{\mbox{$\C^{+}$}}
\newcommand{\lu}{\mbox{$\Lambda$}}
\newcommand{\lt}{\mbox{$\tilde{\Lambda}$}}
\newcommand{\fu}{\mbox{$\varphi$}}
\newcommand{\un}{\mbox{$\partial u/\partial n$}}
\newcommand{\vn}{\mbox{$\partial v/\partial y$}}
\newcommand{\fpu}{\mbox{$\Phi_{\pm}$}}
\newcommand{\ppu}{\mbox{$\Psi_{\pm}$}}
\newcommand{\pu}{\mbox{$\Psi$}}
\newcommand{\f}{\mbox{$\Phi_{1}$}}
\newcommand{\ft}{\mbox{$\Phi_{2}$}}
\newcommand{\lpd}{\mbox{$L^{p}(\partial D,ds)$}}
\newcommand{\lpr}{\mbox{$L^{p}(\r,d\Phi)$}}
\newcommand{\lp}{\mbox{$L^{p}(\Lambda,d\nu)$}}
\newcommand{\hpu}{\mbox{$H^{p}(\ou,d\nu)$}}
\newcommand{\hp}{\mbox{$H^{p}({\cal O},d\nu)$}}
\newcommand{\C}{\mbox{${\bf C}$}}
\newcommand{\r}{\mbox{${\bf R}$}}
\newcommand{\gaz}{\mbox{$\Gamma_{K}(z)$}}
\newcommand{\norm}[2] {\mbox{$\parallel #1 \parallel _{#2}$}}
\newcommand{\bin}[2]{\left({#1\atop #2}\right)}
\newcommand{\ra}{\rightarrow}
\newcommand{\abs}[1] {\mbox{$\,\mid\! #1 \!\mid\,$}}
\newcommand{\gn}{\mbox{$\partial g/\partial n$}}
\newcommand{\wn}{\mbox{$\partial w/\partial n$}}
\newcommand{\dist}{\mbox{dist}}
\newcommand{\re}{\mbox{Re}\,}
\newcommand{\im}{\mbox{Im}\,}
\newcommand{\h}{\mbox{$\frac{1}{2}$}}
\newcommand{\tp}{\mbox{$\tilde{\Phi}$}}
\newcommand{\sign}{\mbox{sign}\,}
\newcommand{\var}{\mbox{Var}}
%\renewcommand{\baselinestretch}{2}
\begin{document}
\title{An Optimal Execution Time Estimate \\
of Static versus Dynamic Allocation\\
in Multiprocessor Systems}
\author{H\aa kan Lennerstad\thanks{Both authors were supported in part
by the Blekinge Research Foundation.}\\
Dept. of Mathematics\\
Univ. of Karlskrona/Ronneby\\
S-371 79 Karlskrona
\and
Lars Lundberg$^*$\\
Dept. of Computer Science\\
Univ. of Karlskrona/Ronneby\\
S-372 25 Ronneby}
\date{}
\maketitle
\section{Abstract}
Consider a multiprocessor with $k$ identical processors,
executing parallel programs consisting of $n$ processes.
Let
$T_s(P)$ and $T_d(P)$ denote the execution times for the program $P$
with optimal static and dynamic allocations respectively, i. e. allocations giving minimal execution time.
We derive a general and explicit formula for the
maximal execution time ratio $g(n,k)=\max T_s(P)/T_d(P)$,
where the maximum is taken over all programs $P$ consisting of $n$ processes.
Any interprocess dependency structure for the programs $P$ is allowed,
only avoiding deadlock.
Overhead for synchronization and reallocation is neglected.
Basic properties of the function $g(n,k)$ are established, from which we obtain
a global description of the function. Plots of $g(n,k)$ are included.
The results are obtained by investigating a mathematical
formulation. The mathematical tools involved are essentially tools of
elementary combinatorics. The formula is a combinatorial
function applied on certain extremal matrices corresponding to
extremal programs. It is mathematically
complicated but rapidly computed for reasonable $n$ and $k$, in contrast to
the np-completeness of the problems of finding optimal allocations.
\section{Introduction}
In this report an optimal bound of the efficiency of using
static compared to dynamic allocation in parallel
computing is derived.
In static allocation no process reallocation is allowed from
the processor where a process was initiated. Dynamic allocation allows
unlimited reallocation.
Consider a multiprocessor with $k$ processors. We calculate the quotient
$$g(n,k)=\max_P \frac{T_s(P)}{T_d(P)},$$
where the maximum is taken over all programs $P$ with $n$ processes.
The processes of the programs $P$ may have any set of runtimes and
any possible structure of interprocess dependency.
$T_s(P)$ and $T_d(P)$ are the execution times for the program $P$ with
optimal static and optimal dynamic allocations, respectively,
i. e. allocations for which the execution time is minimal.
Beside the allocations
the function $g(n,k)$ itself is optimal, since it is a bound of the above ratio
which cannot be improved; the term "optimal" thus appears in two senses.
The problem is fully defined in section 3.
The mathematical aspects of the subject are focused in this report.
The same problem is treated from a
computer science point of view in [6]. That report also contains
a more detailed presentation on how the problem occurs in parallel
computing, and on the significance of the problem for parallel processing.
\vspace{5mm}
Since the problems of finding optimal allocations are np-complete,
general and preferably optimal results are needed already
for a medium number of processors and processes in order
to choose multiprocessor architecture which opimizes performance.
There appears to be no general results concerning allocation strategies
of parallel programs other than the results by Graham [3]. The overhead
for process reallocation and synchronization is neglected also in that work.
In [3] it is proved that in the
unsynchronized case, i.e. when there are no interprocess dependencies,
the optimal worst case ratio of static versus dynamic allocations is 2, taken
over all parallel programs consisting of any number of processes.
So called self-scheduling algorithms are also considered.
This term is used for dynamic allocation
algorithms where, when a processor becomes idle and there are waiting
executable processes, one of these is immediately allocated to this processor.
It is established in [3] that the execution time for a program allocated with
a self-scheduling algorithm is never higher than two times the
completion time with optimal dynamic allocation.
The allocation scheme is an important feature for the
performance of a multiprocessor, thus an immediate area of applications
for the present results is multiprocessor design.
The results can further be used to evaluate the efficiency of
allocation algorithms versus the best possible algorithm. By using
the estimates for the allocation obtained by
a self-scheduling algorithm versus an optimal dynamic allocation
mentioned above ([3]), the np-completeness in
finding the optimal dynamic allocation can be avoided, and an estimate of the
execution time with optimal static allocation is obtained.
This is optimal except for at most a factor of 2. The
estimate can be improved by running several different self-scheduling
algorithms. So called free-fly algorithms, which allocates
parallel programs which are not completely known at start, can also
be evaluated in this way in the common case when the complete program
is known at completion. The case of average ratio between
the allocation strategies is not considered here.
We believe that this is a problem which requires a different approach.
A consequence of the generality of the present results is
that no specific knowledge about a parallel program can be used to
improve the estimate. However, it is expected that the
techniques presented here can be extended to take advantage of
various kinds of program specifics, thus improving the bounds by
keeping away from those programs which maximize the ratio studied in
this work.
In effect we present a methodology to obtain optimal
control of np-complete scheduling problems.
Taking the reallocation overhead into account obviously favors static
allocation. The significance of this is strongly program dependent,
although possibly a feature which can be analyzed by similar mathematical
tools.
\vspace{5mm}
We next give an overview of the report.
In the following section the allocation problem is described and analyzed in
detail.
It is formulated as a mathematical problem about so called
$0,1$-matrices, i.e. matrices where all entries are $0$ or $1$.
In section 4 we give a full formulation of the mathematical problem and
introduce necessary notation.
Section 5 contains the main result; here the formula for the function
$g(n,k)$ is stated and proved. Results which give
a global description of the function $g(n,k)$ are presented is section 6.
Finally plots describing the function $g(n,k)$ are included.
The references [1], [2], [9] and [10] present theoretical results on
$0,1$-matrices.
The references [3], [4], [5], [7], [8], [11], [12]
and [13] are a selection of reports
where scheduling problems are analyzed. However none of these, with the exception
of [3] as described above, appear to
be useful for the present formulation and solution of the problem.
\section{The allocation problem}
In this context the only difference between static and dynamic allocations is
that in the dynamic case processes, after having been started, can be
transferred to another processor. The cost of this transferrence is
neglected. In the static case processes are always processed to the end
on the processor on which they were initialized. If a process is put into a
waiting state, it will later be restarted on the same processor.
A program $P$ consists of $n$ processes of possibly very different
execution times. The processes are of course usually dependent of each other.
One can expect dependencies of the type that process $i$ cannot
execute further at the time point $t_i$ unless process
$j$ has reached the time point $t_j$. When process $j$ has
reached the time point $t_j$
it is said to execute a synchronizing signal to process $i$,
restarting this process. Certainly there can be many synchronizing signals
to a time point $t_j$, in which case all have to be executed before
the process restarts.
The execution time of synchronizing signals is neglected.
Most parallel programs contain many synchronizing signals. In this
report any set of synchronization signals is allowed, as long as
the parallel program is executable. Our only assumption
about the structure of synchronizing signals is that it is consistent,
i.e. the program can, for example when having $n$ processors, execute to the end
without violating any synchronizing signal.
Thus a parallel program $P$ of $n$ processes is defined by the total execution
times of the $n$ processes and by the set of synchronizing signals.
Now consider a parallel program $P$. Assume that we have found an optimal dynamic
allocation, with execution time $T_d(P)$. This optimal dynamic allocation
will be kept fixed during the entire following argument
dealing with the program $P$ and its descendant $P'$.
Next we introduce a discretization
of the time interval in subintervals $(t_i,t_{i+1})$ of equal length, such
that all synchronizing signals, process initiations and process terminations
appear on the time points $t_i$, where
$t_i=\frac{i}{m}T_d(P), i=0,...,m$. Obviously
all processes in the interval $(t_{i-1},t_i)$ are completed before
any part of the processes corresponding to the interval $(t_i,t_{i+1})$
when using this allocation, since this is so without the discretization.
Such a discretization is possible if all synchronizing
signals and process terminations occur at rational time points, which we can assume. Observe that $m$ might be very large even if the program $P$ is small and has a simple structure. However, $m$ plays no important role in the theory.
Consequently, during a time interval $(t_i,t_{i+1})$, no process of the program $P$ starts, and no process stops.
From the program $P$ we next construct another program $P'$
by two changes of the program $P$: we introduce
new synchronizing signals and prolong certain processes. At every time point
$t_i$ we introduce all possible synchronization between the processes. This
means that the synchronization now requires that all processes in the interval
$(t_{i-1},t_i)$ have to be completed before
any part of the processes corresponding to the interval $(t_i,t_{i+1})$,
which will increase the execution time with most other allocations.
Since the execution time of synchronizing signals is neglected this does
not change the total execution time with the fixed optimal dynamic
allocation, which is $T_d(P)$. Further, all processors are made to be
busy at all time intervals. This is achieved by, if necessary,
prolonging some processes.
However no process is prolonged beyond $T_d(P)$, hence $T_d(P)=T_d(P')$.
It is of no importance
that the transformation from $P$ to $P'$ can be made in many ways; many
programs can play the role of $P'$ to a specific program $P$.
By the construction we thus have $T_d(P)=T_d(P')$.
However, since introducing more synchronization and prolonging processes never
shortens the execution time, for other allocations the execution time is
either increased or unchanged.
In particular, for optimal static allocation
we therefore have $T_s(P)\leq T_s(P')$. Consequently,
$$\frac{T_s(P)}{T_d(P)}\leq\frac{T_s(P')}{T_d(P')}.$$
Certainly there are programs $P$ which are left unchanged by
the above transformation: programs such that $P=P'$.
Since these programs constitute a subset of the parallel programs
we consider, we actually have
$$\max_P \frac{T_s(P)}{T_d(P)}=\max_P \frac{T_s(P')}{T_d(P')}.$$
Therefore, in order to calculate the maximum,
only programs of the type $P'$ need to be considered.
We next represent a program $P'$ with optimal dynamic allocation
on a multiprocessor with $k$ processors by an $m\times n$ matrix of 0:s and 1:s
only. Here each process is represented by a column, and each time period
is represented by a row. The entry at the position $(i,j)$ of the matrix
is 1 if the $j$:th process
is active between $t_{i-1}$ and $t_i$; if it is inactive the entry is 0.
Each row contains exactly $k$ 1:s, since each processor is constantly busy.
In the sequel such a matrix is referred to as an $m, n, k$-type matrix.
The main part of the report analyzes these matrices. For example we
characterize the type of matrix which corresponds to the worst case.
Because of the complete synchronization,
each row has to be completed before the next
row. The optimal dynamic allocation of the program $P'$ represented in this way
is $m$, i.e. the time unit is changed to $T_d(P)/m$.
What is the optimal static execution time of the program $P'$?
To compute this we need to
decide how the $n$ processes are to be allocated to the $k$ processors.
Since every process in the static case is to be executed on one
processor only, the static allocation corresponds to
a way of grouping the $n$ columns of the matrix
together in $k$ sets, one set for each
processor. Because of the complete synchronization, at each
step the processors have to wait for the slowest processor.
This is the processor which has the highest number of processes to
execute, i.e. the maximum number of 1:s in one group. Hence the static
execution time is the sum of the maximas for the rows, muliplied by the factor
$T_d(P)/m$. This is the {\em optimal}
static execution time $T_s(P')$ if we have found the best allocation, i.e.
a way of grouping the $n$ columns together in $k$ sets which minimizes
the static execution time. In the following we denote
$$T(P)=T_s(P)\frac{m}{T_d(P)},$$
i.e. we compute the time in the time unit $T_d(P)/m$.
In the main result of the report we give a formula for the function $g(n,k)$
representing the worst case, i.e. for any parallel program $P_0$:
$$\frac{T_s(P_0)}{T_d(P_0)}\leq g(n,k)=\max_P \frac{T_s(P)}{T_d(P)}=\max_P\frac{T(P)}{m}.$$
\section{The mathematical formulation}
As described in the previous section,
the corresponding mathematical problem can be formulated as follows:
Consider an $m\times n$ matrix $P$ of 0:s and 1:s only,
such that each row has exactly $k$ 1:s,
and thus $n-k$ 0:s, $1\leq k\leq n$. These matrices are referred to as
$m,n,k-$type matrices.
The column vectors of $P$ will sometimes be denoted by $v_i$.
Consider a partition
$A$ of the $n$ vectors $v_i$ into $k$ sets. Observe that
the number of sets equals the number of 1:s on each row in $P$.
We will be mostly concerned with partitions where the sizes of the
sets in the partition differ as little as possible. If $n/k$ is an integer
$w$ this means that every set has $w$ members.
Denote the integer part of $n/k$, the floor function, by
$\lfloor n/k\rfloor $, and the smallest integer greater than or equal
to $n/k$, the ceiling function, by $\lceil n/k\rceil$.
If $n/k$ is not an integer, the sets in a partition where the sizes differ
as little as possible have $\lfloor n/k\rfloor $ or $\lceil n/k\rceil$
members.
Given any partition $A$ of the column vectors in $k$ sets, we form a quantity
$T_{A}(P)$ as follows.
The vectors in each group are added together, making $k$
vectors of nonnegative integers from the
$n$ column vectors. By taking the maximum of these vectors,
componentwise, one single vector of positive integers is obtained.
All vectors here are of course $m$-dimensional.
The sum of the components
of the final vector is the quantity $T_{A}(P)$. In formula:
$$ T_{A}(P)=\sum_{j=1}^{m}\max_{l=1,...,k}(\sum v_i),$$
where the last sum is taken over the indicies $i$ which belongs to the $l$:th
partition set.
\vspace{4mm}
We want to choose the partition $A$ so that $T_{A}(P)$ is minimal.
The interesting quantity is thus
$$T(P)=\min_{\mbox{{\scriptsize all partitions $A$}}}T_{A}(P).$$
The function $g(n,k)$ is defined by
$$g(n,k)=\max\{\frac{T(P)}{m}, \mbox{all } m,n,k-\mbox{type matrices }P\}.$$
For given $m, n$ and $k$, we will thus be concerned with the problem of
calculating $\max T(P)$, over all $m,n,k -$type matrices $P$.
A natural conjencture is the estimate $T(P)/m\leq g(n,k)\leq 2$. In the case
$n\leq2k$ it is immediately seen to be true, by simply
grouping the column vectors together in pairs. The
conjencture is
not true in the general case, as is immediately visible in Figure 1.
However, for a partition where the size of the sets differ as little
as possible, the largest set has $\lceil n/k\rceil$. Then the maximum
number of 1:s in a set is $\min(\lceil n/k\rceil,k)$: both
the number of slots in the largest set and the total number of $1$:s
provide bounds.
We thus obtain a crude estimate $g(n,k)\leq \min(\lceil n/k\rceil,k)$.
We will frequently
compare the optimal estimate $g(n,k)$ with this crude estimate.
\vspace{4mm}
Our final preparation before the main result is to introduce and summarize
the notation which is relevant in this situation.
We say that a matrix $P$ is of \boldmath$m,n,k$\unboldmath-{\bf type}
if it has $m$ rows and $n$ columns, all entries are 0:s or 1:s,
and each row has exactly $k$ 1:s, where $1\leq k\leq n$.
We call a matrix $P$ {{\bf balanced} if all possible rows,
that is, if all $\bin{n}{k}$ permutations of the
$k$ 1:s, occur equally frequently as rows of $P$. The number of rows
is thus necessarily divisible by $\bin{n}{k}$.
We also need the following three combinatorial functions.
Let $I$ be a finite sequence of nonnegative integers. Then we define:
\begin{description}
\item $b(I)=$ the number of distinct integers in $I$
\item $a(I,j)=$ the number of occurencies of the $j$:th distinct integer in
$I$, enumerated in size order, $1\leq j\leq b(I)$.
\item $\pi(k,w,q,l)$ = the number of permutations of $q$ 1:s distributed
in $kw$ slots, which are divided in $k$ sets with $w$ slots in each,
such that the set with maximum number of 1:s has exactly $l$ 1:s.
\end{description}
\newpage
\section{The optimal estimate}
With this notation,
the main result is contained in the following theorem.
\begin{thm}
Given positive integers $m, n$ and $k$, $kkl$,
otherwise it is given by
$$\pi(k,w,q,l)=\bin{w}{l}
\sum_I \bin{w}{i_1}\cdot...\cdot\bin{w}{i_{k-1}}\frac{k!}
{\Pi_{j=1}^{b(\{l,i_1,...,i_{k-1}\})}a(\{l,i_1,...,i_{k-1}\},j)!}.$$
Here the sum is taken over
all sequences of nonnegative integers $I=\{i_1,...,i_{k-1}\}$
which are decreasing: $i_j\geq i_{j+1}$ for all $j=1,...,k-2$,
bounded by $l: i_1\leq l$,
and have sum $q-l$: $\sum_{j=1}^{k-1}i_j=q-l$.
For each $m,n,k-$type matrix $P$ the minimum
$$T(P)=\min_{\mbox{{\scriptsize all partitions $A$}}}T_{A}(P)$$
is attained for a partition where the
sizes of the sets in the partition differ as little as possible.
The bound is optimal in the sense that if $\bin{n}{k}$ divides $m$,
in which case there exist balanced matrices,
we have $T(P)/m=g(n,k)$ for all balanced matrices $P$.
\end{thm}
In Figure 1 a plot of the function $g(n,k)$ for $1\leq k,n\leq 50$
is presented. Figure 2 shows detail structure of $g(n,k)$.
If $k\geq n$ the definition $g(n,k)=1$ is consistent to the allocation problem.
Then each processor has one single process to execute, and there
is no difference between static and dynamic allocations. The case $g(n,1)=1$
is another trivial case.
The sequences in the function $\pi$
can easily be generated by the algorithm described in the following lemma.
We say that the least decreasing sequence of length $\mu$ and sum
$\sigma$ is the sequence
$\{\lceil\frac{\sigma}{\mu}\rceil,...,\lceil\frac{\sigma}{\mu}\rceil,
\lfloor\frac{\sigma}{\mu}\rfloor,...,\lfloor\frac{\sigma}{\mu}\rfloor\}$.
If $\sigma_{\mu}$ is the remainder when $\sigma$ is divided by $\mu$,
the number of $ \lceil\frac{\sigma}{\mu}\rceil$:s is $\sigma_{\mu}$, and
the number of $ \lfloor\frac{\sigma}{\mu}\rfloor$:s is $\mu-\sigma_{\mu}$,
making the sum of the sequence $\sigma$.
\begin{lemma}
Let $\lambda$ and $\sigma$ be nonnegative
integers and $\mu$ be a positive integer such that $\lambda\leq
\sigma\leq \lambda\mu$.
Every sequence of $\mu$ integers in the interval $0\leq i \leq \lambda$
which is decreasing, bounded by $\lambda$ and has
sum $\sigma$ is generated exactly once by the following algorithm:
\begin{enumerate}
\item Take $I$ as the least decreasing sequence of length $\mu$ and sum
$\sigma$.
\item Find the {\bf rightmost} position in $I$, say $j_1$, which fullfills:
\begin{enumerate}
\item $i_{j_1}0$
\end{enumerate}
The algorithm terminates if no such $j_1$ can be found.
\item The next sequence is obtained from $I$
by increasing the entry in position $j_1$ by one, and replacing
the subsequence $\{i_{j_1+1},...,i_{\mu}\}$ with the least
decreasing subsequence of length $\mu-j_1$ and sum $\sum_{j=j_1+1}^{\mu}i_j-1$.
\item Go to step 2.
\end{enumerate}
\end{lemma}
{\em Proof of the lemma:} It is immediately clear that the starting sequence
is decreasing, has sum $\sigma$ and has
entries in the interval $0\leq i \leq l$.
It is also obvious that these properties are preserved by the algorithm.
Consider a sequence $I$ of this kind. If it is not the last one
generated by the algorithm, the next sequence will have its entry
at $j_1$ increased. This entry will not decrease again unless an entry to
the left is increased. By reapplying this argument we find that no
sequence is generated twice by the algorithm.
What is left to prove
is that the algorithm generates all such sequences. This is done
by induction over the length $\mu$ of the sequence.
Assume that this is true for all
decreasing sequences of length $\mu$.
For $\mu=1$ the only sequence is $\{\sigma\}$, which is the
starting sequence of the algorithm and the only sequence generated by
the algorithm.
We want to prove that the algorithm generates
all decreasing sequences of length $\mu+1$,
sum $\sigma$ and bound $\lambda$.
These sequences are $\{\min(\lambda,\sigma),I_1\},...,
\{\lceil\sigma/\mu\rceil,I_{j_0}\}$, where $I_1,...,I_{j_0}$ are decreasing
sequences of length $\mu$, bounded by
$\min(\lambda,\sigma),...,\lceil\sigma/\mu\rceil$ and
with sums $\sigma-\min(\lambda,\sigma),...,\sigma-\lceil\sigma/\mu\rceil$
respectively. Now the algorithm with length $\mu+1$ applied on the
sequences $\{\min(\lambda,\sigma),I_1\},...,
\{\lceil\sigma/\mu\rceil,I_{j_0}\}$ with the bounds
$\min(\lambda,\sigma),...,\lceil\sigma/\mu\rceil$ is, except for the first
entry, the same as the algorithm with length $\mu$ applied on the
sequences $\{I_1,...,I_{j_0}\}$ with the bounds
$\min(\lambda,\sigma),...,\lceil\sigma/\mu\rceil$.
By the induction assumption this generate all decreasing sequences of length
$\mu$. The lemma is proved.
\vspace{4mm}
{\em Proof of the theorem:}
Consider an arbitrary matrix $P$ of $m,n,k$-type. Let $A$ be a partition
where the sizes of the sets differ as little as possible. We will later prove that the minimum is attained for such a partition.
Observe that this is equivalent to
$s_{max}-s_{min}\leq1$, if $s_{max}$ is the number of vectors
in a set with most vectors, and $s_{min}$ is the number of vectors
in a set with fewest vectors.
Each row in $P$ can be regarded as a permutation of $k$ 1:s in $n$ slots.
There are of course $\bin{n}{k}$ such.
Further there exist $n!$ permutations of the columns of $P$, where each permutation produces a possibly different
$m,n,k$-type matrix $P_i, i=1,...,n!$. When we permute the column vectors we permute the rows, all possible rows appear if we perform all possible column
permutations. Now partition $A$ applied to $P_i$ is equivalent to a different
partition $A_i$ applied to the original matrix $P$,
in the sense that $T_A(P_i)=T_{A_i}(P)$.
Next we construct a matrix $P^*$ from the matrices $P_i$, which
has $m\,n!$ rows and $n$ columns, using the following duplication
argument. This will provide control of the partitions.
The first $m$ rows of $P^*$ constistute the matrix $P$ itself. The
next $m$ rows constitute the matrix $P$ where the columns are permuted according to a specific permutation, which is not the identity.
The following $n!-2$
blocks of $m$ rows contain all other permutations of the columns of the
matrix $P$.
We know three facts about the matrix $P^*$ which makes
this procedure useful:
\begin{enumerate}
\item Every row in $P^*$ occur exactly as many times as any other different row in the matrix $P^*$. Every possible row does appear. That is, $P^*$ is a balanced matrix.
\item Each column permutation represents a partition $A_i$
of the columns of $P$, so that $T_{A}(P_i)=T_{A_i}(P), i=1,...,n!$.
\item The quantities $T_A(P_i), i=1,...,k!$ relate to the quantity
$T_A(P^*)$ as $T_A(P^*)=\sum_{i=1}^{n!}T_A(P_i)$.
\end{enumerate}
(1) is clear since the duplication argument from each row produces all other
$n!-1$ rows, counting all $0$:s and all $1$:s as different. In reality there are
$\bin{n}{k}$ different such rows, hence each is repeated $k!(n-k)!$ times. (2)
follows by an argument above. Observe that the function $T$ by definition is the sum
of an operation made on each row seperately. On the right side of the equality
in (3) this summation is carried out seperately for the $n!$ parts of $P^*$.
Thus, since $T(P)$ arises from the partition $A_i$ which gives the smallest value of $T_{A_i}(P)=T_A(P_i)$,
we have $T(P)\leq T_{A_i}(P)=T_{A}(P_i)$
for all $i=1,...,n!$. We then obtain from (3) above:
$$T(P)\leq \sum_{i=1}^{n!}T(P_i)/n!=T(P^*)/n!\,.$$
We have thus established that the balanced matrices represent the worst case.
A balanced matrix has a particularly simple structure.
Each permutation in the matrix $P^*$ is repeated $m$ times if we count
every 1 and every 0 as distinct. By releasing this distinction, each permutation
really is repeated $mn!(n-k)!$ times. Since this factor only multiplies
all occuring numbers, what is left to study is the balanced matrix $\tilde{P}$
where each permutation occurs exactly once. We have:
$$T(P^*)=mn!(n-k)!T(\tilde{P})\mbox{, so}$$
$$T(P)\leq T(P^*)/n!
=mn!(n-k)!T(\tilde{P})/n!=\frac{m}{\bin{n}{k}}T(\tilde{P}).$$
The matrix $\tilde{P}$ has $\bin{n}{k}$ rows and $n$ columns, and contains
each permutation of the $k$ 1:s in the $n$ slots exactly once.
The columns are grouped together in sets with $\lceil n/k\rceil$ or
$\lfloor n/k\rfloor$ members. For the maximum operation we are interested
in the number of 1:s in the set with most 1:s, we here have say $l$ 1:s.
For the component sum of the final vector we are interested in
the number of permutations where we have exactly $l$ 1:s in the set of
maximum 1:s.
Consider first the case when $n/k=w$ is an integer. Then all sets in
the partition contains $w$ vectors.
For the sake of clarity, we compute a few special
cases of $\pi(k,w,k,l)$ before considering the general situation.
We here assume that $k\leq w$.
\begin{enumerate}
\item Clearly, $\pi(k,w,k,1)=w^k$, since here we have to put exactly one 1
in each set. In the first set the 1 can be put in $w$ different slots,
as it can in all other $k$ sets.
\item $\pi(k,w,k,k)=\bin{w}{k}k$.
Here the first factor comes from the
ways of putting all the 1:s in one set, and the factor $k$ is the
number of sets where this can be done.
\item $\pi(k,w,k,k-1)=\bin{w}{k-1}w\bin{k}{2}$. The second factor in this case
is the number of slots to put the single 1. The last factor is the number
of ways to distribute the two different sets among the $k$ sets.
\item $\pi(k,w,k,k-2)=\bin{w}{k-2}(\bin{w}{2}\bin{k}{2}+
w\bin{k}{3}/2)$. The first term is the number of ways to produce $l=k-2$
with the distribution $k-2,2,0,...,0$ of the 1:s in the $k$ sets.
The other term arises from the distribution $k-2,1,1,0,...,0$.
\end{enumerate}
Generally, to start with we have a number of ways to distribute $k$ 1:s in
$k$ sets regardless of order both in each set and between the sets.
These ways are represented by the decreasing sequences. The order
in each set is here disregarded in such a way that only the number of 1:s is
significant. The order between the sets is disregarded
by chosing one specific order: decreasing sequences.
In a set of $i$ 1:s and $w-i$ 0:s there are $\bin{w}{i}$ different
permutations. So, taking the order in the sets into account, if we have
$\{i_1,...,i_k\}$ 1:s in the $k$ sets respectively, there are
$\bin{w}{i_1}\cdot...\cdot\bin{w}{i_k}$ different permutations.
Since the maximum number $l$ must appear, we always get a factor $\bin{w}{l}$,
which can be factored out. The remaining sequence is of length
$k-1$ and has sum $k-l$.
We next consider the order between the sets. There are $k!$ permutations of the sets.
If there are $a(\{l,i_1,...,i_{k-1}\},j)!$ sets with
$i_j$ 1:s, we get no new permutations by permuting within this group of sets.
Hence the factor from permuting the sets is
$$\frac{k!}
{\Pi_{j=1}^{b(\{l,i_1,...,i_{k-1}\})}a(\{l,i_1,...,i_{k-1}\},j)!}.$$
The number of permutations which give the number $l$ is thus
$$\pi(l)=\bin{w}{l}\sum_I \bin{w}{i_1}\cdot...\cdot\bin{w}{i_{k-1}}\frac{k!}
{\Pi_{j=1}^{b(\{l,i_1,...,i_{k-1}\})}a(\{l,i_1,...,i_{k-1}\},j)!},$$
summing over the decreasing sequences of length $k-1$, sum $k-l$ and bound
$l$.
In the case when $n/k$ is not an integer, we
work with a partition where the $n_k$ leftmost sets have $w+1=\lceil n/k\rceil$
vectors and the rightmost $k-n_k$ have $w=\lfloor n/k\rfloor$ vectors.
The formula in this case is derived from the previous formula by
introducing the possibility that the number of sets, $k$, and the number
of 1:s, $q$, are not equal. By
summing over all possible maximums to the left $(l_1)$ and to the
right $(l_2)$, and over all possible numbers of 1:s to the left $(i)$,
the results for general $n$ and $k$ follow.
The bounds of the indexes appear from the limitations of the number
of 1:s which there is room for to the left or to the right in the different
cases, and from the minimum number of 1:s according to $l_1, l_2$ and $i$.
\vspace{4mm}
We finally prove the optimality result.
If the matrix $P$ is balanced with $\bin{n}{k}$ rows, then $P=\tilde{P}$,
and the above calculation is true with equality if
$\min_{\mbox{{\tiny all partitions $A$}}}T_{A}(P)$ is attained for
a partition where the sizes of the
sets differ as little as possible. We next prove that this is so for balanced
matrices. Thus the bound is attained for programs corresponding to
balanced matrices.
Let $A$ be a partition of the $n$ columns into $k$ sets,
with $\{i_1,...,i_k\}$ columns in each set, respectively.
Assume that there are $i_j$:s, say $i_1$ and $i_2$,
such that $i_1\geq i_2+2$, otherwise we have the above mentioned type
of partition. From the partition $A$ we will
obtain a new partition $A'$ by transferring the $i_1$:th vector
from the first set to the second. We show that the result is never worse
for this partition, i.e.
$T_{A'}(P)\leq T_{A}(P)$.
By repeating this transformation, a partition where the sizes of the sets
differ as little as possible is finally derived,
and the result follows from the inequality.
Consider a row in $\tilde{P}$, that is a permutation
$p=\{p(i)\}_{i=1}^{n}$ of 1:s and 0:s. If $p(i_1)=0$ nothing happens on this
row. If $p(i_1)=1$ and $\sum_{i=1}^{i_1}\leq \sum_{i=i_1+1}^{i_2}$, the
maximum taken over this row does increase, unless other sets contain more 1:s.
These are the critical permutations. However, for each such permutation
there is a unique permutation $\tilde{p}$ where the row maximum
in such a case will decrease. $\tilde{p}$ is defined as a partial mirror image:
$$\tilde{p}(i)=\left\{ \begin{array}{ll}
p(i_1+i_2+1-i) & \mbox{ if }
i=1,...,i_2 \mbox{ or } i=i_1+1,...,i_1+i_2\\
p(i) & \mbox{ all other } i
\end{array}
\right. $$
Since $\sum_{i=1}^{i_2}p(i)<\sum_{i=i_1+1}^{i_1+i_2}p(i)$ for a critical
permutation, we know that $\tilde{p}$ is not critical
if $p$ is. Further it follows from
$\tilde{\tilde{p}}=p$ that $\tilde{ }$ is a bijection. Hence,
every critical permutation $p$ can be paired with a unique permutation $\tilde{p}$,
since $\tilde{P}$ contains each permutation exactly once.
Because of $p(i)=\tilde{p}(i)$ for $i=i_1+i_2+1,...,n$
and $\sum_{i=1}^{i_1}\tilde{p}(i)\geq\sum_{i=i_1+1}^{i_2}p(i)$, it is clear
that if
the partition change causes the maximum on the row with $p$ to increase,
the maximum on the row with $\tilde{p}$ will certainly decrease. The
proof of the theorem is completed.
\vspace{4mm}
Given $n$ and $k$, how many matrices of $\bin{n}{k},n,k$-type are
balanced? Since we obtain all balanced matrices by permuting the
rows in a given balanced matrix, there are $\bin{n}{k}!$ balanced matrices.
There are in total $\bin{n}{k}^{\bin{n}{k}}$ different matrices
of $\bin{n}{k},n,k$-type: for each row there are $\bin{n}{k}$
possibilities and we have $\bin{n}{k}$ rows.
The relative number of balanced matrices
can now be estimated by Stirlings formula $n!\approx(\frac{n}{e})^n\sqrt{2\pi n}$:
$$\frac{\bin{n}{k}!}{\bin{n}{k}^{\bin{n}{k}}}
\approx(\frac{1}{e})^{\bin{n}{k}}\sqrt{2\pi\bin{n}{k}},$$
which tend to zero rapidly as $\bin{n}{k}\rightarrow\infty$.
The significance of this is limited, however, since it is clear that
for increasing $n$ and $k$, matrices very close to balanced matrices play a
role increasingly similar to the role of the balanced ones.
\section{Properties of the function $g(n,k)$}
The function $g(n,k)$ can be regarded as a weighted mean value of the integers
$l=1,2,...,\min(\lceil n/k\rceil,k)$, where the weights are
the number of permutations which gives $l$ to the final sum,
divided by the total number of permutations $\bin{n}{k}$. This fact
is exploited in this section.
The crude estimate can be viewed as the estimate of this weighted mean value by
the largest integer $\min(\lceil n/k\rceil,k)$.
From the following theorem we will be able to derive the following
description of $g(n,k)$ for large $n$ and $k$. We find that $g(n,k)$
like the crude estimate have an infinite series of
plateaus: for each positive integer $w$ there is an almost planar
unbounded part which is a subset of the domain $n>(w-1)k, n\mu$ and
$(w+1)k>n$ are in maximum norm arbitrarily close to $w+1$.
However, for large $w$ $\nu$ may have to be chosen very large, since
$1-w^{-(w+1)}$ is then very close to $1$.
Globally the function $g(n,k)$ thus has a shape resembling an infinite
winding staircase with constant step height, where each step is narrower,
smoother and considerably more distant from the origin than the previous
step. Closer to the origin the plateaus appear as slowly rising ridges.
\vspace{4mm}
{\bf References:}
\begin{description}
\item [{[1]}] R. A. Brualdi, A. J. Hoffman, {\em On the spectral radius of $(0,1)$-matrices}, Linear Algebra and its Applications, 65, (1985) p. 113-146.
\item [{[2]}] R. A. Brualdi, M. Katz, {\em An extremal problem concerning matrices of $0$'s and $1$'s}, Linear and Multilinear Algebra, 20, no. 4, (1987) p. 325-331.
\item[{[3]}] R. L. Graham, {\em Bounds on Multiprocessing Timing Anomalies}, SIAM
J. Appl. Math., Vol. 17, No. 2, March 1969.
\item[{[4]}] R. Krahl, {\em Allocation Problems in Distributed Computer Systems},
Inform. Inf. Rep. no. 12, (1990) p. 101-115.
\item [{[5]}] L. Lundberg, {\em Static Process Allocation Using Information
about Program Behavior}, Proceedings of the 24'th Hawaii International
Conference on System Sciences, Hawaii, pp. 1-9 (1991).
\item[{[6]}] L. Lundberg and H. Lennerstad, {\em Comparing the Performance of
Optimal Dynamic and Static Process Allocation}, Technical Report,
University of Karlskrona/Ronneby, Sweden (1993).
\item[{[7]}] M. Naderi, {\em Modelling and Performance Evaluation of Multiprocessor Organisation with Shared Memories}, Comput. Archit. News, vol. 16, no. 4, (Sept 1988) p. 51-74.
\item[{[8]}] C.C Price, {\em Task Allocation in Data Flow Multiprocessors: an annotated bibliography}, Comput. Archit. News, vol. 19, no. 1, (March 1991) p. 128-134.
\item[{[9]}] Proceedings of the Twentieth Southeastern Conference on Combinatorics, Graph theory and Computing, Boca Raton, FL (1989).
\item[{[10]}] Proceedings of the Twenty-second Southeastern Conference on Combinatorics, Graph theory and Computing, Baton Rouge, LA, (1991).
\item[{[11]}] J. Rost, {\em A Distributed Algorithm for Dynamic Task Scheduling}, CONPAR 90-VAPP, N. Joint Int. Conf. on Vector and Parallel Processing, Proceedings, Zurich, Switzerland 10-13 Sept 1990, Springer Verlag, p. 628-639.
\item[{[12]}] B. Shirazi, Ming-fang Wang, {\em Heuristic Functions for Static Task Allocation}, Microprocess. Microprogr. (Netherlands) Vol 26, no. 3, (Oct 1989) p. 187-194.
\item[{[13]}] J. Zahorjan, C. McCann, {\em Processor Sheduling in Shared Memory Multiprocessors}, Performance Evaluation Review (USA), vol. 18, no. 1, (May 1990).
\end{description}
\end{document}