u$, and have sum $q: \sum_{j=1}^ki_j=q$. If $w=n/k$ is not an integer, we let $w=\lfloor n/k\rfloor $ and denote the remainder of $n$ divided by $k$ by $n_k$. Then we have: $$r(n,k,u,q)=\frac{(k-n_k)!n_k!}{q\bin{n}{q}} \sum_{l=\max(0,q-w(k-n_k))}^{\min(q,(w+1)n_k)}$$ $$\sum_{I_1}\sum_{I_2} \prod_{j=1}^{n_k}\bin{w+1}{i_{1,j}}\prod_{j=1}^{k-n_k}\bin{w}{i_{2,j}} \frac{d(I_1,u)+d(I_2,u)}{\Pi_{j=1}^{b(I_1)}a(I_1,j)!\Pi_{j=1}^{b(I_2)}a(I_2,j)!}.$$ The sum over $I_1$ is taken over all decreasing sequences $I_1$ of bound $w+1$, length $n_k$ and sum $l$. The sum over $I_2$ is taken over all decreasing sequences $I_2$ of bound $w$, length $k-n_k$ and sum $q-l$. \end{thm} This theorem, and Theorem \ref{formula}, are proved in section \ref{comb}. Here also a fast algorithm which generates all decreasing sequences is given. We remark that the formula can alternatively be formulated with multinomial coefficients and taking the sum over certain partitions of the integer $k$. For programs of any number of variables we have the following results: \begin{thm}\label{general} $$\inf_P\frac{h(P,k,u)}{h(P,1,q)}=1-R(k,u,q),$$ where the infimum is taken over all programs $P$ of any number of variables. \end{thm} The function $1-R(k/u,u,k)$ for $u=1,2,4,8$ is plotted in section 11.1, comparing caches with increasing associativity. \begin{thm}\label{formula} If $u\geq q$, then $R(k,u,q)=0$. If $kuu$, and have sum $q: \sum_{j=1}^ki_j=q$. \end{thm} Note that the sequences in this formula only differ from those appearing in Theorem \ref{nformula} in that here the bound of $w$ or $w+1$ is removed. \vspace{10mm} We will next describe the results for specific programs from which the generals results of theorems 4.1 to 4.4 follow. These also have some interest in their own right. To this end we next describe the programs which are extremal for the ratio $\max_P em(P,k,u,q)/h(P,1,q)$: the complete programs. As discussed previously, a program can be represented by a sequence of $m$ integers, each between 1 and $n$, and each integer occuring at least once since each variable is referenced at least once. In this sequence each entry represents one memory reference, and its value denotes the variable which is referenced. When executed using a $(1,q)$-cache the program induces a sequence of states for the cache. Our results are established by analysing this sequence of states, we will next describe its representation. For any given program $P$ we initially add $n$ references, one reference to each of the $n$ variables in the same order as they are initially referenced in the program $P$. This gives the program $P'$ with $m(P')=m(P)+n$. Note that if $n>q$ we add $n$ $(1,q)$-cache misses. For any reference in the program $P'$ after the initial $n$ references, the state of the cache is described by a so called $q,n$-hit vector. This vector contains one entry for each variable, the content describes whether the variable is in the cache or not. There are $n-q$ 0:s, for variables not in the cache. Variables in the cache are denoted by "1", except the least recently referenced variable in the cache, which is represented with "$q$" at the corresponding position. The program $P$ is represented by a matrix $M(P)$ having as rows all induced $q,n$-state vectors of $P'$ which represent hits. If $n>q$, $P'$ has at least $n+1$ initial misses, i.e. the number of rows of the matrix $M(P)$ is less than $m(P)$. Obviously there exist $q\bin{n}{q}$ different $q,n$-hit vectors. The raison d'etre for the $q,n$-hit vectors is that we here represent enough information for our purpose, and that any collection of $q,n$-hit vectors represent a program (Theorem \ref{summarize}). Therefore results about programs can be derived by studying collections of $q,n$-hit vectors. \begin{defi}\label{hit} A program $P$ is complete if all $q\bin{n}{q}$ possible $q,n$-hit vectors are equally frequent in the matrix $M(P)$. \end{defi} \begin{thm}\label{compl} All complete programs are extremal for the ratio \\ $em(P,k,u,q)/h(P,1,q)$, i.e. for each complete program $P$ we have $$em(P,k,u,q)/h(P,1,q)=\max_{P'}em(P',k,u,q)/h(P',1,q).$$ Here the maximum is taken over all programs $P'$ with the same number of variables as $P$. \end{thm} The main part of the report is devoted to the proof of this theorem. In parallel with this we prove the usefulness of the $q,n$-hit vector description. Concerning optimal mappings we have the following theorem, which is essential in order to compute the formulas of theorems \ref{nformula} and \ref{formula}. \begin{thm}\label{mapp} For any complete program $P$ of $n$ variables there is an optimal mapping for a $(k,u)$-cache where each set in the mapping contains $\lceil n/k\rceil$ or $\lfloor n/k\rfloor$ variables. \end{thm} Clearly, if $n/k=w$ is an integer there is an optimal mapping where each set has exactly $w$ variables. In section \ref{precomb} we prove that complete programs may have any hit ratio: \begin{lemma}\label{any} For any number $H, 0\leq H\leq1$, and for any $\epsilon>0$, there is a complete program $P$ such that $$\abs{\frac{h(P,1,q)}{m(P)}-H}<\epsilon.$$ \end{lemma} We conclude this section by showing how Theorem \ref{ngeneral} and \ref{general} follow once Theorem \ref{compl} and Lemma \ref{any} have been established. \begin{thm}\label{nspecific} \begin{description} \item [(i)]For any program $P$ with $n$ variables we have $$h(P,1,q)(1-r(n,k,u,q))\leq h(P,k,u).\phantom{< m(P)-h(P,1,q)r(n,k,u,q)}$$ %\hspace*{51mm}$$ \item [(ii)]For any complete program $P$ of $n$ variables we have $$h(P,1,q)(1-r(n,k,u,q))\leq h(P,k,u)< m(P)-h(P,1,q)r(n,k,u,q).$$ \end{description} \end{thm} \pf Given a program $P$, denote by $A_0$ and $A_1$ optimal mappings for the hit ratio and for the number of extra misses, respectively. From Lemma \ref{hm} we then get $$h(A_0,P,k,u)\geq h(A_1,P,k,u)\geq h(P,1,q)-em(A_1,P,k,u,q).$$ Hence $$h(P,k,u)\geq h(P,1,q)-em(P,k,u,q).$$ By invoking $r(n,k,u,q)\geq em(P,k,u,q)/h(P,1,q)$ the inequality in (i) follows. The right inequality for complete programs follows from the relation $$h(A,P,k,u) + em(A,P,k,u,q) < m(P).$$ To the left we here have all hits and a subset of the misses when using a $(k,u)$-cache, to the right we have all memory references. Only the misses which are also misses with a $(1,q)$-cache are not represented. Since the initial reference is a miss for both cashes, a strict inequality is valid. By taking optimal mappings $A_0$ and $A_1$ as before we get $$h(A_0,P,k,u) < m(P) - em(A_0,P,k,u,q)\leq m(P) - em(A_1,P,k,u,q).$$ Now by invoking Theorem \ref{compl}, for complete programs $P$ we have $r(n,k,u,q)$ $=em(P,k,u,q)/h(P,1,q)$, the right inequality follows. \pfend The same proof gives the following $n$-independent version: \begin{thm}\label{specific} \begin{description} \item [(i)]For any program $P$ we have $$h(P,1,q)(1-R(k,u,q))\leq h(P,k,u).\phantom{< m(P)-h(P,1,q)R(k,u,q)}$$ %\hspace*{51mm}$$ \item [(ii)]For any complete program $P$ we have $$h(P,1,q)(1-R(k,u,q))\leq h(P,k,u)< m(P)-h(P,1,q)R(k,u,q).$$ \end{description} \end{thm} {\em Proof of Theorem \ref{ngeneral}(i) and \ref{general}:} By taking $h(P,1,q)/m(P)$ arbitrarily close to 1, Theorem \ref{ngeneral}(i) and Theorem \ref{general} follow immediately from Lemma \ref{any} and from Theorem \ref{nspecific} and \ref{specific}, respectively. {\em Proof of Theorem \ref{ngeneral}(ii):} Since, by Lemma \ref{any}, there are complete programs with hit ratio arbitarily close to $h(P,1,q)/m(P)$, we can use Theorem \ref{nspecific}(ii). We thus assume that the hit ratio is not measured with infinite accuracy. For the sake of the optimality we need to prove that for any given program $P$ there is a program $P'$ with hit ratio arbitrarily close to the hit ratio of $P$, and so that the difference $$\frac{h(P',k,u)}{m(P')}-\frac{h(P,1,q)}{m(P)}(1-r(n,k,u,q))$$ is arbitrarily small. By Lemma \ref{any}, for any $\epsilon>0$ there is a complete program $P''$ so that $h(P'',1,q)/m(P'')>1-\epsilon$. Theorem \ref{nspecific}(ii) then gives $h(P'',k,u)/m(P'')>1-r(n,k,u,q)(1-\epsilon)$. Now let $P'''$ be a program with no cache hits; let for example $P'''$ continuously reference the $n$ variables cyclically. Now we choose the relative size of $P''$ and $P'''$ so that the concatenated program $P'=P''+ P'''$ has hit ratio arbitrarily close to the hit ratio of $P$: $01-\epsilon$, for any $\epsilon > 0$. Thus $\sup em(P,k,u,q)/h(P,1,q) = 1$ \pfend \begin{thm} If $u\geq q$, then $r(n,k,u,q) = 0$. \end{thm} \pf If $u\geq q$, then each set in the $(k,u)$-cache is larger than the set in the $(1,q)$-cache. Consequently, using the LRU replacement policy, no extra misses can occur.\pfend \vspace{4mm} Having isolated the trivial cases above, unless stated otherwise we assume that $n > ku\geq q > u$, in the rest of this paper. Further, "extremal programs" will unless stated otherwise denote programs which maximize the quantity $em(P,k,u,q)/h(P,1,q)$. \vspace{5mm} \section{From programs to $q,n$-hit vectors} \begin{thm} For any values of $n, k, u$ and $q$, there is at least one extremal program such that each variable occupies exactly one block. \end{thm} \pf The only difference to the statement here compared to the one in Lemma \ref{one} is that we here want to maximize $em(P,k,u,q)/h(P,1,q)$, while we in that proof want to minimize $h(P,k,u)/h(P,1,q)$. Since maximizing misses are equivalent to minimizing hits, it is easily seen that the same proof \newpage \noindent works also for this theorem. \pfend Hence there are extremal programs which has room for exactly one variable in each block, from now on we consider only those. Since there are $n$ variables in the program, we know that there will be at least $n$ cache misses, because each variable will be referenced at least once. The first reference to a variable, which always results in a miss, is referred to as the initial reference. \begin{thm} Consider an arbitrary extremal program $P$, and denote the first referenced variable by $i_1$, the secondly referenced variable by $i_2$, and so on. Then consider a program $P'$ which is identical to $P$ with the exception that a sequence of $n$ memory references to variables $i_1, i_2,..., i_n$ in this order has been added at the start of the program. Then $P'$ is also an extremal program. \end{thm} \pf Since we consider the case when $n > q$, the number of hits using a $(1,q)$-cache is the same for $P$ and $P'$. I.e. memory reference number $r$ in $P$ results in a miss using a $(1,q)$-cache if and only if memory reference number $r+n$ in $P'$ results in a miss using a $(1,q)$-cache. Recall that we use the LRU replacement policy. Hence $h(P,1,q) = h(P',1,q)$. Consider memory reference number $r_2$ in program $P$, and denote the referenced variable by $i$. If $r_2$ results in a hit, using a $(1,q)$-cache, then there is another reference $r_1$ which also accesses variable $i$, such that $r_1 < r_2$ and there are no references to variable $i$ between $r_1$ and $r_2$. The sequence of memory references between $r_1$ and $r_2$ completely determines if $r_2$ will result in a miss using a $(k,u)$-cache, i.e. this sequence determines if an extra miss will occur. Obviously this sequence corresponds to the sequence $r_1+n$ to $r_2+n$ in $P'$. Consequently, the number of extra misses is the same for $P$ and $P'$, i.e. $em(P,k,u,q)/(P,1,q) = em(P',k,u,q)/ h(P',1,q)$, where $m$ is the number of memory references in $P$. Consequently, if $P$ is an extremal program, $P'$ is also an extremal program. \pfend Hence there are extremal programs which has no hits during the $n$ first memory references using a $(1,q)$-cache, from now on we consider only those. A $q,n$-state vector $V$ defines the variables currently stored in a $(1,q)$-cache, and the order in which they have been referenced. $V$ has one entry for each variable. If variable $i$ is not in the cache then $V(i)=0$. Otherwise $V(i)$ denotes the number of variables referenced since variable $i$ was last referenced, including variable $i$. Hence if variable $i$ is the most recently accessed variable, $V(i) =1$. If it is the second most recently accessed variable, $V(i) = 2$, and so on (see Figure 4). Hence after the the first $n$ memory references the $q,n$-state vector consists of the $q$ first positive integers and $n-q$ zeros. As discussed before, the LRU replacement policy is used. \vspace*{1.95in} %\special{psfile=cfig4.ps vsize=130 voffset=-59 hoffset=-75 %vscale=90 hscale=90} Clearly, a $q,n$-state vector has $n$ entries and contains each of the first $q$ positive integers exactly once, and $n-q$ zeros. If reference to variable $j$ is a cache miss, and $V(i) = q$, then variable $i$ is replaced in the cache. We then put $V(i) =0$, increment all positive integers and put $V(j) =1$. If the reference to variable $j$ is a hit we increment all entries which are smaller than $V(j)$, and then put $V(j)=1$. Figure 4 shows a $q,n$-state vector for the case when $q = 4$ and $n = 7$. We are interested in the ratio $em(P,k,u,q)/h(P,1,q)$ for extremal programs. Using a $(1,q)$-cache there are no hits during the first $n$ memory references. Therefore, we do not consider the $n$ first memory references, i.e. we assume that there are always $q$ non-zero entries in a $q,n$-state vector. The numbering of the $q,n$-state vectors is chosen so that the $q,n$-state vector number $r$ describes the state of the cache immediately before reference $r$ in the program. Now consider a mapping $A$ of the $n$ variables to $k$ sets. Each mapping can be regarded as a partition of the first $n$ positive integers in $k$ sets $A_j$, a mapping $A$ is thus the sets $A_1,...,A_k$. \begin{lemma}\label{ku} Consider a $(k,u)$-cache, a mapping $A$ and a $q,n$-state vector $V$ corresponding to reference $r$ in a program $P$. Assume that $i_0\in A_{\alpha}$. Then the variable $i_0$ is not in the $(k,u)$-cache at reference $r$ in program $P$ if and only if there are $j$ variables $i_j$, which are also mapped to $A_{\alpha}$, such that $0 < V(i_j) < V(i_0)$ and $j \geq u$. \end{lemma} \pf This is a direct consequence of the definition of the $q,n$-state vector and of the fact that the LRU replacement policy is used. \pfend The next theorem states that a $q,n$-state vector contain enough information about the history of the program for our purpose. \begin{thm}\label{enough} If the $q,n$-state vector after $r$ references and the mapping of variables to sets are known, then this is possible to determine from this information whether reference $r$ is an extra miss or not. \end{thm} \pf Consider a $q,n$-state vector $V$, a mapping $A$ and a reference to a variable $i_0$. If $V(i_0) > 0$, no miss will occur using a $(1,q)$-cache. This is a direct consequence of the definition of the $q,n$-state vector. Assume that $i_0\in A_{\alpha}$ using the $(k,u)$-cache and there are $j$ variables $i_j$, which are also mapped onto $A_{\alpha}$. In that case the previous lemma guarantees that if $0 < V(i_j) < V(i_a)$ and $j\geq u$, a cache miss will occur using the $(k,u)$-cache and mapping $A$. Consequently, if the $q,n$-state vector and the mapping of variables to memory blocks are known, it is possible to determine if an extra miss will occur. \pfend Our next aim is to prove that any sequence of $(q,n)$-state vectors represents the sequence of $(1,q)$-hits for some program (Theorem \ref{anyqn}). This allows us to consider arbitrary sequences of $(q,n)$-state vectors. Two $q,n$-state vectors $V$ and V' are said to be {\em congruent} if $V(i) = V'(i)$, for all $i$ such that $V(i) = 0$. \begin{lemma}\label{congr1} For two arbitrary $q,n$-states vectors $V_1$ and $V_2$, it is possible to generate a sequence of memory references, such that, starting from $V_1$, we reach a state vector which is congruent with $V_2$ without having any hits using a $(1,q)$-cache. \end{lemma} \pf Generate a reference to the first variable $i$ (the variables are enumerated in some fixed arbitrary order) for which $V_1(i) = 0$ and $V_2(i)\not= 0$, thus obtaining a new $q,n$-state vector $V'$. If $V'$ and $V_2$ are not congruent we again issue a reference to a variable $i$ where $V'(i) = 0$ and $V_2(i)\not= 0$. This procedure can be repeated until we reach a vector $V'$ which is congruent with $V_2$. For the sequence of memory references so generated there are no hits using a $(1,q)$-cache. \pfend Consider two congruent $q,n$-state vectors $V$ and $V'$. The $q$ indexes $i_j$ for which $V(i_j)\not=0$ are enumerated in such order that $V(i_1) = 1, V(i_2) = 2,..., V(i_q) = q$. $V$ and $V'$ are {\em modular congruent} if there is an integer $x$: $V(i_j) = ((V'(i_j)+x) \mod\ q)+1$ for all $j: 1\leq j\leq q$. \begin{lemma}\label{modcongr} For two modular congruent $q,n$-state vectors $V_1$ and $V_2$, it is possible to generate a sequence of memory references such that, starting from $V_1$, we reach $V_2$ without having any hits using a $(1,q-1)$-cache. \end{lemma} \pf If we issue a reference to the variable $i_q: V_1(i_q) = q$, we will obtain a new state vector $V_1$' which is modular congruent to $V_1$. Also this reference will clearly be a miss when using a $(1,q-1)$-cache. By repeating this kind of reference we can generate any vector $V$ which is modular congruent to $V_1$ with no hits when using a $(1,q-1)$-cache. \pfend Two $q,n$-state vectors $V_1$ and $V_2$ are {\em quasi-identical} if $V_1(i) = V_2(i)$, for all $i$ except two indexes $i_a$ and $i_b$ such that $V_1(i_a) = 0, V_2(i_b) > 0, V_1(i_a) = V_2(i_b)$ and $V_1(i_b) = V_2(i_a)$. \begin{lemma}\label{quasi} For two quasi-identical $q,n$-state vectors $V_1$ and $V_2$, it is possible to generate a sequence of memory references, such that, starting from $V_1$, we reach $V_2$ without having any hits using a $(1,q-1)$-cache. \end{lemma} \pf Consider two states vectors $V_1'$ and $V_2'$ which are modular congruent with $V_1$ and $V_2$ respectively. $V_1'$ and $V_2'$ have been chosen in such a way that $V_1'(i_b) = q$ and so that $V_1'$ and $V_2'$ are also quasi-identical. By issuing one reference to the variable $i_a$ with $V_2'(i_a)=q$ we obtain $V_2''$ which is modular congruent to $V_2'$. From Lemma \ref{modcongr} we know that it is possible to generate a sequence of memory references, such that, starting from $V_1$ it is possible to reach $V_1'$ without having any hits using a $(1,q-1)$-cache. From $V_1'$ we get $V_2''$ by issuing a reference to variable $i_a$. Since $V_1'(i_a) = 0$, this reference leads to a miss using a $(1,q-1)$-cache. Again, Lemma \ref{modcongr} guarantees that, starting from $V_2''$, it is possible to generate a sequence of memory references, such that we reach $V_2$ without having any hits using a $(1,q-1)$-cache. \pfend \begin{lemma}\label{congr2} For two congruent $q,n$-state vectors $V_1$ and $V_2$, it is possible to generate a sequence of memory references, such that, starting from $V_1$, we reach $V_2$ without having any hits using a $(1,q-1)$-cache. \end{lemma} \pf Assume that $V_1(z) = V_2(z) = 0$, and that $V_2(i_j) = j$ for all $j: 1\leq j\leq q$. In that case, the algorithm below reaches $V'_1=V_2$, starting from $V_1$, without generating any hit using a $(1,q-1)$-cache. \begin{enumerate} \item Let $j = q$ and let $V'_j$ be modular congruent to $V_1$ such that $V'_q (i_q) = q$. \item If $j=1$ the algoritm terminates, else let $j=j-1$. \item If $V'_j (i_j) = j$, then $V'_{j} = V'_{j+1}$ and go to 2. \item Go to state $X_j$, which is quasi-identical with $V'_{j+1}$, i.e. $X_j(i) = V'_{j+1}(i)$; for all $i: 1\leq i\leq n$ except that $X_j(i_j) = V'_{j+1}(z) = 0$ and $X_j(z) = V'_{j+1}(i_j) = t$ $(t < j)$. \item Go to state $X'_j$, which is quasi-identical with $X_j$, i.e. $X'_j(i) = X_j(i)$; for all $i: 1\leq i\leq n$ except that $X'_j(i_y) = Xj(i_j) = 0$ and $X'_j(i_j) = X_j(i_y) = j$, i.e. $i_y$ is the index for which $X_j(i_y) = j$. \item Go to state $V'_j$, which is quasi-identical with $X'_j$, i.e. $V'_j(i) = X'_j(i)$ for all $i: 1\leq i\leq n$ except that $V'_{j}(z) = X'_j(i_y) = 0$ and $V'_{j}(i_y) = X'_j(z) = t$, n.b. $V'_{j}$ is congruent with $V_1$ and $V_2$. \item Go to 2. \end{enumerate} The algorithm above guarantees that $V'_1(i) = V_2(i)$ for all $i: j\leq i\leq n$. Moreover, Lemma \ref{quasi} guarantees that there will be no hits using a $(1,q-1)$-cache during steps 3, 4 and 5 in the algorithm above. \pfend \begin{thm}\label{anyqn} If $n > q+1$, it is possible to go from any $q,n$-state vector $V_1$ to any other $q,n$-state vector $V_2$ without having any hits in a $(1,q)$-cache. \end{thm} \pf This follows immediately from Lemma \ref{congr1} and \ref{congr2}. \pfend For the moment we only consider cases, in which $n > q+1$. The case when $n = q+1$, will be handled later. As discussed previously, a program can be represented by the sequence of memory references with associated $q,n$-state vectors corresponding to hits using a $(1,q)$-cache. From Theorem \ref{anyqn} we know that if $n > q+1$, each $q,n$-state vector in this sequence is completely independent from any other vector in the sequence. \begin{thm}\label{preex} Consider a mapping $A$ and two memory references $r_1$ and $r_2$, both referring to the same variable x. Denote the $q,n$-state vectors corresponding to $r_1$ and $r_2$ respectively by $V_1$ and $V_2$. Assume that these vectors are such that $0 < V_1(x) = p < q, V_2(x) = q, V_2(i) =V_1(i)$, when $V_1(i) < p$, otherwise $V_2(i) = V_1(i)-1$ for all $i\not=x$. Assume also that reference $r_1$ is a miss using a $(k,u)$-cache and a mapping $A$. Then reference $r_2$ is also a miss using a $(k,u)$-cache and the mapping $A$. \end{thm} \pf From Lemma \ref{ku} we know that if reference $r_1$ leads to a miss using a $(k,u)$-cache and mapping $A$, then there are $i$ variables $z_i$, which are mapped to the same set as $x$, such that $0 < V(z_i) < V(x) = p$ and $i \geq u$. We also know that $V_2(i) =V_1(i)$, when $V_1(i) < p$. Consequently, at reference $r_1$ the same $z_i$ variables are mapped to the same set as $x$. Under these conditions, Lemma \ref{ku} tells us that reference $r_2$ will result in a cache miss. \pfend \begin{thm}\label{ex} If $n > q+1$ there exists an extremal program such that each hit using a $(1,q)$-cache is a reference to the variable for which the corresponding entry in the q,n-state vector equals $q$. \end{thm} \pf Consider a program $P$ and its induced sequence of $q,n$-state vectors. If a reference $r$ occurs to a variable $i$ with $V(i) 0$ belonging to this partition set is larger than $u$. This is a necessary and sufficient condition for an extra miss. Since we thus are only interested in whether a certain entry in a $q,n$-state vector is $0, q$ or some value between $0$ and $q$, we can further simplify our notation. In a $q,n$-state vector ``0'', ``1'', and ``$q$'' are left unchanged, but all other integers are replaced by ``1''. Hence each vector has one ``$q$'', $q-1$ ``1'' and $n-q$ ``0''. Such a vector is referred to as a $q,n$-hit vector. Also this vector contains enough information to be able to decide whether an extra miss occurs or not. We will in the following often represent a program as a matrix which has $q,n$-hit vectors as rows. Again only the hits are represented, all $q,n$-hit vectors representing misses with a $(1,q)$-cache are deleted. Since a program by definition has $h(P,1,q)$ hits using a $(1,q)$-cache, we obtain an $h(P,1,q)\times n$ matrix. Figure 5 shows a program for which there are five cache hits using a $(1,3)$-cache, i.e. $h(P,1,q) = 5$. The program has six variables. The first cache hit that occurs is a reference to variable number three. When this hit occurs variables number one and two are also in the cache. \vspace*{2.2in} %\special{psfile=cfig5.ps vsize=150 voffset=-45 hoffset=-70 %vscale=80 hscale=80} The example in Figure 5 illustrates that if the mapping of variables to sets is known, it is possible to determine if the memory reference corresponding to a certain $q,n$-hit vector will result in a miss using a $(k,u)$-cache or not. Remember, we are interested in the mapping of variables to sets which results in a minimum number of misses using a $(k,u)$-cache. The figure also shows that the number of extra misses depends on the mapping. The problem of calculating $r(n,k,u,q) = \max_P em(P,k,u,q)/h(P,1,q)$ for all programs $P$ of $n$ variables can now be formulated as to find a sequence of $q,n$-hit vectors for which the ratio obtained when dividing the minimum number of misses for the sequence, using a $(k,u)$-cache, with the length of the sequence, is maximal. From Theorem \ref{ex} and \ref{enough} the following theorem follows, which summarizes the aim of the present section to reformulate the problem into arbitrary sequences of $q,n$-hit vectors: \begin{thm}\label{summarize} If $n>q+1$, any sequence of $q,n$-hit vectors represents a program. \end{thm} It is obvious that any program with $n$ variables corresponds to a sequence of $q,n$-hit vectors, as described before Definition \ref{hit}. \section{Complete programs}\label{precomb} There are $q\bin{n}{q}$ distinct $q,n$-hit vectors. A program for which the corresponding sequence of $q,n$-hit vectors contains exactly $x$ copies of each distinct vector is referred to as a complete program; $x$ is referred to as the repetition factor. Such a sequence of $q,n$-hit vectors is referred to as a complete sequence. A set of $q$ $q,n$-hit vectors form a transformation group if all zeros are positioned at the same entries in all vectors and if the ``$q$'' is positioned differently in all vectors. Consequently, there are $x\bin{n}{q}$ transformation groups in a complete program. \vspace*{2.6in} %\special{psfile=cfig6.ps vsize=180 voffset=-45 hoffset=-65 %vscale=80 hscale=80} Figure 6 shows a $q,n$-state vector $V$, and the corresponding $q,n$-hit vector $V$. The figure also shows the transformation group $T$, in which $V$ is a member. The arrows in the right part of the figure correspond to a sequence of four memory references, where the first reference is to a variable mapped to block 5, the second reference is to a variable mapped to block 6, the third to block 7 and the fourth to block 2. These four memory references go through all $q,n$-hit vectors in T exactly once, reaching $q,n$-hit vector $V$ again after the fourth memory reference. Using a $(1,4)$-cache, none of the four memory references would result in a cache miss. \vspace{4mm} So far, we have only considered the case when $n > q+1$. The reason for this is that if $n = q+1$, we cannot guarantee that there is a program corresponding to any arbitrary sequence of $q,n$-hit vectors (see theorem \ref{summarize}). However, in the next theorem we will show that there are complete programs even when $n = q+1$. \begin{thm} If $n = q+1$, then there is a program $P$ such that the sequence of $q,n$-hit vectors corresponding to hits using a $(1,q)$-cache forms a complete sequence. \end{thm} \pf We are going to describe a complete program $P$ with $n = q+1$ variables. The program $P$ starts by generating the $n-1$ memory references $2,3,...,q+1$. This gives cache misses only and the $q,n$-hit vector $V=(0,q,1,...,1,1)$. The following $q$ references are references to the variables $2,...,q,q+1$ in this order. By the definition of transformation groups we then clearly issue cache hits such that we go through all $q,n$-hit vectors in the transformation group $T$ of $V$ exactly once, finally returning to $V$. Having done this we generate a memory reference to variable 1. This issue gives a miss and a new $q,n$-hit vector $V'$ belonging to a different transformation group $T'$, where $V'(1) = 0$. Again we generate a sequence of $q$ memory references, such that we go through all $q,n$-hit vectors in $T'$ exactly once, finally returning to $V'$ again with no misses using a $(1,q)$-cache. This procedure is repeated until we have got all the $q+1$ different transformation groups. Clearly the cache hits in the program $P$ corresponds to all possible $q,n$-hit vectors, and each $q,n$-hit vector occurs once. Consequently, $P$ is a complete program for the case when $x$ is one. Note that by repeating the $(q+1)^2$ memory references it is possible to obtain a complete program with any arbitrary value on $x$. Hence there are complete programs when $n = q+1$. \pfend \pfof{any} Consider a complete program $P$, for which the hits occur in such order that $q,n$-hit vectors belonging to the same transformation group are listed consecutively. If $h(P,1,q)/m(P) > H$ we add $y$ cache misses at the end of the program, thus obtaining a new program $P'$ which also is complete since only cache hits are involved in the definition of complete programs. $$\frac{h(P',1,q)}{m(P')} = \frac{h(P,1,q)}{m(P)+y}.$$ If $h(P,1,q)/m(P) < H$, we consider another complete program $P'$ where we have increased the repetition factor from $x$ to $x'$, still with $q,n$-hit vectors belonging to the same transformation group listed consecutively. Since only hits occur within a transformation group, we thus introduce no new misses. Let $\alpha=\frac{x'}{x}$. Denoting $b(P)=m(P)-h(P,1,q)$, the misses using a $(1,q)$-cache, and $\alpha=\frac{x'}{x}$, it is clear that $$\frac{h(P',1,q)}{m(P')} = \frac{\alpha h(P,1,q)}{\alpha h(P,1,q)+b(P)}.$$ Hence we can increase the ratio arbitrarily close to 1. By repeating the two techniques it is possible to find a complete program which has hit ratio arbitrarily close to $H$. \pfend \begin{lemma}\label{only} Consider a complete program $P$. The variables in $P$ include $i_1$ and $i_2$. If $A_1(i_1) =S_1$ and $A_1(i_2) = S_2$ and there is another mapping $A_2$ which is identical to $A_1$ except $A_2(i_1) =S_2$ and $A_2(i_2) = S_1$, then the number of extra misses is the same for mappings $A_1$ and $A_2$. \end{lemma} \pf Obviously, the number of misses using a $(k,u)$-cache for vectors $V$, such that $V(i_1) = V(i_2)$ is the same for mappings $A_1$ and $A_2$. The symmetry of complete sequences guarantees that all $q,n$-hit vectors $V$ in $P$ for which $V(i_2) \not=V(i_1)$ can be grouped into pairs $(V,V')$, such that $V(i) = V'(i)$, except that $V(i_1) = V'(i_2)$ and $V(i_2) = V'(i_1)$. Iff $V$ results in a miss for mapping $A_1$ using a $(k,u)$-cache, then V' results in a miss for mapping $A_2$ using a $(k,u)$-cache. Similarly, iff $V$ results in a miss for mapping $A_2$ using a $(k,u)$-cache, then $V'$ results in a miss for mapping $A_1$ using a $(k,u)$-cache. Consequently, the number of extra misses is the same for $A_1$ and $A_2$. \pfend \begin{lemma}\label{prev} Consider a complete program $P$. If there are $n_1$ variables mapped onto set $S_1$ and $n_2$ variables mapped onto set $S_2$ and $n_1 +1 < n_2$, the number of extra misses cannot decrease if one variable is remapped from $S_2$ to $S_1$. \end{lemma} \pf Consider a mapping $A_1$, such that there are $n_1$ variables mapped onto set $S_1$ and $n_2$ variables mapped onto set $S_2$, where $n_1 + 1 < n_2$. From Lemma \ref{only} we know that only the number of variables mapped to a each set is interesting. In this case the variables are mapped in such a way that $A_1(i_1) = S_1, A_1(i_2) = S_1,..., A_1(i_{n_1}) = S_1, A_1(i_{n_1+1}) = S_2,..., A_1(i_{n_1+n_2}) = S_2, A_1(i_{n_1+n_2+1}) = S_3$ and so on. Consider also another mapping $A_2$ which is identical to $A_1$ except that $A_2(i_{n_1}) = S_2$. We are going to show that the number of extra misses cannot be larger using $A_2$ than $A_1$. Obviously, the number of misses for vectors $V$, such that $V(i_{n_1}) = 0$ is the same for mappings $A_1$ and $A_2$, i.e. we only have to consider vectors $V$ for which $V(i_{n_1})\not= 0$. We now introduce the concept of $(n_1,n_2)$-permutations. $V'$ is by definition an $(n_1,n_2)$-permutation of $V$ if $V'(i_j) = V(i_{n_1+n_2+1-j})$ and $V'(i_{n_1+n_2+1-j}) = V(i_j)$ for all $j:1\leq j\leq n_2$, and all other entries are equal: $V'(i_j)=V(i_j)$ for all $j: n_1+1\leq j\leq n_2$ and $n_1+n_2+1\leq j\leq n$. If $V(i_{n_1}) = "q"$ and $V(i_j) = "1" $ for all $j: n_1+1\leq j\leq n_2$, then $V$ may, using a $(k,u)$-cache, lead to a miss using mapping $A_2$ even if $V$ did not cause a miss using mapping $A_1$. The symmetry of complete programs guarantee that each such vector $V$ can be grouped together with another vector $V'$, forming a pair $(V,V')$ such that $V'$ is an $(n_1,n_2)$-permutation of $V$. From the definition of $(n_1,n_2)$-permutations and we know that if, using a $(k,u)$-cache, vector $V$ leads to a miss using mapping $A_2$ and a hit using mapping $A_1$, then $V'$ leads to a miss using mapping $A_1$ and a hit using mapping $A_2$. We now introduce the function $\#a(V,[x,y])$, i.e. there are two parameters: a $q,n$-hit vector $V$ and a range $[x,y]$. The value of this function is a natural number indicating the number of "1":s in $V$ in the interval $V(i_x)...V(i_y)$. If $V(i_{n_1}) = "1"$ and $ \#a(V,[1,n_1]) < \#a(V,[n_1+1,n_1+n_2])$, then $V$ may, using a $(k,u)$-cache, lead to a miss using mapping $A_2$ even if $V$ did not cause a miss using mapping $A_2$. The symmetry of complete programs guarantee that each such vector $V$ can be grouped together with another vector $V'$, forming a pair $(V,V')$ such that $V'$ is an $(n_1,n_2)$-permutation of $V$. From the definition of $(n_1,n_2)$-permutations we know that if, using a $(k,u)$-cache, vector $V$ leads to a miss using mapping $A_2$ and a hit using mapping $A_1$, then, using a $(k,u)$-cache, $V'$ leads to a miss using mapping $A_1$ and a hit using mapping $A_2$. \pfend \pfof{mapp} By repeating the argument of the preceeding theorem we finally obtain an optimal mapping of the type described in the theorem. \pfend \begin{lemma}\label{same} All complete programs $P$ with $n$ variables have the same ratio $em(P,k,u,q)/h(P,1,q)$. \end{lemma} \pf Obviously, the ratio $em(P,k,u,q)/h(P,1,q)$ is not affected by the order in which the $q,n$-hit vectors are listed in the sequence corresponding to the complete program. Therefore, the only potentially significant difference between two complete programs is the repetition factor $x$. Creating $x$ copies of each row, increases both the number of extra misses $em(P,k,u,q)$ and the number of hits using a $(1,q)$-cache $h(P,1,q)$ with a repetition factor $x$. Consequently, the ratio $em(P,k,u,q)/h(P,1,q)$ is not affected by $x$, i.e. all complete programs $P$ with $n$ variables have the same ratio $em(P,k,u,q)/h(P,1,q)$, where $m$ is the number of memory references in $P$. \pfend \pfof{compl} Consider a sequence of $q,n$-hit vectors corresponding to an arbitrary program $P_0$ (see Figure 7), and take a mapping $A_0$ which is optimal for the program $P_0$. Let further $A_1$ be any mapping where a set with most variables have at most one more variable than a set with fewest variables. We then clearly have $$\frac{em(P_0,k,u,q)}{h(P_0,1,q)}=\frac{em(A_0,P_0,k,u,q)}{h(P_0,1,q)} \leq\frac{em(A_1,P_0,k,u,q)}{h(P_0,1,q)}.$$ As discussed previously, the $q,n$-hit vectors form a $h(P_0,1,q)\times n$ matrix; in the example in Figure 7 we have $h(P_0,1,q) = 4$ and $n = 3$. We next duplicate the program $P_0$ in $n!$ copies, where each copy corresponds to one of the $n!$ permutations of the $n$ columns in the $h(P_0,1,q)\times n$ matrix. The copies are concatenated, resulting in a sequence of $q,n$-hit vectors of length $n!*h(P_0,1,q)$. This sequence corresponds to a program $P_1$. It is clear that $P_1$ is a complete program (see the rightmost version of $P_1$ in Figure 7). \newpage \vspace*{5.1in} %\special{psfile=cfig7.ps vsize=370 voffset=-40 hoffset=-55 %vscale=80 hscale=80} Now the mapping $A_1$ on each of the $n!$ permutations of the columns of $P_0$ can alternatively be regarded as mappings $A_{1,i}, i=1,...,n!$ on the program $P_0$. Here all mappings $A_{1,i}$, some of which may be equal, have all the property that the sizes of the sets as equal as possible. Now fix the mapping $A_1$ to be such that $em(A_1,P,k,u,q)\leq em(A_{1,i},P,k,u,q)$ for all $ i=1,...,n!$. Then $em(A_1,P,k,u,q)$ is a lower bound also of the mean value of the quantities $em(A_{1,i},P_0,k,u,q)$. Since the sum of these is $em(A_1,P_1,k,u,q)$, we get $$\frac{em(A_1,P_0,k,u,q)}{h(P_0,1,q)}\leq \frac{1}{n!}\sum_{i=1}^{n!}\frac{em(A_{1,i},P_0,k,u,q)}{h(P_0,1,q)}=$$ $$\frac{em(A_1,P_1,k,u,q)}{n!h(P_0,1,q)}=\frac{em(A_1,P_1,k,u,q)}{h(P_1,1,q)}.$$ Further, from Lemma \ref{same} we know that all complete programs $P$ with $n$ variables have the same ratio $em(P,k,u,q)/h(P,1,q)$. By this and Theorem \ref{mapp}, mappings as $A_1$ are optimal mappings. Thus, for an arbitrary program $P_0$; $$\frac{em(P_0,k,u,q)}{h(P_0,1,q)}\leq\frac{em(A_1,P_1,k,u,q)}{h(P_1,1,q)}\leq \frac{em(P_1,k,u,q)}{h(P_1,1,q)}$$ where $P_1$ is complete. Hence complete programs are extremal. \pfend \section{Calculating $r(n,k,u,q)$ and $R(k,u,q)$}\label{comb} \pfof{nformula} By Theorem 6.1, 6.2 and 6.3 we know that if $n\leq ku$ or $u\geq q$, then $r(n,k,u,q)=0$, and if $kuu$. Hence, for each permutation in the permutation set corresponding to the decreasing sequence $I$, the ratio of extra misses is $d(I,u)/q$. The number of permutations corresponding to a specific decreasing sequence is computed in two steps. First the 1:s in each set are permuted, which gives rise to a factor $\bin{w}{i_1}\cdot...\cdot\bin{w}{i_k}$. Next the sets are permuted, which gives the factor $k!$. But sets with equal number of 1:s need not be permuted, in order to count each permutation exactly once. We thus need to divide by the number of permutations of the sets which has the same number of 1:s. This gives the factor $(\Pi_{j=1}^{b(I)}a(I,j)!)^{-1}$. If $n/k$ is not an integer we assume that the first $n_k$ subsets contain $w+1$ columns, and the other $k-n_k$ contains $w$ columns each. We next consider permutations with a maximum number of 1:s in the sets of size $w+1$, and the remaining 1:s in the sets of size $w$. The quantity $d(I,u)$ is computed and multiplied with the number of such permutations, which is obtained as above for the sets of size $w+1$ separately times this number for the sets of size $w$. Next the same procedure is repeated after having moved one of the 1:s from the sets of size $w+1$ to the smaller sets. The procedure is repeated until either the sets of size $w+1$ run out of 1:s or the sets of size $w$ run out of 0:s. The limits of the sums exclude cases when there are no permutations, i.e. we are avoiding summing zeros. For example, if $\max(0,q-w(k-n_k)>0$, then we have in total $w(k-n_k)$ slots in the sets of size $w$. Thus if $q>w(k-n_k)$, packing this side full of 1:s will still leave $q-w(k-n_k)$ 1:s for the sets of size $w+1$. Since $l$ is the number of 1:s here, $l$ should thus start with $q-w(k-n_k)$. Also, the terms corresponding to sequences when both $I_1$ and $I_2$ have all entries smaller than $u$ can be omitted, since then $d(I_1,u)+d(I_2,u)=0$. \pfend The decreasing sequences can easily be generated by the algorithm described in the following lemma. We say that the least decreasing sequence of length $k$ and sum $q$ is the sequence $\{\lceil\frac{q}{k}\rceil,...,\lceil\frac{q}{k}\rceil, \lfloor\frac{q}{k}\rfloor,...,\lfloor\frac{q}{k}\rfloor\}$. If $q_{k}$ is the remainder when $q$ is divided by $k$, the number of $ \lceil\frac{q}{k}\rceil$:s is $q_{k}$, and the number of $ \lfloor\frac{q}{k}\rfloor$:s is $k-q_{k}$, making the sum of the sequence $q$. \begin{lemma}\label{alg} Let $q$, $u$ and $k$ be positive integers with $uu$ and has sum $q$ is generated exactly once by the following algorithm: \begin{enumerate} \item As starting sequence take $i_1=u+1$ and take $\{i_2,...,i_k\}$ as the least decreasing sequence of length $k-1$ and sum $q-u-1$. \item Find the rightmost position in $I$, say $j_1$, which fullfills: \begin{enumerate} \item $i_{j_1}0$ \end{enumerate} The algorithm terminates if no such $j_1$ can be found. \item The next sequence is obtained from $I$ by increasing the entry in position $j_1$ by one, and replacing the subsequence $\{i_{j_1+1},...,i_{k}\}$ with the least decreasing subsequence of length $k-j_1$ and sum $\sum_{j=j_1+1}^{k}i_j-1$. \item Go to step 2. \end{enumerate} \end{lemma} The lemma is proved in [7]. \begin{lemma}\label{decreasing} For any $n > 0, r(n,k,u,q) \leq r(n+1,k,u,q)$. \end{lemma} \pf Suppose that $P_0$ is a complete matrix and $A_0$ is an optimal mapping for $P_0$. Hence $$r(n,k,u,q)=\max_{\mbox{{\scriptsize programs }}P \mbox{{\scriptsize of } } n \mbox{{\scriptsize variables}}}\frac{em(P,k,u,q)}{h(P,1,q)}=$$ $$\frac{em(P_0,k,u,q)}{h(P_0,1,q)}=\frac{em(A_0,P_0,k,u,q)}{h(P_0,1,q)}.$$ We next construct a program $P_1$ by adding to the program $P_0$ a new variable of size exactly one block, as the very last reference. Since this variable is not referenced before it is a miss with any cache. Hence we can obtain an optimal mapping $A_1$ to the program $P_1$ by adding the new variable to any set. The program $P_1$ has $n+1$ variables but may not be optimal. Thus: $$\frac{em(P_0,k,u,q)}{h(P_0,1,q)}=\frac{em(P_1,k,u,q)}{h(P_1,1,q)}\leq$$ $$\max_{\mbox{{\scriptsize programs }}P \mbox{{\scriptsize of } } n+1 \mbox{{\scriptsize variables}}}\frac{em(P,k,u,q)}{h(P,1,q)}=r(n+1,k,u,q).$$ \pfend \vspace{4mm} {\em Proof of Theorem \ref{formula}:} By the previous lemma it is clear that $$\sup_nr(n,k,u,q)=\lim_{n\rightarrow\infty}r(n,k,u,q).$$ Further it follows that $$\lim_{n\rightarrow\infty}r(n,k,u,q)=\lim_{w\rightarrow\infty}r(wk,k,u,q).$$ Note also that exactly the same decreasing sequences are generated for all $w\geq k$, if $k$ is fixed. This fact makes it possible to explicitely compute the function $R(k,u,q)$. When letting $w\rightarrow\infty$, it is enough to study the behaviour of the $w$-dependent part of each term, which is the quotient $$\frac{\bin{w}{i_1}\cdot...\cdot\bin{w}{i_{k}}}{\bin{wk}{q}}.$$ We will use the fact that $i_1+...+i_k=q$. From the Stirling formula $n!\approx(\frac{n}{e})^n\sqrt{2\pi n}$ we get for large $n$: $$\bin{n}{k}\approx\frac{e^{-k}}{k!} \sqrt{\frac{n}{n-k}}\frac{n^n}{(n-k)^{n-k}}.$$ Thus the quotient approximately equals $$\frac{e^{-i_1}}{i_1!}\sqrt{\frac{w}{w-i_1}} \frac{w^w}{(w-i_1)^{w-i_1}}\cdot... \cdot\frac{e^{-i_k}}{i_k!}\sqrt{\frac{w}{w-i_k}} \frac{w^w}{(w-i_k)^{w-i_k}}$$ $$\frac{q!}{e^{-q}}\sqrt{\frac{kw-q}{kw}}\frac{(kw-q)^{kw-q}}{(kw)^{kw}}=$$ $$\frac{q!}{i_1!\cdot...\cdot i_k!}$$ $$\sqrt{\frac{w}{w-i_1}\cdot...\cdot\frac{w}{w-i_k}\frac{kw-q}{kw}}$$ $$(1-\frac{q}{kw})^{kw}\frac{w^{kw}}{(w-i_1)^w\cdot...\cdot(w-i_k)^w} \frac{(kw-q)^{-q}}{(w-i_1)^{-i_1}\cdot...\cdot(w-i_k)^{-i_k}}.$$ The first line after the last equality clearly is independent of $w$. The second line tend to $1$. The first factor of the third line tend to $e^{-q}$. The second factor can be written as $$(1+\frac{i_1}{w-i_1})^w\cdot...\cdot(1+\frac{i_k}{w-i_k})^w.$$ Now, $$(1+\frac{i-1}{w-i})^{w-i}\rightarrow e^{i-1}\mbox{ as } w\rightarrow\infty,$$ so we get $$e^{i_1}\cdot...\cdot e^{i_k}=e^q$$ as $w\rightarrow\infty$, cancelling the factor $e^{-q}$. From the third factor we get $$(\frac{kw-q}{w-i_1})^{-i_1}\cdot... \cdot(\frac{kw-q}{w-i_k})^{-i_k}\rightarrow k^{-i_1}\cdot...\cdot k^{-i_k}=k^{-q}.$$ Thus the third line above contributes with a factor $k^{-q}$ as $w\rightarrow\infty$. Hence: $$\lim_{w\rightarrow\infty}\frac{\bin{w}{i_1}\cdot...\cdot\bin{w}{i_{k}}} {\bin{wk}{q}} =\frac{q!k^{-q}}{i_1!\cdot...\cdot i_k!}.$$ The proof is completed.\pfend \section{Discussion} In this section we discuss the practical implications of the results as well as the restrictions and assumptions. In subsection 9.1 demonstrate how the results can be used for obtaining a bound on the worst case performance. In subsection 9.2 we discuss how the results can be used for comparing set associative caches with dirsect mapped caches or other set assocaitive caches. In the remaining subsections we discuss some of the restrictions and assumptions in the definition of the problem. More specifically, in subsection 9.3 we disuss how replacement policies ither than LRU would affect the results. Subsection 9.4 discusses the restriction that each variable must fit in one cache block. This discussions shows that the obtained results can be used also for programs with large variabes. In subsection 9.5 we discuss large scientific programs which manipulate large matrices in different ways, e.g. matrix multiplication and addition. \subsection{Applying the results} The results reported here make it possible to obtain optimal worst case bounds on the cache hit ratio for set associative and direct mapped caches. This can be understood by considering the following example. Consider a program $P$, with $n$ variables, for which we have obtained a certain hit ratio $X$ using a $(1,q)$-cache. Now we are interested in the hit ratio $Y$ for $P$ using an optimal mapping of variables to memory addresses for another cache organization -- a $(k, u)$-cache -- such that $q < k*u$ and $u < q$. We know that $X \leq h(P,1,q)/m(P)$, and that $Y = h(P,k,u)/m(P)$. Consequently, from Theorem 4.1 (ii) we know that $X(1-r(n,k,u,q) \leq Y$. One practical implication of this is that if we should obtain a hit ratio which is smaller than $X(1-r(n,k,u,q))$ using a $(k,u)$-cache, then we know that the mapping of variables to memory adresses is not optimal. Generally we do not know if $X = h(P,1,q)$ or not, and Theorem \ref{ngeneral} (ii) gives an optimal lower bound on $h(P,k,u)/m(P)$. Consequently, based on the information we have the lower bound on $Y$ is optimal. In the example above the results were used for evaluating the performence (hit ratio) of one particular program $P$. However, it is also possible to use our results when evaluating different cache architectures, e.g. Theorems \ref{general} and \ref{formula} directly tell us how much worse the hit ratio we may obtain for a $(k,u)$-cache compared to a $(1,q)$-cache. However, the results can also be used for more complex scenarios, e.g. consider two cache memory systems, in system one we have a $(1,20)$-cahe and in system two we have a $(20,2)$-cache. The main memory access time is the same in both cases, and the access time for the $(1,20)$-cache is 8 times faster than a main memory access whereas the access time for the $(20,2)$-cache is 10 times faster than a main memory access, i.e. the $(20,2)$-cache is slightly faster than the $(1,20)$-cache. Let $M$ be the access time for one memory reference on the main memory, and let $C$ denote the access time on the cache memory. For a program $P$ with length $L$ and hit ratio $H$ on system 1 we get the total access time $$A_1(P)=L(C_1H+M(1-H))=L(M-(M-C_1)H).$$ Let $H'$ be the hit ratio with system 2. From Theorem \ref{ngeneral} we know that $H(P)(1-r(n,k,u,q))\leq H'(P)\leq1$. From this lower and upper bounds of the ratio $A_2(P)/A_1(P)$, the access time for system 2 when the access time for system 1 is known, follow: $$\frac{C_2}{M-(M-C_1)H}\leq\frac{A_2(P)}{A_1(P)}\leq \frac{M-(M-C_2)H(1-r(n,k,u,q)}{M-(M-C_1)H}.$$ In section 11.3 these bounds, $f_1(H)$ and $f_2(H)$ respectively, are plotted as a function of the program parameter $H$ when system 1 is a $(1,20)$-cache with $M/C_1=8$, and system 2 is a $(20,2)$-cache with $M/C_2=10$. Here we need $R(20,2,20)=0.2459$. If the number of variables $n$ is known, one can obtain a slightly sharper upper bound by replacing $R(20,2,20)$ by $r(n,20,2,20)$. The functions $f_1(H)$ and $f_2(H)$ are interesting when we want to compare the two cache organizations. \subsection{Comparing set associative and direct mapped caches} Consider two set associative cache memories $M_1$ = $(k,u*c_1)$-cache and $M_2$ = $(k*c_2,u)$-cache, such that \(c_1 \geq 1\) and \(c_1 \geq 1\). \begin{thm}\label{setvsset} $$\frac{h(P,k*c_2,u)}{h(P,k,u*c_1)} \geq 1-R(c_2,u,c_1),$$ for all programs $P$ of any number of variables. \end{thm} \pfof{setvsset} We first consider an arbitrary program $P$ with $n$ variables and an optimal mapping $A_1$ for $M_1$, i.e. a mapping which results in $h(P,k,u*c_1)$. We then consider another program $P'$ with $n'$ variables (\(n' \geq n\)) and a mapping $A'_1$. $P'$ is identical with $P$ with the exception that $n'-n$ references to the $n'-n$ new variables are added at the end of the program; one reference for each new varialble. $A'_1$ is identical to $A_1$ for the $n$ original variables. The remaining remaining $n'-n$ variables are mapped in such a way that the number of variables in each set equals $n'/k$. Obviously, $h(P,k,u*c_1)$ = $h(P',k,u*c_1)$ using any mapping, i.e. the mapping of variables $n'-n$ does not affect the number of hits, since all references to these variables result in cache misses anyway. $P'$ can be viewed as the sum of $k$ interleaved programs $P'_i$, \(i = 1..k\). A program $P'_i$ contains the references to the variables in set $i$ (\(1 \leq i \leq k\)) in $M_1$ using mapping $A'_1$. We then consider a mapping $A_2$ for $M_2$ such that variables which are mapped to set $i$ in $A'_1$ (\(0 \leq i \leq k-1\)) are mapped to one of the $c_2$ sets $i+k*j$ (\(0 \leq i \leq k-1, j = 0,..,c_2-1\)) in $A_2$. For instance, if $k=2$ and $c_2=2$, then all variables mapped to set zero in $A'_1$ are mapped to sets zero and two in $A_2$, and all variables mapped to set one in $A'_1$ are mapped to sets one and three in $A_2$. Using this mapping, each of the $k$ sets in $M_1$ can be treated as a fully associative cache containing $u*c_1$ blocks which is compared to a $(c_1,u)$-cache. When \( n' \rightarrow \infty\) theorem \ref{general} that: \[\frac{h(P'_i,k*c_2,u)}{h(P'_i,k,u*c_1)} \geq 1-R(c_2,u,c_1),\] for all programs $P'_i$ of any number of variables. Obviously, \(h(P',k*c_2,u) = \sum_i h(P'_i,k*c_2,u)\) and \(h(P',k,u*c_1) = \sum_i h(P'_i,k,u*c_1). \) Consequently, using the mappings $A'_1$ and $A_2$ we know that \[\frac{h(P',k*c_2,u)}{h(P',k,u*c_1)} \geq 1-R(c_2,u,c_1),\] for all programs $P$ of any number of variables. The theorem follows from the facts that the relation above is valid for any number of variables $n$, including the case when $n$ in infinitely large, and that $A_1$ is optimal for this program using a $(k,c_1*u)$-cache, and that $h(P,k,u*c_1)$ = $h(P',k,u*c_1)$ using any mapping. \pfend \subsection{Other replacement policies} In this paper we have assumed the LRU replacement policy. However, in this subsection we will show that the results can be applied to systems with other replacement policies. Consider a fully assoiative cache, i.e. a $(1,q)$-cache, and a direct mapped cache, i.e. a $(k,1)$-cache. \begin{thm}\label{repl1} \begin{description} \item [(i)] Using any replacement policy $$\frac{h(P,k,1)}{h(P,1,q)} \geq 1-r(n,k,1,q),$$ for all programs $P$ with $n$ variables \item [(ii)] Consider a program P with $n$ variables, and suppose that the hit ratio $h(P,1,q)/m(P)$ for $P$ with a $(1,q)$-cache is known. Then for any replacement policy $$\frac{h(P,k,1)}{m(P)} \geq (1-r(n,k,1,q))\frac{h(P,1,q)}{m(P)}.$$ \item [(iii)] $$\frac{h(P,k,1)}{h(P,1,q)} \leq 1-R(k,1,q),$$ for all programs $P$ of any number of variables and for any replacement policy. \end{description} \end{thm} \pfof{repl1} Consider a hit in the $(1,q)$-cache. We now temporary redefine the $q,n$-hit vectors somewhat. The $q$ still represents the hit in the $(1,q)$-cache and the $q-1$ ones still represent the $q-1$ last referenced variables. However, depending on the replacement policy all of these variables may not be in the $(1,q)$-cache. Conseqently, some of the $n-q$ zeroes may be in the cache. When using this definition of $q,n$-hit vectors, the same kind of proofs can be used for showing that complete programs represent the worst-case scenario when comparing a direct mapped cache to a fully asociative cache. However, we do not know if it is possible to obtain a sequence of $q,n$-hit vectors corresponding to a complete program. Consequently, Theorem \ref{repl1} follows from Theorems \ref{ngeneral} and \ref{general} and the fact that the complete program is still the worst-case program when comparing a fully associative cache to a direct mapped cache. \pfend \subsection{Large variables} In some cases, one variable may be too large to fit within one cache block. Examples of such variables are strings, records and arrays which are accessed sequentially (we will have a separate discussion about matrices in the next sub section). Accesses to such large variables consist of a a number of clustered accesses to the different blocks in which the variable is contained. \begin{thm}\label{largev1} Consider programs $P$ with $n$ large variables variables with a size of $v$ blocks each. These blocks are mapped to consequetive memory addresses. Each access to one of these variables results in is a sequence of accesses to the $v$ blocks. \begin{description} \item [(i)] $$\frac{h(P,k,u)}{h(P,1,q)} \geq 1-r(n,\lfloor k/v \rfloor,u,\lceil q/v \rceil),$$ for all programs $P$ with $n$ variables with a size of $v$ blocks each. \item [(ii)] Suppose that the hit ratio $h(P,1,q)/m(P)$ for $P$ with a $(1,q)$-cache is known. Then $$\frac{h(P,k,u)}{m(P)} \geq (1- r(n,\lfloor\frac{k}{v}\rfloor,u,\lceil\frac{q}{v}\rceil)) \frac{h(P,1,q)}{m(P)}.$$ \item [(iii)] $$\frac{h(P,k,u)}{h(P,1,q)} \leq 1-R(\lfloor k/v \rfloor,u,\lceil q/v \rceil),$$ for all programs $P$ of any number of variables with a size of $v$ blocks each. \end{description} \end{thm} \pfof{largev1} Consider a $(1,q')$-cache, where \(q' = v*\lceil q/v \rceil\), and a $(k',u)$-cache, where \(k' = v*\lfloor k/v \rfloor\). Since, \(q' \geq q\) and \(k' \leq k\), it is clear that \(\h(P,k,u)/h(P,1,q) \geq h(P,k',u)/h(P.1.q')\) for any program $P$. Consequently, it is enough to show that $$\frac{h(P,k,u)}{h(P,1,q)} \geq 1-r(n,\lfloor k/v \rfloor,u,\lceil q/v \rceil) = 1-r(n,k'/v,u,q'/v),$$ for all programs $P$ with $n$ variables with a size of $v$ blocks each. Consider the set of mappings where the first block in each variable is mapped to set $i$, where \(i = 0,v,2v,...,k'-v-1\). Let $h'(P,k',u)$ be the maximal number of cache hits for program $P$ within this set of mappings. Obviously, \(h'(P,k',u) \leq h(P,k',u)\). If we restrict ourselfs to these mappings, we end up in a situation which is identical to having $n$ one block variables and comparing a $(1,q'')$-cache to a $(k'',u)$-cache, where \(q'' = q'/v = \lceil q/v \rceil\) and \(k'' = k'/v = \lfloor k/v \rfloor\). For such systems Theorem \ref{ngeneral} guarantees that: \begin{description} \item [(i)] The worst case ratio of cache hits comparing two caches is given by $$\inf_P\frac{h(P,k'',u)}{h(P,1,q')}=1-r(n,k'',u,q''),$$ where the infimum is taken over all programs $P$ with $n$ variables. \item [(ii)] Consider a program P with $n$ variables, and suppose that the hit ratio $h(P,1,q'')/m(P)$ for $P$ with a $(1,q'')$-cache is known. Then an optimal estimate of the hit ratio for the program $P$ using a $(k'',u)$-cache is given by $$\inf_P\frac{h(P,k'',u)}{m(P)}=(1-r(n,k'',u,q''))\frac{h(P,1,q'')}{m(P)}.$$ \end{description} Consequently, Theorem \ref{largev1} follows from theorems \ref{ngeneral} and \ref{general} and the facts that \(\h(P,k,u)/h(P,1,q) \geq h(P,k',u)/h(P.1.q') \geq h'(P,k',u)/h(P,1,q') = h(P,k'',u)/h(P,1,q''),\) for any program $P$ of $n$ variables with size $v$ blocks. \pfend \subsection{Scientific programs with large matrices} If the colliding variables are two elements in a large matrix, then the compiler can obviously not map them independently of each other. One scenario where these types of cache collisions occur are scientific applications, where large matrices are manipulated in different ways. In such programs, there are large loops in which differnt parts of the large matrix is accessed, e.g. rows, columns, diagonals etc. Such loops contain long sequences of memeory accesses for which the addresses are separeted by by a fixed distance called the {\it stride}. Consider a system with $k$ cache sets and a program accessing a ($bk$)x($ck$)-matrix ($b$, $c$ $\geq$ 1), which is mapped row by row starting at memory address zero. This means that the $i$:th element in each row is mapped to set ($i$ mod $k$). As long as the program works its way through the matrix by accessing each row sequentially, everything works fine, i.e. there are a minimum number of collisions. However, if the program accesses each column sequentially, there will be a collision on every access, i.e. for column $j$ all memory accesses will be made to cache set ($j$ mod $k$). It has been shown, that these kind of collisions can be handled by adding a number of dummy columns (or dummy rows if the matrix is stored column by column). By adding a dummy column in the example above, we can access a column sequentially with a minimum number of collisions. Other access patterns, e.g. diagonals, can be handled in a similar way. However, for some combinations of access patterns some collsions always remain. For instance, \section{Conclusions} The average behaviour of set-associative caches have been studied extensively. One rule of thumb is that a $(k,1)$-cache gives about the same hit ratio as a $(k/4,2)$-cache [3]. On page 421 in [3] there is a table where the hit ratio using different cache sized and different degrees of associativity is shown for a "typical" work load. A smaller version of the same table is also shown in [5]. These tables show that the hit ratio gain of going from a $(k,1)$-cache to a $(k/2,2)$-cache is larger than the gain of going from a $(k/2,2)$-cache to a $(k/4,4)$-cache, and going from a $(k/4,4)$-cache to a $(k/8,8)$-cache is even smaller. The plot in Section 11.1 supports this. There has also been a great interest for developing models for memory access patterns, which make it possible to evaluate average performance of fully associative, set-associative and direct mapped caches. It is well known that one of the problems with direct mapped caches compared to fully or set associative caches is that they exhibit a terrible worst-case performance [4]. However, before this result it has not been possible to quantitively compare the worst-case behaviour of direct mapped caches to fully associative caches. Section 11.2 displays a plot of this performance relation. On pages 408 to 425 in [3] there is a very comprehensive discussion on the performance implications of various cache parameters, e.g. the LRU replacement policy outperforms random replacement particularly for caches with relatively few blocks (up to a couple of thousand blocks). Moreover, the ``optimal'' block size increases with increasing cache size. Based on these observations it is reasonable to assume that large caches will have large blocks and that the LRU replacement policy is used within each set. Consequently, the LRU assumption made in this report is reasonable. The assumption that each variable can be mapped to one block becomes more resonable for large cache blocks. However, large arrays will probably have to be mapped to consequtive parts of the adress space exceeding one cache block. There are however options open to the compiler or programmer, e.g. one can either store a two dimensional array row by row or column by column. Another approach is to use non-linear mapping functions. When ordinary linear mapping is used, every $k$:th memory block is mapped to the same set. However, in some cases non-linear mapping functions have been suggested [1][17]. In those cases, the problem of optimally mapping variabels to memory adresses is transfered into the problem of defining an optimal mapping function, and in that case different parts of large arrays can be mapped independently of each other. %\newpage %pip3 %\newpage %pip4 %\newpage %pip \newpage {\bf References:} \begin{description} \item[{[1]}] F. Dahlgren, P. Stenstr{\"o}m, {\em On Reconfigurable On-chip Data Caches}, Proceedings of the 24th Annual International Symposium on Microarchitecture, Alberquerque, New Mexico, November 1991. \item[{[2]}] M. R. Garey, D. S. Johnson, {\em Computers and Intractability -- A Guide to the Theory of NP-Completeness}, W. H. Freeman and Company, 1979, ISBN 0-7167-1044-7. \item[{[3]}] J. L. Hennessy, D. A. Patterson, {\em Computer Architecture a Quantative Approach}, Morgan Kaufmann Publishers Inc., 1990, ISBN 1-55880-069-8. \item[{[4]}] M. D. Hill, {\em A case for Direct-Mapped Caches}, IEEE Computer, December 1988, pp. 25-40. \item[{[5]}] M. D. Hill, A. J. Smith, {\em Evaluating Associativity in CPU Caches}, IEEE Transactions on Computers, Vol. 38, No. 12, December 1989. \item[{[6]}] IBM Cooperation, {\em IBM System/370 Model 168 Functional Characteristics (GA22-7010)}, 3rd ed., May 1974, IBM Syst. Products Division, Poughkeepsie, NY. \item[{[7]}] H. Lennerstad, L. Lundberg, {\em An Optimal Execution Time Estimate for Static versus Dynamic Allocation in Multiprocessor Systems}, SIAM Jou. of Comp., to appear. \item[{[8]}] H. Lennerstad, L. Lundberg, {\em Optimal Performance Functions Comparing Process Allocation Strategies in Multiprocessor Systems}, Research Report 3/93, H\"ogskolan i Karlskrona/Ronneby, Sweden, 1993. \item[{[9]}] H. Lennerstad, L. Lundberg, {\em Optimal scheduling results for parallel computing}, SIAM News, Vol. 27, No. 7, 1994 (survey article). \item[{[10]}] L. Lundberg, H. Lennerstad, {\em An Optimal Bound on the Gain of Using Dynamic versus Static Allocation in Multiprocessor Computers}, Technical Report, H\"ogskolan i Karlskrona/Ronneby, Sweden, 1992. \item[{[11]}] L. Lundberg, H. Lennerstad, {\em An Optimal Performance Bound on the Gain of Using Dynamic versus Static Allocation in Multiprocessors with Clusters}, Technical Report, H\"ogskolan i Karlskrona/Ronneby, Sweden, 1993. \item[{[12]}] L. Lundberg, H. Lennerstad, {\em An Optimal Upper Bound on the Minimal Completion Time in Distributed Supercomputing}, Proceedings of the 8th ACM Conference on Supercomputing, Manchester, England, July 1994. \item[{[13]}] L. Lundberg, H. Lennerstad, {\em Bounding the Maximal Gain of Changing the Number of Memory Modules in Multiprocess Computing}, Technical Report, H\"ogskolan i Karlskrona/Ronneby, Sweden, 1994. \item[{[14]}] L. Lundberg, H. Lennerstad, {\em An Optimal Lower Bound on the Maximal Speedup in Multiprocessors with Clusters}, First International Conference on Algorithms and Architechtures for Parallel Processing, Brisbane, Australia, April 1995. \item[{[15]}] R. L. Mattson, J. Gecsei, D. R. Slutz, I. L. Traiger, {\em Evaluation techniques for storage hierarchies}, IBM Syst. J., Vol. 9, 1970, pp. 78-117. \item[{[16]}] S. Przybylski, M. Horowitz, J. Hennessy, {\em Performance Tradeoffs in Cache Design}, Proceedings of the International Symposium on Computer Architecture, Honolulu, Hawaii, June 1988. \item[{[17]}] A. Seznec, {\em A case for two-way skewed-associative caches}, Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, California, May 1993. \item[{[18]}] A. J. Smith, {\em A Comparative Study of Set Associative Memory Mapping Algorithms and Their Use for Cache and Main Memory}, IEEE Transactions on Software Engineering, Vol. SE-4, No. 2, March 1978. \item[{[19]}] A. J. Smith, {\em Cache Memories}, ACM Computing Surveys, Vol. 14, No 3, Sept 1982, pp. 473-530. \end{description} \end{document}