\documentclass[11pt,a4paper,fleqn,twocolumn]{article}
\setlength{\topmargin}{-16mm}
\setlength{\textheight}{255mm}
\setlength{\oddsidemargin}{0mm}
\setlength{\textwidth}{160mm}
\renewcommand{\thepage}{13/\arabic{page}}
%\usepackage[]{a4wide}
\usepackage[dvips]{graphicx}
\usepackage{epsfig}
\usepackage{floatflt}
\usepackage{amsmath}
%\usepackage{times}
\title{On the Characteristics of WWW Traffic and the Relevance to ATM}
%\author{P\"ar Karlsson\thanks{Email: \texttt{pka@itm.hk-r.se}, Phone: \texttt{+46 455 78063}} \ and \AA ke Arvidsson\thanks{Email: \texttt{akear@itm.hk-r.se}, Phone: \texttt{+46 455 78053}}\\
\author{P\"ar Karlsson\thanks{\texttt{pka@itm.hk-r.se}, \texttt{+46 455 78063}} \ and \AA ke Arvidsson\thanks{\texttt{akear@itm.hk-r.se}, \texttt{+46 455 78053}}\\
Department of Telecommunications and Mathematics,\\
University of Karlskrona/Ronneby, \\
S-371 79 Karlskrona, Sweden
}
%\date{April 1997}
\date{}
\begin{document}
\maketitle
\begin{abstract}
%\onecolumn
%This document describes a study of the characteristics of recorded WWW traffic.
%Several parameters of the traffic are investigated,
%The results are used to investigate a scenario where ATM is used as the underlying transport mechanism.
%Problems with the deployment of ATM in the approach taken are considered and suggestions for improvements are made.
The characteristics of HTTP traffic originating from a WWW server are investigated. This constitutes one of the most important and fastest growing traffic types during the last couple of years and in the foreseeable future. Several parameters of the resulting TCP/IP traffic are investigated.
We put the results in the light of a scenario where ATM is used as the underlying transport technology. Using simulation studies we investigate the properties of the resulting ATM cell arrival stream when the TCP/IP traffic is conveyed over an ATM network. The resulting ATM traffic is highly bursty and exhibits characteristics that indicate self-similarity.
The high variability of the traffic implies that a fixed allocation of bandwidth between the mean and peak rate is an infeasible way to achieve a reasonable utilization of the system, since this results in tremendous buffering demands. This calls for a different view on the way to study the system under consideration. Different properties of the traffic must be taken care of by different methods. Variations over longer time-scales are dealt with by means of capacity allocation and fluctuations with shorter duration are buffered.
Rather than considering a queuing process with a fixed deterministic servicetime, we have to consider the case when the service time in some way is dependent on the input to the system. This is for example the case when bandwidth is allocated on a per connection basis. Simulations show that by using simple bandwidth allocation principles the buffer demands can be kept much more modest.
This realistic way of looking at the system might also put different tasks, such as traffic modelling, in a new light. For instance, does a model that is to be used for buffer dimensioning have to capture traffic behavior on time-scales that are longer than reasonably can be buffered anyway?
%\twocolumn
\end{abstract}
\section{WWW Traffic\hfill \\ Measurements}
In the following section several characteristics of the traffic originating from a WWW server are investigated.
The measurements were performed by Ericsson Utvecklings AB, \"Alvsj\"o.
The server under study was connected to a 10Mbit/s Ethernet segment connected directly to a router.
No other sources of traffic were present at the segment.
A workstation connected to the segment running the program \texttt{snoop}, available in the operating system Solaris from SUN Microsystems, was used to observe the traffic.
All IP packets to and from the WWW server during a period of approximately 10 hours were captured and selected fields from the header were logged to a file together with a time-stamp.
According to the manual page for \texttt{snoop} the measurements have an
accuracy of 4\begin{math}\mu\end{math}s.
The statistics reported in the following are all derived from the traffic going out from the server.
The reason for this is that this traffic is the most interesting since it is magnitudes larger in size than the incoming traffic.
The number of incoming packets were 163716 and the outgoing were 156296, the latter is, however, in general much shorter.
\subsection{IP Statistics}
In figure \ref{ipsize} a histogram of the size of the outbound IP packets is shown.
Size here means only the size of the IP packet (header and payload), the Ethernet overhead is not included.
As could be expected, the distribution contains a lot of small packets that carry TCP control messages and HTTP requests and answers.
A large peak can also be found at 1500 octets, this peak comes from the maximum possible Ethernet frame size which is 1514 octets (the Ethernet header constitutes 14 octets).
\begin{figure}[!h]
\begin{center}
\includegraphics[width=7cm]{packethist.eps}
\end{center}
\caption{Histogram of IP packet sizes}
\label{ipsize}
\end{figure}
In figure \ref{tcpsize} a histogram of the total amount of data transported over the TCP connections is presented.
Both IP and TCP overhead as well as possible retransmissions are included in figure \ref{tcpsize}.
The HTTP/1.0 \cite{HTTP1.0} implementation used opens a separate TCP connection for every element of a HTML page that the clients request.
\begin{figure}[!h]
\begin{center}
\includegraphics[width=7cm]{tcpsize.eps}
\end{center}
\caption{Histogram of the total amount of data transported over TCP connections}
\label{tcpsize}
\end{figure}
In figure \ref{tcpdur} the duration of the TCP sessions is presented.
The actual duration of each TCP connection had to be extracted from the IP-log with some care.
By inspection of the log it was found that the TCP connections often were closed a long time after the actual data transfer had ceased.
Since we are more interested in the actual time it takes to transfer the data we decided to consider the stop time of each connection to be the time at which the last packet containing data was logged.
During the work associated with the log file, it was discovered that clients sometimes tried to use TCP connections that had been closed several hours ago.
This resulted immediately in an RST response from the server.
Clearly this is a bug in the TCP implementation of the clients.
The operating system of the clients with this behavior is unknown.
\begin{figure}[!h]
\begin{center}
\includegraphics[width=7cm]{tcpdur.eps}
\end{center}
\caption{Histogram of the duration of TCP connections}
\label{tcpdur}
\end{figure}
Figure \ref{tcpspeed} presents the mean bit rate, obtained by dividing the total size with the duration of each TCP connection.
As can be seen the bit rate enjoyed by connections is highly variable, from a few kbit/s to about 1 Mbit/s.
The large differences in bit rate is probably mostly due to bottlenecks outside of the local network the server is connected to.
Unfortunately the location of clients could not be extracted from the log file.
Checking the dependence of users location and the ``speed'' they experience while communicating with the server remains to study.
\begin{figure}[!h]
\begin{center}
\includegraphics[width=7.5cm]{tcpspeed.eps}
\end{center}
\caption{Histogram of average bit rate of TCP connections}
\label{tcpspeed}
\end{figure}
It should be noted that the method used by HTTP/1.0 to open a separate TCP connection for every element to download is highly wasteful on several points.
First of all it generates a high amount of unnecessary control traffic to set up and tear down connections.
Since the TCP flow control mechanism also takes some time to find a good rate of inserting traffic into the network it also is disadvantageous to have short (in terms of transferred data as well as duration) connections.
These shortcomings are (among other things) targeted in HTTP/1.1 \cite{HTTP1.1}.
\subsection{Session Arrival Statistics}
Due to the fact mentioned above that HTTP/1.0 opens a separate TCP connection for every element on a HTML page there exists a high correlation between the arrivals of TCP connections from a client.
%Due to the fact mentioned above that a separate TCP connection is opened for every element on an HTML page there exists a high correlation between the arrivals of TCP connections from a client.
It would be interesting to instead observe the behavior of human users when ``surfing''.
The strong correlation between TCP connections do not reflect the real behavior of human users, it is created by the protocols involved.
%More interesting is the arrival process of what can be called new sessions.
We now introduce the concept of a session.
A new session is considered to start when a human user selects to download a new page.
%Typically such a session starts with the download of the HTML file that defines the page.
%When the file is analyzed by the client, download of the other elements on the page follows.
%On another time-scale the behavior of users can be found when the contents of the document are viewed and a decision might be made to follow a link referring to another page on the same, or perhaps another server.
To capture the session arrival process one should preferably look directly at the contents of the actual traffic conveyed over the TCP connections.
Another alternative might be to get this information from the server logs, an issue which remains for further study.
Since this information was not available in our trace we developed a decision mechanism to separate different sessions from each other.
By introducing a limit of how long time there should be silence (no new TCP connections) from a certain client before considering the download of a page to be finished, we were able to separate sessions from each other.
Around 4 seconds was experimentally found to be a good limit.
From the 17347 TCP connections in our trace 6022 could be considered as indicating a new session with the rule above and a limit of 4 seconds.
The decision to use 4 seconds as a limit is of course somewhat arbitrary.
This length does, however, also make sense as a minimal separation between user activities.
Figure \ref{sessioncount} shows the number of detected sessions resulting from different limits.
\begin{figure}[!h]
\begin{center}
\includegraphics[width=7cm]{sessioncount.eps}
\end{center}
\caption{Number of sessions detected for different session separation limits}
\label{sessioncount}
\end{figure}
\subsubsection{Distribution of Session \hfill \\ Arrival Times}
The start time of each of the TCP connections in our log is presented in figure \ref{tcpstart}.
As can be seen the intensity at which new TCP connections occur decreases slightly at the end of the trace.
In the investigations below the last 2.14 hours of the trace were rejected not to include this non-stationarity in the results.
\begin{figure}[!h]
\begin{center}
\includegraphics[width=7cm]{tcpstart2.eps}
\end{center}
\caption{Start time for TCP connections}
\label{tcpstart}
\end{figure}
Figure \ref{tcpdisthist} shows the distribution of the interarrival times between TCP connections.
The use of several TCP connections to retrieve a page is here manifested in the large count of short interarrival times.
Mainly this large amount of short interarrival times argues against the times being exponentially distributed.
\begin{figure}[!h]
\begin{center}
\includegraphics[width=7cm]{tcpdisthist.eps}
\end{center}
\caption{Histogram of interarrival times between TCP connections}
\label{tcpdisthist}
\end{figure}
In figure \ref{sessiondisthist} the interarrival times between the detected sessions are presented.
The limit used in the detection of sessions was 4 seconds.
\begin{figure}[!h]
\begin{center}
\includegraphics[width=7cm]{sessdisthist.eps}
\end{center}
\caption{Histogram of interarrival times between sessions}
\label{sessiondisthist}
\end{figure}
Clearly figure \ref{sessiondisthist} looks more like an exponential distribution than figure \ref{tcpdisthist}.
The estimated mean session distance is 5.10 and the estimated variance is 26.0.
This also seems to agree well with an exponential distribution whose mean is \begin{math}1/\lambda\end{math} and variance is \begin{math}1/\lambda^2\end{math}.
To test the hypothesis that the session distances are exponentially distributed a chi-square test was performed.
The result of the test is reported in table \ref{chi2table}.
The threshold values are based on a significance level of 5\%.
\begin{table}
%\begin{tabular}{|p{1.3cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|}
\begin{center}
\begin{tabular}{|c|c|c|c|}
\hline
M& N& D& T \\
\hline
\hline
10& 8 & 13.36& 15.51 \\
\hline
20& 18& 18.60& 28.87 \\
\hline
30& 28& 38.48& 41.34 \\
\hline
40& 38& 43.69& 53.38 \\
\hline
50& 48& 44.79& 65.17 \\
\hline
\hline
%%\tiny
\multicolumn{4}{|l|}{M - Number of intervals} \\
\multicolumn{4}{|l|}{N - Degrees of freedom} \\
\multicolumn{4}{|l|}{D - \begin{math}\chi^2\end{math} test statistic} \\
\multicolumn{4}{|l|}{T - \begin{math}\chi^2\end{math} threshold value} \\
%%\normalsize
\hline
\end{tabular}
\end{center}
\caption{Chi-square test of interarrival times}
\label{chi2table}
\end{table}
%\begin{table}[!h]
%\begin{center}
%\begin{tabular}{|l||c|c|c|c|c|}
%\hline
%Intervals& 10& 20& 30& 40& 50 \\
%\hline
%Degrees of freedom& 8& 18& 28& 38& 48 \\
%\hline
%\small \begin{math}\chi^2\end{math}\normalsize\ test statistic& 13.36& 18.60& 38.48& 43.69& 44.79 \\
%\hline
%%\vspace{1mm}
%\small \begin{math}\chi^2\end{math}\normalsize\ threshold value& 15.51& 28.87& 41.34& 53.38& 65.17 \\
%\hline
%\end{tabular}
%\end{center}
%\caption{Chi-square test for exponentially distributed interarrival times}
%\label{chi2table}
%\end{table}
The test was performed with equiprobable intervals.
%The number of of intervals in the tests were chosen according to rules found in the literature.
%According to \ref{chisquaretesting} the number of intervals should be \begin{math}\ln n\end{math} where \begin{math}n\end{math} is the number of samples, this gives 9 intervals.
%Another rule is found in \cite{sim} that recommends \begin{math}\sqrt n\end{math} to \begin{math}n/5\end{math}intervals, this gives 73 to approximately 1000 intervals.
Since the parameter of the exponential distribution was estimated from the data, the degrees of freedom for the \begin{math}\chi^2\end{math} distribution was decreased with one as recommended in the literature \cite{probability}.
%ref kanske?
The conclusion of the test is that we can not say that the interarrival times not are exponentially distributed.
\subsubsection{Independence of Arrival Times}
Still to determine are to what degree successive interarrival times are correlated.
Figure \ref{autocorr} shows an unbiased estimation of the autocorrelation at different lags for the intervals.
Clearly the correlation have decreased significantly already at lag 1.
There does, however, seem to exist some correlation up to lag 20.
Since the mean arrival distance here is 5.10 seconds lag 20 corresponds to about 100 seconds.
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{autocorr.eps}
\end{center}
\caption{Autocorrelation of interarrival times}
\label{autocorr}
\end{figure}
%INSERT DERIVATION HERE
For successive interarrival times $X_n,\ n=1,2,3,\dots,N$ the unbiased estimation of the autocorrelation at different lags $k$ takes the following form.
\begin{equation}
R_X(k) = \frac{1}{N-k} \sum_{n=1}^{N-k} X_{n} X_{n+k},\qquad k\ge0
\end{equation}
Now assuming a hypothesis as follows; the $X_n$s are independent identically exponentially distributed with parameter $\lambda$ gives
\begin{align}
&\textrm{E}[X] = \textrm{E}[X_n] = 1/\lambda \\
&\textrm{E}[X^2] = \textrm{E}[X_n^2] = 2/\lambda^2 \\
&\textrm{V}[X] = \textrm{V}[X_n] = 1/\lambda^2.
\end{align}
Letting $Z_n = X_n X_{n+k}$ it is easily verified that
\begin{align}
&\textrm{E}[Z] = \textrm{E}[Z_n] =
%\textrm{E}[X_n X_{n+k}] = \textrm{E}[X_n] \textrm{E}[X_{n+k}] =
1/\lambda^2 \\
&\textrm{V}[Z] = \textrm{V}[Z_n] = %\textrm{E}[Z_n^2] - \textrm{E}^2[Z_n] = \\
%= \textrm{E}[(X_n X_{n+k})^2] - \textrm{E}^2[X_n X_{n+k}] = \\
%= \textrm{E}[X_n^2]\textrm{E}[X_{n+k}^2] - \textrm{E}^2[X_n]\textrm{E}^2[X_{n+k}] =
3/\lambda^4.
\end{align}
Now the mean of the autocorrelation estimator is
\begin{equation}
\textrm{E}[R_X(k)] = \frac{1}{N-k} \textrm{E} \left[ \sum_{n=1}^{N-k} Z_n \right] = 1/\lambda^2.
\end{equation}
To find the variance of $R_X(k)$ we first need the covariance between different $Z_n$.
\begin{equation}
\textrm{Cov}[Z_i,Z_j] = \textrm{E}[Z_i Z_j] - \textrm{E}[Z_i]\textrm{E}[Z_j]
\end{equation}
Expanding this reveals
\begin{equation}
\textrm{Cov}[Z_i,Z_j] = %\left\{ \matrix{
\begin{cases}
\textrm{V}[Z] & i=j \hfill\cr
\textrm{E}^2[X] \textrm{V}[X]\hfill & i \pm k=j \hfill\cr
0 & \textrm{otherwise}\hfill
%}
%\right. .
\end{cases}.
\end{equation}
Returning to the variance of $R_X(k)$ we find
\begin{equation}
\begin{split}
\textrm{V}[R_X(k)] &= \frac{1}{(N-k)^2} \textrm{V}\left[ \sum_{n=1}^{N-k} Z_n \right] \\
&= \frac{1}{(N-k)^2} \sum_{i=1}^{N-k}\sum_{j=1}^{N-k} \textrm{Cov}[Z_i,Z_j].
\end{split}
\end{equation}
Expanding this using the results above gives
\begin{equation}
\begin{split}
\textrm{V}[R_X(k)] &= \frac{1}{(N-k)^2} \bigl((N-k)\textrm{V}[Z] + \\
&+ 2(N-2k)\textrm{E}^2[X] \textrm{V}[X] \bigr) \\
&= \frac{5N-7k}{(N-k)^2 \lambda^4}.
\end{split}
\end{equation}
%----------------------
We now can form a normalized estimate of the autocorrelation that should be normally distributed with zero mean and unit variance if our hypothesis is correct.
\begin{equation}
N(k) = \left(R_X(k) - 1/\lambda^2\right) / \frac{\sqrt{5N-7k}}{(N-k)\lambda^2}
\end{equation}
In figure \ref{normautocorr} this normalized estimation of the autocorrelation is plotted from lag 1 to lag 100.
The dotted line is the upper bound of a 95\% confidence interval based on an normal distribution with zero mean and unit variance.
Up to lag 15 there seems to be a higher correlation present than could be expected based on the assumptions above.
The conclusion is therefore that some noticeable correlation exists between session arrivals.
This correlation is, however, not very strong and it can also be expected to be less significant in an environment with more users present, at least when networking and server resources are expanded accordingly and thus not constitutes a limitation.
It can be noted that the correlation remains above zero for all but a few of the lags in figure \ref{normautocorr}, but since all these values fit within the 95\% interval we refer from drawing any conclusions from this fact.
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{autocorr_norm.eps}
\end{center}
\caption{Normalized autocorrelation}
\label{normautocorr}
\end{figure}
An expansion of this investigation should preferably be to test the distribution and independence for shorter intervals.
The arrivals in each interval could then be tested against parameters specific to the interval under study.
An even better fit can be expected from such a study.
\section{ATM}
\subsection{IP over ATM}
To investigate the impact of this type of traffic when ATM is used as the underlying transport mechanism a series of simulations were performed.
A simulation environment in C++ was built to test various ideas.
For the conversion of the captured IP traffic the method presented in figure \ref{ipoveratm} was used.
\begin{figure}[!h]
\begin{center}
\includegraphics[width=2.8cm,angle=-90]{ip_over_atm.eps}
\end{center}
\caption{IP over ATM}
\label{ipoveratm}
\end{figure}
The approach in figure \ref{ipoveratm} is denoted ``Classical IP over ATM''.
RFC 1577, ``Classical IP and ARP over ATM'' \cite{rfc1577} defines how to use ATM as a replacement for standard network technology when using IP.
Large parts of RFC 1577 are concerned with resolving of IP addresses to ATM addresses, these parts are not included in our simulations.
The actual mapping of IP packets into ATM cells are defined in RFC 1483 \cite{multenc}.
RFC 1483 describes two different methods for carrying connectionless traffic (\textit{e.g.} IP) over ATM AAL5 PDUs.
The first method multiplexes several protocols over one ATM Virtual Circuit.
This is accomplished by prefixing the AAL5 PDUs by an IEEE 802.2 Logical Link Control (LLC) header.
The second method separates different protocols by using separate VCs for different protocols.
Clearly the second method should be the preferred one since the first method duplicates functions found in ATM. % omform.<<<<<<<<
The second method, also denoted ``VC Based Multiplexing'' also have the nice property of allowing separation of different traffics on different VCs.
The mapping of IP packets into AAL5 PDUs is rather straightforward and the number of cells an IP packet generates can be found from the following formula.
\begin{equation}
N_{cells} = \Biggl\lceil \frac{IP_{size} + 8}{48} \Biggr\rceil
\end{equation}
Where \begin{math}N_{cells}\end{math} is the number of ATM cells, \begin{math}IP_{size}\end{math} is the size of the IP packet in octets.
(8 is the size of the AAL5 PDU trailer.)
%<<<<<