This snapshot, taken on
02/07/2014
, shows web content acquired for preservation by The National Archives. External links, forms and search may not work in archived websites and contact details are likely to be out of date.
 
 
The UK Government Web Archive does not use cookies but some may be left in your browser from archived websites.
This report, commissioned by the Advisory Committee on Networking, considers the application of caching to make more efficient use of national and international networks. The main focus is on the World Wide Web, since this now represents the majority of traffic on most networks, but most caching systems also support the FTP and Gopher protocols.

Caching on JANET

This webpage has been archived. Its content will not be updated.

This report, commissioned by the Advisory Committee on Networking, considers the application of caching to make more efficient use of national and international networks. The main focus is on the World Wide Web, since this now represents the majority of traffic on most networks, but most caching systems also support the FTP and Gopher protocols.

The report is in 2 parts, on the theory and practice of caching. Sections 1 to 3 deal with the operation of caches: the strategies of simple caching and cache co-operation, and the problem of document consistency. Section 4 is a survey of some of the software now available to implement caches; section 5 examines some of the major cache installations now in operation. Section 6 concludes with recommendations for the further development of caching within the UK.

Simple caches

Operation

Most web browsers have a very simple approach to networking. Given a URL, containing a host name and an item on that host, they make a TCP connection to the named host and retrieve the specified item. If the host cannot be reached or the item does not exist, the user will receive an error message instead of the page requested. Since browsers, and their users, are independent there is a huge amount of replication in the information carried over the network: every user gets their own copy of every page they request. Popular sites may have many simultaneous connections transmitting identical copies of a single item over the same trunk routes.

Web caches are an attempt to reduce this wastage; by intercepting parallel requests a cache may be able to serve copies of the same document to a number of users from only a single connection to the remote site. To achieve the greatest reduction in network traffic this merging of individual requests should be performed as close as possible to the users ­ certainly before the requests reach slow international links ­ though each cache must still have sufficient users to ensure a reasonable number of duplicate requests.

The interaction between a browser and a cache is the same as between a browser and a host site. A TCP connection is made and an item requested. The only difference is that the request must be for the full URL, not just the item part, as the browser is no longer connecting directly to the host given in the URL. If the connection cannot be made, the browser is likely to display an error message without checking whether the source host may be reached by some other route. When a cache receives a request it compares it against other recent requests. If the same URL has already been fetched from its source, and provided a copy of the resulting document was kept and is still available, then the new request can be satisfied immediately by replying with the local copy. Otherwise the cache will forward the request, either to the host named in the URL or to a parent cache, and return any response to the browser. If the document satisfies the cache's internal rules then a copy will be kept in readiness for future requests. Simple caches make requests through TCP connections in exactly the same way as browsers so, once again, if the connection cannot be made, an error will be returned to the user.

Caches can pass on requests to other caches so a hierarchy can be built up with each level serving, indirectly, a wider community of users. However the restriction that each client (whether browser or cache) can have only one parent cache limits the structure to a simple tree, with browsers at the leaves and unsatisfied requests passing directly upwards to the highest level cache and then to the source host. The requirement that each cache must have a sufficient number of clients applies at each level, so a fairly wide, flat, hierarchy is best. The trunk links within JANET are fast and, at present, relatively free from congestion, so there is little networking purpose in having more than two levels of caches - local (or institutional) and national - although where a number of local caches are connected by a particularly high-speed network, such as a MAN, an intermediate regional cache may give a useful reduction in traffic between the MAN and JANET. Conversely, if an institution does not yet have a local cache, its browsers may send requests directly to a regional or national cache since the transaction between a simple cache and its client are the same whether the client is a cache or a browser. The cache hierarchy, and the underlying networks, are shown in figure 1.

Problems

Cache hierarchies can yield successive reductions in network traffic but there is a cost in replication of disk storage. When a request is passed up a branch of the cache hierarchy, the returning response is likely to be saved by each cache it passes through. In a flat hierarchy this is acceptable since there will be only two or three such caches. For simple caches the width of the hierarchy is a more serious problem, since caches at the same level are inaccessible to one another. Popular documents, read by at least one user at each site, are likely to exist as individual copies in every cache in the country.

Cache machines have only finite disk space so must eventually discard old copies to make room for new requests. Most use a policy based on discarding the least recently requested document, as being the least likely to be requested again in future. Given sufficient disk space, documents are discarded around the time when they would, in any case, have been replaced by a more up-to-date version. However if disk space is insufficient then the cache may be forced to discard a current document and make an unnecessary connection to the source host when the document is next requested. The amount of disk space required depends on the number of users served and the breadth of their reading. Ideally a cache should have room to store every document which the users of the cache request more than once during the lifetime of the document. Such a cache would never retrieve a second copy of an unchanged document, so would generate the minimum possible network traffic. To achieve this in practice would, of course, involve storing every document requested since there is no way to predict which documents will be re-read in future.

The relationship between number of users and cache disk space means that more disk space is needed at the top of the hierarchy than the bottom. A departmental cache in an institution might well have space for all the pages on its topic, but the parent institutional cache would need to store all the pages for all topics. At the national cache level it becomes impractical

for a single computer to manage either the required disk space or the rate of incoming requests so that the service must be split across a number of machines. The load of requests is shared, but the individual caches are now at the same level in a hierarchy of simple caches, so cannot make use of disks connected to other machines. To avoid generating unnecessary requests, each one must have sufficient disk space to store all of its users' requests. The situation is shown in figure 2 where a simple client requests a document from one of the national cache machines. If this particular cache does not have the document, or has discarded its copy, it cannot discover whether the document may be available from one of its fellows so can only obtain it by contacting the source host.

It is estimated that the UK national cache serves 20% of its potential users at present. An increase in the number of users is likely to require additional cache machines to handle the increasing number of connections. To maintain the same level of service each new machine must have at least the same amount of disk as the present caches. Furthermore the number of documents on the World Wide Web is also increasing, so the disk space per machine may also need to grow to match.

Co-operating Caches

Client intelligence

Instead of blindly sending all requests to a single cache, as described above, there are a number of ways for a client to improve its use of the network. Most browsers and caches have a list of patterns to match hosts which should be contacted directly, rather than through a cache. The list would normally include any sites which are either closer than the cache, or at a similar distance. If an institution has web servers and a cache on the same local network, then a direct call to the server should give as quick a response for local browsers as going via the cache. Disk space on the cache server is better used to store documents with a higher cost of retrieval. For the same reason, browsers and institutional caches connected to JANET should normally make direct connections to web servers on that network rather than requesting them from the national cache. This simple form of cache selection based on the URL can also be extended to choose one from a number of available caches which specialise in particular URLs. For example, in America, the NLANR caches on the east coast hold pages from Europe while those on the west coast are used for pages from Asia. This may reduce the amount of duplication of files on the different caches, but works best if clients follow the same geographic rules when selecting a cache. Requesting a page from the wrong cache may result in a copy being stored there as well. Of the popular browsers, Netscape 2 is at present the only one to support full cache selection through its auto-configuration scripts.

If a cache machine fails, and stops accepting requests, it seems obvious that clients should ignore it and attempt to find another route to the source host instead. However very few browsers provide this resilience, perhaps because cache servers were developed as part of network firewalling where a direct connection would be bound to fail. Netscape's auto-configuration scripts can return an ordered list of caches, with a default option to connect direct if none of these responds. Many other internet applications follow a de facto standard whereby the Domain Name Service provides a list of translations for an address and the client chooses one which is available. This has never been widely supported on the Web, however, and most browsers will only use the first cache on a DNS list. A DNS server can share requests among a group of cache machines by varying the order in which the list is presented, but this gives only a crude division of labour.

Netscape's resilience to cache failure is provided by the normal TCP connection along which the request is made and the document retrieved. If the attempt to connect to the first choice cache is refused then the browser will try to connect to the next cache on the list. An alternative method is to send an initial enquiry to the cache, using a UDP packet, and only initiate the TCP connection once a reply has been received. The UDP packet may be sent to a special port maintained by the cache software, or to the echo port which is a standard part of the TCP/IP suite. Using the echo port merely checks that the machine is alive, and provides no evidence that the cache software is running. As a means of testing whether a single cache is alive, this is inferior to the TCP method since it may give false negatives if a UDP packet is lost for some other reason, and involves an additional exchange of packets if the cache is alive. However if a number of caches are available then a preliminary exchange of UDP packets can be used to determine which one gives the fastest response. Packets are sent in parallel to the caches and the round-trip time measured; the fastest cache can then be chosen to receive the TCP request. The round-trip time will include contributions from the speed and loading of each cache server but will also be affected by network distance and congestion, so is a good overall estimate of the "cache speed" perceived by the user. For the cache servers it also provides better load balancing than the simple division of labour provided by rotating DNS lists.

Using multiple UDP enquiries before the TCP connection appears to be a considerable waste of network bandwidth. In fact each UDP exchange requires only two IP packets against at least eight for a TCP connection. If the UDP enquiries can improve the chance of a successful TCP connection then they are indeed worthwhile. Resilience and load balancing can be achieved without placing any meaningful information in the UDP packets; by including the request itself in the body of the packet a further improvement can be made with no additional network traffic. The UDP reply from the cache can then indicate not only its presence and its speed, but also whether the requested document is available. The client can then choose a cache which is able to fulfil its request immediately, in preference to one which may have to obtain the document from elsewhere. Furthermore if the document is small enough to fit in a single UDP datagram then it may be possible to include it in the UDP reply from the cache server, thereby eliminating the costly TCP connection entirely. The size of document which can be returned in this way is limited by the TCP/IP implementations, varying between 512 and 65507 bytes, though 8192 is a common value [Stevens 1994].

These five levels of client intelligence are summarised in the following table

Level Behaviour
0 simple uses a single cache; may have patterns for uncached URLs
1 selection chooses a cache (or not to cache) based on the URL
2 resilience detects absence of cache and tries alternatives
3 load balancing chooses a cache based on measured speed of response
4 discovery chooses a cache which has the requested document

Operation

Load balancing and discovery were introduced by the Harvest project at the University of Colorado, who developed an Internet Cache Protocol (ICP) for this purpose. At present ICP is only supported by cache software which has its origins in the Harvest project, but no alternative protocol has appeared and ICP is now being proposed as an Internet standard for cache inter-communication. Current implementations only use ICP for the initial UDP enquiries; if a document is subsequently obtained by a TCP connection to a cache or the source host, then this uses the normal Hypertext Transfer Protocol (HTTP). The interaction between a group of ICP caches is shown in figure 3. ICP exchanges, using UDP, are shown as dotted lines while HTTP exchanges, using TCP, are solid lines.

Each cache has a list of others to which ICP queries may be sent. Patterns may be assigned to each one allowing an initial selection based on URL. When a request is received which an individual cache cannot satisfy, it sends out parallel ICP enquiries to each of the eligible caches for the requested URL. A UDP echo packet may also be sent to the source host named in the URL, but this is usually disabled since it can only be used for load-balancing, not discovery. Some servers have also reacted badly to receiving large numbers of these echo packets. The client cache then waits for responses to its enquiries and measures the round-trip time for each one. To allow for lost packets, or caches which are down, a limit is placed on this waiting time. If the cache receives 'hit' responses, indicating that the requested document is available, then it will retrieve the document by an HTTP connection to the server with the fastest of these responses. If the document itself was returned in the ICP packet, then no HTTP connection is required. If only 'miss' responses are received then the client will consider only those caches which are defined as 'parent' (rather than 'neighbour') in its configuration, and will make an HTTP connection to the parent which returned the fastest 'miss'. The client already knows that this parent will have to forward the request, but has at least determined that it is the best able to do so. Pure ICP operation assumes that round-trip time is the only relevant measure, but if technical or administrative reasons require that particular caches be favoured, even if they are slower to respond, then weighting factors can be applied to the timings.

The hierarchy of parents and neighbours implemented by ICP is much looser than the rigid tree imposed by simple client behaviour. During the discovery stage there is no distinction between the two types, and only in retrieving unavailable documents does a client assume that its parents are in some sense 'nearer' to the source of documents than it is. The ability to send requests to multiple neighbours also gives access to documents stored by caches at the same level of the hierarchy. A cache may be configured to keep local copies of documents obtained from other caches (as in the simple caching method) or simply to forward them its clients. The latter option reduces the amount of replication of documents, and hence the disk space required by each cache, but increases the amount of network traffic between the caches. Most caching software allows the choice to be made separately for each neighbour or parent so that disk space and network traffic can be balanced. An even greater benefit is seen at the national level, since if clients can discover in advance which of the national cache machines hold a document then the number of duplicate copies resulting from connections to the 'wrong' machine will be reduced. A group of institutional caches co-operating over a fast Metropolitan Area Network may be as effective as a separate MAN cache and make this intermediate level unnecessary. Such a group of caches would be unlikely to save documents obtained from one another, but would save documents retrieved from hosts or caches beyond the MAN.

This description has assumed that the client making ICP requests is itself a cache and not an individual browser. There is no technical reason why browsers should not use the protocol, though the effort required to maintain configurations for different versions of different browsers would be considerable. Most institutions should only need a single local cache machine to which all requests for external pages should be sent so the full power of co-operation is not required. A more worthwhile development in browser communications would be to include tolerance of cache faults either through auto-configuration scripts or proper handling of DNS lists.

Inter-working

Previous sections have described homogeneous systems of simple and co-operating caches. It is unlikely that this could ever be achieved in practice, so the operation of a mixed system needs to be considered.

Simple caches can be referenced by ICP clients but can only usefully be configured as parents. Since a simple cache does not offer an ICP service, enquiry packets are instead directed to the TCP/IP echo port. If the cache host is running then the packet will be returned unaltered, which is interpreted by the client as a 'miss' but allows a round-trip time to be measured. As a neighbour the simple cache would never be used, but if no hits are received then it may be chosen as the fastest parent. A normal HTTP connection will then be made to the simple cache to request the document. The interaction still follows the pattern of figure 3 but the ICP packets provide load balancing only and not discovery. If a client has both simple and co-operating parents, it is possible that the simple parent will always be chosen as it has much less work to do before returning the 'miss' response. To prevent this, co-operating parents should be given a preferential weighting since a document saved on one of these caches will be of more benefit to the client in future.

When a simple client requests a document from one of a group of co-operating caches the interactions are as shown in figure 4. If the cache does not have the requested document, it sends ICP enquiries to its fellows and may then make an HTTP connection to one of them to retrieve the document, which is returned to the client. The document should only be saved to disk by the second cache if the network between the cache machines is known to be slow or heavily loaded. Having multiple copies is wasteful of disk space so it is important to provide sufficient network bandwidth between co-operating caches.

Effects of co-operation

Cache co-operation has no effect when a document is found on a browser's first-choice cache or, assuming the ICP timeout is negligible compared to the delay in fetching a document from the source host, when the document is not cached anywhere on the line from browser to source. Introducing co-operation at any single level of the hierarchy increases the number of caches to which a browser has indirect access, so should increase the chance of finding a copy of the document nearer than the original. For documents which do not fit into an ICP reply packet the amount of network traffic will increase somewhat, though the extra packets use the cheap UDP protocol. For small documents it is possible that the traffic may decrease. Caches connected by fast networks may be configured to share disk space, reducing the need for large disks at all caches, though this will also increase the network traffic between them. When two levels of the hierarchy co-operate the ICP discovery process replaces this extra traffic between caches. The availability of multiple caches gives resilience and load balancing, benefits which are passed on to the user.

Cache Consistency

Whenever copies are made of an original document the problem of maintaining consistency arises. If the original changes after the copies are made then the copies immediately become out of date. This problem, known as consistency, applies to web caches as much as any other document store. If consistency is not maintained then a document can become stale; the staleness of a copy is defined as the length of time since the original became different from the copy and is often expressed as a percentage of the age of the document.

When dealing with on-line documents, such as web pages, it is at least possible to contact the original source to check whether a copy is stale, provided the network and host server have not changed or failed. However the early versions of the Hypertext Transfer Protocol only provided the 'GET' request which retrieves the full text of a document. To ensure that a copy was up to date, a cache had to repeat the whole process of obtaining the original and then compare the text against the copy. The first improvement was a new 'HEAD' request which returns summary information about the requested document such as the time of last change. If a HEAD request showed a copy to be stale then another connection had to be made to the original host to perform a GET. This disadvantage led to the introduction of the conditional GET (also known as the If-Modified-Since GET). An IMS GET includes the date of the copy in the request: the server is expected to reply with either 'unchanged' or the new text of the document. Both the check and the update are thus performed by a single request. Most web servers now support IMS GET requests; those which do not treat them as simple GETs and return the text even if it is unaltered. These checks still involve contacting the original host, so may take some time for a distant site, but they reduce the amount of network traffic required when document is unchanged.

Staleness only becomes a problem when a document is requested from a cache (though if a cache must discard documents to save disk space it might be attractive to discard stale ones first). It is possible for a cache to avoid serving stale documents entirely by checking every request against the original server, however this may not always be desirable. For many documents, especially small ones from distant sites, establishing the connection to the host site takes longer than transferring the text. In such cases a cache checking for staleness with a conditional GET will take nearly as long as the client making a direct request for the full document and the purpose of caching is lost. For these documents a user may be prepared to accept the possibility of receiving a stale copy in return for the speed of an immediate response from the cache. At present the "acceptable staleness" is set by the operator of the cache, though proposed developments in HTTP will allow users to express their preference for each request. However if users feel the cache staleness is too high then they will either force the cache to refresh the document, possibly when there is no need, or else stop using the cache entirely.

If a cache does not check staleness on every request then some other algorithm must be used to decide when to perform the check. The simplest is to check at regular time intervals (or, in practice, on the first occasion the document is requested after the interval has elapsed). This allows the cache manager to quote an exact maximum staleness, for example that a particular cache will never issue a document which is more than twelve hours out of date.

Another method attempts to reflect the variety of documents on the web: some change hourly, others never. When a document is saved by a cache it is given a time to live (TTL) and will be issued without reference to the original until the TTL has elapsed. The HTTP protocol defines timestamp headers which could be used to set the TTL but unfortunately most are optional and not provided by many servers. The most widely used is Last-Modified, which gives the last occasion on which a document was changed. This allows the TTL to be set as a percentage of the current document age, for example with a TTL percentage of 10% a document which was last changed 10 days ago will not be checked until 1 day later. Two additional values are usually included in the TTL calculation: a maximum, to prevent documents remaining in the cache without checking for an unreasonable time, and a default TTL for documents which arrive without a Last-Modified header. A few documents arrive with an explicit expiry date. Most caches will use this as the TTL and risk serving very stale copies if the document author decides to make a change before the intended date.

Variable TTL methods would seem to be an improvement on fixed staleness in two ways: static documents are not checked unnecessarily and rapidly changing ones do not become excessively stale. However experience has indicated that users prefer to have a guaranteed maximum staleness. A compromise, if bandwidth permits, might be a TTL algorithm with an upper limit of a few hours. This would at least improve the handling of short-lived documents.

Most cache programs allow different staleness parameters to be used for different URLs, normally through matching a list of patterns. This is commonly used to differentiate between text and graphic files. Text files are expected to be small and subject to frequent and significant changes, so should have limited staleness, while images are large and change less often, at least in terms of their information content. Pattern matching can also be used to handle other protocols, such as FTP and gopher, which have no knowledge of the document age.

Cache Software

The World Wide Web is perhaps the ultimate open system, allowing many different systems and programs to work together. It is therefore no surprise to find a variety of programs capable of acting as web caches. Most of the development of the web infrastructure (as opposed to web clients) has been based on the Unix operating system so this is the system on which the majority of caching programs run. Unix computers span the range of sizes needed for institutional, local and national caches though for small institutions without Unix experience, viable alternatives may soon be available.

A survey of cache and browser software was written for the European Community Desire project in early 1996 [Bekker et al. 1996] though other programs and new versions have become available since then. The following sections describe cache programs which are in active use now (August 1996) as well as some new systems which may become popular in future.

WWW Consortium

The World Wide Web Consortium (W3C) maintain a collection of freely-available reference software used to demonstrate and evaluate new developments in WWW technology. This collection now includes two WWW servers, both of which are able to act as caching proxies.

CERN proxy/server [Neilsen 1996]

The original CERN web server was passed on to the W3C and developed to version 3.0A in July 1996, at which point further development ceased in favour of the new Jigsaw server described below. In future new releases will only be made if security problems are discovered. The program is written in C and has been ported to most flavours of Unix. As the first public domain web server it was widely adopted to set up web sites and, when it gained the ability to act as a caching proxy, many sites chose it as a web cache rather than introduce a new and unfamiliar alternative. Both Britain and New Zealand used the CERN server for their national academic cache service before moving to more efficient alternatives as demand grew. However the program is still widely used by institutions as a local cache [Hamilton 1996a].

The server acts as a simple cache with a unique parent for each protocol supported (HTTP, FTP, Gopher and WAIS). It is not resilient to the failure of its parent, returning errors if the parent cannot be contacted. A list of patterns can be given for which the parent should not be used. Both fixed staleness and variable TTL algorithms can be used to determine when to contact the original host; a conditional GET request is issued for document requests when these limits expire. The cache hierarchy is configured by setting environment variables before the program is run; other values and options are set by directives in a text file.

The cache is implemented as a single server process which creates a new child process to handle each request. When the request has been satisfied, the child exits. The continual creation and destruction of child processes places a considerable load on the machine. When files are saved they are written to disk in a file whose name is derived from the URL. This results in a deep directory structure so that retrieving a cached file can involve a large number of disk accesses to read each of the many directories in the path. As a result the CERN server is slow and not suitable for heavily loaded applications. It was not originally designed as a cache server so is much less efficient than more recent programs which are dedicated to the task. [Bekker 1996] found it to be more than an order of magnitude slower than version 1.4 of the Harvest server but concluded that its relatively simple configuration and maintenance might make it suitable for "end user sites with lower demands and with fast and easy access to a more advanced proxy server".

Jigsaw [Baird-Smith 1996]

Jigsaw is a replacement for the CERN server being developed by the W3C, currently available in an experimental alpha version. Although it is still a dual-purpose cache and web server it addresses some of the performance problems of the older program and is reported to be five to ten times faster, without any specific attempts to optimise the code. Of most interest is the fact that the program is written in Java, so should run on any computer and operating system to which the Java environment is ported, allowing the same cache program to be run on both Unix and non-Unix systems.

Apache [Apache 1996]

Apache is a freely-available, modular web server, developed from the NCSA program by a group of volunteers. The program is written in C and runs on most Unix operating systems. An experimental caching module has been written which may be compiled into the basic server. The module supports selection of parents based on the URL, but it is not clear from the documentation whether it is resilient if the chosen parent cannot be contacted. Staleness can be controlled by both fixed time and TTL methods.

Apache is reported to be the most widely-used WWW server, used by 36% of web sites. The addition of caching ability to this popular program may lead to its widespread adoption as a caching proxy as it did for the CERN proxy server.

Spinner [Spinner 1996]

Spinner is another example of a modular web server for which a cache module is available. It does not support co-operation between caches but different parents may be used for URLs matching different patterns. It is not clear, if there is more than one parent for any given URL, how the program chooses between them or whether it is resilient to the failure of the chosen parent. The program is distributed as source code and is known to run on most common flavours of Unix.

Purveyor [Process 1996]

Purveyor is a web server and proxy cache running under Open VMS. It has many features intended for use as a firewall gateway, including restriction by host and URL pattern. It can be configured to use a single parent cache but does not appear to be resilient to the failure of that cache. Each cached document which does not have an explicit expiry time is assigned a fixed time to live and also a shorter expiry time in case it is not re-read. When documents are accessed their staleness may be checked with a conditional GET request. The program is for sale in binary form for VAX and Alpha platforms.

Catapult [Microsoft 1996]

In June 1996 Microsoft announced a beta release to registered Microsoft developers of their Catapult proxy server for Windows NT. The program is intended as a gateway between a corporate or institutional LAN and the Internet and can restrict both the hosts which are allowed to access it and the resources which they can contact. Local clients may use Novell's IPX protocol to communicate with the server so do not need to support TCP/IP. The server can manage a dial-up link so does not need a permanent connection to the Internet. While connected it will automatically fetch fresh copies of the most popular documents and store them in its local cache. These documents can then be served to users even while the gateway is disconnected from the Internet. The server offers a particularly wide range of protocols, including HTTP, FTP, RealAudio, VDOLive, IRC, mail and news. It can be managed remotely using the standard Microsoft server administration tools.

No details are available of the caching methods, or how the server will interact with other caches.

Netscape Proxy Server [Netscape 1996]

The Netscape Proxy Server is a dedicated web cache and proxy server, available commercially since 1995. The program is available for a wide range of operating systems: Digital Unix, HP-UX, AIX, IRIX, SunOS, Solaris, BSDI, Windows 95 and Windows NT. The server is installed, configured and controlled through a web-based manager program; the initial installation of this manager is the only stage at which access is required to the command line, or indeed the server host itself.

For each request the server uses at most a single parent cache, but the parent may be selected according to the URL requested. If the first choice parent is not available then a list of successive alternatives may be tried. If all else fails the cache will contact the original source directly. A Netscape cache is therefore resilient to the failure of other hosts but cannot perform discovery or load balancing among a number of possible document sources. Document staleness is checked using conditional GET requests and may be invoked on every request, at fixed time intervals, or using a TTL based on the document age when it was cached. Documents retrieved using FTP or Gopher, which do not provide the document age, are reloaded after a fixed time. Maximum and minimum sizes may be given: outside these limits a document will be retrieved but not saved in the cache. The server can also pre-emptively fetch groups of linked web pages according to a schedule. This might be used, for example, to load busy documents at off-peak times to ensure that they can be served direct from the cache.

The cache has a variety of filtering options for use as a firewall proxy. As well as the normal blocking based on URL patterns, the cache can also exclude particular MIME types or documents containing the HTML tags for Java, Javascript, or images. The administrator may also add their own choice of unwelcome tags to the list. Filtering can be based on the type of browser in use, as indicated by the user-agent string, though, since a number of browsers allow this signature to be changed by the user, this is of doubtful value.

When the server is started it creates a configurable number of processes, each of which can handle a single request at a time. Incoming requests will be accepted so long as there are idle processes available. When a document is returned to the client the process becomes free to handle another request. This model handles peaks of demand well (provided sufficient processes have been configured) but can lead to intense competition for resources between the processes. The intervals between demand peaks must be sufficient for most of the requests to complete or the scheduler will be overloaded. Cached documents are written to a shallow directory hierarchy so that retrieval is fast. The directories may be placed in one or many disk partitions of the same or different sizes.

Netscape has an extremely attractive and powerful administration interface, provided by a separate, password-protected, server program running on a dedicated TCP/IP port on the server. Any web browser which supports frames may be used to access this server and, through icons and menus, can perform all operations from installing and starting a new cache server to re-configuring an existing cache, monitoring its operation or producing traffic reports. The cache server can also act as an SNMP (Simple Network Management Protocol) agent. Alarm conditions may be defined which will cause alerts to be sent to the local SNMP server, allowing the cache to be monitored centrally along with other network devices. The cache program produces standard access and error logs; these can be inspected through the manager interface or new log file formats designed by selecting the fields which should appear. It is also possible to save the current cache settings to file, either as a reference or as a basis for another cache server.

Harvest project [Hardy et al 1995]

The Harvest software was developed by the Internet Research Task Force Research Group on Resource Discovery (IETF-RD) with the aim of making effective use of the information available on the Internet. Harvest was designed to make very efficient use of the network and of individual servers; the programs are designed to work with one another and with other instances of themselves so that the load of information gathering and publishing can be shared between many servers. This differs from the normal model of the World Wide Web, where each piece of information is published by a single server, but offers the advantages of resilience and scalability. Harvest introduced the ICP protocol for co-operation between individual caches; this is still only supported by programs derived from the project.

The full software suite includes programs to collect and index information from internet sites and locate information relevant to user queries, however the most frequently used program is the cache server. The original programs are available as source code and can be freely used under the terms of the IETF-RD license.

The original project ended in early 1996 with the cache software at version 1.4pl3. Subsequent development has been done by two groups, both including staff of the original project. Harvest 2 is a commercial product of the Harvest Developers Group while a team from the National Laboratory for Advanced Networking Research (NLANR) have continued to provide a free version under the name Squid.

Harvest 1 [Wessels 1995]

The final version 1.4pl3 of the original Harvest cache software is still used by many sites as for some time after the end of the project it was the most stable version of the code available. The program was written for SunOS, Solaris and Digital Unix (formerly OSF/1), but many people contributed source code patches to make it run on a wide range of other Unix systems including AIX, FreeBSD, HP-UX, Linux and IRIX.

The cache supports the HTTP, FTP and Gopher protocols for retrieving documents and uses the ICP protocol to discover whether a document is available from any of its fellow caches. A group of co-operating caches can be set up from which documents may be obtained. If none of these caches has the required document it will be requested from the fastest of those caches nominated as "parents" or, if no parent responds, direct from the original source. This makes the cache resilient to failure of any of the other caches in the group and able to perform load balancing if there is a choice of routes for obtaining a document. Patterns can be used to select which parents will be used for particular URLs. The client machines which are allowed to use the cache may be specified by IP address or pattern; a client which is not permitted to send requests to the cache will receive an "access denied" message if it attempts to do so. The cache is configured by editing a text file which is read when the cache starts. If this configuration file is changed the cache program must be restarted for the changes to take effect.

When a document is retrieved by the cache it will be saved to disk and, if space permits, in virtual memory. The only exceptions, which are not saved at all, are dynamic documents (whose URLs include either /cgi-bin/ or ?), and those for which a password is required. There is also a limit on the largest document which can be served by the cache. When each document is saved it is given a time to live, based on as much timestamp information as the source provides. Default and maximum TTLs may be assigned to different URL patterns. The cache will continue to serve the same copy of the document until the TTL elapses. When this occurs the document is removed from the cache and a fresh copy obtained on the next request. The cache never attempts to check if a copy is stale and has no support for conditional GET requests. It is therefore possible that the cache may unknowingly serve stale copies and, furthermore, the possible staleness cannot be predicted without interrogating the internal tables using the cache manager program. If the cache cannot obtain the document named in a request, either because the source host cannot be contacted or because the document does not exist, then this failure is also cached. Any requests for the same document within five minutes will fail immediately; after that time the cache will make another attempt to contact the source. If the request failed because the source host did not appear in the Domain Name Service then no further attempt will be made for one hour.

The cache is implemented as a single main process which uses non-blocking I/O for all operations over the network or to disk. This makes the most efficient use of CPU time (since there are no other processes to schedule), but doubts have been expressed over its ability to handle peaks in demand. DNS look-ups inevitably involve blocking, so are handled by a group of separate server processes with a cache of past results. FTP documents are also retrieved by spawning a separate process. Documents on disk are saved in a shallow directory hierarchy, so access to disk files is fast, and the virtual memory copy of the most popular documents means that these may never be read from disk files at all. [Bekker 1996] found that the Harvest cache was at least an order of magnitude faster than the CERN program in addition to its advantages of resilience and co-operation with other caches. Version 1.4pl3 of the software could only use a single disk to store cached documents unless the underlying operating system allowed some form of disk mapping.

The Harvest software generates a range of logfiles recording requests made by clients, interaction with other caches and the history of document storage on disk and in virtual memory. There is also a cache manager program, run through a web browser, which displays the contents of the cache and various internal tables. The manager interface also allows an authorised user to close the cache down or force a refresh of a particular document.

Harvest 2 [Harvest 1996]

Harvest 2 is a commercial development of the Harvest 1.4 cache server which has been available since April 1996. It is supplied in binary form with versions available for SunOS, Solaris, Linux, IRIX, HP-UX, AIX, Digital Unix, BSDI, Solaris x86 and FreeBSD. Evaluation copies are available from the company's web site while full licenses are based on the number of requests per hour the site is expected to handle. It is understood that the software will limit its own performance at the licensed level. Harvest 2 is already in use by the SingNet cache (see below) and is being considered as a future upgrade by other major caches currently using the non-commercial versions of Harvest.

Harvest 2 still uses the ICP protocol for cache co-operation so can inter-work with Harvest 1 and Squid caches. An option has been added to stop the cache saving to disk documents obtained from certain caches, thereby sharing their disk space rather than duplicating it. Cached files can also be saved to multiple disks of different sizes if necessary. There are many new options for determining staleness which allow almost any policy to be implemented. Conditional GET requests are used to check for staleness and may be invoked for every request or at regular intervals, as well as when the document expires. The simple caching of failure messages can now be controlled more finely with each failure code having its own rules.

The program is claimed to be three times faster and now runs parallel I/O threads to address the problem of peak loads. Several new controls have been added for use as part of a firewall, including blocking access to undesirable sites by URL, keyword, or using the Webtrack rating system.

Another log file has been added, to record when documents were checked or refreshed, while a signal can be sent to the program to start a new set of logfiles. Changes to the configuration can also be invoked by sending a signal so there is no longer any need to restart the cache. The cache manager program has some new statistical display options and a command to eliminate one (or all) objects from the cache.

Squid [Wessels 1996a]

Development of the Harvest 1 software has also been carried on as part of the NLANR cache project and this version, renamed Squid, continues to be freely available as source code. There is a sizeable community of volunteers testing and improving the program. Like Harvest 2, Squid has fixed the outstanding problems of Harvest 1 and added several new features including the ability to handle connections using the secure HTTPS and SSL protocols.

Squid uses the ICP protocol to co-operate with other caches and can now share disk space by not caching documents obtained from nominated caches. Parent caches may be assigned weights to express a preference in addition to the simple speed of response; this may be used to compensate for known differences in the behaviour of different parents, for example. Multiple disks or partitions may be used to store cached files, though the partitions must all be the same size. The size limit on requests has been removed. Conditional GET requests are now used to check staleness, though this still only occurs on the expiry of the document's assigned time to live. There is finer control of the caching of error replies and DNS lookups.

The underlying single process model is the same as Harvest 1 but FTP requests are now handled by an independent server process. Access controls are considerably more flexible and can now be based on any combination of source and destination address, HTTP method, protocol, domain, port, day, time of day and URL pattern matching. These controls can also be used for other aspects of the cache configuration, for example selecting different parent caches based on the time of day.

There is now a signal to make the cache re-read its configuration and another to do a clean shutdown, giving existing connections a configurable grace period to complete normally. The cache also becomes available for service more rapidly, without waiting for all the existing cached files on disk to be read. The logfiles are more consistent and provide more detail of each transaction.

Existing Cache installations

Most web caches have been installed at the local level serving users within an institution or department, sometimes as a side-effect of a firewall proxy. In the UK a survey [Hamilton 1996a] of 78 sites connected to JANET found that 42 institutions were already running local web caches: 22 using the CERN software, 14 Harvest, 1 Netscape and 1 Purveyor. Most are only available to users within the institution though some, such as Imperial College and Manchester, also act as regional caches, accepting connections from other local sites by agreement. Many of the sites using the Harvest software have also negotiated peering agreements with other co-operating caches; these systems are studied by the Cybercache project at Loughborough [Hamilton 1996b].

National caches are less common, though they offer considerable benefits to countries which suffer from congestion on international links. Some of the best-known examples are described in the following sections.

United Kingdom (HENSA) [Smith 1996]

The UK academic network has always offered high-performance connections within the country but comparatively slow international links. The desire to make best use of the scarce international bandwidth led to an early interest in web caching and, after initial experiments with the Lagoon and CERN caches, a national academic cache was set up by the Higher Education National Software Archive (HENSA) using the Netscape caching proxy in 1995. The cache rapidly outgrew the original Sparc 10 host and a subsequent dual-processor Silicon Graphics machine and now uses six separate computers located at sites in Canterbury and Leeds. The use of multiple sites and hosts is intended to provide resilience against failure of individual machines or local infrastructure problems such as power cuts. The geographic distribution of the cache service should also reduce the congestion on any individual part of the network which might otherwise be caused by the heavy traffic into and out of the cache.

The six cache machines are Silicon Graphics servers with 175MHz R4400 processors, 128Mbytes of memory and around 12Gbytes of disk. At an early stage of the experimental cache service it was found that disk bandwidth could be a bottleneck, so a large number of small 2Gb disks are now used with the cached files divided between them. The six cache servers run independently, so each client has access only to the resources of the server it connects to. Connections are shared between the servers by a round-robin Domain Name Server which rotates the conventional name wwwcache.hensa.ac.uk around the available machines. Each client program translates that name into a cache server address when it starts and then continues to use the same machine for the lifetime of the client program. The next client may well be given a different translation for the same name. Over time this gives a reasonably fair division of labour between the cache machines though it cannot adapt to particularly long-lived or demanding clients. Servers which go out of service can be removed from the DNS though their existing clients may have to be restarted to get a new translation.

The Netscape software creates a number of processes, each of which can handle a single concurrent request. If a request arrives when there is no process available to serve it then it will be refused. Each machine runs 650 of these processes so the complete cache can maintain nearly 4,000 simultaneous connections. Each of these competes with the others for network bandwidth and severe congestion has been encountered at busy periods. In an attempt to prevent this, the number of processes was reduced, but users complained when their attempts to connect were rejected. An attempt to reduce the number of idle connections by setting the timeout between received packets to 90 seconds met with similar complaints, so this was increased to 15 minutes. Apparently a moribund connection is regarded as better than no connection at all.

The HENSA cache receives up to 1.25 million connections a day with peaks up to 100 connections per second. The hit rate is very high, between 55% and 60%. Figures from the New Zealand cache suggest that at least 15% of all requests are for dynamic pages which are inherently uncachable, so this is probably close to the maximum which could ever be achieved.

New Zealand (NZGate) [Neal 1996]

In New Zealand the cost of Internet connectivity has always been borne in full by individual sites and since 1990 charges have been allocated on the basis of the volume of traffic generated. The very high cost of traffic on international links gives a strong economic incentive to use web caches at both local and national levels and caches can often be funded from the traffic savings which they produce. The New Zealand Internet Gateway, based at the University of Waikato, introduced a cache in May 1995, using the CERN server software, to act as a parent to the various institutional caches which subscribed to its international link. Later in 1995 the cache switched to version 1.3 of the Harvest software when it became apparent that development of the CERN program had ceased. After some initial reliability problems a stable service was obtained with version 1.4 of Harvest. The commercial version 2.1 is now in use.

Until May 1996, the cache ran on a single Digital AXP 4000 model 710 server with 128Mbytes of RAM and 10Gbytes of disk, though only about half of the available disk space was used. The national server now acts as parent to co-operating caches at six universities and four commercial Internet Service Providers, mostly using the public domain versions of Harvest and Squid. Some CERN child caches may still be operating at other sites. The NZGate cache also has co-operative links with the NLANR network and with caches in Australia.

In July 1996, the NZGate cache was receiving around 100,000 HTTP requests per day totalling 1.6Gbytes of data of which 254Mbytes came from the cache. This gives a 15% reduction in the international traffic in addition to the saving gained by documents served from the individual local caches. As in all hierarchical caching schemes, the user experiences the cumulative hit rate for all the caches through which requests pass. There were somewhat more ICP requests than HTTP, suggesting that the majority of the lower level caches also use this protocol.

United States of America (NLANR) [Wessels 1996b]

The National Laboratory for Advanced Networking Research is a collaboration between several American research organisations. One of its projects is investigating caching "to facilitate the evolution of an efficient national architecture for handling highly popular information". To this end they are continuing the development of the Squid (formerly Harvest) caching software and use it on an experimental network of seven co-operating cache servers across America. The servers also have links to co-operating cache projects in other countries.

Each server runs on a DEC AlphaServer 1000/266MHz with 128Mbytes of RAM and 10Gbytes of disk. Individual machines are designated as parents for particular geographical domains as appropriate to their location near the US international links. Servers located on the East Coast are used as parents for Europe and Africa; those on the West Coast for the Pacific, Asia and South America; and those in the mid-West for America and Canada. The caches are linked by a very high speed backbone. Simple clients send requests to their local cache machine which obtain the document from the relevant parent. Documents are only saved by the parent. This arrangement results in relatively low hit rates on individual cache machines (since each machine will only consider caching pages from a third of the web) and relatively high rates of traffic between the caches. This should be compensated by the lack of duplication between disks.

The individual cache servers receive between 70,000 and 174,000 HTTP requests a day, with hit rates of around 10%. There are nearly four times as many ICP requests showing the high level of co-operation between the servers.

Singapore [SingNet 1996]

Singapore Telecom provide a commercial internet service (SingNet) within Singapore including a cache for access to Web pages outside the country. This is intended to reduce the traffic on their international networks and provide faster access for their customers. Customer sites are encouraged to set up their own local caches using the SingNet caches as parents. Instructions are given for using both simple and co-operating caches though the relatively low number of ICP requests received by the servers (less than half the number of HTTP) suggest that few of the latter are in use.

The service runs on two DEC AlphaServer 4100 computers which co-operate using the commercial Harvest 2.0 cache software. On a typical day the servers each received 650,000 HTTP requests, an average of 53% of which were satisfied from within the cache. During the busiest five minute period there were over 6,000 TCP connections to each machine.

Norway [Uninett 1996] and France [CNRS 1996]

Norway and France have national caches on their respective academic networks, each serving around 100,000 requests a day. Both use the Squid cache software and achieve hit rates between 20% and 30%. The Norwegian cache is well established with 120 clients at various universities; the French cache has recently been transferred from a research project and so far has only 13 client systems. In France, commercial Internet Service Providers have been invited to establish peering arrangements between their own caches and the academic system.

Conclusions

National Cache Service

The UK National cache was one of the first large web caches and is now among the largest, and has one of the highest hit rates, in the world. The original system has so far been able to grow to meet the increase in demand but its design may no longer be ideal. There are two sources of growth which a cache must address: as more people use the cache they will generate more requests to it, and as the web grows those requests will be for a wider range of pages. An additional problem is the popularity of graphics on web pages so that a typical "page" now involves more and larger files than when the service began. Where caches publish statistics divided by file type it is common to see many more requests for GIF files than HTML.

An increase in requests from users can only be addressed by providing more efficient software or more computing power. Modern cache programs, such as Harvest and Netscape, are already well optimised so increased processing power is the only available solution. To this end the national cache has expanded from the original one machine to six. The current cache is estimated to serve 15% to 20% of the academic community and has some spare capacity, but a national cache of 15 to 20 machines might be necessary to handle just the present community, ignoring future growth in the community and its use of the web. The alternative is to introduce lower level caches as filters which can serve the most popular pages themselves and only pass on uncommon requests to the national service. The reduction in the number of direct connections by browsers to the national cache is likely to reduce its reported hit rate, since many of the hits will now occur at the local caches, but the requests which do reach it will be served more quickly, giving a better overall service to users.

The national cache can only achieve the high observed hit rates by storing popular documents on most, or all, of the independent cache machines. Each machine must therefore have sufficient disk space to store all of these popular documents. As the number of such documents on the web grows so the disk space on every machine must also increase. Introducing co-operation between the individual machines would reduce the amount of replication, and therefore the amount of disk space which must be provided on each one. If a cache machine receives a request for a document it does not have, it can use the co-operation protocol to query all of its fellows before deciding to retrieve the document from a remote location. The six machines are located at two separate sites, in Leeds and Canterbury, which suggests that the caches might "share" disks within each site. Documents obtained from another machine at the same site would be served to the client without taking a copy, while those obtained from the other site would be written to local disk. Assuming that responses arrive from local machines before those from remote ones, traffic on the inter-site network will be limited to simple UDP enquiries with TCP connections only being used if the document is available at one site but not the other. Popular documents will then be stored at both sites, a two-fold replication rather than six-fold, so would still be available from the national cache if either a single cache machine or a complete cache site were lost through network or power failure.

The present arrangement of national and international links in JANET does not give either cache site a significant advantage in contacting particular overseas networks. Unlike the NLANR network there is therefore little geographical reason for making caches specialise in particular Internet domains. With independent caches, specialisation can reduce the individual disk space requirements provided a sufficient number of clients are aware of it. Co-operation between caches gives a similar reduction in disk space without the effort of manually dividing the traffic into equal groupings of domains. Specialisation might reduce the amount of traffic between the co-operating cache machines by increasing the likelihood of a client choosing the right cache server, but carries the risk of "losing" caching for an entire country if the specialist cache machine failed. Co-operation between local and national caches will provide the same benefits and be more resilient to the failure of a cache machine.

If the national cache is to adopt co-operative methods then three types of client need to be considered. For simple clients, the round-robin DNS method is the only automatic way to spread load between the cache machines. The client behaves as at present, obtaining a single cache machine from the DNS and making a TCP connection to it. If the chosen cache machine is unavailable then the connection will fail. Otherwise the cache host will check its own storage, and those of its fellow caches, before obtaining the document from the original source. Simple clients should experience somewhat more cache hits (since they have access to the resources of all the cache machines) at the cost of some additional network traffic between the cache hosts. Resilient clients, which can try another cache if the first choice is unavailable, should still use a round-robin DNS to obtain the first choice cache. Additional DNS lists, following a different sequence to the first, could be used to randomise the second, third and successive choices. Ideally the lists should be arranged to alternate choices between the two cache sites, but this may be beyond the abilities of the DNS system. Apart from their resilience to cache failure, these clients have the same benefits and effects as simple clients on the co-operating caches. The greatest benefit is seen by clients which can themselves join in the co-operation. These are able to locate the correct cache for each request, using the ICP protocol, without causing any traffic between the national caches. These clients should therefore see an increase in the speed with which requests are served both because there is no need to transfer documents between the national caches and, in the case of small documents which can be included in the ICP reply, by eliminating the expensive TCP connection entirely. Co-operation between client and cache increases the amount of traffic on the national network, but only by simple UDP exchanges.

Since 1995 the national cache has used Netscape's server software which does not implement co-operation, just a resilient hierarchy of independent caches. If the cache machines are to be made to co-operate it may be necessary to change the software used, unless Netscape have plans to develop their program in this direction. The attractions of co-operation with client caches also require that the national cache support the ICP protocol used by several existing local caches. The only software available now which meets these requirements is the Harvest family, Harvest 2 and Squid. There has been concern that past versions of the software were insufficiently mature for production use, and that its single-process model might not be able to handle the very high peak connection rates experienced by the national cache service. However the SingNet cache, which is a commercial service and has a similar load, suggests that the latest Harvest 2 software may have achieved the necessary performance and reliability. The machines in the national cache are independent, which makes a gradual or experimental change of program very easy. Experience of the new software could be gained by switching one, or preferably two, of the existing machines to using it. If the trial system was made available initially to those sites which already have co-operating local caches, then any reliability problems would be less critical since the client caches are resilient. Furthermore the managers of existing local caches are likely to be the most supportive of a co-operating national cache during its development stage.

Caches act as bandwidth multipliers. For every megabyte of requests arriving at the national cache, only 400 kilobytes of traffic are passed on to further networks. Dedicating links to cache traffic is therefore a very efficient way to use their limited capacity. [Smith 1996a] describes an experiment which provided a dedicated 1Mbps link from the national cache to America and dramatically reduced the time taken to obtain documents which were not in the cache. Under normal circumstances around 60% of all requests to the cache are serviced within 5 seconds; with the dedicated link this increased to 85%. Virtually all requests were completed after 25 seconds although without the extra link some 20% were still unfinished at that time. A 4Mbps link dedicated to cache use should be equivalent to a 10Mbps direct link, and would allow some cache traffic to be taken off the open access trans-Atlantic links. Much of the current load on those links is direct access to foreign web sites by individual users. This use needs to be reduced if other services, such as remote terminal access, are to be viable, but the offer of better web service through caching does not seem to be persuasive. The existing national cache is already 25 seconds per document faster, on average, and yet the majority of browsers still use the slower, direct, route. It is not clear whether this is due to ignorance or a lack of confidence in caches, but if education cannot persuade users to use the cache then it may be necessary to consider the unattractive option of limiting the amount of web traffic on the general purpose links, merely to allow other services in. This would need to be done with great care. After a previous attempt to improve the use of international bandwidth [Smith 1996a] concluded that users "are not aware of the techniques which can be used to maximise throughput on a congested link, and they are not prepared to accept these techniques when they are forced upon them".

Local Caches

The introduction of local caches should be encouraged as the best way to reduce the load on both the JANET network and the national cache service. Sites which run their own caches should benefit from quicker access to popular pages, since these will be available on the local network. Recent cache software will also provide resilience to faults in higher-level caches. A local cache gives the opportunity to manage local policy at a single point rather than having to reconfigure many individual browser programs. When problems do occur, users may be more willing to contact a local cache administrator than a national service. Even non-users of the web may benefit as the reduction in web traffic on the external network should leave more bandwidth for other services.

These benefits need not be restricted to large HEIs. A study [Abrams et al 1995] of a departmental web cache found that hit rates of 30% could be achieved with only 25 users and that hit rates were increased where browsing is concentrated in a few subject areas. Small sites with indirect connections to JANET are likely to have the greatest difference between internal and external network speeds, so benefit most from the bandwidth multiplication provided by a cache. Local caches do not require large computers - Abrams' example required a disk cache of only 200Mb - though sufficient memory is essential. A site with a slow external link may be able to run a small web cache on an existing internet gateway machine. Caching software is now available for systems other than Unix so familiarity with that operating system is no longer essential. If nothing else is possible, small sites may be able to arrange access to web caches at larger institutions nearby, but a cache at the far end of a slow network link loses many of the benefits.

High speed Metropolitan Area Networks are also an excellent location for web caches as there is a reduction in bandwidth between the MAN and the national network. A cache connected to the MAN can make efficient use of the limited external bandwidth. Sites may choose to have their own local caches which co-operate across the MAN, effectively sharing their disk space, or a single cache may be provided centrally as part of the MAN function.

When choosing cache software for a site, management effort may be balanced against the service provided to users. Co-operating caches are somewhat more complicated to configure and manage but should give faster and more reliable access to the national cache service. Linking the cache with others at nearby institutions gives further improvements, though the topology of the JANET network can make the choice of neighbours confusing. A central advisory service could perhaps provide assistance in the choice of neighbours and in identifying lightly used portions of the network. The service could also advise on configuring clients to access the national cache since the methods for resilience and load balancing work best if there is agreement between the cache and its clients.

User Education

An ideal cache would be completely transparent to its users, however this is unlikely ever to be achieved in practice. A cache is always liable to intrude itself, especially when errors occur. The simplest case is when a user types in an incorrect URL: without a cache the error will be reported in a familiar, browser-dependant, message as the browser itself detects the error. If the browser passes on the problem to a cache, the user will instead see an error page which is at best unfamiliar, and possibly incomprehensible. Anyone used to "document unobtainable" is unlikely to appreciate that "DNS lookup failed" means the same thing, and will often blame the new error messages on "faults in the cache".

These misunderstandings may be the cause of the apparent widespread opinion that caches are slow and unreliable and the small proportion of sites choosing to use the national cache. If so then more information needs to be made available, not simply highlighting the performance benefits of web caching, but explaining the other changes which are likely to be noticed. Most caches have a more advanced view of the world than browsers so, for example, if a source host is lost while a document is being retrieved, many caches will continue serve the half document until the connection is restored. Anyone used to a simple "connection lost" error is likely to complain about the missing half document rather than welcoming the half which did arrive. The concept of staleness also needs to be explained as international networks have reached the point where it is no longer feasible to send a conditional GET for every request. Caches with fixed expiry times have the advantage of simplicity here, even if a TTL based on document age is more rational.

To be of any use, a cache must be trusted by its users. If it ever gains a reputation for serving "old" information then users will either configure it out or simply reload every page, leaving the networks even more congested than they are now.

References

  • M.Abrams, C.Standridge, G.Abdulla, S.Williams, E.Fox 1995, Caching Proxies, Limitations and Potentials, World Wide Web Journal Issue 1, p.119-133 (4th WWW Conference)
  • Apache Project 1996, Apache Project
  • A.Baird-Smith 1996, Jigsaw Web Server
  • H.Bekker, I.Melve, T.Verschuren 1996, Survey of Caching Requirements and Specification of Prototype
  • CNRS 1996, Renater Web Cache
  • M.Hamilton 1996a, Survey of Cache Usage on JANET
  • M.Hamilton 1996b, Cybercache Project
  • D.Hardy, M.Schwartz, D.Wessels 1995, Harvest User's Manual, University of Colorado at Boulder
  • Harvest Development Group 1996, Harvest 2 Cache Server
  • Microsoft 1996, Catapult Proxy Server
  • D.Neal 1996a, The Harvest Object Cache in New Zealand, Computer Networks and ISDN Systems vol.28 nos7-11, p.1415-1430 (5th WWW Conference)
  • D.Neal 1996b, New Zealand Internet Cache
  • H.F.Neilsen 1996, CERN Web Server 
  • Netscape 1996, Netscape Proxy Server
  • Process Software Corp. 1996, Purveyor Web Server
  • SingNet 1996, SingNet Cache,
  • N.G.Smith 1996a, The UK National Web Cache, Computer Networks and ISDN Systems vol.28 nos7-11, p.1407-1414 (5th WWW Conference)
  • N.G.Smith 1996b, HENSA Cache
  • Spinner 1996, Spinner Web Server
  • W.Richard Stevens 1994, TCP/IP Illustrated Vol.1, p.159-160
  • Uninett 1996, Norwegian Web Cache
  • D.Wessels 1995, Harvest Project
  • D.Wessels 1996a, Squid Cache Server
  • D.Wessels 1996b, NLANR Cache

Summary
Author
Andrew Cormack, University of Wales, Cardiff
Publication Date
1 September 1996
Publication Type
Topic
Strategic Themes