Performance Matters: September 2009

This is the first in what is likely to be a multi-post series discussing how HTTP works down at the TCP packet level, opportunities for improvement and a look at how some of the big players perform.

Background

I'm going to assume that if you're reading this you have a fairly good idea of what HTTP and TCP are and what the roles of each are. What I want to focus on is the design goals of each and how those goals impact the web as we know it today.

TCP, being the foundation of most connection-oriented data transfer on the Internet, has to work well for lots of different purposes including (but certainly not limited to):

Web pages
Large file transfers
Remote system management
E-mail transfer
Interactive console access
LOTS of other stuff....

It also does it's best to be kind to the underlying network infrastructure.

HTTP is a request/response protocol where there is always a 1:1 mapping between requests from a client and responses from the server (assuming no errors). It is also generally limited to send one request at a time over a given TCP connection and can't send another request until the previous one complete (HTTP pipelining aside which was pretty much a colossal failure).

Web pages generally require a bunch of individual requests to load a full page (20-100 is not unusual) but it all starts with a single request for the "base page" which is the HTML code that defines the page and all of the elements required to build it. The web browser (client) doesn't know what other elements it needs to download until the base page loads (at least far enough where it can start identifying elements and parsing the HTML).

There are 2 features of TCP in particular that I want to discuss that have a fairly large impact on the performance of HTTP - Nagle and TCP Slow Start. Both are somewhat sensitive to the end-to-end latency between the server and the client so it's probably useful to first understand what some typical latencies look like. I'll generally refer to the end-to-end latency as the Round Trip Time (RTT).

Typical Latencies

There are generally 2 places latency comes into play when we talk about a consumer's connection to your web servers (we're just talking front-end between the user's browser and your servers).

1 - The latency between the user's PC and their ISP's Internet connection (this is generally referred to the last-mile)

2 - The latency between the ISP's Internet connection and your front-end servers (web servers in the simple case, load balancers/accelerators in more complex setups)

Last time I measured last-mile latencies for various forms of consumer connectivity I ended up with these Round Trip Times:

DSL: 50 ms
Cable: 60 ms
Dial-up: 120-250 ms

These were measured from just a few samples so may not represent global averages or newer technology (the DSL number in particular may be a little high) but is good enough for these discussions.

For the backbone latencies (between the ISP and your servers) it is going to come down to the distance from the ISP (speed of light) and the point where your content is being served from. In the case of a CDN this can be close to zero but some basic round trips for other cases are:

Across the US (east-coast to west-coast): 75 ms
US east-coast to Europe: 100 ms
US west-coast to Australia: 150 ms

Nagle

Nagle is a feature that tries to gather outbound data into as few packets as possible and can really help applications that do a bunch of small writes. What it essentially does is send out as many full packets as it can and if it ends up with a partial packet at the end, instead of sending that it will hang onto it for a brief amount of time in case the application is going to send more data. In an extreme case where an application writes out a large amount of data one byte at a time, instead of sending a packet for every single byte it would be able to reduce that significantly (over 1000x) which is better for both the network and the application.

It does generally mean that the last bit of data you send is going to get delayed by at least one round-trip's worth of time (it sends a partial frame when all other data transmitted has been ACK'd by the receiver and there is no pending data on the wire).

What this means for HTTP is that with Nagle enabled, the end of every response will generally get held up by one RTT. In the case of flushing the partial response out early the end of the data that got flushed would also be held up for one RTT or until more of the response is sent back (potentially killing any benefit from the early flush). If the receiver has delayed ACK's enabled then it's possible for an additional 200ms to get added to the last chunk.

Fortunately, Nagle can be turned off on a per-connection basis by setting the TCP_NODELAY option on the socket and the majority of HTTP servers have it disabled. I do know that it is disabled in Apache, Squid and IIS. I'm a little less sure about lighttpd (there was a ticket opened a long time ago to do it but I haven't actually checked).

With it disabled you do need to be aware that every write will generally result in packets being sent on the wire so don't flush your HTTP pages after every line :-)

Checking to see if Nagle is disabled is generally pretty easy to see in a packet trace. Generally if you see the PSH bit set it means it has been disabled.

TCP Slow Start

TCP Slow Start is the algorithm that TCP uses to probe the network path and determine how much data can be in flight between the server and the client at any given point in time. The client advertises what the upper limit is during the 3-way handshake (receive window) but the server doesn't immediately start by blasting out the full window's worth of data, it ramps up from a very small number of packets and keeps increasing as the client successfully receives data until either the window is fully filled or data starts getting dropped.

In 1999, RFC 2581 set the initial number of packets that TCP should use in slow start to 2 which was actually a fairly significant improvement from the 1 that it was before because in concert with delayed ACK's the slow start could actually stall for 200 ms. In 2002, RFC 3390 increased the initial window to roughly 4 KB and as best as I can tell it hasn't changed since then.

What this means for http is that unless the initial response on a given connection is under 4 KB the full response will be delayed by AT LEAST one RTT (depending on how large the response is it could bump up against the growing transmit window several times with each increase costing an additional RTT). This happens for each new connection which tends to happen fairly frequently for HTTP (and is happening more and more as browsers start opening more connections in parallel to try to retrieve pages faster). HTTP doesn't keep connections open for long enough for

I particularly enjoy this quote from RFC 3390 in 2002:

   The larger initial window specified in this document is not intended
  as encouragement for web browsers to open multiple simultaneous TCP
  connections, all with large initial windows.  When web browsers open
  simultaneous TCP connections to the same destination, they are
  working against TCP's congestion control mechanisms [FF99],
  regardless of the size of the initial window.  Combining this
  behavior with larger initial windows further increases the unfairness
  to other traffic in the network.  We suggest the use of HTTP/1.1
  [RFC2068] (persistent TCP connections and pipelining) as a way to
  achieve better performance of web transfers.

We all know how well pipelining worked out, don't we? And that's exactly what browsers are doing (and more so as they go from 2 up to 6 concurrent connections).

Unlike Nagle, Slow Start can't be tweaked or configured. It is baked into the OS TCP stack.

In looking at packet captures from the big players in search where every millisecond matters it looks like they are all seeing a penalty from it. Interestingly Bing seems to be using a window closer to 6-8 KB but I need to take a closer look and see if that's normal for Windows or if they are doing something special. I was somewhat surprised to see that Google who are generally the kings of speed and are already running custom kernels hadn't gone in and tweaked with it at all. Google's main search page looks to fit under the 4 KB window which helps with it's delivery speed but the search results do not (and with some tweaks to slow start you should be able to deliver even bigger pages just about as fast, and certainly faster than they are currently delivered).

Interesting, so what can we do about it?

Up until now, for the most part this has been an exercise in identifying areas that are impacting our ability to deliver web pages as fast as possible. Instead of having all of the intelligence at the lowest levels of the OS I think it would make a lot of sense to bubble up some of the configurations and statistics up to the applications so they can do on-the-fly tuning. If I get some free time to work on it I'm going to look at building some patches for the Linux kernel that would expose settings for slow start on a per-connection basis as well as statistics around RTT and packet loss (also on a per-connection basis).

Then applications could be intelligent and tune the settings dynamically. For example, a web server could start out by setting the initial window to an intelligent guess as to what it should be for optimum performance (complex algorithms could help but say for example, that it starts out at a size that would satisfy 80% of requests based on historical data). If it starts seeing packet loss at a higher than acceptable rate on the aggregate of it's connections it ramps the size down for future connections and every now and then attempts to increase it as packet loss goes away. It basically lets you do a lot of application-specific tuning without having to build a single algorithm that will work for all protocols.

At this point it's all mostly theoretical so please feel free to poke giant sticks at it (or me). I'll probably start just by configuring bigger windows statically in the kernel and seeing how that works but for sites or applications looking to squeeze every last bit of performance out of http this looks to be a good way to do it and it can be done without any changes on the client or to the protocols (always a good thing).

One of the things we like to look at is what a page looks like when it is loading, particularly if you are comparing a before and after or multiple sites to each other. This is particularly helpful when talking to the business about performance and sometimes the raw load times don't adequately represent the user experience.

Up until now this has been a pretty manual process where we screen record a video while loading a site then load it up into Premiere and stitch it together.

Starting today, you can have WebPagetest record a video of a page load directly. It's still in a pretty rough form and it's going to be a while before it'll be ready for the masses but what is there is already pretty powerful and extremely flexible.

At it's guts, pagetest is really just grabbing screen shots every 0.1 second (10 frames per second) whenever the browser window changes (changes are detected immediately because pagetest hooks the screen drawing code and can tell whenever the browser paints something to the screen). It will capture up to 20 frames at 0.1 second granularity, then fall back to another 20 frames at 1 second and then capture the remainder at 5 second increments. This is done to keep the in-memory image requirements to a minimum but is more than sufficient for most sites (it would take something like a continuously animating image loading very early in a very long page for this to become a problem - at which point you probably have other things to worry about).

As things stand right now, the "video" is a bunch of static images and an Avisynth script that turns it into a video for playback so it only works on Windows (and you need to install Avisynth to be able to use the videos). The scripting is incredibly powerful and becomes even more interesting when you create other scripts that operate on the existing "video" script files (I have a standard one for doing side-by-side comparisons for example). I'll probably set up a forum for sharing the scripts so we can all benefit from each other's work.

Long-term I plan on making it point-and-click for generating actual video files but that will take a while to implement and there was a fair amount of interest in even the basic functionality so I decided to launch it as it is while I work on improving it.

The video recording capability is only exposed if you are 'logged in" so if you don't see a "video" tab on the main test screen go over to the forums, log in and then it should show up. I needed to put that in place because the storage requirements for the videos are pretty significant and if I start running into storage problems I may need to reach out to individual users. As with everything else, I plan to keep the videos indefinitely but if storage becomes a problem they'll be the first thing to get pruned.

To capture video, just check the box in the "video" tab on the test screen. The video will be available for download on the screen shot page for a given test (there will be a "Download Video" link at the top).

Update: I wrote up more details on how to use the captured videos here: http://www.webpagetest.org/forums/showthread.php?tid=46

Performance Matters

Thursday, September 17, 2009

TCP and HTTP, fighting each other to bring you the web

Tuesday, September 8, 2009

Recording video with WebPagetest