Performance Matters: 2009

Friday, December 4, 2009

WebPagetest outage

Update 2: All back to fully operational now, sorry for the inconvenience.

Update: I have disabled the forums and authentication integration as well as made some changes to at least bring the main testing online. The forums will be back once Dreamhost fixes the problems with the MySQL server.

Sorry, Dreamhost seems to be having all sorts of problems trying to keep the server that runs WebPagetest running so the site has been unavailable for the better part of the day. I've been harassing them on a fairly regular basis so hopefully they will get it cleared up soon.

Ahh, how I miss the days where it was running in my basement and I could just run home and kick it :-)

Thanks,

-Pat

Thursday, September 17, 2009

TCP and HTTP, fighting each other to bring you the web

This is the first in what is likely to be a multi-post series discussing how HTTP works down at the TCP packet level, opportunities for improvement and a look at how some of the big players perform.

Background

I'm going to assume that if you're reading this you have a fairly good idea of what HTTP and TCP are and what the roles of each are. What I want to focus on is the design goals of each and how those goals impact the web as we know it today.

TCP, being the foundation of most connection-oriented data transfer on the Internet, has to work well for lots of different purposes including (but certainly not limited to):

Web pages
Large file transfers
Remote system management
E-mail transfer
Interactive console access
LOTS of other stuff....

It also does it's best to be kind to the underlying network infrastructure.

HTTP is a request/response protocol where there is always a 1:1 mapping between requests from a client and responses from the server (assuming no errors). It is also generally limited to send one request at a time over a given TCP connection and can't send another request until the previous one complete (HTTP pipelining aside which was pretty much a colossal failure).

Web pages generally require a bunch of individual requests to load a full page (20-100 is not unusual) but it all starts with a single request for the "base page" which is the HTML code that defines the page and all of the elements required to build it. The web browser (client) doesn't know what other elements it needs to download until the base page loads (at least far enough where it can start identifying elements and parsing the HTML).

There are 2 features of TCP in particular that I want to discuss that have a fairly large impact on the performance of HTTP - Nagle and TCP Slow Start. Both are somewhat sensitive to the end-to-end latency between the server and the client so it's probably useful to first understand what some typical latencies look like. I'll generally refer to the end-to-end latency as the Round Trip Time (RTT).

Typical Latencies

There are generally 2 places latency comes into play when we talk about a consumer's connection to your web servers (we're just talking front-end between the user's browser and your servers).

1 - The latency between the user's PC and their ISP's Internet connection (this is generally referred to the last-mile)

2 - The latency between the ISP's Internet connection and your front-end servers (web servers in the simple case, load balancers/accelerators in more complex setups)

Last time I measured last-mile latencies for various forms of consumer connectivity I ended up with these Round Trip Times:

DSL: 50 ms
Cable: 60 ms
Dial-up: 120-250 ms

These were measured from just a few samples so may not represent global averages or newer technology (the DSL number in particular may be a little high) but is good enough for these discussions.

For the backbone latencies (between the ISP and your servers) it is going to come down to the distance from the ISP (speed of light) and the point where your content is being served from. In the case of a CDN this can be close to zero but some basic round trips for other cases are:

Across the US (east-coast to west-coast): 75 ms
US east-coast to Europe: 100 ms
US west-coast to Australia: 150 ms

Nagle

Nagle is a feature that tries to gather outbound data into as few packets as possible and can really help applications that do a bunch of small writes. What it essentially does is send out as many full packets as it can and if it ends up with a partial packet at the end, instead of sending that it will hang onto it for a brief amount of time in case the application is going to send more data. In an extreme case where an application writes out a large amount of data one byte at a time, instead of sending a packet for every single byte it would be able to reduce that significantly (over 1000x) which is better for both the network and the application.

It does generally mean that the last bit of data you send is going to get delayed by at least one round-trip's worth of time (it sends a partial frame when all other data transmitted has been ACK'd by the receiver and there is no pending data on the wire).

What this means for HTTP is that with Nagle enabled, the end of every response will generally get held up by one RTT. In the case of flushing the partial response out early the end of the data that got flushed would also be held up for one RTT or until more of the response is sent back (potentially killing any benefit from the early flush). If the receiver has delayed ACK's enabled then it's possible for an additional 200ms to get added to the last chunk.

Fortunately, Nagle can be turned off on a per-connection basis by setting the TCP_NODELAY option on the socket and the majority of HTTP servers have it disabled. I do know that it is disabled in Apache, Squid and IIS. I'm a little less sure about lighttpd (there was a ticket opened a long time ago to do it but I haven't actually checked).

With it disabled you do need to be aware that every write will generally result in packets being sent on the wire so don't flush your HTTP pages after every line :-)

Checking to see if Nagle is disabled is generally pretty easy to see in a packet trace. Generally if you see the PSH bit set it means it has been disabled.

TCP Slow Start

TCP Slow Start is the algorithm that TCP uses to probe the network path and determine how much data can be in flight between the server and the client at any given point in time. The client advertises what the upper limit is during the 3-way handshake (receive window) but the server doesn't immediately start by blasting out the full window's worth of data, it ramps up from a very small number of packets and keeps increasing as the client successfully receives data until either the window is fully filled or data starts getting dropped.

In 1999, RFC 2581 set the initial number of packets that TCP should use in slow start to 2 which was actually a fairly significant improvement from the 1 that it was before because in concert with delayed ACK's the slow start could actually stall for 200 ms. In 2002, RFC 3390 increased the initial window to roughly 4 KB and as best as I can tell it hasn't changed since then.

What this means for http is that unless the initial response on a given connection is under 4 KB the full response will be delayed by AT LEAST one RTT (depending on how large the response is it could bump up against the growing transmit window several times with each increase costing an additional RTT). This happens for each new connection which tends to happen fairly frequently for HTTP (and is happening more and more as browsers start opening more connections in parallel to try to retrieve pages faster). HTTP doesn't keep connections open for long enough for

I particularly enjoy this quote from RFC 3390 in 2002:

   The larger initial window specified in this document is not intended
  as encouragement for web browsers to open multiple simultaneous TCP
  connections, all with large initial windows.  When web browsers open
  simultaneous TCP connections to the same destination, they are
  working against TCP's congestion control mechanisms [FF99],
  regardless of the size of the initial window.  Combining this
  behavior with larger initial windows further increases the unfairness
  to other traffic in the network.  We suggest the use of HTTP/1.1
  [RFC2068] (persistent TCP connections and pipelining) as a way to
  achieve better performance of web transfers.

We all know how well pipelining worked out, don't we? And that's exactly what browsers are doing (and more so as they go from 2 up to 6 concurrent connections).

Unlike Nagle, Slow Start can't be tweaked or configured. It is baked into the OS TCP stack.

In looking at packet captures from the big players in search where every millisecond matters it looks like they are all seeing a penalty from it. Interestingly Bing seems to be using a window closer to 6-8 KB but I need to take a closer look and see if that's normal for Windows or if they are doing something special. I was somewhat surprised to see that Google who are generally the kings of speed and are already running custom kernels hadn't gone in and tweaked with it at all. Google's main search page looks to fit under the 4 KB window which helps with it's delivery speed but the search results do not (and with some tweaks to slow start you should be able to deliver even bigger pages just about as fast, and certainly faster than they are currently delivered).

Interesting, so what can we do about it?

Up until now, for the most part this has been an exercise in identifying areas that are impacting our ability to deliver web pages as fast as possible. Instead of having all of the intelligence at the lowest levels of the OS I think it would make a lot of sense to bubble up some of the configurations and statistics up to the applications so they can do on-the-fly tuning. If I get some free time to work on it I'm going to look at building some patches for the Linux kernel that would expose settings for slow start on a per-connection basis as well as statistics around RTT and packet loss (also on a per-connection basis).

Then applications could be intelligent and tune the settings dynamically. For example, a web server could start out by setting the initial window to an intelligent guess as to what it should be for optimum performance (complex algorithms could help but say for example, that it starts out at a size that would satisfy 80% of requests based on historical data). If it starts seeing packet loss at a higher than acceptable rate on the aggregate of it's connections it ramps the size down for future connections and every now and then attempts to increase it as packet loss goes away. It basically lets you do a lot of application-specific tuning without having to build a single algorithm that will work for all protocols.

At this point it's all mostly theoretical so please feel free to poke giant sticks at it (or me). I'll probably start just by configuring bigger windows statically in the kernel and seeing how that works but for sites or applications looking to squeeze every last bit of performance out of http this looks to be a good way to do it and it can be done without any changes on the client or to the protocols (always a good thing).

Tuesday, September 8, 2009

Recording video with WebPagetest

One of the things we like to look at is what a page looks like when it is loading, particularly if you are comparing a before and after or multiple sites to each other. This is particularly helpful when talking to the business about performance and sometimes the raw load times don't adequately represent the user experience.

Up until now this has been a pretty manual process where we screen record a video while loading a site then load it up into Premiere and stitch it together.

Starting today, you can have WebPagetest record a video of a page load directly. It's still in a pretty rough form and it's going to be a while before it'll be ready for the masses but what is there is already pretty powerful and extremely flexible.

At it's guts, pagetest is really just grabbing screen shots every 0.1 second (10 frames per second) whenever the browser window changes (changes are detected immediately because pagetest hooks the screen drawing code and can tell whenever the browser paints something to the screen). It will capture up to 20 frames at 0.1 second granularity, then fall back to another 20 frames at 1 second and then capture the remainder at 5 second increments. This is done to keep the in-memory image requirements to a minimum but is more than sufficient for most sites (it would take something like a continuously animating image loading very early in a very long page for this to become a problem - at which point you probably have other things to worry about).

As things stand right now, the "video" is a bunch of static images and an Avisynth script that turns it into a video for playback so it only works on Windows (and you need to install Avisynth to be able to use the videos). The scripting is incredibly powerful and becomes even more interesting when you create other scripts that operate on the existing "video" script files (I have a standard one for doing side-by-side comparisons for example). I'll probably set up a forum for sharing the scripts so we can all benefit from each other's work.

Long-term I plan on making it point-and-click for generating actual video files but that will take a while to implement and there was a fair amount of interest in even the basic functionality so I decided to launch it as it is while I work on improving it.

The video recording capability is only exposed if you are 'logged in" so if you don't see a "video" tab on the main test screen go over to the forums, log in and then it should show up. I needed to put that in place because the storage requirements for the videos are pretty significant and if I start running into storage problems I may need to reach out to individual users. As with everything else, I plan to keep the videos indefinitely but if storage becomes a problem they'll be the first thing to get pruned.

To capture video, just check the box in the "video" tab on the test screen. The video will be available for download on the screen shot page for a given test (there will be a "Download Video" link at the top).

Update: I wrote up more details on how to use the captured videos here: http://www.webpagetest.org/forums/showthread.php?tid=46

Friday, July 24, 2009

Traffic Shaping

One of the really important features that WebPagetest offers is the ability to test on various different connection types. This is critical when you're testing the end-user experience because the difference in performance between a fast Ethernet connection to a web site and a consumer connection is HUGE (like orders of magnitude huge). The main differences boil down to bandwidth (how fast your connection is - in each direction) and latency (usually referred to as the "ping times" by gamers). Bandwidth is interesting but for most sites the real killer is latency and consumer connections can easily have 50+ ms of latency before they even get to the Internet. It may not seem like much but when you add it up over tens of requests and the number of round trips each requires the effect adds up really quickly (and is the main reason it is critical to reduce the number of requests for your web site).

Most of the commercial services for testing site performance will offer you either backbone-connected test systems or (for significantly more money) testing over actual ISP connections. The backbone-connected testing is reasonably good for trending but tends to be fairly far off of the consumer experience and can easily hide changes in the number of requests or bytes that would have a large impact on the consumer experience and using real resedential lines gets to be very cost prohibitive when you're doing high-volume testing.

When we (AOL) built out our internal performance testing systems (to augment the commercial systems we use) we did a bunch of testing and analyzing and decided that a good compromise would be to use traffic-shaping technology to simulate various types of last-mile connections on our high-speed connectivity. This gives us the flexibility to test any type of consumer connection while still keeping costs under control.

There are a bunch of options for "simulating" consumer connections (proxies, browser plug-ins, etc) but it was very important to me that if we were going to simulate the connectivity that the simulation be accurate and most of the solutions fail that basic test. Unless the simulation is happening at the packet level you are going to miss a lot of the subtle behaviors that will impact the performance (TCP slow start is a great example). A bunch of them also work by proxying the browser connection which changes how the browser makes it's requests.

At a packet-level there are a bunch of different configurations from software on the same PC to software on an external PC to dedicated appliances. We started out trying a software solution but it was fairly expensive and tended to be unreliable (sometimes it would throttle and sometimes it wouldn't) so we decided a dedicated external solution would be the way to go. After a bunch of research we Narrowed down to 3 options:

FreeBSD with DUMMYNET
Linux with Nist Net
Linux with netem

Of these, dummynet was the one we liked the most. It was designed for protocol testing and could reliably simulate all of the aspects of a connection that we found interesting, was REALLY easy to configure and could scale really well (we can run several hundred test systems behind a single dummynet system, each getting a dedicated virtual pipe). Nist Net was supposed to be accurate as well but hadn't been kept up to date (didn't work with more recent kernels) and was unpleasant to configure. The netem solution is probably the most complex, requiring configuring using HTB or something else to do bandwidth and after using dummynet was excruciating to configure for a large number of systems.

This works exceptionally well for large test systems but if you're trying to deploy a small-footprint, the need for the external PC and custom networking causes problems. As we get people offering remote test systems (most recently with Daemon Solutions hosting the UK test location) it is not fair to also ask them to set up the complicated simulation configuration so I started looking around again to see if the landscape had improved any.

The first thing that grabbed my attention was that dummynet is now available on Linux. This is pretty exciting to me not so much because of the fact that it's on Linux but because Linux ALSO offers a lot of very good virtualization options (which really don't exist for FreeBSD). This opens up the possibility of having a single physical device contain the traffic shaping as well as one or more test systems (different browser versions for example).

Before you say "why don't you just run FreeBSD in a VM", I did a bunch of testing on that a while ago, trying to achieve the same goal, and the timing within a virtual environment isn't reliable enough to do good traffic shaping (it could sort of do it but the consistency just wasn't there). Running a Linux host will let the traffic shaping still happen on a physical host. It does raise the question of how consistent the test data will be for the actual browser running in a VM. I also did a bunch of testing on that (may write about it at some point because the data was not what I expected) and as long as you only use a single VM it is as reliable as a physical machine (more than one VM, even on a really high-end server tuned for it introduced enough variability to steer us away from it).

After all that I started thinking if it would be worthwhile to write my own traffic shaper for windows that worked at the packet level and did the same things we use dummynet for. I've written Windows networking drivers before and it's not what I'd call a pleasant thing to do but there are a lot of other things that it could offer - particularly packet-level details on a given test in addition to the traffic shaping. Then someone pointed me to this: http://www.akmalabs.com/downloads_netsim.php

It looks like it does most of what I'd need for a stand-alone system and would work right out of the box. I'm going to do some testing with it but if it pans out we may be able to get consistent connectivity types across the various test locations.

Thursday, June 11, 2009

How does your site stack up?

I pulled together the results of the tests that have been run on WebPagetest over the past year and did a bunch of aggregate analysis. There were over 24,000 unique urls tested in that time so the data is a pretty wide sampling across different types of sites. You can see the full details here: http://www.webpagetest.org/forums/thread-22.html

One of the particularly useful things you can do is look at your own test results and compare them in the distribution graphs to see where you land. For example, are you slower than 95% of the sites that were tested? How do the number of requests and bytes stack up? How about the optimizations?

Looking at straight averages across all of the tests:

Load Time: 10.1 seconds
Time to First Byte: 1.1 seconds
Time to Start Render: 3.8 seconds

Page Size: 510 KB
Number of Requests: 50
Number of Redirects: 1

Perhaps more interesting are observations on the distributions:

Page Measurements

Load Time: 35% of the sites tool longer than 10 seconds to load (and there's a pretty long tail that goes out to 60 seconds with 5% of the sites taking longer than 30 seconds). On the positive side, 33% of the sites loaded in under 5 seconds.

Time to First Byte: Looked surprisingly good with 76% of sites coming in under 500ms. More confirmation that the back-end on most sites works well and the work needs to be done on the front-end (content). That said, 9% took over 2 seconds so there are some sites that still have some back-end work to do.

Time to Start Render: There is a lot of room for improvement and this is probably one of the most useful measurements (and unique to Pagetest). The user doesn't see anything display before this point so it doesn't matter how fast the back-end is if there is a lot of js and css code loading in the head that prevents the page from rendering (even worse is not much code but lots of files). 60% of the sites take over 2 seconds to start rendering with 20% of the sites taking over 5. If you're going to focus on optimizing anything, this is the first number you should be looking at.

Page Size: I feel sorry for anyone still using dial-up. 30% of the sites were over 500KB.

Number of Requests: This is usually the most impactful measurement because most of the time in making a request is wasted and not actually downloading content so the more requests on the page the more time that is being wasted not downloading content. 33% of the sites have 50 or more requests with 12% having 100 or more (with a really scary tail out to 400).

Number of Redirects: 66% of the sites had no redirects and in general things looked really good. The 2% of sites with over 8 redirects should probably look at reducing them though.

Optimizations

I won't go through all of them but I will hit the high points.

Most sites are doing a good job with persistent connections. Only 5% of the sites are not leveraging keep-alives at all.

On compression, 50% of the sites could save 50% or more of their text bytes by enabling gzip compression. This helps both the end user and saves bytes on the wire which goes directly to bandwidth costs the site owners have to pay.

The biggest impact for most sites comes in combining several js and css files into a singlee file (of each type). This goes directly to the start render time and 20-30% of the sites have a large number of files to combine.

The last one I'll touch on is caching of static assets. A full 25% of the sites don't use any expires or cache-control headers at all. That makes the repeat view of the site almost as slow as the first view (and makes a lot of unneeded requests to the site). If you don't want people to keep coming back, this is a sure way to encourage that :-)

There are a ton of charts and a lot more data in the full analysis so if you want more information head over there. I also have an excel spreadsheet of the raw data available at the end of the analysis if you want to run any different kinds of analysis on it.

Tuesday, May 5, 2009

Optimization impact

As I look through the tests that come through WebPagetest I've been wondering if all of the optimization checks really make sense or if we should really be focusing on just the top 3 or 4 things that have the largest impact and not worrying about the smaller things. A LOT of the pages I see come through don't even have the basics in place and they may be overwhelmed by the checklists, etc (and by the time you start worrying about a few bytes from cookies, that may not be the real bottleneck for your site and you may be wasting optimization time).

This was particularly tweaked as YSlow 2 was released and added more automated checks from the 34 best practices. Are people going to be trying for an A and over-optimizing or optimizing the wrong part of their site? Are people at the beginning of the curve going to be overwhelmed about what they have to do and miss the opportunity for a large payoff for minimal effort?

Ryan Doherty wrote a really good article where he did a step-by-step optimization of a fake social network portal in "Optimizing openSpaceBook". He did miss persistent connections and I'd argue that even minifying probably wasn't necessary but he hit the big hitters and documented the improvement from each. I decided to take that framework and walk through optimizing a real-world site going through the steps from easiest to implement to the most difficult and just focusing on the changes that would result in large gains.

My proposed optimization path that everyone should take at a minimum (and that should be universally beneficial) is:

Enable Persistent Connections: This is a simple configuration setting on most web servers, requires no changes to the site and has almost no risk (the only risk is if you are running a site at close to capacity you may not be able to keep the connections open for long).
Properly Compress your Content: This includes both gzipping your html/javascript/css and properly compressing your images (jpegs can often be saved at a lower quality with no sacrifice in quality - we use Photoshop quality level 50 as the baseline at AOL). For the gzip compression it is again usually just a matter of configuration on the web server to enable it. If you pay for bandwidth this can also save you real money on hosting costs.
Allow the Browser to Cache your Static Content: Now we're starting to stray into possibly requiring code changes and this won't have any impact on the initial load time bot for repeat visits the savings can be significant.
Reduce the Number of HTTP Requests: This one is a bit more ambiguous but the most important cases for this are to collapse your CSS and JS down to a single file of each and to use image sprites for your page element graphics. This definitely requires more work than the other 3 optimizations but the payoff is usually well worth it.

I'd argue that these 4 optimizations will get you 90+% of the improvement for most of the sites and anything left is going to be very specific to each individual site (javascript optimizations, etc).

I decided to take a corporate portal and walk through an optimization exercise much like Ryan did to see what the benefit was for each step. It didn't take long to find one that was in pretty bad shape - I decided to look at a portal for a national web design company and it turns out that their home page pretty much failed everything except for the persistent connections so I cloned their site, broke the persistent connections and started optimizing.

First, the baseline - With everything broken I measured the site and this is what it looked like:

First View

Repeat View

If you ever see one of your repeat view waterfalls with a lot of yellow on it it means you REALLY need to do a better job of letting the browser cache your page. All of those requests are wasted round trips.

And the numbers from the baseline:

	Load Time	Start Render	Requests	Bytes In
First View	18.446s	8.174s	87	500 KB
Repeat View	13.176s	7.281s	87	20 KB

You'll notice that the repeat view is not much faster than the first view even though it only downloads 20KB of data - that's because of the 87 requests which are really slowing things down.

Step 1: Enable Persistent Connections (keepalives)

A quick tweak to the Apache configuration to turn on the keepalives and we eliminate one round trip from each of the requests. The waterfall essentially looks the same, just without the little orange bits at the beginning of each request but look at what happened to the times:

	Load Time	Start Render	Requests	Bytes In
First View	10.591s	4.922s	87	503 KB
Repeat View	7.431s	4.336s	87	23 KB

That's close to a 50% improvement in load times with 5 minutes worth of work and NO changes to the page itself.

Step 2: Compression

Again, just a quick tweak to the Apache configuration to enable gzip compression and re-compress a few of the jpeg images and we get:

	Load Time	Start Render	Requests	Bytes In
First View	9.698s	4.610s	87	354 KB
Repeat View	7.558s	4.351s	87	26 KB

We got another second or so in first view times and saved 150KB of bandwidth for the site. This particular page did not have a lot of text or javascript and the images were already in pretty good shape so the improvement wasn't as big as it would be on several sites I have seen but the effort required is minimal and there is no downside to doing it.

Step 3: Cache Static Content

This will not have any impact on the first view performance but if users ever come back to your site it can have a huge impact on the performance of your site. We are starting to cross the line into "may require some application work" though to make sure it is safe for your static assets to be cached for long periods. If so, then actually enabling it is again just a configuration setting on the server.

Here is what the waterfall looks like for the repeat view after we let the browser cache everything:

There are only 2 requests (and one of those is generated by javascript and canceled right away). More importantly, here are what the times look like:

	Load Time	Start Render	Requests	Bytes In
First View	9.751s	4.533s	87	361 KB
Repeat View	0.788s	0.753s	2	0 KB

As expected, no impact to the first view times, but the repeat view times got 90% faster.

Step 4: Reduce the number of HTTP requests

Now we're finally into the realm of having to do actual development work on the page. This page only had a few javascript files (5) but there were a TON of individual images for the various page elements. Combining the javascript and css files is a pretty trivial effort (and there are even modules that can do it for you). Changing the page to use image sprites instead of discrete images is a lot more work but WELL worth it (best if you can just plan to do this before you build a site but also worth it when retrofitting).

Here is what the waaterfall looked like after combining the files together:

And the numbers:

	Load Time	Start Render	Requests	Bytes In
First View	3.910s	1.079s	15	344 KB
Repeat View	0.773s	0.819s	2	0 KB

That's another 60% improvement in the first view times (and I only did the easy combining - it could have been refined even more).

Wrap-Up

As you can see, with just 4 core rules you can take a page from 18 seconds to load all the way down to 4 seconds (an 80% improvement). If that doesn't demonstrate the 80/20 rule, I don't know what does. Are the other best practices worth implementing/checking? I'd strongly content that it's good to know them but once you actually implement even these 4 basic rules you're either going to be fast enough or you're going to be doing some more advanced testing and analysis to see what is making the site slow (probably by manually looking at the waterfalls and looking for the bottlenecks).

I'm not making any changes yet but I'm strongly considering changing the checklist on WebPagetest to focus on these 4 rules as critical to implement and then provide the other details more as informational checks but not present them as prominently as they currently are.

Thoughts? Leave a comment here or discuss them in the WebPagetest Forums