Friday, July 12, 2013

Measuring performance of the user experience

TL;DR: WebPagetest will now expose any User Timing marks that a page records so you can use the same custom events for your synthetic test measurement as well as your Real User Measurement (and you can use WebPagetest to validate your RUM measurement points).

Before kicking off an optimization effort it is important to have good measurements in place.  If you haven't already read Steve Souders' blog post on Moving beyond window.onload(), stop now, go read it and come back.

The Page Load time (start of navigation to the onload event) is the cornerstone metric for most web performance measurement and it is a fundamentally broken measurement that can end up doing even more harm than good by getting developers to focus on the wrong thing.  Take two examples of static pages from WebPagetest for example:

The first is the main test results page that you see after running a test.  Fundamentally it consists of the data table and several thumbnail images (waterfalls and screen shots).  There are a bunch of other things that make up the page but they aren't the critical parts of the page for the user.  Specifically, Ads, social buttons (twitter and g+), the partner logos at the bottom of the page, etc.

Here is what it looks like when it loads:

The parts of the page that the user (and I) care about have completely finished loading in 500ms but the reported page load time is 3 seconds.  If I was going to optimize for the page load time I would probably remove the ads, the social widgets, the partner logos and the analytics.  The reported onload time would be better but the actual performance for the user experience would not change at all so it would be completely throw-away work (not to mention detrimental to the site itself).

The second is the domains breakdown page which uses the Google visualization libraries to draw pie charts of the bytes and requests by serving domain:

In this case the pie charts actually load after the onload event and measuring the page load time is really just measuring a blank white page.

If you were to compare the load times of both pages using the traditional metrics they would appear to perform about the same but the page with the pie charts has a significantly worse user experience.

This isn't really new information, the work I have been doing on the Speed Index has largely been about providing a neutral way to measure the actual experience and to do it consistently across sites.  However, if you own the site you are measuring, you can do a LOT better since you know the parts of the page 

Instrumenting your pages

There are a bunch of Real User Measurement libraries and services available (Google Analytics, SOASTA mPulse, Torbit Insight, Boomerang, Episodes) and most monitoring services also have real-user beacons available as part of their offerings.  Out of the box they will usually record the onload time but they usually also have options for custom measurements.  Unfortunately they all have their own APIs right now but there is a W3C standard that the performance group nailed down last year for User Timing.  It is a very simple API that lets you record point-in-time measurements or events and provides a way to query and clear the list of events.  Hopefully everyone will move to leveraging the user timing interfaces and provide a standard way for marking "interesting" events but it's easy enough to build a bridge that takes the user timing events and reports them to whatever you are using for your Real User Measurement (RUM). 

As part of working on this for WebPagetest itself I threw together a shim that takes the user timing events and reports them as custom events to Google Analytics and SOASTA's mPulse or Boomerang.  If you throw it at the end of your page or load it asynchronously, it will report aggregated user timing events automatically.  The "aggregated" part is key because when you are instrumenting a page you can identify when individual elements load but what you really care about is when they have ALL loaded (or all of a particular class of events have happened).  The snippet will report the time of the last event that fired and it will also take any period-separated names (group.event) and report the last time for each group.  In the case of WebPagetest's result page I have "aft.Header Finished", "aft.First Waterfall" and "aft.Screen Shot" (aft being short for above-the-fold".  The library will record an aggregate "aft" time that is the point when everything that I consider critical as above-the-fold has loaded.

The results paint a VERY different view of performance than you get from just looking at the onload time and match the filmstrip much better.  Here is what the performance of all visitors from the US to the test results page looks like in mPulse.

Page Load (onload):

aft (above-the-fold):

That's a pretty radical difference, particularly in the long-tail.  A 13 second 98th percentile is something that I might have freaked out about but 4 seconds is quite a bit more reasonable and actually better represents the user experience.

One of the cool things about the user timing spec is that the interface is REALLY easy to polyfill so you can use it across all browsers.  I threw together a quick polyfill (feel free to improve on it - it's really basic) as well as a wrapper that makes it easier to do the actual instrumentation.  

Instrumenting your page with the helper is basically just a matter of throwing calls to markUserTime() at points of interest on the page.  You can do it with inline script for text blocks:

or more interestingly, as onload handlers for images to record when they loaded:

If you can get away with just using image onload handlers that would be the safest bet because inline scripts can have unintended blocking events where the browser has to wait for previous css files to load and process before executing.  It's probably not an issue for an inline script block well into the body of a bage but something to be aware of.

Bringing some RUM to synthetic testing

Now that you have gone and instrumented your page so that you have good, actionable metrics from your users, it would be great if you could get the same data from your synthetic testing.  The latest WebPagetest release will extract the user timing marks from pages being tested and expose them as additional metrics:

At a top-level, there is a new "User Time" metric that reports the latest of all of the user timing marks on the page (this example is from the breakdown pie chart page above where the pie chart shows up just after 3 seconds and after the load event).  All of the individual marks are also exposed and they are drawn on the waterfall as vertical purple lines.  If you hover over the marker at the top of the lines you can also see details about the mark.

The times are also exposed in the XML and JSON interfaces so you can extract them as part of automated testing (the XML version has the event names normalized):

This works as both a great way to expose custom metrics for your synthetic testing as well as for debugging your RUM measurements to make sure your instrumentation is working as expected (comparing the marks with the filmstrip for example).

Tuesday, June 4, 2013

Progressive JPEGs FTW!

TL;DR: Progressive JPEGs are one of the easiest improvements you can make to the user experience and the penetration is a shockingly-low 7%.  WebPagetest now warns you for any JPEGs that are not progressive and provides some tools to get a lot more visibility into the image bytes you are serving.

I was a bit surprised when Ann Robson measured the penetration of progressive JPEGs at 7% in her 2012 Performance Calendar article.  Instead of a 1,000 image sample, I crawled all 7 million JPEG images that were served by the top 300k websites in the May 1st HTTP Archive crawl and came out with....wait for it.... still only 7% (I have a lot of other cool stats from that image crawl to share but that will be in a later post).

Is The User Experience Measurably Better?

Before setting out and recommending that everyone serve progressive JPEGs I wanted to get some hard numbers on how much of an impact it would have on the user experience.  I put together a pretty simple transparent proxy that could serve arbitrary pages, caching resources locally and transcoding images for various different optimizations.  Depending on the request headers it would:

  • Serve the unmodified original image (but from cache so the results can be compared).
  • Serve a baseline-optimized version of the original image (jpegtran -optimize -copy none).
  • Serve a progressive optimized version (jpegtran -progressive -optimize -copy none).
  • Serve a truncated version of the progressive image where only the first 1/2 of the scan lines are returned (more on this later).
I then ran a suite of the Alexa top 2,000 e-commerce pages through WebPagetest comparing all of the different modes on a 5Mbps Cable and 1.5Mbps DSL connection.  I first did a warm-up pass to populate the proxy caches and then each permutation was run 5 times to reduce variability.

The full test results are available as Google docs spreadsheets for the DSL and Cable tests.  I encourage you to look through the raw results and if you click on the different tabs you can get links for filmstrip comparisons for all of the URLs tested (like this one).

Since we are serving the same bytes, just changing HOW they are delivered, the full time to load the page won't change (assuming an optimized baseline image as a comparison point).  Looking at the Speed Index, we saw median improvements of 7% on Cable and 15% on DSL.  That's a pretty huge jump for a fairly simple serving optimization (and since the exact same pixels get served there should be no question about quality changes or anything else).

Here is what it actually looks like:

Some people may be concerned about the extremely fuzzy first-pass in the progressive case.  This test was just done with using the default jpegtran scans.  I have a TODO to experiment with different configurations to deliver more bits in the first scan and skip the extremely fuzzy passes.  By the time you get to 1/2 of the passes, most images are almost indistinguishable from the final image so there is a lot of room for improving the experience.

What this means in WebPagetest

Starting today, WebPagetest will be checking every JPEG that is loaded to see if it is progressive and it will be exposing an overall grade for progressive JPEGs:

The grade weights the images by their size so larger images will have more of an influence.  Clicking on the grade will bring you to a list of the images that were not served progressively as well as their sizes.

Another somewhat hidden feature that will now give you a lot more information about the images is the "View All Images" link right below the waterfall:

It has been beefed up and now displays optimization information for all of the JPEGs, including how much smaller it would be when optimized and compressed at quality level 85, if it was progressive and the number of scans if it was:

The "Analyze JPEG" link takes you to a view where it shows you optimized versions of the image as well as dumps all of the meta-data in the image so you can see what else is included.

What's next?

With more advanced scheduling capabilities coming in HTTP 2.0 (and already here with SPDY), sites can be even smarter about delivering the image bits and re-prioritize progressive images after enough data has been sent to render a "good" image and deliver the rest of the image after other images on the page have had a chance to display as well.  That's a pretty advanced optimization but it will only be possible if the images are progressive to start with (and the 7% number does not look good).

Most image optimization pipelines right now are not generating progressive JPEGs (and aren't stripping out the meta-data because of copyright concerns) so there is still quite a bit we can do there (and that's an area I'll be focusing on).

Progressive JPEGs can be built with almost arbitrary control over the separate scans.  The first scan in the default libjpeg/jpegtran setting is extremely blocky and I think we can find a much better balance.

At the end of the day, I'd love to see CDNs automatically apply lossless image optimizations and progressive encoding for their customers while maintaining copyright information.  A lot of optimization services already do this and more but since the resulting images are identical to what came from the origin site I'm hoping we can do better and make it more automatic (with an opt-out for the few cases where someone NEEDS to serve the exact bits).

Tuesday, May 28, 2013

What makes for a good talk at a tech conference?

I have the pleasure of helping select the talks for a couple of the Velocity conferences this year and after looking at several hundred proposals it is clear that there are widely varying opinions from submitters on what would make for a good talk and also a lot of cases where the topics may be good but the submitter may have the wrong focus. I'm certainly not an expert on the topic but I think that if you just keep one point in mind when submitting a talk for a tech conference (any tech conference) your odds of getting a talk accepted will go up exponentially:

It is all about the attendees! Period!

When you're submitting a talk, try to frame it in such a way that each attendee will get enough value out of your talk to justify the expense of them attending the conference (conference costs, travel, opportunity cost, etc). If all of the talks meet that criteria then you end up with a really awesome conference.

If you are talking about a technique or toolchain, make sure that attendees will be able to go back to their daily lives and implement what you talked about. More often than not that means the tools need to be readily available (bonus points for open source) and you need to provide enough information that what you did can be replicated. These kinds of talks are also a lot better if they are presented by the team that implemented the "thing" and not by the vendor providing the toolchain. For most tech conferences, the attendees are hands-on so hearing from the actual dev/ops teams that did the work is optimal.

Make sure you understand the target audience as well and make the talks generally applicable. For something like Velocity where the attendees are largely web dev/ops with a focus on scaling and performance, make sure your talk is broadly applicable to them. A talk on implementing low-level networking stacks will not work as well as a talk about how networking stack decisions and tuning impact higher-level applications for example.

What doesn't work?

  • Product pitches (there are usually sponsored tracks and exhibit halls for that kind of thing)
  • PR. This is not about getting you exposure, it is about educating the attendees.