Tuesday, October 11, 2011

Testing for Frontend SPOF

Steve Souders had a great blog post last year that talked about Frontend Single Points Of Failure (SPOF).  Given the continuing rise in 3rd-party widgets on pages it is becoming increasingly important and I realized that there weren't any good tools for testing for it.  Seemed like the perfect opportunity to piece something together so that's exactly what I did.

Testing for Frontend SPOF

Probably the most critical part of testing a failure of a 3rd-party widget is to make sure you get the failure mode correct.  When these things fail, the servers usually become unreachable and requests time out.  It is important to replicate that behavior and not have the requests fail quickly, otherwise you will see what your site looks like without the content but the experience won't be right (the real experience is Sooooooo much worse).

I looked around for a well-known public blackhole server but couldn't find one so I went ahead and set one up (feel free to use it for your testing as well):

blackhole.webpagetest.org (aka 72.66.115.13)

A blackhole server is a server that can be routed to but all traffic gets dropped on the floor so it behaves exactly like we want when testing the failure mode for 3rd-party widgets.

With the blackhole server up and running you can now use it for testing manually or through tools like Webpagetest.

Browsing the broken web

For the purposes of this example I'll be "breaking" the twitter, Facebook and Google buttons as well as the Google API server (jquery, etc) and Google Analytics.

Now that we have a blackhole server, breaking the web is just a matter of populating some entries in your hosts file (C:\Windows\System32\drivers\etc\hosts on windows).  Go ahead and add these entries and save the updated hosts file:


72.66.115.13 ajax.googleapis.com     
72.66.115.13 apis.google.com         
72.66.115.13 www.google-analytics.com
72.66.115.13 connect.facebook.net    
72.66.115.13 platform.twitter.com    

...and go browse the web.  It shouldn't take you long to find a site that is infuriatingly painful to browse.  Congratulations, you just experienced a Frontend SPOF - now go fix it so your users don't have to feel the same pain (assuming it is a site you control, otherwise just yell at the owner).

Testing it with WebPagetest

It's a lot easier to discover broken content just by browsing using the hosts file method, but if you find something and need to make the case to someone to get it fixed, nothing works better than a WebPagetest video.

First, test the site as you normally would but make sure to check the "capture video" option (and it's probably not a bad idea to also give it a friendly label).

Next, to capture the broken version of the site you will need to use a script (largely just copy and paste).  You need to send the broken domains to the blackhole and then visit the page you are trying to test:


setDnsName ajax.googleapis.com blackhole.webpagetest.org
setDnsName apis.google.com blackhole.webpagetest.org
setDnsName www.google-analytics.com blackhole.webpagetest.org
setDnsName connect.facebook.net blackhole.webpagetest.org
setDnsName platform.twitter.com blackhole.webpagetest.org
navigate your.url.com

Just paste the script into the script box (with the correct URL to be tested), make sure capture video is checked and that you have a friendly label on the test.

Finally, go look at the test history, select the tests that you ran and click compare (the history works best if you log into the site before submitting your tests).

And what would be the fun in it without an example.  Here is what happens to Business Insider when Twitter goes down (yeah, THAT never happens): http://www.webpagetest.org/video/view.php?id=111011_4e0708d3caa23b21a798cc01d0fdb7882a735a7d

Yeah, so it's normally pretty slow but when Twitter goes down the user stares at a blank white screen for 20 seconds!  At that point, Business Insider itself may as well be down.  Luckily it can easily be solved just by loading the twitter button asynchronously.

Monday, October 3, 2011

Anycast and what it means for web performance


Every now and then the topic of Anycast comes up in the context of web performance so I thought I’d take a stab at explaining what it is and the benefits.

tl;dr – DNS servers should always be Anycast (and even some of the largest CDN’s are not so don’t just assume you are covered).  Anycast for the web servers/CDN is great if you can pull it off but it’s a lot less common than DNS.

Anycast – the basics

Each server on a network (like the Internet) is usually assigned an address and each address is usually assigned to a single server.  Anycast is when you assign the same address to multiple servers and use routing configurations to make sure traffic is routed to the correct server.  On private networks where there is no overlap this is pretty easy to manage (just don’t route the Anycast addresses out of the closed network).  On the public Internet things are somewhat more complicated since routes change regularly so a given machine could end up talking to different servers at different points in time as routing changes happen on the Internet (congested links, outages, and for hundreds of other reasons).

The routing behavior on a network as large as the Internet means Anycast is not a good fit for stateful long-lived connections but stateless protocols or protocols that recover well can still work.  Luckily for the web, the two foundational protocols for web traffic are largely stateless (DNS and HTTP).

DNS Anycast

By far, the most common use for Anycast on the Internet is for DNS (servers and relays).  To provide fast DNS response times for users across the globe you need to distribute your authoritative DNS servers (and users need to use DNS relays/servers close to them).

One way to distribute your servers is to give each one a unique address and just list them all as authoritative servers for your domain.  Intermediate servers running Bind 8 will try them all and favor the fastest ones but it will still use the slower ones for some percentage of traffic.  Bind 9 (last I checked anyway) changed the behavior and no longer favors the fastest so you will end up with a mix of slow and fast responses for all users.

Using Anycast you would distribute your servers globally and give them all the same IP address and you would list a single address (or a couple of Anycast addresses for redundancy) as the authoritative servers for your domain.  When a user goes to look up your domain, their DNS relay/server would always get routed to your best authoritative server (by network path, not necessarily physical geography).  Since DNS is just a request/response protocol over UDP, it really doesn’t matter if they end up talking to different physical servers for different requests.

So, as long as the routing is managed correctly, DNS Anycast is ALWAYS better than other solutions for a distributed DNS serving infrastructure (at least for performance reasons).  You should make sure that you are using Anycast DNS for moth your own records as well as any CDNs you might leverage.  It works for both the authoritative servers as well as DNS relays that users might use.  Google’s public DNS servers for end users are globally distributes but use the Anycast addresses of 8.8.8.8 and 8.8.4.4 so you will always get the fastest DNS performance regardless of where you are and what network you are on.

HTTP Anycast

Even though HTTP is not as stateless as DNS (TCP connections need to be negotiated and maintained), the connections live for a short enough time that Anycast can also work really well for HTTP – though it requires more control over the network to keep routing changes to a minimum.

Typically, geo-distribution of web servers is done by assigning them different IP addresses and then relying on geo-locating DNS to route users to the server closest to them.  It usually works well enough but there are some fairly big gotchas:

  • The geo-locating DNS server actually sees the address of the user’s DNS server, not the user themselves so it can only provide the server closest to the user’s DNS – not necessarily the user (there is a spec update to relay the actual user IP through in DNS requests so this can be done more accurately).
  • The geo-locating is only as good as the knowledge that the service has about which web servers are closest to the user’s DNS servers.  It usually works well but it’s not uncommon to see traffic routed to servers that are far away.
  • The Time To Live (TTL) on the DNS responses is usually really short (60 seconds) so that dead or overloaded servers can be pulled out as needed.  This effectively means that the DNS records can’t be cached by the user’s DNS servers and the requests all have to go back to the authoritative servers.


With Anycast, servers can be deployed globally with the same IP address.  When it works well it addresses all of the issues that using DNS to geo-locate has:

  • DNS can reply with the same IP address for all users and the address can have a long TTL and be cached by intermediate DNS resolvers.
  • In the case of a CDN, you can even assign the Anycast address directly as an A record and avoid the extra step of a CNAME lookup.
  • You don’t need to know where the user is.  Routing will take care of bringing the user to the closest server regardless of where they or their DNS server are located.
  • If you need to take a server offline, you adjust the routing so that traffic goes to the next best physical server.


I’m glossing over a LOT of the complexity in actually managing an Anycast network on the public internet but assuming you (or your provider) can pull it off, Anycast can be a huge win for HTTP performance as well.

All that said, there are only a few implementations that I am aware of for using Anycast for HTTP (and they are all CDN providers).  Anycast for HTTP should not be the main focus when picking a CDN since there are a lot of other important factors – the most important of which is to make sure they actually have edge nodes near your users (if you have a lot of users in Australia then pick a provider with edge nodes in Australia FIRST, then compare other features).

Friday, September 2, 2011

Firefox - Comparing WebPagetest to Firebug

We recently launched experimental support for testing with Firefox on WebPagetest and I've already had some interesting questions as people started to see things on WebPagetest that they were not used to seeing when they use the Firebug Net Panel (of which I'm a HUGE fan and user).

When we built the Chrome test agent for WebPagetest we made the decision to try and make it cross-browser and re-use as much code as possible (the re-use is easily over 90% between Chrome and Firefox).  One of the side effects is that we try to get our data from as low as possible in the application stack so we can re-use as much of it as possible and not rely on browser-specific features.  For all practical purposes this means that we get the networking data as the browser sends/retrieves it from the OS.  We don't get visibility into the browser cache but we get really good data on  the actual data that is sent on the wire.

For everything I'm talking about here I'll be using a test page that is part of a firebug bughttp://jeroenhoek.nl/jquery/twice.html and talking specifically about just the net panel.

Favicon

One of the first pings I got was about favicon.ico errors showing up in WebPagetest but not in Firebug (the requests aren't displayed in Firebug at all):



It's not necessarily a big deal but it is traffic you should be aware of.  I was particularly surprised to see that Firefox tried to download the favicon 3 times.  Just more reason to make sure you always have a valid one - eliminates 3 hits to your server in the case of Firefox.

Socket Connect

Another interesting difference is the socket connect behavior.  When we implemented the Chrome agent we had to deal with the initial connections not necessarily being right before a request because of Chrome's pre-connect logic.  The Firefox agent inherited the capability and it looks like it's a good thing it did because Firefox 6 appears to also pre-connect (looks like it opens 2 connections to the base domain immediately.


In Firebug the connection is right up next to the request (I expect this is a limitation in the waterfall and not the measurement itself).

Interestingly, it looks like Firefox pre-connected 2 additional connections in my Firebug test but that's possibly because Firefox had learned that that page needed more than 2 connections (WebPagetest uses a completely clean profile for every test and my Firebug test was on my desktop with a clear cache but not a new profile).

Duplicate Requests

Finally, the actual topic of the bug report, Firebug shows an aborted request for the web font.  If the aborted request is real, it was completely internal to Firefox and never hit the wire because the WebPagetest trace shows just the single request.

Finally

These are just the things that have jumped out at me over the last couple of days.  Let me know if you see any other interesting differences.  As always, it's great to have tools that measure things differently so we can validate them against each other (and confirm behaviors we are seeing to make sure it's not a measurement problem).

Friday, May 20, 2011

The Demise of Active Website Monitoring

It hasn't happened yet but it's a question of when, not if active monitoring of websites for availability and performance will be obsolete.  My prediction is that it will happen in the next 5 years though if everything lines up it could be as soon as 2 years away.

By active monitoring I am referring to testing a website on a regular interval and potentially from several locations to see if it is working and how long it takes to load (and bundled in with that the alarming, reporting, etc. that goes with it).

Active monitoring has some pretty strong benefits over any alternatives right now:
  • Rich debugging information (resource-level timing, full access to headers and network-level diagnostics)
  • Consistency - the test conditions do not vary from one test to the next so there is minimal "noise"
  • Predictability - you control the frequency and timing of the tests
  • Low-latency alerting - you can get notified within minutes of an event/issue (assuming it is detected)
But it's not all sunshine and roses:
  • You only have visibility into the systems/pages that you test (which is usually a TINY fraction of what your users actually use)
  • It's expensive.  You usually end up picking a few key pages/systems to monitor to keep costs under control
  • The more you test, the more load you put on the systems you are monitoring (capacity that should be going to serve your users)
  • You can only test from a "representative" set of locations, not everywhere your users actually visit from.  This may not seem important if you only serve content from one location, but do you use a CDN?  Do you serve ads or use 3rd-party widgets that are served from a CDN?  If so then there is no way that you are actually able to test every path your users use to get your content
  • The performance is never representative of what the users see.  Usually monitoring is done from backbone connections that are close to CDN POPs.  Even if you spring for testing on real end-user connections ($$$$) you have to pick a small subset of connection types.  You users visit from office connections, home ISP connections, mobile, satellite and over various different connections even within the house.
So, none of this is new, why now and what is going to replace it?

There are several different advances that are converging that will make it possible to collect, report and act on REAL end user data (Real User Monitoring - RUM).  There are several issues with using data from the field but between advances in the browsers and Big Data they are on the verge of being solved:

First off, getting the rich diagnostic information from the field.  Monitoring is useless if you can't identify the cause of a problem and historically you have had very little insight into what is going on inside of a browser.  That all started to change last year when the W3C Web Performance Working Group formed.  They are working on defining standards for browser to expose rich diagnostic/timing information to the pages.  The first spec that has been implemented is the Navigation Timing standard which exposes information at a page-level about the timings of various browser actions.  The Navigation Timing spec has already been implemented in IE9 and Chrome and will be coming soon in the other major browsers.  Even more interesting will be the Resource Timing standard which will expose information about every resource that is loaded.

HTML5 also opens up the possibility to store data in local storage in the case where a failure can't be reported (to allow for it to be reported later) and for pages that leverage the Application Cache you can even run completely offline and detect failures to reach the site in the first place.

OK, so we will be able to get the rich diagnostics from the field (and collect data from the real user sessions so you get coverage on everything the users do on your site, from everywhere they visit, etc) - that's a lot of data, what do you do with it?

Big Data to the rescue.  Data storage and analysis for huge data sets has started to explode.  Primarily driven by hadoop but there are tons of commercial companies and services entering the space.  It's already possible to store, process and analyze petabytes (and more) very efficiently and things are only going to improve.  We are quickly evolving towards a world where we collect everything and then slice and dice it later.  That's pretty much exactly what you need to do with field data to investigate trends or dig into issues.

Why 2-5 years and not now?

Adoption. Browsers that support the new standards will take a while to reach critical mass (and the Resource timing spec isn't defined yet).  There are also no services or toolkits yet that do good field monitoring so it will take a while for those to evolve and mature. 

I'm curious to see if the traditional beacon services (omniture, comscore, etc) step into this space, if the traditional monitoring providers adapt or if a new breed of startups catches them all off guard.  I am a little surprised that most of the participation in the standards process is coming from the browser vendors themselves trying to anticipate how the data would be used - I'd expect to see more of the monitoring services playing an active role if it was on their radar.

Friday, January 7, 2011

Tour of the "Meenan Data Center"

Well, I've been promising to do it for a while and it's finally ready.  If you've ever wondered what the facility looks like that runs the Dulles test location for WebPagetest (and soon the Web Server), here you go...

First up, here is the secure entrance to the cage securely below ground level in case of tornados or other such craziness (yes, in case you're wondering - my basement).


Here are the physical machines grinding away day and night running your tests.  They are co-located with the "climate and humidity control system" (furnace).  The unRAID file server is completely unrelated to WebPagetest, it just happens to be sitting there (and I'm a huge fan of the technology - I can RAID a massive array of disks but only the disk being accessed at a given time spins up so it's great for power consumption).

 
Around the corner we have the brand-spanking-new web server that will be running WebPagetest.org (among other random personal sites).  The Voip converter just happens to be there because that's where my phone line comes in and it's great for blocking telemarketers and keeping the phone from ringing when the kids are sleeping.


Finally, we have the heart of the data center tour, the network that pulls it all together. I've been completely spoiled by FIOS.  Seriously low latency high bandwidth connectivity right to the home.  The bulk of that wiring is really just my house and random devices (yes, all of the plugged in ports are live - everything has a network connection these days).  The BSD router is overkill these days.  It was there originally because the traffic shaping was done there but now that it has moved into the testers themselves I need to get around to replacing it with something a little less power hungry.


And there you have it.  The Meenan Data Center in all it's glory!