Friday, August 20, 2010

Passive vs Active performance monitoring

One of the things that has always bothered me about actively monitoring a site's performance (hitting it on a regular interval from an automated browser) is that you are only getting results for the specific page(s) you are monitoring from the locations and browsers you are using for the monitoring.  To get better coverage you need to do more testing which increases the amount of artificial traffic hitting your site (and still ends up not being a very realistic coverage of what your end users are seeing).

Passive monitoring on the other hand involves putting a beacon of some kind on your page that reports the performance of every page that every visitor loads (without artificial traffic).  You get complete coverage of what real users are doing on your pages and what their real experiences are.

There are some real benefits to active testing, particularly the controlled environment which produces consistent results (while passive monitoring requires a lot of traffic otherwise individual user configurations will skew the results on a given day).  Active monitoring also gives you a wealth of information that you can't get from field data (information on every request and details on exactly what is causing a problem).

Active testing is easier - you just find a company that offers the service, subscribe and start receiving reports and alerts.  For passive monitoring you need to instrument your pages and build the infrastructure to collect and analyze the results (or find a company that will do it for you but then you are potentially adding another external Frontend SPOF to your page). Boomerang is a great place to start for passive monitoring but you still need the reporting infrastructure behind it.

Can we do better?  Would something like a mix of passive and active monitoring work better where active tests are initiated based on information collected from the passive monitoring (like the top pages for that day or pages that are showing slower or faster performance than "normal")?

Several people have asked for WebPagetest to be able to do recurring, automated testing and I'm debating adding the capability (particularly for private instances) but I'm not convinced it is the right way to go (for performance, not availability monitoring). Is the amount of artificial traffic generated (and testing infrastructure) worth it?  Are the results meaningful on a regular basis or will it just end up being another set of reports that people stop paying attention to after a week?

I'd love to hear from other users on how they monitor their sites and what they have found that works well so shoot some comments back and lets get a discussion on it going.

8 comments:

  1. We use an Oracle (ex Moniforce) product called RUEI.

    It gets a feed of traffic from a tap port on the switch in front of our webfarm and then pushes it into a set of OLAP cubes.

    It's great up to a point, it'll measure real page load times passively as it examines at both the HTTP and TCPIP levels. It'll extract custom dimensions out of the data it sees too.

    Where it falls down it in the was it rolls up data, often the interesting detail at the edge gets lost and there are other issues with the way if warehouses the data

    As an example I've been playing with Apdex but wanted to see what difference reducing the metric for satisfied page loading was but the stats aren't available at a low enough level to play what if scenarios

    What I'd really like to be able to do is get the same sort of feed and store the important information in a way that we could use hadoop (or similar) to analyse it so I could go back over the data to explore it in different ways.

    After reading about it in Theo Schlossnagle's book been thinking about where mod_spread could get.

    Other truly lush thing RUEI does is measure average throughput (though I think it's approach is a bit borked), you can then dial the figures into Charles to see what experience your users are really getting.

    @andydavies

    ReplyDelete
  2. I had forgotten about tap-based solutions, largely because the sites I'm used to looking at rely heavily on 3rd party content and CDN's. How accurate have you found it to be for capturing the full page load times that the users see?

    ReplyDelete
  3. From our sampling, it's reasonably accurate but we still serve all our own files i.e. no CDN. Where the do fall down is in that they can only look at network traffic not actual browser events e.g. when unload gets fired.

    Third party beacon of course alter the dynamic of the page load so run into different issues!

    There isn't an easy way of measuring what the real user experience is...

    ReplyDelete
  4. Don't get me wrong, I'm a huge proponent of active testing (as well as instrumenting the crap out of your systems) for operational purposes but that is a different beast from full-page performance testing where you use real browsers. In those situations you need a mix of active testing (some form of http get automation with Nagios, a service or any of a bunch of tools that can do it) as well as stats out the wazoo from your applications, hosts, access logs, network gear and load balancers.

    I'm more questioning the need to do the same thing with full browsers to actively monitor a subset of your pages. From an ops perspective you end up with a bunch of alarms that are outside of the ops team's control because of 3rd party issues (ads, beacons, tracking, agent-location problems, etc) and someone has to parse through them to figure out who to call where targeted ops monitoring can monitor each separately and send alarms directly to the responsible party.

    You also need to do the active testing to make sure the site is actually up (availability testing) because the passive monitoring won't give you any data if they can't get to your front door.

    What I'm questioning is the usefulness of doing what WebPagetest does (with full browsers from multiple locations) for ongoing performance monitoring (usually of front-end performance because base-page performance can be measured much better with other tools). The resource needs for running full-page testing with real browsers is orders of magnitude higher than what it takes to do http monitoring (somewhere between 100:1 and 1000:1 based on my experience doing both at scale) - is it worth it?

    ReplyDelete
  5. Definitely need both passive and active.

    Definitely need a free, open source place to store passive data (active too, for that matter).

    Also need to add more to passive than just page load time. You can't run YSlow or Page Speed on real users, but you could easily measure things like the number of images and size of JS.

    ReplyDelete
  6. @Patrick - So I think we are agreement then re the problem with the Passive solutions right now is "what do you do with the data" since "you end up with a bunch of alarms that are outside of the ops team's control because of 3rd party issues"?

    So to invert the question "What features would a Passive monitoring solution need in order to make the data it collects operationally useful" i.e. turning data into knowledge etc.

    To answer your specific question "is a 'real browser' monitoring solution worth the hassle versus 'synthetic' active monitoring agents" my 2c worth says probably not. Real browsers crash, hang, leak, constantly need upgrading & have variable performance. All the things you DON'T want from a reliable robust monitoring agent. Sure, you can architect a solution around all those issues, but at what cost, for what marginal value?

    That said, cross-browser testing solutions like Cloudtesting.com, crossbrowsertesting.com or browsershots.org definitely have their place in your performance, functional and regression testing arsenal and having basic scheduling functionality to run a suite of tests once or twice a day, or after every release, also has value.

    @Steve - Surely extending Piwik to house the passive performance data would be feasible without a huge amount of effort?

    What other performance data needs to be collected, and what are the plans for Boomerang to be extended to do collect it?

    ReplyDelete
  7. The first three charts on our RUEI dashboard are

    - average page load time
    - average server response time
    - average throughput

    (all based on five minute intervals)

    Putting aside some of the issues with averages, those three charts give some indication of where the problem may be e.g.

    if page load time rises but server response time remains the same we often see that throughput has dropped.

    What I'd like to be able to do is chart the metrics against each other, or chart the apdex of page load time vs server response time but can't do that in RUEI, so looking at other ways of trying to achieve it.

    (we've also got nagios and cacti whirring away for alarms)

    ReplyDelete
  8. I'd love passive monitoring as a part of testing our continuous build process especially for key pages. In that way we can monitor that the latest build hasn't killed the page performance.

    I see far too many developers don't optimise until they _have_ to. It should be part of a normal dev process.

    ReplyDelete

All comments are moderated and may take a while to appear.

Note: Only a member of this blog may post a comment.