By active monitoring I am referring to testing a website on a regular interval and potentially from several locations to see if it is working and how long it takes to load (and bundled in with that the alarming, reporting, etc. that goes with it).
Active monitoring has some pretty strong benefits over any alternatives right now:
- Rich debugging information (resource-level timing, full access to headers and network-level diagnostics)
- Consistency - the test conditions do not vary from one test to the next so there is minimal "noise"
- Predictability - you control the frequency and timing of the tests
- Low-latency alerting - you can get notified within minutes of an event/issue (assuming it is detected)
- You only have visibility into the systems/pages that you test (which is usually a TINY fraction of what your users actually use)
- It's expensive. You usually end up picking a few key pages/systems to monitor to keep costs under control
- The more you test, the more load you put on the systems you are monitoring (capacity that should be going to serve your users)
- You can only test from a "representative" set of locations, not everywhere your users actually visit from. This may not seem important if you only serve content from one location, but do you use a CDN? Do you serve ads or use 3rd-party widgets that are served from a CDN? If so then there is no way that you are actually able to test every path your users use to get your content
- The performance is never representative of what the users see. Usually monitoring is done from backbone connections that are close to CDN POPs. Even if you spring for testing on real end-user connections ($$$$) you have to pick a small subset of connection types. You users visit from office connections, home ISP connections, mobile, satellite and over various different connections even within the house.
There are several different advances that are converging that will make it possible to collect, report and act on REAL end user data (Real User Monitoring - RUM). There are several issues with using data from the field but between advances in the browsers and Big Data they are on the verge of being solved:
First off, getting the rich diagnostic information from the field. Monitoring is useless if you can't identify the cause of a problem and historically you have had very little insight into what is going on inside of a browser. That all started to change last year when the W3C Web Performance Working Group formed. They are working on defining standards for browser to expose rich diagnostic/timing information to the pages. The first spec that has been implemented is the Navigation Timing standard which exposes information at a page-level about the timings of various browser actions. The Navigation Timing spec has already been implemented in IE9 and Chrome and will be coming soon in the other major browsers. Even more interesting will be the Resource Timing standard which will expose information about every resource that is loaded.
HTML5 also opens up the possibility to store data in local storage in the case where a failure can't be reported (to allow for it to be reported later) and for pages that leverage the Application Cache you can even run completely offline and detect failures to reach the site in the first place.
OK, so we will be able to get the rich diagnostics from the field (and collect data from the real user sessions so you get coverage on everything the users do on your site, from everywhere they visit, etc) - that's a lot of data, what do you do with it?
Big Data to the rescue. Data storage and analysis for huge data sets has started to explode. Primarily driven by hadoop but there are tons of commercial companies and services entering the space. It's already possible to store, process and analyze petabytes (and more) very efficiently and things are only going to improve. We are quickly evolving towards a world where we collect everything and then slice and dice it later. That's pretty much exactly what you need to do with field data to investigate trends or dig into issues.
Why 2-5 years and not now?
Adoption. Browsers that support the new standards will take a while to reach critical mass (and the Resource timing spec isn't defined yet). There are also no services or toolkits yet that do good field monitoring so it will take a while for those to evolve and mature.
I'm curious to see if the traditional beacon services (omniture, comscore, etc) step into this space, if the traditional monitoring providers adapt or if a new breed of startups catches them all off guard. I am a little surprised that most of the participation in the standards process is coming from the browser vendors themselves trying to anticipate how the data would be used - I'd expect to see more of the monitoring services playing an active role if it was on their radar.
Pat,
ReplyDeleteGreat blog post. One minor nit - you wrote, "You can only test from a "representative" set of location". That's actually incorrect. You can now measure, in real time, HTTP traffic from any geo location on any carrier using nothing more than an Android phone.
Cheers,
Peter
Yes, but you actually need to DEPLOY those phones (making it representative). To measure from EVERY location on EVERY carrier and EVERY device that your users use requires RUM monitoring of your actual users.
ReplyDeleteGreat post. I'm sure we will start to see hybrid models from the traditional players. Gomez already makes an attempt based on their own tagging solution, but misses a big chunk of the response time. With GA reporting page speed, I'm sure omniture will be close behind. However, with minimal investment and the right know-how, there is less reason to pay for it.
ReplyDeleteHI Pat,
ReplyDeleteThought provoking stuff... my thoughts are over here - http://www.seriticonsulting.com/blog/2011/5/21/you-can-have-my-active-monitoring-when-you-pry-it-from-my-co.html
(with apologies to Charlton Heston for the blog title)!
@Patrick – agreed.
ReplyDeleteSolution:
Step 1: Employ uTest. They have 37,000 beta testers in 177 countries. Deploy a “Mobile App” that integrates Device, OS, Carrier and real time GPS data, into the Mobile browser. Then have them test 100 Web sites from everywhere they are. You will now have representative performance data from every location on the globe. Included in that will be precise performance metrics on how the page performed on every single carrier network. (Use Episodes like capability for your mobile page)
Step 2. Integrate that Mobile app (proxy server on the device) into the customers existing Mobile apps. Send this performance data directly to your own site.
@DogTown -
ReplyDeleteIt's worth noting that Omniture's had this capability for a while, you just needed to do a bit of custom coding -
http://www.webanalyticscentral.com/2010/08/02/omniture-sitecatalyst-helpful-tip-measuring-time-spent-on-previous-page/
http://webanalyticsland.com/sitecatalyst-implementation/additional-methods-to-measure-interaction-using-the-get-time-to-complete-plug-in/
Agree completely.. RUM is the only way forward. Ive been monitoring this with a custom script on a cpl of sites and they provide huge insights into effects on pageload due to third party scripts... Also helps when tuning performance.
ReplyDeletePat, agree on the general premise that RUM is the long-term path. As you point out, it requires Resource Timing support in browsers, and I believe that spec needs additional thinking around error conditions if it is to obviate the need for active monitoring (its current focus is more on performance). Also agree that it requires Big Data - with real-time processing so that alarm conditions can be detected witihin ~1 min.
ReplyDeleteFor any actively monitored site, the active monitoring will typically visit a site once every 1 to 5 minutes. If this isn't a negligible proportion of the sites traffic then the site isn't worth monitoring, so I dismiss your argument about the increased system load. It's theoretically true, but to all practical purposes irrelevant.
ReplyDeleteAlso there are 2 values to monitoring. One is performance statistics, and your case for using passive monitoring for this is well argued. The second use of monitoring is notification of site failures. How, if at all, does passive browsing produce active alerting? Without a solution to this issue the value proposition of active monitoring will remain high.
@Alan, It's usually a lot more complicated than 1 page every X minutes....
ReplyDeleteHow many different types of pages do you need to test to sufficiently test all of the relevant back-ens (authentication, search, etc)? For a reasonably large site do you need to test every front-end? How about the load balancers? GSLB? (for every page). As the complexity of the system grows, so does the weight of the monitoring (and the costs) and you STILL can't get 100% coverage.
With HTML5 Offline Applications behaving more like installed applications you can start doing things like detect when you can't perform actions and fire off beacons/reporting (though obviously would have to be on a completely different infrastructure). You can also do good old-fashioned log counts monitoring, particularly if you instrument your systems.
A lot of this really starts to blur the lines between full-browser application testing and back-end systems monitoring but ultimately I think they will both start to converge on systems that can mine the live data from increasingly well-instrumented systems.
I do think that full-browser front-end testing will be the first to fall because it's an easier transition from where we are but eventually even the low-level ops monitoring will start to leverage the capabilities of real-time big-data mining. The bonus is that the measured SLA will be the actual user SLA, not an approximation.
Patrick - great post!
ReplyDeleteReal user monitoring is great but also just part of the puzzle.
We really need the ability to monitor in real time at all levels of the stack, from the infrastructure level all the way up to the user level. More importantly we need the ability to correlate all the data so that we can not only identify problems with our service but also drill down into the root cause.
If RUM triggers an alert that a page is loading slower than acceptable then we need to know in real time whether it's an application problem (bug?), network related (could be Geo), a resource problem (server capacity issue?), etc., etc. .. you get the point.
So not only do we need real-time big-data mining - we need advanced analytics that can tie everything together and help us make sense of it.
In my work with RUM, I've found it a struggle to properly segment the data in order to determine causation.
ReplyDeleteFor example, when examining the load time of a page over time, or comparing different pages: are differences due to page content differences/changes, or a change in the mix of user connectivity speeds, or a change in the mix of browser cache 'freshness', or user geography, etc? Or all of the above?
All these things are neatly controlled with synthetic testing. To gain a similar level of insight from RUM, you must collect the data needed to allow proper segmentation.
I've seen a few attempts at this (Philip Tellis's Boomerang, for example). But nothing that provides the complete picture.
I'm in total agreement that RUM is the right direction. But it may be a long road.
-Eric
@Eric, right, that's why the resource timing is critical with the insight into individual requests. Up until now (well, soon anyway if it can get agreed upon and implemented) the visibility just wasn't there for field metrics.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete