Friday, December 28, 2012

Motivation and Incentive

My favorite web performance article in 2012 was this one from Kyle Rush about the work done on the Obama campaign's site during the 2012 election cycle.  They did do some cool things but it wasn't the technical achievements that got my attention, it was the effort that they put into it.  They re-architected the platform to serve as efficiently as possible by serving directly from the edge and ran 240 A/B tests to evolve the site from it's initial look and feel to the final result at the end of the campaign (with a huge impact on the donations as a result of both efforts).

Contrasted to the Romney tech team that appears to have contracted a lot of the development out and spent quite a bit more to do it (wish there was an easy way to compare the impact on the funds raised but the donation patterns across the parties is normally very different).

What I like most is that it demonstrates very clearly how having someone's motivations aligned with the "business" goals is absolutely critical and those are the situations where you usually see the innovative work and larger efforts put in.  I see this time and time again in the tech industry and I'm sure it applies elsewhere but it is absolutely critical to be aware of in tech.

Fundamentally that is what DevOps is all about and why the classical waterfall development model is broken:

  • Business identifies a "product need"
  • Product team specs-out a product to fill that need
  • Dev team builds what was specified by the product team (usually as exactly to the requirements as possible, including fussing about pixel-perfect matching the mock-up designs)
  • Dev team throws the resulting product over the wall to QA to test and verify against the requirements
  • QA team throws the final product over the wall to the Ops team to run
    • Usually forever and long after the dev and product teams have moved on
    • Usually doing all sorts of crazy things to keep the system running (automatic restarts, etc)
By bringing the various teams together and making them have skin in the game they are incented to produce a product that is easier to implement, scales and runs reliably (getting developers on pager duty is easily the fastest way to get server code and architectures fixed).

As you look across your deployment, small site or large, what are the motivating factors for each of the teams responsible for a given component?

The Hosting

If you are not running your own servers then there is a good chance that the company that is running them isn't incentivised to optimize for your needs.

In the case of shared hosting, the hosting provider makes their money by running as many customers on as little hardware as possible.  Their goal is to find the edge at which point people start quitting because things perform so badly and make sure they stay as close to that as possible without going beyond it.  When I see back-end performance issues with sites, they are almost always on shared hosting and at times it can be absolutely abysmal.

With VPS or dedicated hosting they usually get more money as you need more compute resources.  Their incentive is to spend as little time as possible supporting you and certainly not to spend time tuning the server to make it as fast as possible.

If you are running on someone else's infrastructure (which includes the various cloud services so it is increasingly likely that you are), I HIGHLY recommend that you have the in-house skills necessary to tune and manage the servers and serving platforms.  You need remote hands-and-eyes to deal with things like hardware failures, but outsourcing the management will hardly ever be a good idea.  Having someone on your team who is incented to get as much out of the platform as possible will save you a ton of money in the long term and result in a much better system.

Site Development

You should have the skills and teams in-house to build your sites.  Period.  If you contract the work out then the company you work with is usually working to do as little work as possible to deliver exactly what you asked for in the requirements document.  Yes, they will probably work with you a bit to make sure it makes sense but they are not motivated by how successful the resulting product will be for your business - once they get paid they are on to the next contract.

I see it all too often.  Someone will be looking at the performance of their site and there are huge issues, even with some of the basics but they can't fix it.  They contracted the site out and what they were delivered "looks" like what they were asked to deliver and functions perfectly well but architecturally it is a mess.

There are great tools available to help you tune your sites (front and back-end) but you need to have the skills in-house to do it.  Just like with the Obama campaign, they focused on continuously optimizing the site for the duration of the campaign because they were a part of the team and were motivated by the ultimate business goals, not by some requirements document that they needed to check all of the boxes to.

Maybe I'm a bit biased since I'm a software guy who also likes to do the end-to-end architectures and system tuning but I absolutely believe that these are skills you need to have or develop as part of your actual team in order to be successful.  Contracting out for expertise also makes sense as long as they are educating your team as you go along and it's more about the education and getting you on the right track.


Maybe it's my tinfoil hat getting a bit tight, but given that CDNs usually bill you for the number of bits they serve on your behalf, it doesn't feel like they are particularly motivated to make sure you are only serving as many bits as you need to.  Things like always gzipping content where appropriate is one of the biggest surprises.  It seems like a no-brainer but most CDN's will just pass-through whatever your server responds with and won't do the simple optimization of gzipping as much as possible (most of them have it as an available setting but it is not enabled by default).

Certainly you don't want to be building your own CDN but you should be paying very careful attention to the configuration or your CDN(s) to make sure the content they are serving is optimized for your needs.


Finally, just because you have the resources in-house doesn't mean that their motivations are aligned with the business.  In the classic waterfall example, the dev teams are not normally motivated to make sure the systems they build are easy to operate (resilient, self-healing, etc).  In a really small company where the tech people are also founders then it is pretty much a given that their incentives are very well aligned but as your company gets larger it becomes a lot harder to maintain that alignment.  Product dogfooding, DevOps and Equity sharing are all common techniques to try to keep the alignment which is why you see all of those so often in the technical space.

OK, time to put away the soapbox - I'd love to hear how other people feel about this, particularly counter arguments where it does make sense to completely hand-off responsibility to a third-party.

Tuesday, November 20, 2012

Clearing IE's Caches - Not as simple as it appears

I've spent the last week or so getting the IE testing in WebPagetest up to snuff for IE 10.  I didn't want to launch the testing until everything was complete because there were some issues that impacted the overall timings and I didn't want people to start drawing conclusions about browser comparisons until the data was actually accurate.

The good news is that all of the kinks have been ironed out and I will be bringing up some Windows 8 + IE 10 VM's over the Thanksgiving holidays (have some new hardware on the way because the current servers are running at capacity).

In the hopes that it helps other people doing browser testing I wanted to document the hoops that WebPagetest goes through to ensure that "First View" (uncached) tests are as accurate as possible.

Clearing The Caches

It's pretty obvious, but the first thing you need to make sure you are doing when you are going to do first view tests is to clear the browser caches.  In the good old days this pretty much just meant the history, cookies and object caches but browsers have evolved a lot over the years and they store all sorts of other data and heuristic information that helps them load pages faster and to properly test first view page loads you need to nuke all of them.  

For Chrome, Firefox and Safari it is actually pretty easy to clear out all of the data.  You can just delete the contents of the profile directory which is where each browser stores all of the per-user data and you essentially get a clean slate.  There are a few shared caches that you also want to make sure to clear out:

DNS Cache - WebPagetest clears this by calling DnsFlushResolverCache in dnsapi.dll and falling back to running "ipconfig /flushdns" from a shell.

Flash Storage - Delete the "\Macromedia\Flash Player\#SharedObjects" directory

Silverlight Storage - Delete the "\Microsoft\Silverlight" directory

That will be enough to get the non-IE browsers into a clean state but IE is a little more difficult since it is pretty tightly interwoven into the OS as we learned a few years back.

The first one to be aware of is the OS certificate store.  Up until a few months ago WebPagetest wasn't clearing that out and it was causing the HTTPS negotiations to be faster than they would be in a truly first view scenario.  On Windows 7, all versions of IE will do CRL and/or OCSP validation of certificates used for SSL/TLS negotiation.  That validation can be EXTREMELY expensive ( several round trips for each validation) and the results were being cached in the OS certificate store.  This made the HTTPS performance in IE appear faster than it really was for true first view situations.

To clear the OS certificate stores we run a pair of commands:

certutil.exe -urlcache * delete
certutil.exe -setreg chain\\ChainCacheResyncFiletime @now

IE 10 introduced another cache where it keeps track of the different domains that a given page references so it can pre-resolve and pre-connect to them (Chrome has similar logic but it gets cleared when you nuke the profile directory).  No matter how you clear the browser caches (even through the UI), the heuristic information persists and the browser would pre-connect for resources on a first view.

When I was testing out the IE 10 implementation the very first run of a given URL would look as you would expect (ignore the really long DNS times - that's just an artifact of my dev VM):

But EVERY subsequent test for the same URL, even across manual cache clears, system reboots, etc would look like this:

That's all well and good (great actually) for web performance but a bit unfortunate if you are trying to test the uncached experience because DNS, socket connect (and I assume SSL/TLS negotiation) is basically free and removed from the equation.  It's also really unfortunate if you are comparing browsers and you're not clearing it out because it will be providing an advantage to IE (unless you are also maintaining the heuristic caches in the other browsers).

Clearing out this cache is what has been delaying the IE 10 deployment on WebPagetest and I'm happy to say that I finally have it under control.  The data is being stored in a couple of files under "\Microsoft\Windows\WebCache".  It would be great if we could just delete the files but they are kept persistently locked by some shared COM service that IE leverages.

My current solution to this is to terminate the processes that host the COM service (dllhost.exe and taskhostex.exe) and then delete the files.  If you are doing it manually then you also need to suspend the parent process or stop the COM+ service before terminating the processes because they will re-spawn almost immediately.  If anyone has a better way to do it I'd love feedback (the files are mapped into memory so NtDeleteFile doesn't work either).

Browser Initialization

Once you have everything in a pristine state with completely cleared profiles and caches you still have a bit more work to do because you want to test the browser's "first view" performance, not "first run" performance.  Each of the browsers will do some initialization work to set up their caches for the first time and you want to make sure that doesn't impact your page performance testing.  

Some of the initialization happens on first access, not browser start up so you can't just launch the browser and assume that everything is finished.  WebPagetest used to start out with about:blank and then navigate to the page being tested but we found that some browsers would pay a penalty for initializing their caches when they parsed the first HTML that came in and they would block.  I believe Sam Saffron was the first to point out the issue when Chrome was not fetching sub-resources as early as it should be (on a page where the head was being flushed out early).  In the case of the IE connection heuristics it would also pay a particularly expensive penalty at the start of the page load when it realized that I had trashed the cache.

In order to warm up the various browser engines and make sure that everything is initialized before a page gets tested WebPagetest navigates to a custom blank HTML page at startup.  In the WebPagetest case that page is served from a local server on the test machine but it is also up on if you want to see what it does.  It's a pretty empty html page that has a style and a script block just to make sure everything is warmed up.


Hopefully this information will be helpful to others who are doing browser performance testing.  

You should also be careful taking browser-browser comparisons as gospel.  As you can see, there are a lot of things you need to do to get to an apples-to-apples comparison and even then it isn't necessarily what users experience.  Browsers are adding more heuristics, pre-connecting and even pre-rendering of pages into the mix and most of the work in getting to a clean "first view" defeats a lot of those techniques.

Wednesday, August 22, 2012

FCC Broadband Progress Report

The FCC released their eighth broadband progress report yesterday.
The most interesting part for me is when you get to page 45 and they start talking about actual adoption (in the US), as in the speeds that people are actually subscribing to, not what is available or offered.  Their buckets aren't all that granular and the data they used to build the report comes from June 2011 but they give you a good idea of what the spread looks like:
64.0% - At Least 768 kbps/200 kbps
40.4% - At Least 3 Mbps/768 kbps
27.6% - At Least 6 Mbps/1.5 Mbps

Effectively that means that 36% of the households where broadband is available do not subscribe to fixed-line broadband.  If we use the 64% that subscribe to at least some form of fixed-line broadband offering we get:
37% - Less than 3 Mbps/768 kbps
63% - At Least 3 Mbps/768 kbps
43% - At Least 6 Mbps/1.5 Mbps

With WebPagetest's default 1.5 Mbps/768 kbps DSL profile falling in the 37% of the population it is probably hitting somewhere around the 75th percentile.  Time to increase it to something closer to the median (say switch to the 5/1 Mbps Cable)?
I've generally been a fan of skewing lower because you will be making things faster for more of your users and you might be missing big problems if you don't test at the slower speeds but I'm open to being convinced otherwise.

Monday, June 25, 2012

WebPagetest Mobile Agents

Hot off of the Velocity Conference presses, iOS and Android 2.3 agents are now available from the Dulles, VA location on WebPagetest.

We have been working with Akamai on their recently open-source mobitest agents and are happy to announce that the agents are now up and running and are also available for people to use for Private Instances.

Testing is limited to loading URL's and capturing video (and waterfalls) for now with some of the more advanced testing capabilities that you are used to from WebPagetest coming soon.

Thursday, May 31, 2012

EC2 Performance

WebPagetest makes EC2 AMI's available for people to use for running private instances and makes fairly extensive use of them for running the testing for the Page Speed Service comparisons.  We have tested that the m1.small instances produce consistent results but we aren't necessarily sure if they are representative of real end-user machines so I decided to do some testing and see.

This is a very specific test that is just looking to compare the raw CPU performance for web browsing (single threaded) of various EC2 instance sizes against physical machines.  It is not meant to be a browser comparison or a statement about EC2 performance beyond this very-specific use case.

Testing Methodology

I ran the SunSpider 0.9.1 benchmark 5 times on each of the different machines using Chrome 19 (it is important to keep the browser and version consistent since changes to the JavaScript engine will affect the results).


As you can see, the m1.small instances are significantly slower than desktop systems from the last 5 years or so  but it is somewhat faster than more recent low-end laptops.  Netbooks and tablets are significantly slower still with times typically in the 1000+ range.


Unfortunately there isn't a clear-cut answer of what you should use to test if you are trying to test on a "representative" system because both the smaller and larger instances are representative of different ends of the computing spectrum.

My general feeling is that websites should not be CPU constrained and if they are then things will look exponentially worse on the tablets, chrome books and other cheap devices that are starting to flood the market.  If you test using the larger instances then you will be testing on systems more representative of desktops which might be a good thing to do if that is specifically what you are targeting.  The small instances are more likely to expose CPU constraints that will crop up in your user base (much like Twitter noticed).

Call for Help

The systems I tested were machines that I had easy access to but are probably not representative of a lot of systems people have at home.  If you could run SunSpider using Chrome 19 on any systems you have lying around and share the results as well as the system specs in the comments below I'll update the chart and see if we can build a more representative picture.

*update - chart has been updated with the user-submitted results, thank you

Tuesday, April 3, 2012

Anatomy of a MyBB Forum Hack

It has been an exciting 2 days.  Yesterday I discovered that over the weekend the forums at had been hacked and that someone had installed a back door.  I traced the source of the entry pretty quickly and locked out the exploit he had used but I wanted to make sure he hadn't done anything more damaging while he was there so I spent the last day pouring over the access logs to trace back his activities.

The hack involved uploading a custom image file that was both a valid jpeg and had php code inside of it that the php interpreter would execute and then tricking the server into executing the image as if it were php. I have the actual image as well as the command and control php he installed if anyone is interested (and by anyone I mean anyone I know who will do good things with it).

I thought it would be valuable to share (and somewhat entertaining) what I gleaned from the logs so here is the timeline of activities that I managed to piece together (all times are GMT):

March 23, 2012 

- Registered for account in the WebPagetest forums and uploaded executable profile pic
- used (presumably throw-away) yahoo mail account:
- from (also logged in to the account from but there has been no recent activity from that IP)

March 30, 2012 (probably automated bot/process)

08:48 - Back door is installed and first accessed (install method is highlighted later).  Hidden IFrame is added to the forums page.

Periodically - main page is loaded (presumably to check the status of the IFrame) (appears to be manual activity)

08:58 - Loads the main page (presumably checking the IFrame)

April 1, 2012

08:49 - Installs adobe.jar (unfortunately I deleted it and didn't keep a copy for analysis), presumably for distribution or more access (no Java on the server though so not much point)

09:16 - Accessed the installed adobe.jar (presumably testing to make sure it installed)

April 2, 2012


~14:00 - Observed unexpected requests loading and found the IFrame (and quickly deleted it)
17:58 - Tracked down the location of the code that was used to install the IFrame (and unfortunately deleted it in my panic)
18:16 - Secured the hole that was used to execute php in the uploads directory

April 3, 2012

05:54 - Checked the main page for the iframe
06:02 - attempted to access gs.php (the back door php code) (manual debugging/activity)

06:03 - Started probing to see what broke - attempted to access:

06:04 - Manually browser the forums, presumably checking to see if everything was down or just his hack and did some more probing:

/forums/images/on.gif/.php to see if the php interpreter hole was still open (was at the time but he couldn't get any code placed there to execute - this has since been closed)
/forums/uploads/avatars/tileeeee.html (404 - already cleaned up)

06:05 - Tried the avatar hack again:
06:06 - Tries other avatar files to see if php hack is blocked on uploads
/forums/uploads/avatars/avatar_1.jpg/.php (yep - 403)
06:07 - Tries other back door commands he had installed:
06:10 - More frustration:
06:14 - Went through the registration UI and actually registered to the forum again (throw-away)
06:21 - Activated his registration (switched to another IP to continue manual debugging/activity)

06:29 - Accesses forum using new registration (normal forum browsing, a couple of failed post attempts)
06:31 - Tries to access the admin control panel /forums/admincp (404)
06:35 - Logs out of the forum
06:38 - Tries manually loading various attachments with different attempts to obfuscate the path
06:42 - Tries (unsuccessfully) php execution for attachments /forums/attachment.php?aid=175/.php
06:44 - Tries the old avatar routine again /forums/uploads/avatars/avatar_9.jpg/.php
06:48 - Attempts various probings to see if any other extensions will potentially execute:
06:50 - Tries some of the old files again for some reason:
06:52 - Tries to use the avatar hack to download his payload again
06:54 - More futile attempts to probe the avatars directory and understand why things aren't working:
/forums/uploads/avatars/.php (ding, if he didn't know by now, nothing with .php anywhere inside of uploads will load)
06:55 - Seriously, he is expecting different results?
06:56 - He does some MORE poking around to see if the trailing .php hack is universally blocked (it is now)
07:00 - Last trace of access for today