Really nice post- A bit long, but a great view of A/B and how to look at business cycle impact on tests. This gets forgotten in most discussions.
Originally posted on The Daily Egg: http://blog.crazyegg.com/2016/03/16/ab-test-business-cycle/?awt_l=PRw8c&awt_m=3YAOKQaAx8zJUPA
by Daniel Sims
“I read the news today, oh boy” John Lennon famously wrote almost 50 years ago for the classic Beatles song, A Day in the Life.
In those days, there were no smartphones, websites or blogs from which to get your news, only the good old fashioned paper copy that is becoming a rarity.
When you are “reading the news” each day on your A/B testing or other website testing, you will want to read as much news as you can, in as many ways as you can and for as long as you need to.
But how long do you need to?
To answer that question, you need to thoroughly understand your own individual business cycle. Without that knowledge, there is no way to guarantee that your results will be accurate and free from bias.
What Is A Business Cycle?
In the world of macroeconomics, a business cycle is defined as the variation of an economic system (such as the U.S. economy) over time. Economists have observed that both the global and national economies follow a distinct pattern of expansion, peak, contraction, and recession that has consistently repeated itself over many generations.
How does this impact my website CRO?
On a smaller and more individual scale, you need to understand your own customer behavior over time. This means studying the ups, downs, ebb and flow that this behavior is displaying, and capturing a representative slice of that pattern during the period of time you are performing your A/B testing.
Does this simply mean running your test for a full week? In some cases, yes, but in most others it depends on your product, your traffic volume and your customer demographics, among other things.
To take a closer look at how this works, let’s review a typical (albeit hypothetical) “week in the life” of an A/B test and examine some of the scenarios we might encounter.
Day 1. Let The Testing Begin
Months after your website launch, with analytics data in hand, you have observed a somewhat alarming bounce rate on your order page and have designed a clever variation of your “buy now” button that you are anxious to test.
Of course, you have already studied your traffic volume and calculated the sample size required to run the A/B test…right?
If not, When Are You Ready For A/B Testing? provides a quick reference with an example of how to determine your necessary traffic, confidence level and sample size.
Using your calculated sample size as a guide, you launch your test, splitting your traffic evenly between options A and B, and decide to track the results each day to see how the test is progressing.
With that, the clock is officially ticking on your business cycle…. Good Luck!
Day 2. Blue Monday
Nervous with anticipation and wondering how the A/B test is going, you can barely concentrate on your meetings, phone calls or even Junior’s soccer game. You just can’t wait to review your analytics and see what the early trends are showing.
When you finally get around to reviewing the Day 1 results, you are shocked. Not only was the overall traffic lower than expected, your conversion rate on the “B” side (red button) was a big goose egg. Zero? How could this be? Were you wrong about the enhancement? Should you stop the experiment and start over?
Before you change or stop anything, keep in mind that early results are just that – early. That means you almost certainly have not tested an adequate sample size, and these results could just be an anomaly. More on this later.
Just to be sure everything is in order, you might want to spend some time investigating whether or not your test is splitting the traffic equally. If you are not able to do this yourself, you might consult an expert for assistance.
Assuming the test was indeed splitting correctly, you should not change anything yet. The business cycle has just begun, and business cycles tend to resemble roller coasters more often than bullet trains!
Day 3. Super Tuesday
Crisis averted: Your daily traffic level has returned to normal, and you are even seeing a higher conversion rate on the new “buy now” button vs. the original. You also notice something highly unusual, as your daytime traffic is much higher than your evening traffic. This is the complete opposite of your typical pattern, since most of your customers are working people who log on to buy during the afternoon and evening hours.
Is this part of your “normal” business cycle? Here’s another way to look at it.
Advanced manufacturers like Toyota, Apple and HP use something called Statistical Process Control to track the performance of important indicators on every product (and piece of a product) they build. This usually means using Control Charts to visually review your performance.
One interesting thing about control charts is that the control limits are not the specification limits or the pass/fail limits for what you are monitoring. They are actually the limits of “expected behavior” based on the patterns you have seen in the past and are the lines within which you would expect your results to continue in the future. That’s why it is no coincidence that a control chart can sometimes resemble the roller coaster ride of a business cycle graph.
When something approaches or falls outside of the control chart limits, there are two potential reasons:
- Common Cause Variation is the normal variation you would expect to see based on the natural patterns of the past.
- Special Cause Variation means something unusual and unexpected happened that threw things out of whack, like the proverbial “fly in the ointment” that escaped the controls you had in place.
How does this relate to our A/B test example?
Let’s say that later on that same Tuesday evening, you found out that the Presidential debate had pulled record TV ratings while pulling your customer base away from their computers. This would be similar to special cause variation, since it is not part of a typical day or typical business cycle. Does this mean your Tuesday results are invalid? Absolutely not, but when it comes time to define the beginning and end of your business cycle, you need to keep in mind that this particular day was not part of the “usual” pattern.
Day 4. And The Winner Is?
On Wednesday, the tide shifts again. Your original order button is killing the new one by a 2/1 ratio. You try using a significance calculator which shows that “B” is not statistically better than “A”, based on the results so far. Why even bother continuing the test? Isn’t it time to go back to the drawing board and design something new? Once again, the answer is No.
For starters, you have not even come close to completing a full business cycle, with all the variation that entails, and have not had enough traffic yet to fulfill your original sample size requirements.
Does this really matter? This time, the answer this time is Yes, and the reason isstatistical power.
In any hypothesis test (including A/B tests), what is called a Type 2 error happens when you conclude that your enhancement (B) caused no improvement, but it actually did. What can most often lead to this false conclusion is an underpoweredtest.
The power of your test is defined as the probability of not making a Type 2 error, and therefore correctly identifying when there is an actual effect. Power can range from 0% to 100%, with 100% being the best possible. This is somewhat related to, but different from, the p-value, which tells you the probability of the null hypothesis being true (e.g. concluding there is no effect).
Three things influence the statistical power:
The statistical power can be calculated based on your actual sample size and the size of the effect (change) you are trying to discern.
For example, if you ran an A/B test on a sample size of 10 million customers, and you wanted to be able to discern an effect of +5% or greater, your power would be very high.
If you ran the same test on a sample of 100 customers and wanted to discern a +1% increase, your power would be low.
Day 5. The Brainstorm
The next morning, you stop for Cappuccino at your local coffee shop and read an interesting article about the psychological effects of colors.
“Red… How could I have made my new order button red?! Everyone knows red means STOP and green means GO. I need to change my test!”
Once again, patience is the word of the day, and it is probably best to let your test continue as planned.
However, if you truly can’t wait to get new test ideas into the mix, want to test continuously, or are concerned about expedience, such as a time sensitive A/B test related to a promotional or Holiday feature, Bandit testing may be something to look into.
The name “Bandit” comes from the old “One Armed Bandit” nickname for casino slot machines. A common slot playing strategy is to play the machine that is running hot (paying) the most, but also play one or two other machines for a small percentage of the time, just in case one of them suddenly becomes the new hot machine.
In website Bandit testing, you take whichever test option is converting the best at a given time, then send most (around 80%) of your traffic to this option, while A/B testing other options at a lower (20%) level. The idea is to maximize your conversion rate by favoring the perceived best option, while still gathering data for other options simultaneously.
The downside of this method is that you have not given all of your options an equal chance to perform over the course of your business cycle, and the option that was leading early on may not be the true top performer. Despite the ingenious premise behind this testing method, if you decide to try this be aware that your true best option could get lost in the shuffle with an underpowered sample size.
Day 6. Friday Night Lights
Experts have observed a number of trends in web-based customer behavior related to the onset of the weekend. Many of these trends are related to mobility, including the shift from home-based platforms to phones and tablets during the weekend days.
Chances are you may also see a shift in your A/B test results as the weekend begins, since the user interface indirectly affects all other user behavior.
No worries. As long as the weekend is represented proportionally in your business cycle, this variation will be represented in your results as well. When you define your business cycle beginning and end, remember that an accurate sampling period will need to include an equal number of work weeks and weekends.
Day 7. Saturday Surprise
Wow! Here you are on the last day of your first test week, and suddenly everything changes. First, you see your traffic and conversions go through the roof, almost equaling what you had done over the previous six days combined. Also, you are now seeing the B (red button) option surge back into the lead by a sizable margin.
Obviously, this will give the Saturday results more weight in assessing your business cycle than the other days. When you analyze your results over the complete business cycle, you are looking at total visitors and total conversions, regardless of the day.
But was this a typical Saturday or an anomaly? Was there something else happening this particular day that influenced the results? When you analyze your business cycle, you need to take a close look at this.
With this added emphasis and attention on studying business cycle patterns, you will eventually begin to identify and track trends and anomalies, and know whether or not your normal business cycle has been altered, as in our control chart example.
Day 8. Groundhog Day
In the 1993 movie, Groundhog Day, Bill Murray’s character, Phil, experiences the same day over and over again, with only minor differences. As time goes on and the pattern repeats, Phil learns how to use this dilemma to his advantage by foreshadowing events and preparing himself to take maximum advantage of them.
You can eventually gain the same kind of advantage by studying your A/B test business cycle. But first, you need to compare Day 8 to Day 1 results to see if they look like a repeat, a sequel, or a different movie altogether. If the latter is true, that is a pretty good signal that your business cycle isn’t consistent from week to week.
Unfortunately, it also signals that you should continue running the A/B test and studying the cycle for at least another week, since one week may not be truly representative of the full cycle. This is true whether or not you have reached your target sample size.
The Week In Review
In just one short week, you can gain a huge amount of insight into your business cycle, if you take a closer look at what the data is telling you.
In our example, we saw the new “order now” button start slow, then climb the charts with a bullet. We saw the tide shift back towards the original “order now” button mid-week, considered changing course based on new information, then saw the ‘B’ option crush again over the weekend.
Was this a really a complete business cycle? You should consider some of these observations:
- Your Tuesday results were atypical, since an outside factor had a huge impact on your results.
- Your Saturday traffic was greater than all other days, which was unusual based on your past analytics.
- The traffic and A vs. B conversion rate data on the first Sunday was not similar to the second Sunday.
All of these observations point to the conclusion that your business cycle was incomplete. Even if you met your original sample size target, you should consider running the testing for at least another week. You may continue to see quite a bit of day-to-day variation in the following week, but this will only help you to distinguish between common cause and special cause in the future.
One of the most useful byproducts of analytics data is a better understanding of your business cycle. As we have seen, there is no such thing as a “typical” week. Studying trends and variation over time will help you determine the most representative sampling period for your A/B test; long enough to accurately capture the patterns and variation of customer behavior, but not so long as to dilute your results with repeat visitors or lack of relevance.
Despite the increased development and use of tools such as Bandit testing, which can help you get an answer faster, or even search for new answers continuously, there is no substitute for the statistical power of head to head A/B testing when it comes to declaring the true winner and loser.
Maybe a business cycle is less like A Day in the Life and more like The Long and Winding Road after all.