Test Big or Go Home

Running lots of small split-tests made my team feel productive and gave us plenty to talk about. But it failed to answer the questions that mattered most.

Chart going up and down next to a bullseye arrow

On a bright winter afternoon last year, I looked out the conference room window. Clean, blue sky stretched over the wall of mid rise buildings across Manhattan’s 23rd Street. I listened passively as the hair-splitting arguments set in.

“Well…I’m not sure we can come to a conclusion on this test—it’s only been running for 12 days.” “Well, I’m not sure it’s the placement of the button that was bad—it might just be the copy we used in this version of it.” “Well, I’m not sure this new page design is what worked—it might just be the image we used in this version of it.”

I was the product manager on the growth team responsible for getting visitors to Betterment’s website to sign up and deposit money with us.  We’d host a meeting every Friday afternoon to review split-tests in progress in the hope of calling them—that is, to determine whether recent changes we’d made to the website or signup flows were driving more users to open accounts and make deposits.

Fancying ourselves data-driven, we sought to guard against flawed observations that might lead us to a wrong conclusion. Our approach was rigorous but exhausting.

As I zoned out and looked at the sky, I thought about something Katherine Kornas, a product leader who had joined the company a few months earlier—and to whom I reported at the time—shared at a recent product team meeting.

“It’s not a failure when your split test loses,” she’d said, “it’s a failure when your split-test is inconclusive.”

When you change your product in the hope of spurring signups and it actually drives signups down, she shared, you’ve at least learned what doesn’t work. When you change your product and nothing happens, you’ve learned almost nothing about users’ needs or behaviors.

At that moment, learning about what users wanted was more important than ever.

We were still figuring out how best to draw site visitors into a new type of product. Smart Saver was an account that was filled with money market and government bond funds and designed to earn a modest but reliable return. Having spent years making our name as the place to invest for the long-term, Betterment had developed a product built for the kind of short-term dollars that one might otherwise store in a savings account. Smart Saver was the precursor to the cash management products we’d go on to launch in the ensuing year: Cash Reserve and Checking.

My team and I wondered how to feature Smart Saver in our signup flows and on our website. Should it sit front-and-center on the homepage? That would be great for those who arrived in search of that product, but it might confuse visitors who arrived in the hope of rolling over a 401(k). How could we talk about this particular product without obscuring the larger story of what Betterment was all about?

No one knew for sure, but Katherine had the expertise to guide us toward the answer.

“We need to think beyond the page,” she told me in one of our weekly one-on-one meetings. Katherine could sense that the split-testing approach I’d managed—with its small tests that changed only a single thing at a time—would never crack the bigger questions around how best to sell our burgeoning cash products. She urged me to think bigger, about entire flows, or even the broader swath of the user experience.

That was tough to hear. I was proud of running dozens of tests and reviewing them ruthlessly. It made me feel like the data-driven product manager I thought I was supposed to be. Even if running numerous tests led to a lot of arguments in our tedious review meetings, it also gave me a lot to talk about—in release notes, at team meetings, and in reports to the broader growth team. As I re-read my notes from the time, though, I can see that for all my talking, I shared little in the way of strategy-shaping insights.

Katherine was right to call for fewer, but better, tests. Instead of starting with an idea for a feature or design change, she said, I should start with a big question: what's the thing we’re trying to learn? What change to the user’s experience—ideally across multiple stages of their journey—would teach it to us? How might our strategy for the coming months change if the test wins? What about if it loses?

The big test

We went on to test a new, vastly different experience for users who arrived at our site with an interest in Smart Saver. We sent them down a signup flow that was unique for its silence on our investing accounts. We wanted to avoid mentioning any product other than the one the user came for. When a customer finished signing up and arrived at their dashboard, we affirmed to them that they had opened a Smart Saver account and invited them to fund it.

This new approach earned strong results. Strong enough, in-fact, to beat back the kind of hair-splitting questions that had arisen in test review meetings of the past. It established a pattern that Betterment still uses today. If a site visitor seems intent on opening a certain kind of account—say, Cash Reserve, an IRA or a general investing account—we draw into a flow that focuses tightly on getting them settled in. We wait until later to tell them what else about Betterment they might enjoy.

A test is likely big “enough” if it earns buy-in from stakeholders that X result should indicate Y change in your approach. Imagine you’ve replaced your homepage hero image with a picture of a dog. As the test gets underway, you tell your colleagues from marketing, design and engineering that if the dog picture drives signups, you’ll display only pictures of dogs throughout your public-facing pages, your signup flows and even in-app. It would be easy for anyone in the room to raise doubts.

“Well, maybe dogs just work on the homepage, but not on other pages.” “Well, maybe it’s just because this dog is a Labrador, people might not like German shepherds.” “Well, maybe people just like four-legged animals. Can we try a cheetah?”

You’ll stand a better chance of winning the room if you suggest a more comprehensive test: placing pictures of various dogs throughout the user journey—not just in one spot. You might propose that, if the test drives not only sign-ups but also deeper levels of engagement, it will show that dogs are key to acquisition and retention. The key is to strike an agreement with your stakeholders in advance about what result you’re looking for and, more importantly, how it would change your strategy for the months to come.

In thinking bigger about split-tests, my team and I learned how best to draw users into our service. Personally, I learned something even more important. While one should certainly be afraid of drawing the wrong conclusion from a change to the user experience, one must also beware of making changes that are so small as to yield no conclusion at all.