problem / solution board with post-its

The Mistake That Almost Half of Product Managers Make

read time - icon

0 min read

Nov 21, 2022

Nobody knows better than product managers that user behavior is a fickle thing. Whatever kind of product it is that you’re building, it’s a given that users will behave in unexpected ways: they’ll fall prey to cognitive biases, contradict their own stated preferences, and so on. That’s why data and user testing are the bread and butter of product design.

But after countless engagements spent infusing behavioral insights into digital products, I’ve noticed that product managers themselves are just as prone to cognitive errors. 

In fact, when TDL looked into this question recently, we found that four out of ten product managers make the same irrational choice about how to interpret user testing data.

What choice would that be? Read on…

Imagine this

We recently conducted an informal poll on our LinkedIn page, to gauge how product managers would respond to the following hypothetical scenario:

“Imagine you and your team have just launched v1 of a product. Initial development was a whirlwind — you were sprinting to make sure the product would ship on time, and weren't able to do all the testing you would have liked. Now your team is considering making a design change, and you do some quick user testing to see whether it makes a significant difference for your KPIs. 

When you've collected all your data, you run the statistical analyses and find that the design change you're considering clocks in at a p-value of 0.3. In other words, it's not a significant effect at 95% confidence, but there's a positive trend. What do you do?”

Respondents could choose from three options:

  1. Decide not to implement the change
  2. Implement the change anyways
  3. Something else

In the end, we received 90 responses to our survey.

Results

We found that ~40% of respondents chose an action that I would deem irrational. As a recovering academic turned practitioner, this situation is also a perennial thorn in my side, as time and time again I’ve seen people instinctually behave irrationally when they encounter it.

But before we dive into either my rant or our results, let’s first take a step back and review what p-values, as well as the broader concept of statistical significance, actually mean.

What are p-values?

As a general description, if we find that a result is statistically significant, we are, at a particular level of confidence, ruling out random chance as the driver of an interesting observation and instead attributing it to our intervention (e.g. a drug, a message, or a restructured landing page). In other words, we aren’t communicating certainty; we’re simply saying that if, for example, p<0.05 in an A/B test, we’re more than 95% sure that one option is superior to another.

Let’s also take a moment to reflect on what it means if a result is not deemed statistically significant. Take a moment and think about the following: If we run an experiment and miss our p<0.05 cutoff, what does that mean?

Does it mean that we are, at some other level of certainty, confident that one option is worse than the other? 

Of course not! Coming in below our target level of confidence means just that: we are less confident in the superiority of one option over another. On its own, it doesn’t mean that we think one option is worse, just that we lack sufficient confidence to label anything better.

Statistical significance in academia

Now let’s move on to the next piece of the puzzle. What’s so special about p<0.05? Why do we care about 95% confidence?

Here, context matters. If I am an academic researcher, I have (hopefully) dedicated my life to the pursuit of something resembling truth. My area of focus may be anything from the mating behaviors of African rain frogs (They are very cute. Do yourself a favor and Google them — ideally when not mating) to two-sided competition in online shopping environments.

Regardless, my purpose is ultimately to demonstrate something novel and interesting about the world and attribute it to something other than chance.

So, is there a divine scholarly text that posits that 95% confidence is the one true benchmark for academic certainty? Of course not! We could have just as feasibly and more or less justifiably selected 96% or 94% as our gold standard. With the above goal in mind, however, it follows that we just want to be very particular about results we are willing to accept, and 95% confidence sounds pretty reasonable. 

Now, for the purpose of the rest of this article, let’s take the following for granted: In the context of pure scientific research, high thresholds for certainty make a considerable amount of sense. In practice, this is more controversial than you might assume, but that’s a conversation for another day!1

To summarize

  • Missing a confidence threshold does not mean one option is worse. It means that we are not sufficiently confident that one option is superior.
  • 95% confidence is ultimately arbitrary, but makes quite a bit of sense if you’re hunting for truth.

Why product managers should care less about p-values**

** p<.01

Now, let’s move away from researchers and shift our focus to practitioners: specifically, product managers. Is the goal of a product manager to search for some objective truth? I’d argue that it isn’t. That may be an aspiration, but ultimately a product manager is responsible for ensuring that the efforts of a product team translate to improvements when it comes to the product’s ability to meet strategic goals and satisfy user needs (or whatever your definition is, so long as we can agree that academics and PMs have different goals).

Let’s revisit our hypothetical: you, the product manager, tested a change, saw a trend of improved performance, but ultimately found that p = 0.3. This means you’re roughly 70% confident that the improvement you observed is the result of your change, and not random chance.

In our survey, we saw the following breakdown of responses.

Survey breakdown: Roughly 38% indicated they would not ship the change, while 42% indicated that they would.

Roughly 38% indicated they would not ship the change, while 42% indicated that they would. So, who is right? Is anyone?

Let’s break this down as follows.

  • We have two options: 
    undefinedundefined
  • We have two statistical determinations, each with an associated probability
    undefinedundefined

We can thus visualize possible outcomes as follows:

If your responsibility is to ensure that you’re shipping the best possible version of your product, should we implement the change, even though we missed the supposedly magic threshold of p<0.05? I argue that the correct answer is an emphatic and enthusiastic yes. So long as we have a positive trend, the only harmful outcome in our matrix occurs if we don’t ship. If we ship, then the worst case scenario is that we’re left with a product indistinguishable from what we had prior.

To illustrate another point, let me now scale this diagram to reflect the our level of confidence in our possible outcomes.

The above visual illustrates that in practice, given a confidence level of 70%, the most likely results are that we either improved our product or missed an opportunity to do so. In this case, of course we still want to ship! 

What if we’re only 30% confident? What happens then?

The makeup of possible product implications is unchanged. All that has changed is the probability that we land in any one particularly quadrant. This reinforces the idea that, so long as we have a positive trend, the only choice that we can anticipate would lead to a harmful outcome (in this case, a missed opportunity) is to not ship the change

I’ll reel my excitement in a bit by conceding that I’m not factoring in other crucial considerations (e.g. opportunity cost of allocating development resources to a low-confidence change), but I think the point stands that p-values in and of themselves are never sufficient to disqualify a change that generates a positive trend. 

Now, obviously the change could be worse, and our data is just insufficient to demonstrate that. As practitioners we are often faced with information that is incomplete or of lesser quality than we might like. In those cases, our pulse survey suggests that 38% of product managers may feel like the prudent thing to do is nothing, but that is irrational. It is our responsibility to make the best decision we can with the information that we’re given.

Why do we get stuck on p-values?

So why do we do this? One way we can understand this behavior is through the lens of the anchoring bias. This is our tendency to rely too much on one piece of information when making a decision, using it as a reference point against which to judge all subsequent data points. Research has shown that we can even become anchored to numbers that have nothing to do with the decision we’re trying to make. 

When it comes to user testing, we can think of that p-value of 0.05 as an unhelpful anchor. If we set aside the concept of statistical significance and focus on what is most likely to yield the best outcome for our product, a p-value of something like 0.7 should be more than enough to justify shipping a change. But when product managers are fixated on clearing the arbitrary threshold of 95% confidence, they end up missing the bigger picture.

Unanchoring ourselves

Though we can’t completely avoid the anchoring bias, there are a number of evidence-based strategies we can use to help diminish it. Research shows that pausing to assess how relevant an anchor really is to the situation at hand2 and contemplating other options that may work better3 can help reduce the effects of this bias. 

Before even launching a user test, product teams should meet to discuss what thresholds of statistical significance they believe should be met before moving ahead with a change, and establish a plan for how to proceed if a proposed change falls short of those benchmarks. These targets could vary wildly depending on things like sample size, the level of effort required to implement a change, anticipated effects on the user experience, and so on. 

Ultimately, what’s important is that teams don’t attach themselves to standards that are unrealistic or unhelpful in practice, and instead come to a shared understanding of what would constitute a meaningful result in their particular context. 

A significant change

I realize that to some readers, it may sound like I am arguing against maintaining a certain standard of scientific rigor in product design. That’s true in a sense — but only because I believe that it is unrealistic and a little silly to import the standards of academic research into an applied environment. 

If product managers have reasonable certainty that implementing a change will improve user outcomes, and they have no reason to believe said change will do harm, we should not throw away a perfectly good idea just because it fell short of some arbitrary benchmark. 

After all, in our hypothetical scenario above, the original version of the product was put together in a rush and shipped with minimal user testing. Even if v2 isn’t perfect, if it’s backed up by more data than v1, odds are that it’s better. In fact, they’re significantly better.

References

  1. Cowles, M., & Davis, C. (1982). On the origins of the .05 level of statistical significance. American Psychologist, 37(5), 553-558. https://doi.org/10.1037/0003-066x.37.5.553
  2. Mussweiler, T., Strack, F., & Pfeiffer, T. (2000). Overcoming the Inevitable Anchoring Effect: Considering the Opposite Compensates for Selective Accessibility. Personality and Social Psychology Bulletin, 26(9), 1142–1150. https://doi.org/10.1177/01461672002611010
  3. Zenko, M. (2018, October 19). Leaders Can Make Really Dumb Decisions. This Exercise Can Fix That. Fortune. https://fortune.com/2018/10/19/red-teams-decision-making-leadership/amp/ 

About the Author

Read Next

Notes illustration

Vous souhaitez savoir comment les sciences du comportement peuvent aider votre organisation ?