Human bias in A/B testing

When we perform an A/B test, we have four possible outcomes: true positive, true negative, false positive, and false negative. The nomenclature we use causes some issues. If the person reading the result is not familiar with A/B testing, they may conclude that the test “failed” if it did not give a true positive result.

Table of Contents

  1. The problem with true negative
  2. The problem with “corporate A/B testing”

The problem with true negative

Let’s begin with clarifying one thing. Even though we may be biased against true negative, it is not a wrong result. It means that we have succeeded in discovering that the control group does not differ from the treatment group.

We can solve that issue by informing everyone that the real purpose of an A/B test is discovering the truth, not confirming that the new version of the website/product/whatever is better than the old one.

After all, in business, you can only make suggestions. Your clients make a decision. They either like what you do, or they don’t.

The problem with “corporate A/B testing”

Unfortunately, I have seen A/B tests which were performed only to make a justification for a decision which had been already made.

During such fake tests, many metrics were tracked to be sure that at least one of them gave a positive result (and it did not matter whether it was true positive or false positive). Obviously, that one metric was proclaimed to be the most important one, and the success was announced.

After all, if someone’s bonuses depend on the result of a test, the test is going to pass. For such people, the truth does not matter. They will declare a success, no matter what. Even if their decision will destroy the company in the long run.

The only way to prevent that is explaining what an “underpowered” test is and why we should wait until the end of the test before we start looking at the results (or at least, before we make a decision based on those results).

We should also make sure that there is only one metric used to compare the results. We may track many of them, but the test should use only one, and we must decide which one it is before we begin collecting the data.

Obviously, that only works if you are can impose such rules without negative consequences. If you work in a place where you may get fired for doing your job correctly, I can’t help you.

Older post

How to plot the decision trees from XGBoost classifier

How to plot the decision rules of XGBoost

Newer post

Numpy reshape explained

How to use the reshape function in Numpy

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Schedule a call, send me a message on LinkedIn. Schedule a call or send me a message on LinkedIn

>