An Introduction To Statistical Power And A/B Testing

Statistical power is an integral part of A/B testing. And in this article, you will learn everything you need to know about it and how it is applied in A/B testing.

A/B testing, in a nutshell, compares two versions of an application with the help of statistical power and hypothesis testing and gives us results as to which version is better. There are many practical applications of statistical power and A/B testing in our daily lives. In today’s article, we’ll use website design as an example; it could be clinical trials, advertisements, product manufacturing, etc. Feel free to dive in here to explore some exciting trends in A/B testing.

Do you ever feel like you need to conduct A/B testing to improve your website design but aren’t sure if the results you’re getting are statistically significant or not? Or do your results just seem too good to be true? Well, you’re not alone. Oftentimes, A/B testing gives us results that are nothing more than imaginary. In such cases, it’s crucial to understand the statistical power or ‘sensitivity’ of a test in simpler words.

So, let’s start without any further ado.

What Is Statistical Power?

A study’s statistical power, sometimes called sensitivity, is how likely it is to differentiate an actual effect from one caused by chance. It’s the chance that the test rejects the null hypothesis correctly. A study with a 60% power, for example, means that the study has a 60% chance of the test having significant results.

An educated assumption about something in the world around you is referred to as a hypothesis. It should be able to be put to the test, either through experiment or observation. In statistics, hypothesis testing is a method of determining whether or not the results of a survey or experiment are relevant.

Type I And Type II Errors – How Are They Relevant?

An Introduction to Statistical Power And A/B Testing — Source

Before we move any further, let’s go through type 1 and type 2 errors since it’s crucial to understand their purpose to fully understand the power of statistical tests.

There are two types of errors that practitioners should be concerned about in null-hypothesis statistical testing — the approach most typically used in A/B tests. The test process’s probability of incorrectly rejecting a genuine null hypothesis is known as a Type I error. The likelihood that the test technique would fail to reject a false null hypothesis is known as a Type II error.

A type I error, often known as a false positive, occurs when a null hypothesis is rejected that is actually true. Your test detects a difference between variations that do not exist in reality. The apparent discrepancy between the test and control treatments is deceptive and due to chance or error. The significance level for your A/B test is the probability of a type I error, which is indicated by the Greek letter alpha (α). When you test with a 90% confidence level, you have a 10% chance of making a type I error. The inverse link between Alpha and Beta is that as one is reduced, the other increases.

The statistical power of your test is reduced when you lower your alpha. The critical region shrinks as alpha decreases, and a smaller critical region indicates a lesser probability of rejecting the null hence, a reduced power level. If you need extra power, increasing your alpha is one possibility.

A Type II error, often known as a false negative, occurs when a null hypothesis is incorrectly rejected. When your test fails to detect a significant improvement in your existing variation, you have made a Type II error. The probability of committing a Type II error, Beta (β), is inversely proportional to statistical power (1- β). It is also about how to prevent the possibility of false negatives. We want to keep the danger of Type I errors to a minimum while having enough power to identify improvements if the test treatments are genuinely better.

An adequately powered test increases the likelihood of detecting the improvement. You have an unacceptably high risk of failing to reject a false null if your test is underpowered. If the probability of making a Type II error is 20%, then your power level is 80% (1.0 – 0.2 = 0.8). For 90 percent and 95 percent power levels, you can reduce the chance of a false negative to 10% or 5%, respectively.

To put it simply:

A type I error is declaring something that isn’t there, whereas a type II error is failing to declare something that is.
Type II mistakes are influenced by the power level you select: the higher the power level, the less likely a Type II error will occur.

How To Calculate the Statistical Power?

Finally, let’s go through the method of calculating the statistical power. We’ll go through a step-by-step guide that you can easily follow to conduct the statistical power yourself.

Power analysis is a method for determining statistical power, or the likelihood of discovering an effect, provided that the effect exists. To put it another way, power is the likelihood of rejecting a false null hypothesis. It’s important to distinguish between power and a type II error, which occurs when you fail to reject a false null hypothesis. As a result, power can be defined as the likelihood of not committing a type II error. Test results with high statistical power are more than likely valid. The probability of making a type II error decreases as the power increases. A low statistical power indicates questionable test results.

We can find the statistical power in three simple steps:

1. Defining the Region of Acceptance

A researcher gathers sample data for a hypothesis test. Then, they create a test statistic using the sample data. The researcher cannot reject the null hypothesis if the statistic falls within a specific range of values. The region of acceptance refers to the range of values.

2. Specifying the Critical Parameter Value

The critical parameter value is a substitute for the null hypothesis value. The difference between the null hypothesis and critical parameter values is called the effect size. To put it another way, the effect size is equal to the critical parameter value minus the null hypothesis value.

3. Computing the Power of the Test

Let’s assume that the correct population parameter is equal to the critical parameter value rather than the null hypothesis’s value. Evaluate the likelihood that the sample estimate of the population parameter will lie outside the acceptable range based on that assumption. That likelihood is the power of the test.

That’s all! Not that hard, right? Let’s move to the A/B testing section now so you can see the relevance and also go through some important approaches you can take.

What Is A/B Testing?

An AB test is used to determine which version of a webpage is the most effective. A/B testing will show you which page yields more sales if your goal is to convert browsers into buyers. However, before redesigning your website, double-check that your findings are reliable. As a result, AB testing statistics is an essential component of web or app optimization.

When we run an AB test, a type of hypothesis testing, we make two competing versions of a webpage and show them to two sets of randomly selected people. Different buttons, web forms, notifications, and every other variant we can think of will be available in the new edition, which is page B. The conversion rates for pages A and B differ, indicating which version performed better during the test. After that, we must determine whether the results are credible and whether they provide us with any useful information.

A/B testing, in essence, removes all of the guesswork from app optimization and allows experienced optimizers to make data-driven judgments. The ‘control’ or original testing variable is A in A/B testing. ‘Variation,’ or a revised version of the initial testing variable, is denoted by B.

Statistical Power And A/B Testing — Pexels

Bayesian Vs. Frequentist Approach – Which One Is Better?

Two of the most common approaches when it comes to A/B testing are Bayesian and Frequentist. While both are equally good, and there’s no best one for all situations, there are different scenarios where one is better than the other.

So, let’s compare both these approaches briefly so you know which one to use in your use case and why.

BAYESIAN	FREQUENTIST
The concepts of ‘Probability as Degree of Belief’ and ‘Logical Probability’ are used in Bayesian Statistics.	The definition of probability used by Frequentist Statistics is “Probability as Long-Term Frequency.”
You use your prior knowledge from earlier trials to your present data and try to assimilate it. The Bayesian solution is to derive inferences based on available evidence.	Only data from your current experiment is used. Conducting tests and drawing conclusions is the frequentist solution.
It considers the potential of A defeating B as well as the range of improvements that can be expected.	Provides an estimated mean and standard deviation for instances where A beats B but ignores cases in which B beats A.
This approach allows you to have more control over the testing process. You may know a better plan, have a more precise cause to halt tests, and get down to the basics of how near A and B are to each other.	This approach requires the test to run for a certain amount of time to obtain accurate results but can’t figure out how close or far A and B are. It is unable to provide you with the chance of A defeating B.

Conclusion

We conclude that statistical power is critical throughout the design stage of an A/B test. Extra attention should be paid to it to avoid wasting time and money planning, building, deploying, monitoring, and evaluating A/B tests. It is one of the components of A/B testing. It gives you control over the errors and increases your chances of finding significant effects.A/B testing is crucial if you want to improve, enhance or get a better version of something you’re working on, most likely a webpage.

There are two statistical approaches to running an A/B test: Bayesian and Frequentist. Now, it is on you to decide which one is better suitable. A/B testing will find and eliminate all the weak points.