Skip to content
fixyour.page

A/B Test Calculator

Calculate sample size for statistically significant A/B tests.

Your data stays in your browser

Set your parameters and hit Calculate. We'll tell you how many visitors you need.

How to Calculate A/B Test Sample Size

Use the Sample Size Planner to figure out how many visitors you need before you start your test. Enter your current conversion rate, the smallest improvement you'd want to detect, and your desired confidence level. The calculator uses a two-proportion z-test formula to give you an exact sample size per variation. Run your test until you hit that number — stopping early is the fastest way to get a false positive.

Once your test is complete, switch to the Results Evaluator. Enter your visitor counts and conversions for both variations. You'll get a p-value, confidence level, relative lift, and a clear verdict on whether you have a statistically significant winner.

What Is Statistical Significance?

Statistical significance tells you whether the difference between your control and variation is real or just random noise. A p-value below 0.05 means there's less than a 5% chance the difference you're seeing happened by luck. That's the standard threshold most teams use — it doesn't guarantee the result is correct, but it means you can be reasonably confident you're not chasing phantom improvements.

Understanding P-Values and Confidence Levels

The p-value is the probability of seeing a result at least as extreme as yours if there were actually no difference between variations. A p-value of 0.03 means a 3% chance of a false positive. The confidence level is simply 1 minus the p-value, expressed as a percentage — so a p-value of 0.03 gives you 97% confidence. Neither number tells you how large the effect is; that's what relative lift and the confidence interval are for.

Common A/B Testing Mistakes

Stopping tests early when one variant looks like it's winning. Peeking at results repeatedly and calling a winner the first time p dips below 0.05. Running tests on too little traffic and declaring a 50% lift that was really just noise. Not accounting for seasonal traffic patterns. Testing too many variations at once without adjusting for multiple comparisons. And the classic: running a test with no hypothesis about why the change should work, then reverse-engineering a story to fit whatever result you got.

How long should I run an A/B test? +
Until you reach your required sample size — not a day sooner. Use the Sample Size Planner to calculate the number, then divide by your daily traffic to estimate how many days you need. A common rule of thumb is at least two full business cycles (typically two weeks) to account for day-of-week effects, even if you hit your sample size earlier.
What's a good minimum detectable effect? +
It depends on your traffic and how much lift would actually matter to your business. A 20% relative MDE is a common starting point — if your conversion rate is 5%, you're looking to detect a shift to at least 6%. Smaller effects require exponentially more traffic. If you'd need 500,000 visitors to detect a 2% lift and you only get 10,000 a month, that test isn't worth running.
Can I stop a test early if one variant is winning? +
No. Early stopping massively inflates your false positive rate. If you check your test every day and stop the first time you see significance, your actual false positive rate can be 20-30% instead of the 5% you planned for. Decide your sample size upfront and commit to it. If you need the ability to stop early, look into sequential testing methods, which use adjusted significance thresholds.
What does "80% power" mean? +
Power is your test's ability to detect a real effect when one exists. At 80% power, if Variation B truly converts better than Control A by at least your MDE, you have an 80% chance of detecting that difference. The remaining 20% is the false negative rate — the chance you'll miss a real winner and call the test inconclusive. Higher power (90%) requires more traffic but reduces the risk of missing real improvements.
Why do I need so many visitors for small effects? +
Small effects are harder to distinguish from random variation. If your conversion rate is 5% and you're trying to detect a 2% relative lift (5.0% to 5.1%), the signal is tiny compared to the noise in conversion data. You need a large enough sample for the math to confidently say that 0.1 percentage point difference isn't just luck. The sample size grows roughly with the square of the effect size — halve the effect you want to detect, and you need roughly four times the traffic.