Let me ask you a question that will probably make you uncomfortable. How many calls does your QA team score each month? And more importantly, is that number based on math or on the size of your QA team?

Most contact centers score somewhere between 4 and 10 calls per agent per month. Some do fewer. A handful of enterprise operations are brute-force scoring 100% of calls using AI tools that charge them for it. Almost none of them are scoring the right number.

Here is the truth nobody in the QA software industry wants you to hear: scoring every call does not improve QA accuracy. It increases cost without adding insight. The math proves it. And the math also proves that scoring 4 calls per agent per month is completely useless.

3%
That is all you need. For a 60,000-call monthly volume, scoring just 3% of calls (1,789 calls) gives you the same statistical confidence as scoring 100%. Everything beyond that is money you are setting on fire.

The Statistics Behind QA Sampling

The same statistical principles that allow political pollsters to predict elections from 1,000 responses, pharmaceutical companies to approve drugs without testing every human on earth, and Netflix to know what you want to watch next apply to your QA program.

There is a foundational concept in statistics called the Law of Large Numbers. In plain English: once you have enough samples, adding more does not meaningfully change your results. The sample average converges to the true average.

Think about it this way. If you wanted to know the average temperature of a swimming pool, you would not need to measure every water molecule. A few well-placed thermometer readings give you the answer. The same logic applies to call quality.

After a certain sample size, scoring additional calls adds almost nothing to your confidence in the result. You are just burning money and QA hours.

The Two Variables That Matter

When you calculate sample size for QA, two things determine how many calls you need: confidence level and margin of error. Understanding both is the difference between a defensible QA program and one that is based on vibes.

Confidence Level: How Sure Do You Need to Be?

A 95% confidence level means if you repeated this sampling process 100 times, 95 of those samples would capture the true quality level of your center. A 99% confidence level is the gold standard. It means 99 out of 100 samples would capture the truth.

Most academic research uses 95%. Regulated industries (pharma, finance) use 99%. For most contact centers, 95% is plenty. For compliance-critical operations like healthcare, financial services, or collections, go to 99%.

Margin of Error: How Close to the Real Number?

A ±3% margin of error means if your sample shows 82% quality compliance, the true number is between 79% and 85%. For operational decisions, that is more than precise enough.

Tighter margins (±1% or ±2%) require exponentially more samples. For QA, ±3% is the sweet spot. Going tighter does not change how you coach agents or where you invest your time.

The Sample Size Table for a 60,000-Call Center

Here is the reference table for a 60,000-call monthly volume. These numbers use the standard statistical formula with a finite population correction. This is the same methodology used in academic research and regulated industries.

Confidence LevelMargin of ErrorCalls to Score% of Total Volume
90%±5%2700.45%
90%±3%7481.25%
95%±5%3820.64%
95%±3%1,0561.76%
99%±3%1,7892.98%
99%±1%12,45620.76%

As you can see, even at the highest reasonable confidence levels you are looking at a small fraction of your total call volume. The highlighted row is the sweet spot for most operations.

For this 60,000-call example, we recommend 99% confidence with ±3% margin of error. That means scoring 1,789 calls per month. About 60 calls per day. That is just 3% of your total volume, giving you the same statistical validity as scoring everything.

Why Scoring 100% Is a Waste of Money

Legacy AI QA tools charge per call scored. For a 60,000-call operation, scoring 100% of calls can run $15,000 to $25,000 per month depending on the vendor. Scoring 1,789 calls with statistically valid sampling costs a fraction of that.

$4,000+
Saved per month by scoring 1,789 calls instead of 60,000. Same statistical confidence. Better margins. No long-term contract. The legacy tools are charging you to score calls that add zero statistical value.

Why 5 Calls Per Agent Per Month Is Useless

Most contact centers score 4 to 8 calls per agent per month. Here is why that is statistically garbage.

Suppose an agent handles 70 calls per day. Five days per week and four weeks per month means that agent handles 1,400 conversations per month. To draw defensible conclusions about that agent at a 90% confidence level with ±10% margin of error, you would need to score 65 calls per agent per month. At 95% confidence with ±5% margin, you need over 300 calls per agent per month.

Nobody is doing that manually. Which is exactly why manual QA at 5-10 calls per agent is worse than flipping a coin. You are making coaching decisions, performance reviews, and potentially termination decisions based on less than half a percent of what the agent actually does.

When you score 5 calls, you do not have data. You have anecdotes.

One bad call out of 5 makes an agent look like they fail 20% of the time. But if you scored 100 calls and found 4 bad ones, that is a 4% failure rate. The agent is actually performing well. You just got unlucky with your sample. This is not theoretical. It happens every day.

Why Random Sampling Alone Is Not Enough

Random sampling only works if it is truly random AND proportional. If your sample accidentally overrepresents certain agents, call types, or time periods, your results will be skewed. This is where most QA programs fall apart.

Stratified Sampling: The Fix

Instead of pulling calls randomly from the entire pool, divide your calls into groups (by agent, call type, time of day, campaign, queue) and sample proportionally from each group. If Agent A handles 12% of your calls, 12% of your sample should come from Agent A. If morning shift handles 40% of volume, 40% of your sample should be morning calls.

Automated Selection: The Non-Negotiable

Let your QA software handle the randomization algorithmically. Human selection, even with the best intentions, introduces bias. "I will just grab a few calls from the morning shift" is how skewed data happens. So is "let me pull that long call the customer complained about." Both destroy statistical validity.

The Three Mistakes That Kill QA Accuracy

Mistake #1: Cherry-picking calls. Selecting calls based on duration, customer complaints, or supervisor hunches destroys statistical validity. Your sample must be random within your defined strata.

Mistake #2: Inconsistent scoring criteria. If different evaluators interpret your rubric differently, you are adding noise to your data. Statistical precision is meaningless if your measurement instrument is unreliable. This is where AI-powered scoring has a significant advantage. It applies criteria identically every time.

Mistake #3: Scoring too few calls per agent. If you are coaching an agent based on 5 scored calls, you do not have statistically meaningful data. You have anecdotes. For individual agent coaching, you need at least 30-50 calls per agent per month to see real patterns.

What OttoQA Does Differently

This is exactly what OttoQA does for you. Our algorithm automatically handles stratified random sampling, ensuring proportional, unbiased selection across agents, call types, and time periods. You get statistically valid results without having to think about the math.

OttoQA CAN score 100% of calls. But you do not HAVE to. We help you score the right amount. That is smarter and more cost-effective than brute-force scoring.

Every competitor pushes 100% scoring because it costs you more. We built our platform to help you score smart. Fewer calls, better data, lower cost, same confidence.

The Bottom Line

If you are scoring 5 calls per agent per month, you have no data. You have guesses dressed up as metrics.

If you are scoring 100% of calls with a legacy tool, you are burning money on statistical noise.

The sweet spot is a statistically valid sample using stratified random selection. For most centers that is 2-5% of calls at 99% confidence with ±3% margin. The math does not lie. The legacy vendors just do not want you to look at it.