Cold email A/B testing – what to test, how to measure, when to call it

March 24, 2026

I ran 14 A/B tests on cold emails last year. 9 of them gave me a clear winner. 3 were inconclusive. 2 told me I was testing the wrong thing entirely.

Those 9 winners improved my overall reply rate from 7% to 11% over 6 months. A/B testing cold email isn’t optional – it’s how you stop guessing and start compounding small improvements.

Here’s exactly what to test, how to measure it, and the minimum numbers you need before making a call.

What to test (in priority order)

Not all variables are equal. Test them in this order – each one has a bigger impact than the next:

1. Subject lines. This is where A/B testing gives the fastest, clearest results. Subject lines control open rate, and open rate is the top of the funnel. A 10% improvement in opens cascades through everything downstream. My subject line guide covers the formats that work – A/B testing tells you which format works for your audience.

2. First lines. The opener determines whether someone reads past the first sentence or deletes. Test personalized vs. direct, question vs. statement, observation vs. compliment. More on first line approaches.

3. CTAs. The call to action determines whether a reader becomes a reply. Test question vs. statement, specific time ask vs. open-ended, low-commitment vs. direct. Examples in my CTA post.

4. Send time. Day of week and time of day. This has a smaller impact than the other 3 but it’s easy to test and compounds over time. My data on send timing.

5. Email length. Short (40-60 words) vs. medium (80-100 words). I don’t test long emails because they consistently underperform. The length data.

Minimum sample sizes

This is where most people mess up. They send 20 emails with subject line A and 20 with subject line B, see a difference, and declare a winner.

20 sends per variant isn’t enough. Here’s the minimum I use:

Subject line tests: 50 sends per variant minimum. 100 is better. You’re measuring open rate, which has natural variance of 10-15% between small batches.
First line / body tests: 75 sends per variant minimum. Reply rate is a smaller number than open rate, so you need more volume to see a real signal.
CTA tests: 75 sends per variant minimum. Same logic as body tests.
Send time tests: 100 sends per variant minimum. Time-of-day effects are real but subtle.

The math: If your baseline reply rate is 10%, a sample of 50 sends gives you 5 expected replies. The difference between 4 and 6 replies could be noise. At 100 sends, 8 vs. 12 replies starts to tell you something real.

When in doubt, keep testing. Calling a winner too early based on small numbers is worse than not testing at all – it gives you false confidence.

Real A/B test examples

Test 1: Subject line – question vs. statement

Variant A: “Quick question about [company]‘s content strategy” Variant B: “[Company]‘s content strategy”

A: 62% open rate (124 sends)
B: 49% open rate (124 sends)

Winner: A. The question format outperformed by 13 percentage points. This was consistent across 3 follow-up tests. Questions in subject lines create a curiosity loop that statements don’t.

Test 2: CTA – specific vs. open-ended

Variant A: “Worth a 10-minute call this week?” Variant B: “Open to chatting about this?”

A: 11% reply rate (100 sends)
B: 8% reply rate (100 sends)

Winner: A. The specific time commitment (“10-minute”) reduced the perceived cost of saying yes. “Open to chatting” is vague – it could mean a 10-minute call or a 45-minute demo. People avoid ambiguity.

Test 3: First line – observation vs. compliment

Variant A: “Saw that [company] just launched [specific product]. Curious how the rollout’s going.” Variant B: “Love what [company] is doing with [specific product].”

A: 14% reply rate (80 sends)
B: 9% reply rate (80 sends)

Winner: A. Observations feel genuine. Compliments from strangers feel like sales tactics. The observation also opens a conversation – it asks a question implicitly. The compliment is a dead end.

Test 4: Send time – morning vs. afternoon

Variant A: 8:30 AM recipient’s timezone Variant B: 2:00 PM recipient’s timezone

A: 57% open rate, 10% reply rate (150 sends)
B: 52% open rate, 9% reply rate (150 sends)

Winner: A, but barely. The morning edge was small enough that I’d call this “morning is slightly better, not dramatically better.” I default to morning sends but I don’t stress about it.

Test 5: The one that taught me the most

Variant A: 4-sentence email with personalized first line Variant B: 2-sentence email with generic first line but stronger CTA

A: 12% reply rate (100 sends)
B: 7% reply rate (100 sends)

Winner: A by a mile. I ran this test thinking maybe shorter-with-generic would beat longer-with-personal. It didn’t. Personalization in the first line is not optional – it’s the highest-leverage element in the email. This test killed any temptation I had to sacrifice personalization for speed.

The iteration loop

Here’s the process I run every 2 weeks:

Week 1-2: Identify the weakest metric. Check your outreach metrics. Is the problem opens, replies, or call bookings?

Week 1-2: Design the test. Pick 1 variable that affects that metric. Write 2 variants. Don’t change multiple things at once – you won’t know what caused the difference.

Week 1-2: Run the test. Split your send list evenly. Alternate sends (A, B, A, B) rather than sending all A’s Monday and all B’s Tuesday – you want to control for day-of-week effects.

End of Week 2: Evaluate. Did you hit minimum sample size? Is the difference meaningful (5+ percentage points for open rate, 3+ percentage points for reply rate)? If yes, adopt the winner. If no, keep testing or call it a tie and move on.

Then start the next test. One test per 2-week cycle. 26 tests per year. Even if only half produce a clear winner, that’s 13 improvements compounding on each other.

What not to test

Don’t test your offer. If you’re A/B testing “we do SEO” vs. “we do content marketing,” you don’t have a testing problem – you have a positioning problem. Fix your ICP first.

Don’t test formatting. Bold vs. no bold. Bullet points vs. paragraphs. These make almost no difference in cold email. I’ve tested them. Save your sends for variables that actually move the needle.

Don’t test tone. Casual vs. formal is a brand decision, not a test variable. Pick your voice and test the mechanics within it.

Start here

If you’ve never A/B tested cold email:

Pick your current best template.
Write 2 subject line variants.
Send 50 emails with each.
Compare open rates.
Keep the winner. Write a new challenger. Repeat.

That’s the whole system. Test one variable at a time. Hit minimum sample sizes. Adopt winners. Compound over months. The teams that test consistently outperform the teams with “better” copywriters every time.