Bro, do you even test?
By Neil T Stacey
If I flip a coin 10 times and Heads comes up 6 times, is it fair for me to say that the coin shows Heads 60% of the time? Not really; that result could very easily be a result of random chance with a fair coin. And yet, I have seen a Magic player claim that his deck is 60% in a matchup because he won 6 games from 10 in testing, a result that could just as easily be the product of random chance.
I have no intention of turning this article into a lesson in Statistics. Instead, I’m aiming to offer a basic intuitive grasp of some of the principles of hypothesis testing and how they apply to deck-testing. Those interested in more detail can get in touch with me (@NeilTStacey on Twitter) or do some googling on hypothesis testing and in particular, p-value analysis. For most of you, however, an intuitive sense of the basic principles is more useful than a string of equations, so I’ll stick to using examples like the one I started with.
Going back to that example, what if a deck wins 8 from 10 games? I would be far from convinced that it has a win rate of 80%, but I’d be pretty confident that something is going on. But what can I really be confident of? For instance, if a deck’s win rate is 70% it could still easily pick up 8 wins from 10 games with a bit of luck, so I can’t be confident that it’s that high. Even 60% is a bit high as an initial estimate. As a matter of interest, there is about a 16.6% chance of a 60% deck winning 8 from 10 games (I won’t show my work, but I calculated this figure using the binomial distribution).
So even winning 8 games from 10 doesn’t tell us all that much. We can conclude very confidently that the deck is at least a little bit favoured, but that’s about the extent of it. My confidence might be even lower, depending on how I approached the testing. If I’m examining both decks and I don’t care which one wins, that changes things a bit. The odds of either deck winning a lot at random are higher than the odds of one specific deck doing it. When there are multiple possible results that you would consider significant, then you have to adjust for the increased chance of a seemingly significant result occurring at random.
This brings us to another major pitfall of deck-testing. Imagine if, instead of a single coin, I flipped 100 different coins. Imagine I pick one of them up and tell you “This coin came up Heads 8 time from 10. It is the best coin, you should play it at the PTQ.” It would be positively charitable of you to even test that coin yourself instead of just smacking some sense into me.
An equivalent scenario comes up when tweaking a deck. Let’s say you test your newest brew against Abzan midrange, and find that it loses. Since Abzan Midrange is everywhere, that’s not acceptable, so you change your deck a bit. It still loses. You make another change. Still no good. You make another change and Voila, suddenly you win some games. You might decide that your new configuration is favourable against the boogieman of the format, but you might also be wrong.
In fact, if the change you were making each time was to swap one mountain for a mountain with different art, you would eventually get a positive result, just as if you flip enough coins, one of them will eventually come up 80% Heads. Now, your deck might indeed be favoured, just as some coins are weighted unevenly weighted. The issue is that because you tested so many different options, the odds of one of them succeeding by accident are much higher than if you tested just one, and your confidence in that result must be commensurately reduced. So do some more testing from scratch with that new build before you decide anything.
This logic doesn’t just apply to whether or not a particular deck is favoured in a matchup or which build of a deck is better. It applies to any question you might think to ask about a format. All of this adds up to a LOT of test games, so it’s an absolute necessity to take some shortcuts to get past the basics and onto the really important questions. Possible shortcuts will include drawing on your own judgment, asking advice from other players, reading articles or looking at tournament results. No-one’s judgment is perfect, however. Remember that a lot of top pro players arrived at Pro Tour M15 believing that the best thing Mono-Red could do with three mana was bestow Mogis’s Warhound. Goblin Rabblemaster was a $1 bulk rare at the time.
Deck-testing is a big part of finding success in Constructed formats. Some would argue that it is in fact the biggest part, and it’s tough to fault that logic. The process of deck-testing doesn’t only determine what cards you’ll show up with on the day, which is pretty important in itself, it lets you figure out your sideboarding plans for each matchup and serves as practice for all the matchups that you test.
Deck-testing is how we decide what deck to play, it’s how we decide on a build of that deck, it’s how we decide what sideboard cards to take and how to use it and it’s how we figure out how our deck works, as well as how most of the opposing decks work.
Considering all this, it’s worth spending a little bit of time thinking about whether or not you’re doing it right, and whether you can really be confident in the results you come up with.