Many have declared Modern Eldrazi to be a broken deck based on its overpowering performance at Pro Tour Oath of the Gatewatch. Others, most recently seen playing the Eldrazi deck or buying cards for it, argue that we need more data to draw a definitive conclusion. My personal suspicion, however, is that this latter group may be guilty of applying the internet definition of statistical validity, which is “agrees with my beliefs” as opposed to a more objective benchmark such as the “p<0.05” commonly used in science.

For those not familiar with hypothesis testing, p-value analysis is a way of examining a hypothesis by looking at a given set of results and figuring out the likelihood that those results could arise if the hypothesis were false. If that probability is sufficiently remote, then we can reject the notion that the original hypothesis is false.

We can do this for the results of the Pro Tour by determining the probability of a non-broken deck showing up and recording the results achieved by the Eldrazi deck. In effect, we are not asking whether the deck is broken; we are instead asking if we can state confidently that it is broken based upon the data of the Pro Tour.

First, we need to figure out what data we actually have to work with. The most convenient scenario would be to have access to the total number of wins and losses for the deck. Then we could hopefully subtract the number of mirror matches from each of those numbers and be left with an overall win percentage against the rest of the field, along with the necessary data to conduct p-value analysis.

Sadly, we don’t have data for all of the Eldrazi decks, so we have a little less to work with. What we do know is that 19 out of 32 Eldrazi decks achieved a record of 6-4 or better in Modern. Our p-value analysis can take the form of determining how likely it is for a non-broken deck to log that kind of record.

The first thing to do is frame our null hypothesis in terms of a win percentage. If we want to determine whether a deck is good or not, we would assess its results against a basis of a deck with 50% win rate. Instead, we want to know if the deck is busted, so we’ll use a null hypothesis based on a win rate of 55%. If a deck genuinely has a 55% win rate against a diverse field, then you’d need a compelling reason to play something else and you’d find the metagame warping around that deck. 55% is a reasonable baseline for a kind of broken deck, so that’s what we’ll work with.

For the probability of a deck going 6-4 or better in 10 rounds we use the binomial distribution with p=0.55 and n=10, coming up with a probability of 50.4405%. We can use the binomial distribution again, with p=0.504405 and n=32 to determine the probability of 19 or more such decks reaching this record, giving us a probability of 20.24%, or p=0.2024.

In other words, if 32 players show up with a pretty broken deck there is a 20.24% chance that 19 or more of them will reach 6-4 or better. This is quite a long way off of the level of confidence considered acceptable in a scientific context. Now, we’re playing cards here, not prescribing drugs, so we’re not after the same levels of accuracy.

Moreover, this approach neglects several permutations where a player who could otherwise have reached a 6-4 record is instead eliminated on Day 1. Even among 32 exceptional drafters, some of them will wind up with lousy draft records that could see them eliminated early despite a strong Modern performance. We also have no way to effectively eliminate mirror matches from our analysis. We can assume a fair number of Eldrazi mirror matches took place over the course of the weekend, so that is not an insignificant effect. It would seem that we can be modestly confident that the Eldrazi deck is indeed wandering around with a win rate in excess of 55%.

There is, however, a wrinkle to deal with first. We would have gotten equally excited if we saw the same results from another deck. In other words, the number we need is the probability of ANY deck achieving those results, rather than that of any PARTICULAR deck doing so. This is a simple adjustment to make, once again using the binomial distribution. There are typically around 3 decks that are heavily represented at a Pro Tour, any one of which might wind up dominating a Pro Tour by way of luck or bustedness. So let’s re-phrase our question. There’s a lot of flexibility to how we do this; hypothesis testing is less of an exact science than it appears at first glance. Tinkering with how this part is framed does change the final results, but not by too appreciable an amount.

If 3 different decks arrive at the Pro Tour each with a 55% win rate, then by applying the binomial distribution a bunch more times we can conclude that there is a 49.26% chance that one of them would manage to give 19 of 32 players a record of 6-4 or better.

Our confidence in our original hypothesis has now dwindled to “really not confident at all.”

More data from subsequent tournaments could very well confirm that Eldrazi is indeed totally broken; in fact, a more complete dataset from the PT itself might be sufficient to support that argument. That is, however, just supposition. The results we have from the Pro Tour are indeed well within the range of what could arise by accident with a deck that is just very good. We can take some solace in the fact that if we repeat the analysis basing our initial null hypothesis on a win percentage of 50%, then we arrive at a p-value of 0.03112, which represents considerable confidence that it is a good deck at the very least.

However, by the normal standards of hypothesis testing our conclusions are as follows: based upon our available data, we cannot yet reject the hypothesis that Eldrazi is only slightly busted.

By Neil T Stacey