Author: Kurt B. Johnson
I was looking at one of the latest CNN political polls (August 15-18, 2019) conducted by the polling firm SSRS (https://www.cnn.com/2019/08/20/politics/cnn-poll-democrats-2020-biden-rebound/index.html) and noticed that my favorite candidate, Pete Buttigieg, garnered 5% of the vote. The reported precision for the CNN poll – also called the margin of error – was 6.1%. Does this mean Pete’s support is somewhere between -1.1% and 11.1%? It can’t be because the lower bound percentage cannot be negative. So, it’s between 0% and 11.1%, right? That’s not correct either because Pete Buttigieg’s support cannot truly be 0%. It’s got to be somewhere north of that. Why? Simply because people voted for him in the poll of 402 Democratic or Democratic-leaning independent registered voters conducted by CNN. Even if the voter support isn’t really 5%, the evidence from the poll is enough to show that there is some true support for Pete Buttigieg out there, and it’s greater than 0%. The same goes for Kamala Harris, who also received 5% of the vote in the CNN poll. Bernie Sanders and Elizabeth Warren got 15% and 14%, respectively. The top vote getter, Joe Biden, received 29% of the voters’ support. Ignoring the complications of Kamala and Pete and applying the +/- 6.1% margin of error, Joe’s voter support has a 95% confidence level of being between 22.9% and 35.1%, which is known as a confidence interval. The only problem is that this isn’t correct either!
The margin of error reported in survey sampling, which is used to conduct political polls, is the plus-minus percentage that should be assigned to a hypothetical entity with exactly 50% of the vote. So, if Joe Biden had really received exactly 50% of the vote in the CNN survey, then his confidence interval would be 43.9% to 56.1%. The confidence intervals narrow as the percentage moves up or down and away from 50%. Actually, Joe Biden’s confidence interval is computed using +/- 5.54%, resulting in a confidence interval of 23.46% to 34.54%. Pete Buttigieg and Kamala Harris have a confidence interval of voter support between 2.34% and 7.66% based on a +/- 2.66%. As you can see, reporting a +/- 6.1% margin of error when the highest proportion is 29% can be very misleading!
This article will use results from three recent political polls as the backdrop to compare survey sampling and attribute sampling. By seeing the analogy of applying attribute sampling to political polls, it is my hope that auditors will gain insight into how they should apply attribute sampling in their audit sampling projects. The article will first address how the margin of error is calculated in survey sampling. Then the article will calculate the margins of error using attribute sampling. The margins of error are very similar. This article will show that employing a concept from survey sampling known as the Design Effect in attribute sampling generates comparable answers.
Current Survey Sampling Techniques
The Quinnipiac University, Monmouth University and CNN/SSRS polls will be used to illustrate the comparisons between survey sampling and attribute sampling. It is not my intention to single out these polls as any better or worse than any other polls. Nevertheless, we will discuss one critical difference between the Monmouth poll and both the Quinnipiac and CNN polls, namely that the Quinnipiac and CNN polls weight the responses to match Census demographics. Below are the precision and design effects reported by each of the polls. This article will be focusing on the section of the table having to do with the Democratic Voters Polled. Democratic-leaning independents are included by the pollsters in these numbers.
The formula used to calculate the Margin of Error (E) is:
The Margin of Error formula (1.0) was constructed from the formula for the sample size (n):
These polls didn’t actually report all of these Design Effects (Deff). They were computed using this formula:
In all of these formulas, the value for zα/2 is 1.96, which is the critical value for a 95% confidence interval. The value used for P is 0.5. Plugging 402 for n and 1.56 for Deff into equation 1.0 yields 6.1% for the margin of error. That is why the introduction said the margin of error reported for the poll was really for a hypothetical entity garnering exactly 50% of the vote: 50% was used to obtain the number. Why is 50% used? It generates the biggest margin of error. It is also used by all the polls so that we can have a number for each that allows us to gauge each poll’s precision relative to each other. Polls with a smaller reported margin of error are more precise. However, what most people don’t know is that this reported plus-minus cannot be applied directly. In almost all cases, one must perform their own calculations. Equation 1.0 was used to calculate the +/- 5.54% for Joe Biden and the +/- 2.66% for Kamala Harris and Pete Buttigieg. Simply replace 0.5 with 0.29 and 0.05 to obtain those answers.
One may access the in-depth reports for the polling results using the following links:
Attribute Sampling Applied to Political Polling
Very similar to acceptance sampling, attribute sampling is primarily used by auditors to test the goodness of controls in their organization. The password on your computer or cell phone is an example of a control. It controls who can access your device. By granting or denying permissions to the user IDs in the company’s computer network, IT administrators are able to control what people are able to do with their computers at work. Let’s say a company like Amazon has the objective that their shipments are authorized and accurate. Examples of pertinent controls they may put in place to achieve this objective are: 1) sales orders are approved by the correct subset of employees and they provide their initials for approval and 2) sales orders are checked against shipping orders for price and quantity accuracy and initialed by a different subset of employees before shipping. These two controls should be executed for each shipping transaction; they control whether the shipments are authorized and accurate or not. Each control is considered an attribute. In attribute sampling, a random sample of shipping transactions is selected. Then, each transaction is audited for correctness. For each transaction the performance of each of the two controls is evaluated on a yes/no basis. Either the control was performed correctly or it wasn’t. If it was performed correctly, the transaction was conducted in compliance with the control objective. That’s why these are sometimes referred to as compliance tests.
Instead of shipping transactions, take a random sample of registered voters. Each voter’s stance on a candidate is an attribute. Since there are about 20 candidates plus the possible answers of No One, Other and Undecided, each voter selected for the sample has 23 attributes. Each of these 23 attributes has a yes/no answer. Of course, only one attribute has a yes answer (the rest have a no answer) and that is the candidate for whom the voter says they intend to vote. Polls are conducted by asking “who are you going to vote for” and the voter gives their answer. This approach has the advantage of being more expedient than the following method. Equivalently, the pollster could iterate through each of the candidate names and get a yes/no answer for each of them. Once viewed this way, the process mirrors the typical process for sampling for controls using audit attribute sampling. Think of each candidate as a control being performed correctly or not and you’re mentally there. With attribute sampling we can test multiple controls with the same sample. The controls do not have to be independent of each other. Indeed, the 23 voting attributes for each individual voter are not independent; a yes for one means necessarily a no for all the others.
In audit attribute sampling, the auditor is concerned with identifying the rate of control deviations/errors/violations. Going back to our Amazon shipments example, if the sales order was not initialed or initialed by a person who should not have been approving the sales order, then that would count as a control violation or control error. If the check between shipping quantity and sales quantity was not performed, that would also be a control violation. These could happen together on the same transaction or not. Let’s say Amazon has a population of 500 million shipping transactions and we sample 100 of them to audit. Let’s say we find 5 violations of the first control and 7 violations of the second control. Using ADA we can get 95% confidence intervals of (1.64%, 11.28%) and (2.86%, 13.89%) as the ranges of population error rates for each of these controls.
We can use attribute sampling to get good confidence intervals for the poll results. Each voter in the sample has one control violation: the candidate they’ll vote for even if that answer is No One, Other or Undecided. There are 153 million registered voters in the United States (https://www.statista.com/statistics/273743/number-of-registered-voters-in-the-united-states/) and roughly 30% identify themselves as Democrats (https://news.gallup.com/poll/15370/party-affiliation.aspx). That means there are roughly 46 million Democratic or Democratic-leaning independents in the United States. We will use this as the population size. These methods are fairly robust to changes in the population size, so if we were to use 10 million, 20 million or 75 million instead it won’t change the numbers.
We’ll start with the Monmouth poll and its 298 respondents because it does not incorporate response weighting. For example, Pete Buttigieg got 4% of the vote, which means he got 12 votes out of the 298 Democratic voters polled.
Here are the results for the five candidates. The attribute sampling 95% confidence intervals are provided along with the margins of error, which measure the differences between the Vote Share and the lower and upper confidence interval endpoints. The Computed +/- Margin of Error gives the result from employing equation 1.0 and using the Vote Share as the value for P, 298 for n and 1.0 for Deff. Since all the vote shares for the five candidates are all far less than 50%, then that explains why all of the margins of error are far less than the 5.7% reported by Monmouth University.
Monmouth University Results
Notice that the margin of error for the attribute sampling confidence intervals is not symmetric around the sample percentage (i.e. the Vote Share). In other words, the lower margin of error is not the same as the upper margin of error. The Computed +/- Margin of Error, on the other hand, is symmetric. This means that attribute sampling projects less likelihood that the results for the candidates could be lower than the actual vote shares received in the poll when compared to the survey sampling method. For example, the lower margin of error for Joe Biden is 4.18%; the survey sampling margin of error for Joe Biden is 4.45%. On the flip side, attribute sampling projects that the candidates have a greater chance of getting more votes than survey sampling does: 5.06% vs. 4.45% for Joe Biden. The survey sampling result therefore projects an upper bound of 23.45% for Joe Biden while attribute sampling projects 24.06%.
Which is preferable? That depends on what you are testing for. That combined with the knowledge of which method gives the most conservative result generally leads to the preferred method. The conservative result is defined as the less favorable one for accomplishing the objective you are testing for. If your goal is to see optimistically how many votes a candidate could potentially get, then survey sampling is the preferred method. It is more conservative because its upper bound of 23.45% is less than attribute sampling’s 24.06% and therefore less favorable. If your goal is to test whether votes are less than a critical percentage, then attribute sampling is the preferred method. Let’s say this critical percentage – which auditors call materiality – was chosen ahead of time to be 24%. Survey sampling would say the sample passes, but attribute sampling would say it does not. Being less favorable to the outcome, attribute sampling is more conservative and the preferred method in this case.
The survey sampling results can be a bit misleading when the sample percentages are small. Control violations can be a very rare occurrence. Sometimes they do not occur at all in a sample. This is the same as a candidate getting no votes in the poll. John Delaney and Tim Ryan are examples of candidates who got no votes in the Monmouth poll. If P is zero, equation 1.0 returns a margin of error equal to zero as well. However, just because someone receives no votes in a poll of 298 people doesn’t mean they wouldn’t receive votes from some of the 46 million registered Democratic voters in the U.S. In fact, the one-sided 95% confidence interval calculated using attribute sampling finds that up to 1.0% of the population could support one of these two candidates.
Indeed, equation 1.0 is not valid when the sample percentage P is very small and the sample size n is not extremely large. Equation 1.0 is valid when nP ≥ 10 and n(1-P) ≥ 10. Pete Buttigieg barely makes this validity cut for the Monmouth poll. For Pete P is .04 and n is 298, so .04 x 298 = 11.92. For candidates like Amy Klobuchar and Marianne Williamson, who garnered 1% and 2% of the vote respectively, equation 1.0 will not work. Instead, one must employ these equations to get the survey sampling confidence intervals (where z = zα/2 = 1.96):
Once equations 1.4 and 1.5 are employed, we can see that the confidence intervals and margins of error for attribute sampling and survey sampling track even closer together. In fact, the survey sampling results are closer to the attribute sampling results than to the Computed +/- Margin of Error calculated using equation 1.0. One could easily substitute attribute sampling for survey sampling and get almost identical results:
To get an equivalent confidence interval for the zero percenters, one must use 97.5% as the confidence level for the one-side upper limit confidence interval because equation 1.5 is only assigning 2.5% to the allowable error in the upper tail if critical value 1.96 is used as equations 1.4 and 1.5 are for two-sided confidence intervals. This set of inputs creates the equivalent situation:
And remember our hypothetical entity with 50% of the vote? That is about what it takes to create the plus-minus symmetry inherent in equation 1.0. For hypothetical 40% and 30% vote-recipients, the numbers have already moved away from that perfect symmetry.
The beauty of attribute sampling lies in its simplicity. Using a hypergeometric distribution, which is the appropriate distribution to use when sampling without replacement (i.e. a voter can only be picked for the sample one time), we create a cumulative distribution function (CDF) for each attribute (remember, each candidate is an attribute) and find the points that give us 2.5% of the area under the curve on the left and 2.5% of the area above the curve on the right as shown in the graph of vote distributions by candidate for the Monmouth University poll. The 2.5% on the left plus the 2.5% on the right gives us 5%, which is the amount unaccounted for in a 95% confidence interval. This graph will give you the distribution of voter support by candidate and their 95% confidence intervals.
Before I can present the Quinnipiac and CNN/SSRS results, I must discuss the Design Effect (Deff). The Design Effect quantifies the relative change in variance caused by sample designs that are more complex than a simple, unrestricted random sample of the same size.
The Design Effect for a simple random sample is 1.0, which is the effect for the Monmouth poll. Sometimes certain stratification designs can yield a Design Effect less than 1, indicating the design is more efficient than a random sample. Design Effects greater than 1 indicate that the more complex design has made the sample less efficient, which typically occurs with cluster sampling and weighting of responses. Both the Quinnipiac and CNN/SSRS polls weighted their entire sample of responses to reflect national Census figures for gender, race, age, education, region of country and telephone usage. By creating weighted values, this means that some votes in the poll counted more than others. For example, 49.2% of the U.S. population is male. If the poll respondents only included 40% men then the votes of the men were made to count more than the votes of the women. The cost of doing this is it creates a more inefficient poll, which is why the reported Design Effects are around 1.4 to 1.5 for these polls. Besides creating inefficiency, this also means we do not know the true numbers (32% of Quinnipiac respondents may not have voted for Biden), making the aggregation of poll results more difficult. On top of all that is the implicit implication that these demographics are truly what define our views rather than trusting the randomness of the samples to fairly represent our views.
The Design Effect allows us to construct the Effective Sample Size, which is the sample size an unrestricted random sample would need to be to attain the same level of precision as found in the more complex sample design. Using the Design Effect, one can construct the Effective Sample Size (neff) as shown:
The Quinnipiac poll had 648 respondents. After adjusting for the Deff of 1.43, the Effective Sample Size is 453. The weighting effectively threw out the votes of almost 200 people to change the answers. Using the Effective Sample Size and the Effective Vote Count constructed by multiplying the voting share by the effective sample size (e.g. 453 x .32 = 145 effective votes for Joe Biden), we have new inputs to use in computing the CDFs for attribute sampling evaluation.
Quinnipiac University Results
When comparing the attribute margins of error and the survey sampling computed margin of error, we can see that the same pattern as observed for the Monmouth poll holds here. The survey sampling computer margin of error is slightly greater than the attribute lower margin of error and slightly less than the attribute upper margin of error. This is evidence that the design effect has been properly incorporated into the attribute sampling calculations.
Here’s how we would calculate confidence intervals in ADA using the Effective Sample Size and the Effective Vote Count using Joe Biden’s numbers:
If no weighting had been applied, then the Design Effect would be 1.0 and the confidence interval would be smaller. Instead of (27.73%, 36.52%), the confidence interval would be (28.37%, 35.69%) for Joe Biden. One can compare the two provided graphs for Quinnipiac to see the impact the weighting has on the confidence intervals for the different candidates. Finally, the same results and graphs are presented for the CNN/SSRS poll.
CNN/SSRS Poll Results
Now what we’d really like to do is combine the results from the three polls to get a more precise view of the vote distributions for the candidates.
Aggregate of Three Poll Results
Hopefully, this article has shed some light on the statistics of political survey samples. More than merely an academic exercise, this article has shown that attribute sampling and survey sampling basically produce equivalent answers. This article has also shown that we can correctly and successfully apply the Design Effect to attribute sampling. There will be more to come on this in the next few months at Audit Data Analytics.