## How to Evaluate a Variable Sample with ADA

Introduction

With CVS Evaluation, the objective is to project the findings from the sample back to the population. After creating the sample and its high/low value items file(s) (if applicable) using CVS Sample Planning, the user is expected to export the data file(s) to Excel. After auditing the file(s) for correctness, the user should have two columns: one with the recorded amounts and one with the true, audited amounts. The user may choose to take advantage of the RECORDED_AMT and AUDIT_AMT columns that are created for the user’s convenience. Once the auditing of the sample and the high/low value file(s) is complete, the user can import the data file(s) back into ADA to perform CVS Evaluation. ADA’s Classical Variable Sampling Planning and Evaluation routines implement the methodologies laid out in Statistical Auditing (1977) by Donald M. Roberts for variable sampling to create the most accurate and faithful rendering possible.

CVS Evaluation can be used on high error rate samples. A sample must have at least one more error than there are strata to obtain estimates other than the Mean (aka Mean-Per-Unit) estimator. Furthermore, samples with a low number of errors overall and within individual strata cannot provide reliable estimates of the error variance. In cases where this occurs, the program computes conservative upper bounds utilizing the population standard deviations. These procedures are outlined in detail in Roberts’ Statistical Auditing. When 3 or fewer errors are observed in a stratum, the population standard deviations for the stratum’s recorded amounts are used instead of the sample standard deviations. When more than 3 errors exist for a stratum, a calculation known as the “Conservative Approximation” is performed that calculates a composite of both the sample standard deviations and the population standard deviations. The user can choose the upper limit for the number of errors for which the Conservative Approximation is performed. For example, the default of 20 means this approximation is calculated for strata with error counts in the 4 – 19 range and the sample standard deviations from the sample are used without an approximation calculation for strata with error counts of 20 or more. The user can set this threshold as low as 5. This procedure is done because when error counts are this low, the sample standard deviations will underestimate the true stratum standard deviations.

To get started, open the Spec file created when you extracted the sample and click Monetary Unit > Evaluate. A Spec file is required to perform CVS Sample Evaluation because the population standard deviations and other population statistics by stratum are required m to implement the conservative upper bounds if needed. If you do not have a Spec file, you can create one in Excel and import it; please see the section below on creating a Spec file by hand. The program will use the information in the Spec file to search for the sample file and high/low value file(s) and automatically fill in all the information to perform the evaluation. It will look for a sample file and high/low value file(s) in the current project folder that have the same names as the files created in planning and named in the Spec file except it will be looking for a hyphen (-) after those names, a phenomenon created when importing an Excel worksheet. For example, if during planning the sample name (Output Filename entry) is “CVSS Sample” and a high values file was created, which would be automatically named “CVS Sample_HV”, then exporting these to Excel will result in Excel files “CVS Sample.xlsx” and “CVS Sample_HV.xlsx”. The worksheet names will be “Sheet1” for this example. After importing, the files will be called “CVS Sample-Sheet1” and “CVS Sample_HV-Sheet1”. The sheet names do not matter, just that the hyphen is there. Please note that for this to work, the user must not change the Output Filename Prefix entry and simply accept the default when performing Excel import.

Formats Supported by CVS Sample Evaluation

ADA’s CVS Sample Evaluation utility allows you to evaluate a sample that is Parquet data. Parquet is the format ADA uses for storing data files. Using the Dialog

The dialog box consists of seven sections: Sample Specification File, Sample File, High Values File, Low Values File, Confidence Level, Conservative Approximation Error Number Threshold and Output Name.

Sample Specification File. This file is the _Spec file created when using ADA’s CVS Sample Planning module. This file is mandatory to run the module.

Sample File. This is required. Once provided, the user will need to choose the appropriate recorded amount and audited amount columns.

High Values File. This is optional and whether one exists depends on how the sample was planned and extracted in CVS Sample Planning. If provided, the user will need to choose the appropriate recorded amount and audited amount columns.

Low Values File. This is optional and whether one exists depends on how the sample was planned and extracted in CVS Sample Planning. If provided, the user will need to choose the appropriate recorded amount and audited amount columns.

Confidence Level. Defines the Beta risk for the sample evaluation, which is the risk of not finding material error with the sample when it exists in the population. The Beta risk is the complement of the Confidence Level (i.e. 100% – C.L.%). The confidence level defines the expected likelihood that the conclusions drawn with the sample will be accurate.

Two-Sided. Creates upper and lower confidence interval limits. The Beta risk is split equally between the upper and lower sides of the confidence interval.

Upper. Creates a one-sided upper confidence interval limit. The Beta risk is assigned completely to the upper side of the confidence interval.

Lower. Creates a one-sided lower confidence interval limit. The Beta risk is assigned completely to the lower side of the confidence interval.

Conservative Approximation Error Number Threshold. For any stratum in the sample, this threshold sets the lower bound of the number of observed differences/errors at which the sample’s errors are used solely to calculate the standard errors used in the projections. The sample’s errors are used solely to calculate the standard errors when the number of errors in the stratum are equal to this amount or greater. The Conservative Approximation is implemented for a stratum whenever the amount of errors observed for the stratum are less than this amount and greater than 3. The user can choose 20, 15, 10 and 5 errors as the threshold.

Output Name. Name your results data file. It will be saved in the project folder with a .parquet extension. Here, the output file will be named CVS Demo Evaluation.parquet. # Creating a Spec File

If the user used CVS Sample Planning to create their sample, then a Spec file was already created at that time and the user should use that file and disregard this section. If the user created their sample through some other means, then a basic Spec file will need to be prepared. One good method is to create a Spec file by planning a sample that you do not intend to use and then export the Spec file to Excel for the appropriate changes.

Here is a Spec file created using CVS Sample Planning. It has superfluous information at the top that you do not need. This screenshot shows the relevant format in Excel. One must provide how the population was stratified with strata statistics for the population. Requirements:

• The first row should have the letters A through I because when importing you will need to choose that the first row has column names as the Spec sheet in ADA after import will need to have column names A, B, C, D, E, F, G, H and I.

• The first column must have “Stratum Number” in it

• If there are high values in the sampling process, then a row must exist with its statistics providing the Population Items, Population Value, Population Mean, Boundary (the largest value of the high value items) and the Sample Size. Column A for this row must say “High Values”. Please note that high values are not kept as part of the sample file and should be provided as a separate file.

• If there are low values in the sampling process, then a row must exist with its statistics providing the Population Items, Population Value, Population Mean, Boundary (the largest value of the low value items) and the Sample Size. Column A for this row must say “Low Values”. Please note that low values are not kept as part of the sample file and should be provided as a separate file.

After import, the file looks like this in ADA and one can use it to perform an evaluation: Interpretation of Results

After generating the results files, it is expected that the user will export them to Excel, which can easily be done with Quick Export, where the user will have complete control over presentation of their results and be able to easily include the output in their reports and papers. The following views of the output are in Excel with no additional formatting applied other than zooming them down to be able to fit on screen before taking the screenshot.

An evaluation will produce two files: Results Overview and Detailed Calculations. The first one uses the Output Name exactly and represents the overview of the results. The second file uses the Output Name and follows it with a _DetailedCalcs suffix. The first one we will call the Results Overview and the second one we will call the Detailed Calculations.

Here is an example of a Results Overview: The first part contains information on the evaluation of the sample and includes some information obtained from the Spec file: The Summary Details section lists the results for all the estimators for the sample, the high and low values accounting and the results for all the estimators for the sample and high/low values together. These results tell us the population is overstated by almost \$4 million as a point estimate. It could be overstated by about as much as \$9 million and understated by about as much as \$1.2 to \$1.3 million according to the confidence intervals for the error. Since the Error Lower Limit and Error Upper Limit are different signs, this sample does not provide conclusive statistical evidence that the population error is significantly different from zero. The sample of size 202 had 28 errors, but only 4 occurred in the last two strata (2 each). As can be seen in the Detailed Calculations below, the population variances were used in place of the sample variances formed from these 2 errors, resulting in much wider confidence intervals. This sample design was overstratified and would have benefited from fewer strata (4 strata at the most) and a larger sample size.

One can quickly rank the estimator performance by Achieved Precision. Smaller numbers are better. All else being equal, the estimator with the smallest achieved precision is usually the best one to use. That being said, all other things usually are not equal. The Achieved Precision is the amount used to determine the width of each side of the confidence intervals. The Most Likely Error provides the point estimate. If negative, the population is overstated and should be adjusted down by the Most Likely Error amount. If positive, the population is understated and should be adjusted up according to the Most Likely Error amount. This is performed to generate the Most Likely Audited Amount statistics. The Achieved Precision is applied on both sides of the Most Likely Error and Most Likely Audited Amount estimates to produce Upper and Lower limits for both when the user selects a Two-Sided evaluation when setting the Confidence Level. If Upper or Lower is chosen, then only the appropriate limits are calculated.

The following paragraphs provide rough descriptions of the estimators.

The Mean per Unit Estimator, or just Mean Estimator, is less efficient than the other estimators. So, it tends to be (a lot) less accurate for the same sample size. The Mean estimator projects based solely on the audited amounts. It works by taking the average of the sample audited amounts and multiplying the average by the population size for that stratum. The Mean estimator is used when a sample must be pulled with no access to recorded amounts.

The Difference Estimator calculates the average sample difference per stratum and multiplies the average by the population size for that stratum. Differences, or errors, are calculated by subtracting the recorded amount from the audited amount (i.e. Audit_Amt – Recorded_Amt). The difference therefore reflects the amount one would have to add to the recorded amount to get the true, audited amount. Being the average, it is an unbiased estimator of the sample differences. The Difference estimator requires a good number of errors per stratum and must have errors in every stratum to be effective. The Combined Ratio and Combined Regression estimators are usually preferred when at least one stratum has no errors.

The Ratio Estimator calculates the sample ratio of the audited amounts divided by the recorded amounts. With the Combined Ratio Estimator only one ratio is calculated from the strata data as the audited amounts and recorded amounts are totalled across the strata. With the Separate Ratio Estimator a ratio of total audit amount divided by total recorded amount is calculated for each stratum. The Combined Ratio estimator is almost always preferred to the Separate Ratio estimator except when the ratios vary widely across strata and the sample has several errors in each stratum.

The Regression Estimator basically performs a simple regression of the audited amounts on the recorded amounts, forming a slightly more flexible ratio estimator. It is usually more efficient than the Ratio Estimator and is preferred when sample sizes are small. Like with the Ratio Estimator, the Combined Regression estimator is almost always preferred to the Separate Regression estimator except when the regression coefficients vary widely across strata and the sample has several errors in each stratum.

The Detailed Calculations provide all the calculations made by stratum to generate the numbers in the Results Overview. The column names are mathematical variable names that closely follow the nomenclature in Statistical Auditing by Donald M. Roberts, enabling an accurate match against the text. Descriptions and totals (if applicable/used/helpful) are provided below the columns.         Questions