## Applied Machine Learning in Audit: Clustering with k-Modes and k-Prototypes

by Kurt B. Johnson

# Introduction

Application of machine learning, a branch of artificial intelligence (AI), and AI itself has been talked about frequently by audit managers the past few years. Many audit analytics professionals preach that applying AI tools is on their to-do list and is what they will be or should be doing in the near future. Professional audit analytics consultants have penned many articles talking about the challenges, issues and theory of applying machine learning in audit. However, the landscape is barren when it comes to actual, applied machine learning methods that work for audit data and deliver compelling reasons to apply it. Take for example the most recent article on machine learning in the *Internal Auditor* August 2019 edition called *Stronger Assurance Through Machine Learning *by Ying-Choong Lee (subscription required – __https://iaonline.theiia.org/2019/Pages/Stronger-Assurance-Through-Machine-Learning.aspx__). The article does a good job of introducing machine learning, its basic concepts, differentiating between supervised and unsupervised learning and highlighting key challenges that make machine learning difficult to apply in practice. Yet, no true way forward is offered about how to make machine learning in audit a reality. Instead, it – like many others – ends by talking about promise. After reading it, an auditor is probably no closer to making machine learning work for them. One is more likely to be discouraged from trying it since no clear reward is offered for the trouble.

The goal of this article is different. This article will demonstrate how ADA can be used to prepare real invoice accounts payable data like that shown below to find outliers using k-Modes and k-Prototypes clustering analysis. The data and scripts, like ADA, will be freely distributed. After reading this, an auditor, now armed with the necessary tools, is likely to be able to make machine learning work for them…and want to.

# The Promise of Machine Learning and AI

Both internal and external audit functions have increased their implementations of rule-based analytics to test controls and identify data irregularities over the past decade in audit. These analytics can be very effective but have one obvious drawback: the rule must be defined ahead of time. One must know what one is looking for before it can be found. The need for precognition of the issue in order to test for it with rule-based analytics means of course that one simply cannot know what they are missing, a form of blissful ignorance. Machine Learning and AI can create rules that you never knew existed. These techniques can see patterns and create grouping that may not be intuitively clear, obvious or even remotely possible from traditional mining of the data. They can create a path to new rules.

A typical example commonly given to demonstrate machine learning in audit is one that identifies payment outliers based on a couple of other numerical features available in payment datasets. By comparing the date and time when an invoice was submitted to the date and time when it was paid, a timing difference can be calculated. In this example, the invoices can be processed by a varying number of approvers, although this rarely occurs in practice. One can see a graph of this situation in Lee’s August 2019 *Internal Auditor* article. That graph is concerned with identifying two almost-$250,000 payments with a low number of approvers very much like the graph shown below (Graph 1).

Graph 1 shows the result of applying k-Means clustering to the provided *K-Means Sample Data *using twenty-five clusters. Twenty of the clusters are colored grey, but the data composing the five clusters that are the most different from the others are highlighted. These clusters represent different levels of outliers in the data.

Another view of the data can be seen in Graph 2 below. One can see the two outliers in dark green near the bottom right hand corner of the graph. There can be two, three, four or five approvers of an invoice. This variance in the number of approvers is unlikely to be seen in practice.

One can see these two outliers at the bottom of the data.

While this analysis does allow one to make pretty pictures, it does not offer a compelling reason to perform it. Indeed, an auditor would likely have audited these two payments as a rule simply because they were the biggest ones. One could easily identify these by simply sorting or summarizing the data.

However, the other four clusters offer some insights that are not immediately obvious from sorting and summarizing. The centroids – the centers, or midpoint averages – of each cluster are shown in Graph 3 below.

Along with this other viewpoint shown in Graph 4, it easy to see that there are payments besides the two largest ones that would be worthy of closer inspection. The highlighted cluster centroids are clearly not part of the grey central group.

Running the supplied script *K-Means Example.py* will generate results data in ADA like the ones shown in Table 3 below. While different cluster numbers may be assigned, the same groupings should occur. This script also shows how easy it is to work with data from ADA in python. Along with a standard installation of Anaconda (__https://www.anaconda.com/distribution/__) with Python 3.7+, one only needs to install fastparquet (__https://fastparquet.readthedocs.io/en/latest/__) to load data and make new data for ADA as demonstrated in the script.

It is easy to load the data:

It is easy to write the data out to be seen in ADA:

After running the script, make sure you go to Project > Refresh in ADA to refresh the File Explorer view so that the output file will appear there.

One can then summarize the output data in ADA as shown below:

The summarized results can be sorted by count from smallest to largest to find the clusters with the outlier items:

# Accounts Payable Clustering with k-Modes and k-Prototypes

The payment outlier example for k-Means can come across as a bit contrived because it is. The approver level situation has already been discussed. Many AP departments do not operate in excellent clockwork fashion. Therefore, the invoice-to-payment processing time can be a misleading figure with weekends, holidays, errors and other irregularities rendering it practically unusable. Furthermore, k-Means only can be used to process continuous numerical data like invoice amounts. Indeed, outside of invoice amounts there is little in an accounts payable or receivable dataset that qualifies for consideration in a k-Means analysis. That is because almost all the data columns in audit datasets are categorical. Vendors, employees, transaction types, the company employees that enter transactions into the AP/ AR systems and even the text descriptions recorded from invoices that describe the service or item paid for all can be considered categorical features in a clustering algorithm.

In 1997 Zhexue Huang introduced the k-Modes and k-Prototypes algorithms as extensions of k-Means that focus on clustering categories while maintaining the k-Means ability to process large datasets. While k-Means works by calculating the best sets of clusters using mathematical distances between data points (which can be done when the data consists of only continuous numbers), k-Modes works by calculating the distance between two data points by simply using the number of equal attributes they contain. k-Prototypes combines k-Means and k-Modes together where the attributes, or features, in the analysis can be both continuous numeric data and categorical data. This situation is called “mixed attributes” and more can be read about it here:

__https://medium.com/datadriveninvestor/k-prototype-in-clustering-mixed-attributes-e6907db91914__

The data used here is provided and called *Invoices Data with Text Categories*. It is shown in Table 1. It represents an example of AP for a construction company. The categorical features that come ready-made for the analysis are the following columns: COMPANY_CODE, VENDOR_ID and DOC_CREATOR. We will focus on the Vendor ID and Doc Creators in this article. The Doc Creators are the identifiers of the company employees responsible for the transactions in the AP system.

The INV_TEXT column data is too diverse to make for a good categorical data column that could serve as a feature in categorical clustering. It mostly consists of innocuous text describing standard items needed by a construction company: dumpsters, tape, rocks, hauling services, port-a-potties, etc. Nevertheless, a column that can work as a feature input for clustering is provided as the right-most column in the data as shown in Table 5. The categories were created by employing *fuzzy duplicates* in ADA using an 80% similarity threshold to group items with similar text descriptions into categories. The original data only contains the columns up through the invoice posting date. The Appendix at the end of this article will illustrate how fuzzy duplicates was used to create and add the last three columns.

In addition to the full dataset of 10,000 records, another dataset called Dups Only is provided consisting of the 6,029 records that have an actual text category assigned. In other words, Dups Only contains the items with TEXT_CATEGORY not equal to FKDCOMBO.

Two scripts are provided to run the analysis: *K-Modes Example.py* and *K-Prototypes Example.py*. We will focus on the k-Modes analysis first and then do k-Prototypes. One can install the *kmodes* python package (__https://github.com/nicodv/kmodes__) in standard fashion. It is not part of the standard Anaconda installation.

In his paper *A Fast Clustering Algorithm to Cluster Very Large Categorical Datasets in Data Mining*, Huang (1997) introduced a dissimilarity measure called the *chi-square distance*. Instead of just counting the number of dissimilarities between two records, it weights the dissimilarity amount based on the count frequency of the category values in the overall dataset. Huang states that “This dissimilarity measure gives more importance to rare categories than frequent ones” and “is useful in discovering under-represented object clusters such as fraudulent claims in insurance databases.” This stands in contrast with the standard dissimilarity measurement employed by the *kmodes* python package, which gives equal importance to each category of an attribute. Given its relevance to audit analysis, the *chi-square distance *dissimilarity measurement has been added to both provided scripts. One can employ it by setting the *useChiSquare* variable equal to True in the input section near the top of each script.

## k-Modes: Exploring Vendor-Doc Creator Attribute Combinations

The *K-Modes Example.py* script is employed in this case because the columns involved – also called features – are entirely categorical. Here the Vendor ID and Doc Creator columns are the attributes. Each vendor is a category. Each document creator is a category. In this case, we will analyze all 10,000 records. The *columnsToAnalyze* variable is set to [1, 2] – which selects the Vendor ID and Doc Creator columns – and *useChiSquare* is set to False. By doing so, one-time and small volume vendors do not receive greater weight in the analysis. The same goes for the people creating the invoice documents in the system. The clustering is not quite as good if the chi-square distance is used in this analysis; it produces lower silhouette scores.

Like with the *K-Means Example.py* script, the program will return the original input data with a new column called “Cluster” with the assigned cluster number.

Like with k-Means, one can then summarize by cluster:

One can then see that while all the other attribute combinations were able to be clustered together into large clusters, one cluster has only 8 items in it.

A quick glance at the payment invoice amounts, which were not involved in the clustering analysis, shows that something fishy is going on with port-a-potties that should be audited for correctness.

The algorithm calculated 8 cluster centroids. One centroid consists of only the 8 VID682-DC1 transactions. The other 7 centroids are from the top-8 most frequent vendor-creator combinations. From the first 8 combinations in Table 9, all are centroids except for the VID250-DC1 combination. The clustering algorithm chose VID682-DC1 instead.

## k-Modes: Exploring Vendor-Text Attribute Combinations

The *K-Modes Example.py* script is again employed in this case because the columns involved are entirely categorical. Here the Vendor ID and Text Category columns are the attributes. Each vendor is a category. Each text category is a category. In this case, we will analyze only the 6,029 records. One can include all 10,000 records, but the non-duplicate records indicated by the value FDKCOMBO tend to dominate the cluster centroids. The *columnsToAnalyze* variable is set to [1, 12] – which selects the Vendor ID and Text Category columns – and *useChiSquare* is set to True. The rarer text categories will receive more weight in the analysis as a result. The clustering is better if the chi-square distance is used in this analysis; it produces higher silhouette scores.

Summarizing the data in Table 10 reveals almost a handful of clusters that could be audited.

As shown in Table 11, the transactions in these four clusters display different prices for the same items from different vendors, an odd debit/credit transaction and different prices for the same amount of port-a-potties from the same vendor.

Table 12 shows the 10 cluster centroids for the analysis. These four vendor-text category combinations are the cluster centroids for the clusters in Table 11: VID785-FDK583, VID146-FDK790, VID63-FDK1494 and VID693-FDK13. One would want to take a hard look at these vendors and the kinds of materials they provide to the company. One can see that the centroids in general do not all come from the combinations with the most occurrences; only two ranked in the top 10.

For reference purposes, Table 13 shows the summary statistics for the Top 15 Vendor-Text Category combinations in the Dups Only dataset of 6,029 records.

## k-Prototypes: Exploring Amount-Vendor-Text Attribute Combinations

With k-Prototypes we can extend the analysis for Vendor-Text Category attributes to include the Invoice Amount. The *K-Prototypes Example.py* script is employed in this case because some of the columns involved – the Vendor ID and Text Category – are categorical and another column – Invoice Amount – is a continuous numeric. Hence, these are mixed attributes. Each vendor is a category. Each text category is a category. In this case, we will analyze only the 6,029 records. One can include all 10,000 records, but the non-duplicate records indicated by the value FDKCOMBO tend to dominate the cluster centroids. The *columnsToAnalyze* variable is set to [1, 3, 12] – which selects the Vendor ID, Invoice Amount and Text Category columns – and *useChiSquare* is set to True. The rarer text categories will receive more weight in the analysis as a result. The clustering is better if the chi-square distance is used in this analysis; it produces higher silhouette scores.

Summarizing the results in Table 14 illustrates that there are 3 clusters of interest that could be audited more closely.

Again, in Table 15 we see strange variances in the invoice amounts for plywood, replacing tires and wackers. I have no idea what a wacker is.

Table 16 shows the 12 cluster centroids for the analysis. The Invoice Amount centroid averages define midpoints between the various invoice amounts contained in their respective cluster and this is calculated using Euclidean distance between the points. So, for Cluster 1 256.76 represents that midpoint between the values 589.65 and 115.00. One can see that this is not strictly an average. These three vendor-text category combinations are the categorical cluster centroids for the clusters in Table 15: VID665-FDK1073, VID264-FDK1181 and VID505-FDK1492.

# Additional Script Considerations

The *K-Means Example.py* script runs a fairly standard implementation of k-Means. It loops through a list of possible cluster amounts (i.e. *clusterList*) and chooses the number that generates the highest silhouette score. It also contains the code for the graphs.

The *K-Modes Example.py* script and the *K-Prototypes Example.py *script both loop for a number of iterations. Each iteration uses a new random seed number to initialize the k-Modes clustering algorithm if you use Cao, Huang or random as the initialization method. The purpose is to find the initialization that does the best job of clustering over all the cluster amounts in the list based on the silhouette scores. This is done by averaging the silhouette scores for an iteration run for each cluster number in the *clusterList*. With each iteration, both scripts will – like *K-Means Example.py* – loop through the list of possible clusters. The cluster amount that wins out is the one with the highest silhouette score from the iteration with the highest average of its silhouette scores. If the Cao, Huang and random initializations all fail to initialize, one may consider initializing manually. Setting *initializationChoice *as manual will result in the program initializing the centroids to the attribute combinations that occur most frequently in the data.

The *K-Prototypes Example.py* script requires two additional considerations. The dissimilarity measure for the k-Means portion is calculated using Euclidean distances for the numerical features. The dissimilarity measure for the k-Modes portion is calculated by either counting the number of dissimilarities (the default) or by weighting the counts by weight values less than one that reduce the amounts further (the chi-square distance). These two measures are added together. Because of the nature of the k-Modes dissimilarity measure, I have found the k-Prototypes algorithm works best if the numerical features are subjected to Box-Cox transformations. By transforming to a standard normal, the Euclidean distances are better suited to being added to the k-Modes dissimilarity measures. Otherwise, the numerical features tend to dominate the clustering. The k-Prototypes algorithm does have the *gamma* input that acts to weight how much the k-Modes dissimilarity impacts the overall dissimilarity measurement. However, I found that it did not do enough achieve effective results without the Box-Cox transformations. Huang (1997) states that he thinks the value for *gamma* should be between one-third and two-thirds of the standard deviation of the numerical features, but provides no logical justification for this, leaving this as a future area of research. In practice, one can try different values for the *gammaRatio* input. The python *k-modes* package uses 0.5 as the default. Values at or near 1.0 will emphasize the categorical features more. Values closer to zero will emphasize the numerical features more.

## Silhouette Scores in Clustering Analysis

Silhouette scores range in value from -1 to +1. Larger values indicate better clustering in general. Wikipedia provides a good discussion here: __https://en.wikipedia.org/wiki/Silhouette_(clustering)__.

There has been some debate about how to apply the silhouette scores to determine the optimal number of clusters for k-Modes. These issues are discussed here (__https://github.com/nicodv/kmodes/issues/8__) and here (__https://github.com/nicodv/kmodes/issues/46__). There have been some suggestions that the silhouette score calculations should be modified to use modes instead of averages. I believe this is a bad approach for two reasons. First, it is important to understand that the calculation of silhouette scores is independent of the method used to perform the clustering. Just because you use modes to perform your clustering does not mean you should use modes to calculate silhouette scores. Second, the silhouette score employs the average of the silhouette values for each of the records. This effectively weights each silhouette value equally (i.e. by 1/n). Using a mode would only count the silhouette scores of the records with the most common occurrence. Using a mode means you would not be considering the impact of the clustering on all the data points. Since all the data points are important, the average is clearly the better metric.

To calculate the silhouette score for the k-Modes algorithm, the Hamming distance (__https://en.wikipedia.org/wiki/Hamming_distance__) provides an appropriate measure. As the Wikipedia page on Silhouette (clustering) states, “the silhouette can be calculated with any distance metric…”. One simply must find the most appropriate one. For the numerical features employed by k-Means, the Euclidean distance is a good distance to use. The Hamming distance calculates the distance using the percentage of items that are dissimilar. This is actually quite similar to the approach used in k-Modes and thus provides the best measure to use in calculating the silhouette scores. To combine the silhouette scores from the k-Means and k-Modes algorithms, an average is computed so that each algorithm gets equal weight, which works well when employing Box-Cox transformations.

While silhouette scores are an important consideration when clustering, they do not represent the be-all and end-all of what number of clusters is truly best. Indeed, in audit the objective is not necessarily to come up with the set of clusters that achieve the best score, but to come up with the set of clusters that can potentially identify outliers and oddball records in the data. A higher number of clusters should be considered even if they generate lower silhouette scores. When there are more clusters, the sizes of the clusters will be smaller. An auditor will need to obtain clusters with sizes that are feasible to audit. Clusters with more than 20 records are not likely to meet this objective. For example, perhaps using 5 clusters represents the best clustering based on the silhouette scores. However, these 5 clusters all have 100+ records in them. In this case it would be better to look at using 10+ clusters and find an amount for *k* that yields clusters with small sizes so those outliers can be identified and audited.

# Concluding Summary

This article motivates the use of AI machine learning in the world of audit data analytics. While supervised learning is unlikely to ever be a reality, unsupervised techniques like clustering hold more than promise; they can and should be employed by auditors on at least a once-a-year routine basis to find new rules, outlier records outside the purview offered by rule-based analytics. Able to cluster categorical data, the k-Modes and k-Prototypes algorithms can be used to analyze many audit datasets in the areas of general ledger, AP, AR, purchase card expenditures, payroll and travel/entertainment expenses. The ADA python environment along with the included scripts and sample data enable the practicing audit data analyst to hit the ground running in the area of applied machine learning in audit. This article also demonstrates how fuzzy duplicates can be used to create categories for a column that is not already categorized. The scripts also provide the *chi-square distance* dissimilarity measurement that improves the performance of k-Modes in many audit applications.

It is important to recognize what AI machine learning is not. It is not a tool to be used in a continuous monitoring environment. It is a tool to use occasionally for exploratory analysis with the hope of gaining a few key insights into fraudulent or error-ridden activities the auditor may not have identified as an area of risk. Given privacy rights and the lack of in-depth data on employee financials, auditors do not have access to the kind of data necessary to make AI a powerful force in audit. While it should be a tool utilized by all auditors, it is not poised to reshape audit. It is an extremely useful tool for identifying oddities one would not have thought to test for, and an auditor should strongly consider augmenting their capabilities to include it in their regular data analytics program.

Indeed, probably the best use of machine learning in audit is to augment and enhance rule-based analytics. When the application of fuzzy duplicates renders too many items to audit, following up with an application of clustering to narrow down the results to outlier clusters which can be audited would be advantageous. As another example, a rule-based analytic that filters out purchase order, invoice or payment transactions below approval thresholds will typically yield an un-auditable number of items. A subsequent application of k-Modes categorical clustering would be able to isolate auditable clusters/subsets from that initial application of Filter.

# References

Huang Z. (1997): *Clustering Large Data Sets with Mixed Numeric and Categorical Values*, In: *KDD: Techniques and Applications *(H.Lu, H. Motoda and H. Luu, Eds.). – Singapore: World Scientific, pp. 21-34.

Huang Z. (1997): *A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining*, In: *Proc. SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery*, pp. 1-8.

Huang Z. (1998): *Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values*, In: *Data Mining and Knowledge Discovery*, vol. 2, no. 3, pp. 283-304.

Cao F., Liang J. and Bai L.: *A New Initialization Method for Categorical Data Clustering*, In: *Expert Systems with Applications* 36(7), pp. 10223-10228.

# Appendix: Categories from Invoice Text Descriptions

The invoice text descriptions provided in accounting data are not suitable for categorical clustering with k-Modes and k-Prototypes as-is. We use fuzzy duplicates to lump similar text descriptions together into categories. In this way, the fuzzy duplicates technique is not performed for the sake of being a test itself, but rather to prepare the data for the actual test: a k-Modes or k-Prototypes implementation. This appendix will show how the TEXT_CATEGORY column was created.

A unique identifier must exist in the original dataset. If not, one must be created. The unique identifier for the provided test data is the invoice number, column INV_NUMBER. The unique identifier does not have to be one single column like in this example. In most accounting data a collection of 2-3 columns, sometimes involving the invoice number, usually make up the unique identifier.

First, fuzzy duplicates is run on the invoice text column using a 80% match threshold with the option to include exact duplicates. We use 3 for the exact match constraint to keep the number of comparisons small. The first three characters of the INV_TEXT will have to match exactly for two records to be compared.

This creates *Fuzzy Dups Inv Text*. It will have a column called *FuzzyDupKey*. Each value of *FuzzyDupKey* is a label for a text category.

Next, we sort the resulting data by INV_NUMBER (ascending) and Similarity (descending). This creates *Duplicates Sorted by Inv No and Similarity*.

Then we dedupe the data, keeping the first occurrence. We do this based on the INV_NUMBER unique identifier field. We sort and dedupe the data because an invoice number can now be part of multiple categories. We want to find the best category match based on the similarity for each invoice number that is part of a *FuzzyDupKey* category.

Next, we need a shorthand for the *FuzzyDupKey*. So, we summarize by *FuzzyDupKey* as shown in Dialog 6. This creates *List of FuzzyDupKeys*.

We then use Column Management to add a column called FDKSHORT to *List of FuzzyDupKeys*. By double-clicking in the Formula section, you can bring up the Criteria Editor. The formula “FDK” + STR(ilevel_0) will work. In the next version of ADA, one can use the function RECNUM() in place of ilevel_0 in the formula.

Now we add the FDKSHORT column onto the *Deduped Invoices*. This creates the file *Add FDKShort*.

The newly created *Add FDKShort* can be joined back against the original data using the INV_NUMBER unique identifier. For the Right Data File, we only want to include two columns: the FUZZYDUPKEY as shown highlighted in Dialog 9 and the FDKSHORT as shown highlighted in Dialog 10. This creates *Invoice Data with Text Categories*.

The FDKSHORT column cannot be used in the analysis because it contains None for some records. So, we make a new column called TEXT_CATEGORY using Column Management and employ the formula “FDKCOMBO” if FDKSHORT is None else FDKSHORT.

Ty for sharing about AI..

This article motivates the use of AI machine learning in the world of audit data analytics. While supervised learning is unlikely to ever be a reality, unsupervised techniques like clustering hold more than promise; they can and should be employed by auditors on at least a once-a-year routine basis to find new rules, outlier records outside the purview offered by rule-based analytics.