How to Detect Duplicates with ADA

Introduction

ADA (Audit Data Assistant) allows you to detect duplicates in your data.

Formats Supported by Duplicates

ADA’s Duplicates utility allows you to filter Parquet data. Parquet is the format ADA uses for storing data files.

Using the Detect Duplicates Dialog

Duplicate Columns. Column(s) that will be checked for exact duplicate values across rows. At least one column is required for every Duplicate Type chosen except for Fuzzy Duplicates. Rows that have the same values for all the Duplicate Columns are considered to form Duplicate Column Groups. A row that has a unique combination of values for the Duplicate Columns specified is not considered to have a duplicate and cannot form a Duplicate Column Group.

Difference Columns. Column(s) where values across rows must be different within each Duplicate Column Group. Only one column is allowed for Duplicate Difference Date Range and Fuzzy Duplicates.

Duplicate Type. You have many options for the type of duplicate detection you perform on your data sets. Here are the types of duplicates that can be detected and the kind of results you can expect from each.

All Unique Records. This type eliminates all records that have a duplicate and only returns unique records. Output records cannot form a Duplicate Column Group with the Duplicate Columns specified.

Required Inputs: Duplicate Column(s) and Output Name

Deduplicate Keep First. Deduplicates records and keeps the first occurrence of the Duplicate Column values in the data. Output records include the first record from each Duplicate Column Group plus All Unique Records.

Required Inputs: Duplicate Column(s) and Output Name

Deduplicate Keep Last. Deduplicates records and keeps last occurrence of Duplicate Column values in the data. Output records include the last record from each Duplicate Column Group plus All Unique Records.

Required Inputs: Duplicate Column(s) and Output Name

Exact Duplicates. Returns only exact matches of the Duplicate Column values. Output contains all records that form a Duplicate Column Group given the Duplicate Columns specified. Returns all Duplicate Column Groups.

Required Inputs: Duplicate Column(s) and Output Name

Duplicate Difference Unique Values. Only Duplicate Column Groups with all unique values for the Difference Column(s) are returned.

Required Inputs: Duplicate Column(s), Difference Column(s) and Output Name

Duplicate Difference First Values. Returns all Duplicate Column Groups that contain at least two different values for the Difference Column(s) but pares down the output so that each Duplicate Column Group contains only the first instance of each Difference Column value. Equivalent to running Exact Duplicates to obtain all Duplicate Column Groups and then running Deduplicate Keep First on each group separately using the Difference Column(s).

Required Inputs: Duplicate Column(s), Difference Column(s) and Output Name

Duplicate Difference Last Values. Returns all Duplicate Column Groups that contain at least two different values for the Difference Column(s) but pares down the output so that each Duplicate Column Group contains only the last instance of each Difference Column value. Equivalent to running Exact Duplicates to obtain all Duplicate Column Groups and then running Deduplicate Keep Last on each group separately using the Difference Column(s).

Required Inputs: Duplicate Column(s), Difference Column(s) and Output Name

Duplicate Difference All Values. Returns all Duplicate Column Groups that contain at least two different values for the Difference Column(s).

Required Inputs: Duplicate Column(s), Difference Column(s) and Output Name

Duplicate Difference Date Range. Returns records from Duplicate Column Groups where the date values for the Difference Column are inclusively within the range of days set using the Threshold. See the Threshold section for more information. Only ONE column can be chosen for the Difference Column.

Required Inputs: Duplicate Column(s), Difference Column (must be a Datetime field), Threshold and Output Name

Fuzzy Duplicates. Returns rows where Difference Column values are inclusively greater in similarity than the Threshold percentage. See the Threshold section for more information. If Duplicate Columns are specified, the fuzzy duplicate matches will be calculated from the resulting Duplicate Column Groups. Only ONE column can be chosen for the Difference Column.

Required Inputs: Difference Column, Threshold and Output Name

First Character Exact Match Constraint. This setting is specifically for Fuzzy Duplicates. The default here is None, which means that every possible row is investigated for a fuzzy duplicate. However, you can choose to further narrow the rows investigated by initially selecting rows where only the first 1, 2, or 3 characters match exactly for the Difference Column. For example, if 1 is selected the words feather and farther would be considered for a potential match because their first characters match. The words feather and heather would not be considered for a potential match, even though they have more characters in common, because their first characters do not match. Use of values other than None could result in false negatives where fuzzy duplicates that should have been identified are simply missed because the first 1, 2 or 3 characters do not match. This setting is particularly useful if you experience memory errors or you wish to improve performance lag caused by working with very large data sets.

Threshold. This setting is for Fuzzy Duplicates and Duplicate Difference Date Range.

For Fuzzy Duplicates, the Threshold is the minimum percentage of match of duplicate values. For example, meat and mea1 are a 75% match because three out of their four letters match. If you select a threshold of 75%, they will be within the threshold and will be selected as a fuzzy match. However, if you specify a threshold of 80%, they will not be selected because they will fall below the threshold. To specify your desired threshold for Fuzzy Duplicates, enter the number as a decimal between 0 and 1. For the above example, to ensure that the threshold is low enough for meat and meal to be a match at 75%, you would enter 0.75. Damerau-Levenshtein distance similarity values greater than or equal to the Threshold are considered matches.

Specify a non-negative integer for Duplicate Difference Date Range. If two dates are within the Threshold number of days, they are considered a match. For example, the dates January 1, 2018 and January 15, 2018 represent 14 days of difference. They would be considered a match for Thresholds of 14 and higher. Thresholds of 13 and lower would not result in a match.

Exclude exact matches. This setting is specifically for Fuzzy Duplicates. By default, your resulting data set will include both exact and fuzzy duplicates. You can exclude the exact duplicates and only return fuzzy duplicates by checking the box.

Case-sensitive. This setting is specifically for Fuzzy Duplicates. You can specify that your data is case-sensitive, and the fuzzy duplicates process will treat the same values as different values if the capitalization is different. For instance, P.O. Box would not be the same as p.o. box.

Output Name. Name your duplicates data file. It will be saved in the project folder with a .parquet extension. Here, the output file will be named Duplicate Vendors.parquet.

Questions

If you have questions about ADA software or you would like to know about purchasing custom ADA analytics, wonderful! Please call us at (864) 625 – 2524, and we’ll be happy to help.