How to Join Data with ADA

Introduction

ADA (Audit Data Assistant) provides a utility for joining two data files together using common key columns. Many times, the desired result is a dataset with the columns from both data files together side by side. Sometimes, joins are used to determine if key column values from one dataset exist in the other.

Using the Join Dialog

Join’s dialog lets you select the files you’d like to join together, set options for how to perform the join as well as specify a name for the output file:

Join Dialog

Left and Right Key Columns

These are columns that both data sets have in common and on which the data will be joined. The values from these columns must match exactly for the comparison to be considered a match. An exact match of all the values for the Left Key Columns and the Right Key Columns will result in a linkage of the rows associated with those values from both the Left Data File and the Right Data File if using the Left Right Matches option. For example, if you wished to combine employee data with payroll data to get a complete file of information about each employee (name, SSN, etc.) with related information about each of the employee’s paychecks, you would most likely join the data files on a field like employee ID. If the user does not select a column from the Left Key Columns list, does not select a column from the Right Key Columns list or selects no column from either list, then a combinatorial join, also known as a cross join, will be performed. The result will contain all the combinations from all the records in both data files. For example, conducting a Left Right Matches combinatorial join on a Left Data File of 100 records and a Right Data File of 20 records will result in a data file of 2,000 records.

As you check the boxes to select columns, those columns will be moved to the top of the list. The order matters for the checked columns at the top of the list. The key columns will be ordered left to right in the output dataset as they are ordered top to bottom in the Left Key Columns and Right Key Columns lists. You want the columns you select in the Left Key Columns list to be in an order that corresponds with the columns you select for the Right Key Columns. Starting from the top, the first checked column listed in Left Key Columns will be matched with the first checked column listed in Right Key Columns. The second checked column in each will be matched and similarly for all other checked columns. You can hold down the left mouse button and perform click and drag to reorder the columns in the list.

Left and Right Included Columns

By default, all of the columns in your data set are selected. If there is a column you don’t wish to include in the joined data, simply click that column to deselect it. Accidentally deselected a column that you actually want to include in the joined data? Click it again to select. To select or deselect a block of columns, click one and drag to highlight the rest. At least one column from the Left Data File or the Right Data File must be selected.

Sort by Keys (Asc)

This option allows you to sort the data by the key columns you selected in ascending order in the resulting output.

Ignore Warnings

With certain types of joins, you may experience data loss, also known as attrition, when not using a many-to-many join because of duplicate key value combinations for the key columns chosen for both the left and right datasets. For example, if Vendor Master has three duplicate vendor names with different vendor IDs and it is being joined to AP Detail as the right data file in a many-to-one join by vendor name, it will produce only the first instance of the vendor IDs that matched rather than the three that could have matched if a many-to-many match had been performed. ADA will warn you when this happens and give you an explanation of the attrition, but you can turn off these warning messages if you choose.

Join Type

You have many options for the type of join you perform on your data sets. Here are the types of joins and the kind of results you can expect from each.

Left Right Matches. Only records where the left and right key column values match exactly will be included. For example, to match AP data with PO data and only match the invoices to those that have a purchase order, Left Right Matches is the option to use. Invoices without a PO will not be included in the output.

SQL: Inner Join

Left No Right. Records from the Left Data File where the values for the Left Key Column(s) have no corresponding match to the values of the Right Key Column(s). Typically, only the columns from the Left Data File will be desired as the columns from the Right Data File will be blank (NaNs). If the AP invoices data is the Left Data File, to find all the AP invoices with no POs, Left No Right is the option to use.

SQL: Left Join If Null

Right No Left. Records from the Rightt Data File where the values for the Right Key Column(s) have no corresponding match to the values of the Left Key Column(s). Typically, only the columns from the Right Data File will be desired as the columns from the Leftt Data File will be blank (NaNs). If AP invoices data is again the Left Data File, to find the POs without an AP invoice, Right No Left is the option to use.

SQL: Right Join If Null

All in Left. All the records from Left Data File are included in the result. Records from the Left Data File with corresponding value matches between the Left Key Columns and Right Key Columns will contain the values found in the Right Data File for any column chosen in the Right Included Columns list. Records from the Left Data File with no corresponding value matches between the Left Key Columns and Right Key Columns will have blanks (NaNs) for the Right Included Columns. To get all AP invoices along with the PO information, whether a PO exists or not, in the result, then All in Left is the option to use if AP invoices data is the Left Data File and PO is the Right Data File.

SQL: Left Join

All in Right. All the records from Right Data File are included in the result. Records from the Right Data File with corresponding value matches between the Left Key Columns and Right Key Columns will contain the values found in the Leftt Data File for any column chosen in the Left Included Columns list. Records from the Right Data File with no corresponding value matches between the Left Key Columns and Right Key Columns will have blanks (NaNs) for the Left Included Columns. To get all POs with the AP invoices data, whether an invoice exists or not, in the result, then All in Right is the option to use if AP invoices data is the Left Data File and PO is the Right Data File.

SQL: Right Join

All in Both. All records from both files are included regardless of match. This is the same as performing Left Right Matches, Left No Right and Right No Left successively on the same data and then appending the results of all three together.

SQL: Full Outer Join

Fuzzy Join. Matches left and right keys that are a close match but not necessarily an exact one. First, if Left Key Columns and Right Key Columns are specified (they are optional), exact matches using the columns specified in Left Key Columns and Right Key Columns are found by performing a Left Right Matches join. Otherwise, all pairwise combinations of the two files (a combinatorial join) are used. Then, close matches are found for the columns specified in the Fuzzy Options dialog.

Fuzzy Options. If you select Fuzzy Join from the Join Type dropdown, the Fuzzy Options button will become available. When you click this button you’ll be presented with the Fuzzy Options dialog:

Left Fuzzy Column and Right Fuzzy Column. Choose the column from the Left Data File to compare in fuzzy fashion to the column from the Right Data File. These are the key columns for the fuzzy match. Note that only character fields can be used for fuzzy matches. You can only pick one column from each list. If you need to involve more than one column, first use Column Management to create a new column and use string concatenation with the plus (+) operator to combine the columns together to form one column. Do this for both the left and right data files.

First Character Exact Match Constraint. The default here is None, which means that every possible combination of rows is investigated for a fuzzy match after performing any exact match join specified using the Left Key Columns and Right Key Columns. However, you can choose to further narrow the combination of rows investigated by initially selecting rows where only the first 1, 2, or 3 characters match exactly for the Left Fuzzy Column and Right Fuzzy Column. For example, if 1 is selected the words feather and farther would be considered for a potential match because their first characters match. The words feather and heather would not be considered for a potential match, even though they have more characters in common, because their first characters do not match. Use of values other than None could result in false negatives where fuzzy matches that should have been identified are simply missed because the first 1, 2 or 3 characters do not match. This setting is particularly useful if you experience memory errors or you wish to improve performance lag caused by working with very large data sets.

Threshold. The threshold is the minimum percentage of match. For example, meat and mea1 are a 75% match because three out of their four letters match. If you select a threshold of 75%, they will be within the threshold and will be selected as a fuzzy match. However, if you specify a threshold of 80%, they will not be selected because they will fall below the threshold. To specify your desired threshold, enter the number as a decimal between 0 and 1. For the above example, to ensure that the threshold is low enough for meat and meal to be a match at 75%, you would enter 0.75. Damerau-Levenshtein distance similarity values greater than or equal to the Threshold are considered matches.

Exclude Exact Matches. By default, your resulting data set will include both exact and fuzzy matches. You can exclude the exact matches and only return fuzzy matches.

Case-sensitive. You can specify that your data is case-sensitive, and the fuzzy matching process will treat the same values as different values if the capitalization is different. For instance, P.O. Box would not be the same as p.o. box.

Relationship Type. The relationship type tells join how to handle duplicate key column value combinations in the Left Data File and Right Data File. If the columns specified as key columns in both the left and right data files do not have any duplicate values across their rows, then the Relationship Type chosen is immaterial and will not affect the join outcome. However, if you are trying to join Employee Master to Payroll payments where there are duplicate employees in the Employee Master and employees occur multiple times in Payroll due to monthly or bi-monthly salary payments, then the Relationship Type will matter. Let’s say the Employee Master has two records with the same employee ID and employee name, but with different addresses. Since employees receive multiple payroll checks, the employee ID occurs three times in the Payroll data. You perform a join using the employee ID in the Employee Master and the employee ID in the Payroll data. Here is an explanation of the relationship types using this example so you can choose the right one for your join:

Many-to-Many. Results provide all the possible combinations of matches for the values contained in the Left Key Columns and Right Key Columns. For the employee example, the result will contain six matches. The three payroll check rows will be matched with the first employee ID listed and then all three matched again with the second employee ID. Another example of when this would be useful is when comparing addresses in Vendor Master with addresses in Employee Master.

Many-to-One. Results will keep all the duplicate records for the key columns in the Left Data File, but only the first observation of the matches from the Right Data File will be included. For the employee example, the result will contain two matches. The two duplicate employee records with different addresses will be matched with only the first payroll check record. This Relationship Type is useful when joining AP transactions as the left file to Vendor Master detail as the right file when either knowing there are no duplicates in the vendor master or desiring to keep the number of AP transactions static as usage of many-to-many match would create false duplicates of the AP transactions if duplicate vendor master records exist.

One-to-many. This is the reverse of Many-to-One. Results will keep all the duplicate records for the key columns in the Right Data File, but only the first observation of the matches from the Left Data File will be included. For the employee example, the result will contain three matches. Only the first of the two employee records will be matched with the three payroll check records for the employee ID.

One-to-one. Only one record in your left data file will match only one record in your right data file. For the employee example, the result will contain one match. The first of the two employee records will be matched with the first payroll check record for the employee ID. This relationship is useful for comparing unique key data with summarized data. If you wished to join Vendor Master on VendorID with AP data summarized by VendorID, one-to-one would be the correct relationship to choose.

Output Name. Name your joined data file. It will be saved in the project folder with a .parquet extension. Here, the output file will be named Vendor Payments 2017.parquet.

Formats Supported by Join

ADA’s Join utility allows you to join Parquet data. Parquet is the format ADA uses for storing data files.

Questions

If you have questions about ADA software or you would like to know about purchasing custom ADA analytics, wonderful! Please call us at (864) 625 – 2524, and we’ll be happy to help.