4 Ways to Compare Data Sets in Excel
Comparing two data sets in Excel is quite easy using Power Query and the Merge operation. The Merge operation can perform inner and outer join operations on your data. This article will explain the different types of operations.
I will use the list of countries from Wikipedia. the link to the exact page is provided below. I will show how to import this data using Excel tools. A second list of countries will be used from the WorldMeter site which we will also download using Excel using the link below.
Get the Data
Let’s start with the Wikipedia list. Follow these steps to retrieve the list from the Wikipedia page
- From Excel Data tab, select “From Web” in the “Get & Transform Data” group
- An import panel will display on screen with two options “Basic” and “Advanced” (figure 1). Leave it at “Basic” and paste the following Wikipedia URL in the field and click “OK”: https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population
- Next, you will need to select the appropriate access for this site. Since it is Wikipedia and a public website, we can "Anonymous” web access (figure 2) and from the dropdown field, select the top level, www.wikipedia.org and connect to the website
- In the next panel (or page), Excel will present the different sections of the web page. For our purposes, we need to select “Sovereign States and Dependencies by Population” which is the table title (figure 3). Here, you have two options: either “Load” or “Transform”. If you want to load the data as is into Excel, select the first option, otherwise, select “Transform”, which we need to do as we need to delete columns that are not needed.
- Once the data is loaded in the Power Query editor, select all the columns (ctrl+click) except the country name column. Then click “Remove Columns” and finally “Close and Load To” (figure 4) to transfer the focus back to Excel. In Excel, a panel will offer the option to create a table, pivot table, a connection as well as where to display the results. Opt to create a connection only. This will create a connection with the source, which is the Wikipedia page in this example. What is great about this option is that you can continuously refresh the data from the source as needed. Of course, in this example, it is not really needed as the world is not actively creating new countries every other week. Once the connection is created, it will appear in the side panel on the right. You can rename it as needed. It won't affect the connection with the source.
Now repeat the same operation for the WorldMeter country list: https://www.worldometers.info/geography/alphabetical-list-of-countries/. To import only the list of countries, use the “Table 0” selection and select “Transform” as per the previous. Remove all the necessary columns, keeping only the column with the country names.
Select the "Close and Load To" option to transfer the data set back to Excel, creating a Connection as before. Once both lists of countries have been imported into Power Query and the connections have been created, we will perform a series of Merge operations. The objective is to list the names of the countries according to the Merge operation we use. There are six types of merge operations that can be performed on the data sets.
- Left Outer (All from first, matching from second)
- Right Outer (All from second, matching from first)
- Full Outer (All matching from both)
- Inner (only matching rows)
- Left Anti (Rows only from first)
- Right Anti (Rows only from second)
I will compare the data sets using each of the previous Merge operations, loading the results in a separate Excel Worksheet.
Left Outer (All from first, matching from second)
This type will keep all the items in the first list (left side) and those items in the second list that are found in the first. To perform the Merge operation follow these instructions (figure 6):
- Right-click on the Wikipedia connection in the right panel and select Merge
- A panel will appear with the Wikipedia data set already selected
- Select the WorldMeter data set from the second list
- Select the Left Outer from the Join Kind field
- Optional select the "Use fuzzy matching...." when comparing. If you opt for this option, you will need to expand the Fuzzy search option section.
- At the bottom a preview of the merge operation list the number of matches.
- Click "Ok" to perform the operation
Once the merge is completed, the Power Query will appear with the data from both list. On the left is the Wikipedia country list. On the right is a column with the value of "Table" (figure 7 above). This is because the resulting will contain the collection of columns from the second table that can be expanded into the first table. Click on the icon with two arrays next to the column header (figure 8) which will reveal all the columns in the table (figure 7). In our example, there is only one column "Country" (figure 8). Once expanded, the list countries from the WorldMeter will be displayed with the missing values (figure 9).
Select "Close and Load" to transfer the data back to Excel.
Right Outer (All from second, matching from first)
The Compare operations will select all rows from the second list and those items matching in the first list. Performing this merge is the same as the previous; right-click the Wikipedia connection and select Merge from the shortcut menu. The Wikipedia list will already be selected in the Merge panel. Select the WorldMeter as the second list and the Right Outer join type. The result count will be displayed at the bottom. Click ok to perform the operation (figure 10).
As I mentioned before, the second column of data will be displayed as "Table". You need to expand the selection and select the available column to display. In our case, "Countries" is the only choice. You will notice the dataset is different from the last with the data on the right showing the complete list and the only the ones in the first column that match the outer column.
Click "Close and Load to" to transfer the merged data back to Excel. This time, select to copy the resulting data set into an Excel Worksheet
Full Outer (All matching from both)
This operation will select all rows from both list. This is like a Cross Join in SQL. All countries from both list will be merged together. To perform this merge operation, select the Merge command from the context menu. In the Merge interface select both data sets like before. In the Join Type field, select the Full Outer type. Perform the Merge operation (figure 13).
Inner (only matching rows)
To select only rows unique to both lists, use this option. This merge operation will compare data in both list and return only the values that are present in both list. Perform in the like manner as the others, except choose the Inner (only matching rows).
Once done, return the filtered data from both list to a new worksheet in excel.
Left Anti (Rows only from first)
This Merge option will return a list of countries that are unique to the first list. In the sample data, this would be the list of countries from the Wikipedia list only. The Merge operation is identical to the others, except that you must choose the "Left Anti (Rows only from first)" option.
Right Anti (Rows only from second)
Right Anti (Rows only from second) is the reverse of the Left Anti, meaning this Merge option will return a list of countries that are unique to the selected outer list or second list. To perform this operation, select the "Right Anti" in the Join Type field after you launch the Merge operation from the context menu of a selected connection.
All possible compare operations are provided as demonstrated. The Merge can be performed on multiple columns as needed. Also, you can transform the data as needed like any other data set .
The Excel file containing the Merge results can be downloaded from GitHub at this address https://github.com/kevlangdo/excel_compare_lists . To clone the GitHub repo from command line or bash git clone https://github.com/kevlangdo/excel_compare_lists.git
© 2020 Kevin Languedoc