Updated date:

Data Engineering Using Excel

Kevin is a data engineer and advanced analytics developer. He has over 20 years experience in the field.

Basic ETL Model

Basic ETL Model

Data Engineering is the science of researching and extracting data from various sources and architecting data models and data sets which are stored in data warehouses for use with Machine Learning, Advanced Analytics, other formats of data science endeavors. Data is either extracted in batches or streaming or even on demand. Data Engineers design and build the data pipelines that data science runs on.

Regardless what many voices are saying, Data Engineering can be done using almost every programming language if the language has libraries, frameworks or other APIs that allow it to connect to a data source and a data target.

Most sources are either business systems like SAP or Oracle’s eSuite, or database systems (corporate or external), web services, files. Most often, the target is a data warehouse which can consist of one database or multiple databases called data marts, depending on your design.

Python of course comes to mind for its simplistic style and numerous data manipulation frameworks. And SQL (and all variants) is synonymous with data extraction and manipulation (wrangling). The other usual suspects are Java and Scala (Spark, Hadoop and the related eco-system of technologies) and C# of course and Perl. There are many new kids on the block like Julia. JavaScript, by far the most popular programming language, is a real Swiss Army pocketknife of the programming world. If you want or need to do something in programming, you can be sure JavaScript has a library or framework. However, there is one tool that is ubiquitous in Data Engineering and Data Sciences; it is Excel. You can use Excel as a data source (xlsx or csv files), or as a data analysis, profiling and cleansing tool and even, to a certain extent, as a data target. You can include Excel files as part of your data flow DAG (Directed-Acyclic-Graph, which allows a data engineer (developer) to orchestrate to flow and transformation and storage of the data to be used by data analyst, scientists, LoB people.

DAG diagram

DAG diagram

Excel Data Connectors

Besides SQL and Python, I use Excel almost daily for quick profiling and data analysis on data samples. The Power Pivot (Power Query, Dax, M language) tool, and pivot tables are very powerful tools that are used constantly for ad hoc analysis, profiling and wrangling.

You can get data into Excel (or Power Pivot) using various embedded connectors such as:

  • Text or CSV files
  • A table on a web page
  • From a table or range
  • From files like Json, Xml or another spreadsheet
  • From various database systems including SQL Server, Oracle, Terradata, Sybase, SAP Hana, MySQL, PostgreSQL
  • From Azure data sources
  • From online services like Facebook, Twitter, Salesforce to name a few of the more popular
  • And other data source like OData, Hadoop, OLEDB, ODBC and many others

Once the data is in Excel, it can be inspected, analyzed, profiled and cleansed since most raw data is dirty as far as Data Engineering is concerned.

OData Connectivity Example

For example, you can connect and extract data using OData. To demonstrate, you can connect to the following endpoint using the OData connect in Excel

https://services.odata.org/TripPinRESTierService/People
OData URL

OData URL

You will instantly get back the results of the OData Feed query.

OData Results in Excel

OData Results in Excel

Web Scraping Example

Using a similar connector, you can extract “small” or “medium” data from a web page by simply pasting the URL of the page in the Web connector as follows using the following URL:

https://www.worldometers.info/geography/alphabetical-list-of-countries/

Web Scraping Example

Web Scraping Example

The next screen will provide a page to allow you to select what portion of the page. In this instance we want the “Table 0” which provides a preview of the data. At the bottom, you can “Load” or “Load To” (see dropdown) to load the data into an Excel Worksheet, or “Transform” to open the data in Power Pivot, allowing you to further transform the data before loading the data in an Excel Worksheet.

Data Preview

Data Preview

Database Connectivity Example

Another way to extract data would involve connecting to a database system, either on-premise or online. There are many database systems on the market. Some are relational while others on NoSQL which is based on a key/value pairing structure. To use the Excel connectors to connect to a relational database like SQL Server, look at the following tutorials:

Many NoSQL databases are based on the JSON format (short for JavaScript Object Notation and pronounced Jaysone, albeit in a binary format known as BSON (Binary JSON and pronounced Bison). A popular NoSQL database is MongoDB. You can connect to many NoSQL databases using an ODBC driver. ODBC (Open Database Connectivity) is a mature and widely implemented database connection technology.

For the next example, you will need to either download and install MongoDB or create a developer account (free) at https://www.mongodb.com/cloud/atlas. Just follow the instructions to setup an account, create cluster, add a user and add some sample data. You will also need to install the following MongoDB ODBC driver for your system and the MongoDB Connector. You can also install a local database on your computer or server.

Once your MongoDB developer account is setup, connect to the database using Excel MongoDB connector using the following example as a guide. You can also use the following quick start guide to install and run a local instance of MongoDB for free.

Following the information in the quick start guide above, you should get a result like the following:

Zip Code Collection

Zip Code Collection

Before you can connect to MongoDB using Excel, you will need to install one of the following ODBC drivers for MongoDB depending on your system. Since Microsoft Excel is a 32-bit software, it is best to use the 32-bit MongoDB ODBC driver

Create 32 bit Data Connection

Create 32 bit Data Connection

32 bit ODBC Config

32 bit ODBC Config

Now we have the system defined ODBC data source ready that we can use to connect between Excel and MongoDB. You can either use the ODBC connector from “Get Data” (Excel 2016 and later) -> “From Other Sources” -> “From ODBC” or use the “From Microsoft Query” connector. The latter provides an interface to help you choose multiple tables, columns and add filters and sorting to graphically build a SQL query.

Select Data Connection

Select Data Connection

After you select the DSN data source, you will be able to select which tables and columns to include in your query as follows:

Select required tables and columns

Select required tables and columns

Finally, the data is loaded into Excel for further analysis, profiling and cleansing as previously discussed.

Conclusion

We have looked at how you can use Excel as part of a Data Engineering solution either as a data analysis tool and/or as part of a DAG data flow. In the next article, I will provide tutorials on using MSHTML.HTMLDocument API or SHDocVw.InternetExplorer API to connect and scrape or extract data as well as MSXML2.XMLHTTP.

This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.

© 2020 Kevin Languedoc