Kevin is a data engineer and advanced analytics developer. He has over 20 years experience in the field.
Data Engineering is the science of researching and extracting data from various sources and architecting data models and data sets which are stored in data warehouses for use with Machine Learning, Advanced Analytics, other formats of data science endeavors. Data is either extracted in batches or streaming or even on demand. Data Engineers design and build the data pipelines that data science runs on.
Regardless what many voices are saying, Data Engineering can be done using almost every programming language if the language has libraries, frameworks or other APIs that allow it to connect to a data source and a data target.
Most sources are either business systems like SAP or Oracle’s eSuite, or database systems (corporate or external), web services, files. Most often, the target is a data warehouse which can consist of one database or multiple databases called data marts, depending on your design.
Excel Data Connectors
Besides SQL and Python, I use Excel almost daily for quick profiling and data analysis on data samples. The Power Pivot (Power Query, Dax, M language) tool, and pivot tables are very powerful tools that are used constantly for ad hoc analysis, profiling and wrangling.
You can get data into Excel (or Power Pivot) using various embedded connectors such as:
- Text or CSV files
- A table on a web page
- From a table or range
- From files like Json, Xml or another spreadsheet
- From various database systems including SQL Server, Oracle, Terradata, Sybase, SAP Hana, MySQL, PostgreSQL
- From Azure data sources
- From online services like Facebook, Twitter, Salesforce to name a few of the more popular
- And other data source like OData, Hadoop, OLEDB, ODBC and many others
Once the data is in Excel, it can be inspected, analyzed, profiled and cleansed since most raw data is dirty as far as Data Engineering is concerned.
OData Connectivity Example
For example, you can connect and extract data using OData. To demonstrate, you can connect to the following endpoint using the OData connect in Excel
You will instantly get back the results of the OData Feed query.
Web Scraping Example
Using a similar connector, you can extract “small” or “medium” data from a web page by simply pasting the URL of the page in the Web connector as follows using the following URL:
The next screen will provide a page to allow you to select what portion of the page. In this instance we want the “Table 0” which provides a preview of the data. At the bottom, you can “Load” or “Load To” (see dropdown) to load the data into an Excel Worksheet, or “Transform” to open the data in Power Pivot, allowing you to further transform the data before loading the data in an Excel Worksheet.
Database Connectivity Example
Another way to extract data would involve connecting to a database system, either on-premise or online. There are many database systems on the market. Some are relational while others on NoSQL which is based on a key/value pairing structure. To use the Excel connectors to connect to a relational database like SQL Server, look at the following tutorials:
- Easily Import and Export Excel Data to SQL Server
Adding a few lines of VBA code to Excel will allow you to dynamically export your data in a Worksheet to a Microsoft SQL Server without the need of a programmer.
- How-To Synchronize Data Between Excel and SQL Server
SQL Server offers many different technologies for system integration. This tutorial demonstrates how to link an Excel 2007 file with SQL Server 2008.
For the next example, you will need to either download and install MongoDB or create a developer account (free) at https://www.mongodb.com/cloud/atlas. Just follow the instructions to setup an account, create cluster, add a user and add some sample data. You will also need to install the following MongoDB ODBC driver for your system and the MongoDB Connector. You can also install a local database on your computer or server.
Once your MongoDB developer account is setup, connect to the database using Excel MongoDB connector using the following example as a guide. You can also use the following quick start guide to install and run a local instance of MongoDB for free.
Following the information in the quick start guide above, you should get a result like the following:
Before you can connect to MongoDB using Excel, you will need to install one of the following ODBC drivers for MongoDB depending on your system. Since Microsoft Excel is a 32-bit software, it is best to use the 32-bit MongoDB ODBC driver
Now we have the system defined ODBC data source ready that we can use to connect between Excel and MongoDB. You can either use the ODBC connector from “Get Data” (Excel 2016 and later) -> “From Other Sources” -> “From ODBC” or use the “From Microsoft Query” connector. The latter provides an interface to help you choose multiple tables, columns and add filters and sorting to graphically build a SQL query.
After you select the DSN data source, you will be able to select which tables and columns to include in your query as follows:
Finally, the data is loaded into Excel for further analysis, profiling and cleansing as previously discussed.
We have looked at how you can use Excel as part of a Data Engineering solution either as a data analysis tool and/or as part of a DAG data flow. In the next article, I will provide tutorials on using MSHTML.HTMLDocument API or SHDocVw.InternetExplorer API to connect and scrape or extract data as well as MSXML2.XMLHTTP.
This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.
© 2020 Kevin Languedoc