A Deeper Dive into ETL and Data Validation


Our last post focused on the role technology plays in creating a high-quality student experience in higher ed. Today’s students want a centralized, online system for tracking information and managing their activities. Colleges and universities that provide this resource can gain competitive advantage, while increasing the efficiency of their operations.

Trouble is, disparate applications and data sources — both electronic and paper-based — often present a significant roadblock. When data is stored in different database platforms and formats, integrating it is not a simple process.

As we discussed in our last post, organizations employ extract, transform and load (ETL) tools to consolidate the information. Here is a more detailed look at what ETL entails.

The extract process involves reading data from a source and downloading a desired subset. While that may sound straightforward, there are a number of issues to consider. First, the extract process must be designed in such a way that the data is retrieved without impacting system performance or application response time. If the application is actively being used, you will also have to decide how often you will extract the records to keep the data in sync. How do you handle subsequent changes to the data? If the data source doesn’t track modifications it’ll be necessary to perform comparisons each time you extract the data.

The transform process modifies the extracted information so that it matches the attributes of the target database. Because information may have been entered in a variety of ways, the first step is to clean the data to ensure consistency. Do all phone numbers have dashes separating sets of numerals? Are “empty” fields truly empty or do some contain “n/a” or another value? Once the data is cleaned, rules and lookups are applied to convert it to the target format. The transform process may also incorporate custom business logic (such as determining residency, etc.), and require the generation of identification keys or calculated values, the parsing of strings into fields, the joining of data from multiple sources, and other processes.

As the name implies, the load process moves the transformed data into the target database. If the target database supports an application that is actively being used, care must be taken to ensure that the load process doesn’t impact system performance or corrupt existing data.

All of this must be carefully planned, keeping in mind that problems may occur at any phase. Connectivity might be lost or data corrupted. Therefore, data should be moved to a staging area after each phase so that it can be recovered without restarting the entire process.

Validation is also a critical part of any ETL process. The data must be tested to ensure that it contains the expected values and conforms to the proper patterns. Any data that fails the validation test should be identified for analysis and correction.

There are off-the-shelf ETL tools that automate some of these functions and provide connections to common databases and file formats. Nevertheless, setting up an ETL process is a complex undertaking that requires significant database expertise. Axiom has developed a suite of tools that integrate with commonly used higher-ed applications, and can customize the ETL process for your particular environment and needs. Contact us to schedule a demo.