Structure discovery is all about consistency and format. For example, for a group of phone number inputs, you may want to check if any of them contain symbols or letters rather than just numbers.
Structure discovery also employs basic statistical analysis to gather information such as standard deviation, mean, and mode. This can help you notice patterns and correct issues.
Content discovery looks for errors in individual data records. This type of data profiling catches data quality issues such as missing values and ambiguous information.
Content discovery is an important check to pass when you manage your data, especially when you're dealing with data fields that require accuracy. An address, for example, is not complete unless it has a corresponding zip code. Abbreviations, such as using "St." for "Street" and "WI" for "Wisconsin," may also affect mail carrier systems. While these issues may seem innocuous, they make all the difference.
Relationship discovery catalogs the connections between different data sets, including similarities and differences. Sometimes, two data sets need to be combined to create value. For example, a customer's name needs to be matched with their correct address in order to ensure product delivery. Relationship discovery is also integral for sampling, duplicating, and transferring data so data integrity can be maintained.
There are a variety of data profiling tools and techniques available to enhance data quality and solve data quality problems. Here are some data profiling techniques to get you started:
- Column profiling: First and foremost, column profiling conducts frequency analysis. It counts the number of times a value appears in a single column. Then, it uses this information to uncover patterns and produce statistics. For numerical columns, minimum value, mean, and standard deviation are typically calculated.
- Cross-column profiling: Cross-column profiling deals with key analysis and dependency analysis. Key analysis looks for primary keys that identify each data set, such as customer name, product number, or license plate number. Dependency analysis searches for connections within a data set.
- Cross-table profiling: Cross-table profiling is more complex. It analyzes multiple columns across different tables in order to locate broader relationships and dependencies. Stray data and discrepancies are often discovered in this process.
- Data rule validation: Data rule validation notes where data quality can be improved by checking the data you collected against certain established standards.
Now that you understand the basic techniques for profiling, let's take a look at a few data profiling tools:
- Informatica Data Quality: A data profiling tool that allows you to automate your data quality assessment, Informatica Data Quality finds relationships and flags problems within your data and supports the transformation of data with standardization, validation, enrichment, and more.
- Aggregate Profiler: Aggregate Profiler is an open source data quality and data profiling tool. It is an open source tool that supports data generation, data preparation and data masking. It also features real time alerts for data issues and changes.
- Oracle Enterprise Data Quality: This tool is integrated with Oracle Master Data Management and provides data profiling, auditing, cleansing, and matching for a range of data types such as customer, product, financial, and operational data.
Get the most out of your data
Get the most out of your data with data profiling. Ensure the best data quality, so you can make data-driven decisions that take your business to the next level.
Working with data can be daunting, but Mailchimp is here to help. Check out our Marketing Library for more resources on how to use data to forecast, plan, and track your company’s performance and success, including Google Analytics tutorials and how to protect your customers’ data.
Need more ways to format and simplify your raw data? Take a look at our data reporting best practices. Let Mailchimp guide you in improving every step of your customer journey, from prospect to purchase.