Handbook – Data Quality

What is data quality?

Data quality refers to the accuracy, reliability, and suitability of data for its intended use. High-quality data is essential to better decision making and drawing meaningful insights. Furthermore, by following data quality guidelines, we can reduce the amount of time spent cleaning datasets when shared between departments.

Key aspects of data quality

Accuracy: Data accuracy implies correctness and precision, free from errors.
Completeness: Complete data contains all the necessary information required for analysis or decision-making. Empty columns or rows should be removed, when applicable.
Consistency: Consistent data maintains uniformity across different datasets.
Relevance: Relevant data is aligned with the specific purpose or goal.
Validity: Data validity is a measure of how accurate and reliable information is within a database or dataset
Timeliness: Timely data is up-to-date and relevant within the required timeframe.

Steps to perform data quality assessment

Accuracy

Accuracy is a vital data quality metric that evaluates whether data is correct and free from error. Several methods can help you ensure accuracy, as mentioned in the table below.

Method	Description
Equality check	Compare the original and transformed data field by field. The values should match.
Validation rules	Set conditions that data must meet like an age field can’t exceed 120 or go negative.
Data profiling	Use statistical methods to find errors within the data.
Reference data check	Cross-check data values with a trusted external source to ensure data values are correct and consistent.
Completeness check	Verify that all expected data is present. The absence of data can lead to inaccurate results.
Consistency check	Ensure that data is consistent across all systems. Inconsistent data can lead to wrong conclusions.
Uniqueness check	Make sure there are no unnecessary data duplications in the dataset. Duplicate data can lead to misleading analytics.
Timeliness check	Make sure the data is relevant and up to date. Outdated data may not reflect current trends or situations.

Completeness

Completeness refers to the degree to which all required data is available in the data asset. So, it checks if all the expected or necessary fields in a data set are filled with valid entries, leaving no room for blanks or null values.

Completeness is important as missing data can create a significant bias, leading to skewed results and ultimately impacting the credibility of your data analysis.

Here is a table listing methods that ensure completeness.

Method	Description
Null check	Find and fill empty or null data points in the dataset.
Coverage check	Make sure your data covers all necessary dimensions of the entity it represents.
Missing value analysis	Identify patterns in missing data to find systematic data collection issues.
Data imputation	Fill in missing data based on various strategies like mean, median, mode, or predictive modeling.
Cross-reference check	Compare your data with a trusted source to identify any missing elements.
Cardinality check	Assess if the number of unique values in a field matches expectations.
Data sufficiency verification	Ensure you have enough data to support your analysis and conclusions.
Business rule confirmation	Verify that all business rules or conditions are met in the data collection process.

Consistency

Consistency is about making sure your data is standardized across different platforms, systems, and even within the same dataset. Consistency is not just about maintaining a uniform format or removing duplicates. It is about establishing an environment where your data is reliable, trustworthy, and primed for accurate analysis.

The following table covers the methods typically employed to ensure consistency.

Method	Description
Cross-system check	Compare data across different systems. They should match.
Standardization	Maintain uniform data formats. For instance, date fields should follow one format throughout.
Data deduplication	Remove duplicate data entries to avoid confusion and inconsistency.
Business rule check	Ensure data complies with the rules or constraints defined by your business requirements.
Harmonization	Align disparate data representations to achieve uniformity.
Entity resolution	Identify and link different representations of the same entity within or across datasets.
Temporal consistency check	Check if data maintains logical order and sequencing over time.

Validity

Validity checks if data follows set rules, like a specific format or range.

Let’s say a field needs a date. Validity checks if that field has a date in the right format (for instance, mm/dd/yyyy).

The methods listed in the table below help in measuring validity checks.

Method	Description
Format checks	Check if the data matches the expected format.
Range checks	Confirms data falls within a specific range.
Existence checks	Makes sure data is present where required.
Consistency checks	Verifies data is uniform across all sources.
Cross-reference Checks	Compares data with another reliable source for confirmation.
Logical checks	Review data to see if it makes sense. For example, a customer’s age can’t be negative.

Uniqueness

Uniqueness in data points ensures they only exist once in the system. This property is crucial, especially when test data lingers in production or failed data migrations leave incomplete entries. For example, a job might run twice without any system in place to prevent duplicate data flow. This problem is common in workflow engines, data sources, or targets.

Uniqueness checks can mitigate this issue by identifying and preventing duplicates.

Here’s a table covering popular methods that ensure data uniqueness.

Method	Description
Deduplication	Removes identical entries from the dataset.
Key constraint	Enforces unique keys in a database to prevent duplicate entries.
Record matching	Finds and merges similar records based on set rules.
Data cleansing	Removes duplicates through a process of checking and correcting data.
Normalization	Minimizes data duplication by arranging data in tables.
Fuzzy matching	Uses logic that looks for patterns to detect non-identical duplicates.

Timeliness

Timeliness checks if your data is up-to-date and ready when needed.

Timeliness keeps your data fresh and relevant. Think of a weather forecast. If it is a day late, it is not of much use.

The following table lists the most popular methodologies for ensuring timeliness.

Methodology	Description
Real-time monitoring	Allows instant tracking of data as it moves through pipelines.
Automated alerts	Sends notifications when there are significant delays or failures.
Scheduled jobs	Runs data jobs at optimal times to avoid bottlenecks and improve flow.
Load balancing	Distributes data jobs across systems to prevent overload and ensure swift processing.
Parallel processing	Uses multiple cores or servers to process data simultaneously, improving speed.
Data partitioning	Divides data into smaller, more manageable parts, speeding up processing time.
Late arrival handling	Implements strategies to manage late-arriving data, such as using default placeholders.

Tidy Data

Tidy data means organizing your data in a simple, clean way. The main benefit of using a tidy data structure for tabular data is that it allows for easy analysis.

Follow these rules:

Columns for Things: Each category (like age, height) gets its own column.
Rows for Items: Each row is about one thing (person, product).
Table for Each Type: Different things go in different tables (customers, sales).
Easy Rows, Columns: Keep data easy to read, with rows and columns doing their job.

Reference: https://byuidatascience.github.io/python4ds/tidy-data.html

Common Errors

Spreading a category across several columns

Country	Population_2024	Population_2023	Population_2022
United States	341,000,000	332,000,000	333,300,000
Canada	39,000,000	38,000,000	37,000,000
Mexico	123,100,000	128,400,000	127,500,000

Instead, only two columns are needed (Population and Year). This allows for easier analysis without having to rearrange the existing data. In the above example, time and resources would be wasted to clean the data for a machine-readable format.

Country	Population	Year
United States	341,000,000	2024
United States	332,000,000	2023
United States	333,300,000	2022
Canada	39,000,000	2024
Canada	38,000,000	2023
Canada	37,000,000	2022
Mexico	123,100,000	2024
Mexico	128,400,000	2023
Mexico	127,500,000	2022

Putting two variables into a single column

Parks	Latitude and Longitude
Liberty Park	(40.7459898, -111.8743378)
Sugarhouse Park	(40.7238758, -111.8492732)
Glendale Park	(40.7326094, -111.9249696)

While the variables might be similar location attributes, they should be stored in separate columns. The primary purpose is to make each variable easy to analyze. Another common example of this error is having the gender and age of a person stored in one column.

Parks	Latitude	Longitude
Liberty Park	40.7459898	-111.8743378
Sugarhouse Park	40.7238758	-111.8492732
Glendale Park	40.7326094	-111.9249696