What is data quality?
Data quality refers to the accuracy, reliability, and suitability of data for its intended use. High-quality data is essential to better decision making and drawing meaningful insights. Furthermore, by following data quality guidelines, we can reduce the amount of time spent cleaning datasets when shared between departments.
Key aspects of data quality
- Accuracy: Data accuracy implies correctness and precision, free from errors.
- Completeness: Complete data contains all the necessary information required for analysis or decision-making. Empty columns or rows should be removed, when applicable.
- Consistency: Consistent data maintains uniformity across different datasets.
- Relevance: Relevant data is aligned with the specific purpose or goal.
- Validity: Data validity is a measure of how accurate and reliable information is within a database or dataset
- Timeliness: Timely data is up-to-date and relevant within the required timeframe.
Steps to perform data quality assessment
- Accuracy
Accuracy is a vital data quality metric that evaluates whether data is correct and free from error. Several methods can help you ensure accuracy, as mentioned in the table below.
Method | Description |
Equality check | Compare the original and transformed data field by field. The values should match. |
Validation rules | Set conditions that data must meet like an age field can’t exceed 120 or go negative. |
Data profiling | Use statistical methods to find errors within the data. |
Reference data check | Cross-check data values with a trusted external source to ensure data values are correct and consistent. |
Completeness check | Verify that all expected data is present. The absence of data can lead to inaccurate results. |
Consistency check | Ensure that data is consistent across all systems. Inconsistent data can lead to wrong conclusions. |
Uniqueness check | Make sure there are no unnecessary data duplications in the dataset. Duplicate data can lead to misleading analytics. |
Timeliness check | Make sure the data is relevant and up to date. Outdated data may not reflect current trends or situations. |
- Completeness
Completeness refers to the degree to which all required data is available in the data asset. So, it checks if all the expected or necessary fields in a data set are filled with valid entries, leaving no room for blanks or null values.
Completeness is important as missing data can create a significant bias, leading to skewed results and ultimately impacting the credibility of your data analysis.
Here is a table listing methods that ensure completeness.
Method | Description |
Null check | Find and fill empty or null data points in the dataset. |
Coverage check | Make sure your data covers all necessary dimensions of the entity it represents. |
Missing value analysis | Identify patterns in missing data to find systematic data collection issues. |
Data imputation | Fill in missing data based on various strategies like mean, median, mode, or predictive modeling. |
Cross-reference check | Compare your data with a trusted source to identify any missing elements. |
Cardinality check | Assess if the number of unique values in a field matches expectations. |
Data sufficiency verification | Ensure you have enough data to support your analysis and conclusions. |
Business rule confirmation | Verify that all business rules or conditions are met in the data collection process. |
- Consistency
Consistency is about making sure your data is standardized across different platforms, systems, and even within the same dataset. Consistency is not just about maintaining a uniform format or removing duplicates. It is about establishing an environment where your data is reliable, trustworthy, and primed for accurate analysis.
The following table covers the methods typically employed to ensure consistency.
Method | Description |
Cross-system check | Compare data across different systems. They should match. |
Standardization | Maintain uniform data formats. For instance, date fields should follow one format throughout. |
Data deduplication | Remove duplicate data entries to avoid confusion and inconsistency. |
Business rule check | Ensure data complies with the rules or constraints defined by your business requirements. |
Harmonization | Align disparate data representations to achieve uniformity. |
Entity resolution | Identify and link different representations of the same entity within or across datasets. |
Temporal consistency check | Check if data maintains logical order and sequencing over time. |
- Validity
Validity checks if data follows set rules, like a specific format or range.
Let’s say a field needs a date. Validity checks if that field has a date in the right format (for instance, mm/dd/yyyy).
The methods listed in the table below help in measuring validity checks.
Method | Description |
Format checks | Check if the data matches the expected format. |
Range checks | Confirms data falls within a specific range. |
Existence checks | Makes sure data is present where required. |
Consistency checks | Verifies data is uniform across all sources. |
Cross-reference Checks | Compares data with another reliable source for confirmation. |
Logical checks | Review data to see if it makes sense. For example, a customer’s age can’t be negative. |
- Uniqueness
Uniqueness in data points ensures they only exist once in the system. This property is crucial, especially when test data lingers in production or failed data migrations leave incomplete entries. For example, a job might run twice without any system in place to prevent duplicate data flow. This problem is common in workflow engines, data sources, or targets.
Uniqueness checks can mitigate this issue by identifying and preventing duplicates.
Here’s a table covering popular methods that ensure data uniqueness.
Method | Description |
Deduplication | Removes identical entries from the dataset. |
Key constraint | Enforces unique keys in a database to prevent duplicate entries. |
Record matching | Finds and merges similar records based on set rules. |
Data cleansing | Removes duplicates through a process of checking and correcting data. |
Normalization | Minimizes data duplication by arranging data in tables. |
Fuzzy matching | Uses logic that looks for patterns to detect non-identical duplicates. |
- Timeliness
Timeliness checks if your data is up-to-date and ready when needed.
Timeliness keeps your data fresh and relevant. Think of a weather forecast. If it is a day late, it is not of much use.
The following table lists the most popular methodologies for ensuring timeliness.
Methodology | Description |
Real-time monitoring | Allows instant tracking of data as it moves through pipelines. |
Automated alerts | Sends notifications when there are significant delays or failures. |
Scheduled jobs | Runs data jobs at optimal times to avoid bottlenecks and improve flow. |
Load balancing | Distributes data jobs across systems to prevent overload and ensure swift processing. |
Parallel processing | Uses multiple cores or servers to process data simultaneously, improving speed. |
Data partitioning | Divides data into smaller, more manageable parts, speeding up processing time. |
Late arrival handling | Implements strategies to manage late-arriving data, such as using default placeholders. |
Tidy Data
Tidy data means organizing your data in a simple, clean way. The main benefit of using a tidy data structure for tabular data is that it allows for easy analysis.
Follow these rules:
- Columns for Things: Each category (like age, height) gets its own column.
- Rows for Items: Each row is about one thing (person, product).
- Table for Each Type: Different things go in different tables (customers, sales).
- Easy Rows, Columns: Keep data easy to read, with rows and columns doing their job.
Reference: https://byuidatascience.github.io/python4ds/tidy-data.html
Common Errors
- Spreading a category across several columns
Country | Population_2024 | Population_2023 | Population_2022 |
United States | 341,000,000 | 332,000,000 | 333,300,000 |
Canada | 39,000,000 | 38,000,000 | 37,000,000 |
Mexico | 123,100,000 | 128,400,000 | 127,500,000 |
Instead, only two columns are needed (Population and Year). This allows for easier analysis without having to rearrange the existing data. In the above example, time and resources would be wasted to clean the data for a machine-readable format.
Country | Population | Year |
United States | 341,000,000 | 2024 |
United States | 332,000,000 | 2023 |
United States | 333,300,000 | 2022 |
Canada | 39,000,000 | 2024 |
Canada | 38,000,000 | 2023 |
Canada | 37,000,000 | 2022 |
Mexico | 123,100,000 | 2024 |
Mexico | 128,400,000 | 2023 |
Mexico | 127,500,000 | 2022 |
- Putting two variables into a single column
Parks | Latitude and Longitude |
Liberty Park | (40.7459898, -111.8743378) |
Sugarhouse Park | (40.7238758, -111.8492732) |
Glendale Park | (40.7326094, -111.9249696) |
While the variables might be similar location attributes, they should be stored in separate columns. The primary purpose is to make each variable easy to analyze. Another common example of this error is having the gender and age of a person stored in one column.
Parks | Latitude | Longitude |
Liberty Park | 40.7459898 | -111.8743378 |
Sugarhouse Park | 40.7238758 | -111.8492732 |
Glendale Park | 40.7326094 | -111.9249696 |