Skip to content

What is data quality?

Data quality refers to the accuracy, reliability, and suitability of data for its intended use. High-quality data is essential to better decision making and drawing meaningful insights. Furthermore, by following data quality guidelines, we can reduce the amount of time spent cleaning datasets when shared between departments.

Key aspects of data quality

  • Accuracy: Data accuracy implies correctness and precision, free from errors.
  • Completeness: Complete data contains all the necessary information required for analysis or decision-making. Empty columns or rows should be removed, when applicable.
  • Consistency: Consistent data maintains uniformity across different datasets.
  • Relevance: Relevant data is aligned with the specific purpose or goal.
  • Validity: Data validity is a measure of how accurate and reliable information is within a database or dataset
  • Timeliness: Timely data is up-to-date and relevant within the required timeframe.

Steps to perform data quality assessment

  • Accuracy

Accuracy is a vital data quality metric that evaluates whether data is correct and free from error. Several methods can help you ensure accuracy, as mentioned in the table below.

MethodDescription
Equality checkCompare the original and transformed data field by field. The values should match.
Validation rulesSet conditions that data must meet like an age field can’t exceed 120 or go negative.
Data profilingUse statistical methods to find errors within the data.
Reference data checkCross-check data values with a trusted external source to ensure data values are correct and consistent.
Completeness checkVerify that all expected data is present. The absence of data can lead to inaccurate results.
Consistency checkEnsure that data is consistent across all systems. Inconsistent data can lead to wrong conclusions.
Uniqueness checkMake sure there are no unnecessary data duplications in the dataset. Duplicate data can lead to misleading analytics.
Timeliness checkMake sure the data is relevant and up to date. Outdated data may not reflect current trends or situations.
  • Completeness

Completeness refers to the degree to which all required data is available in the data asset. So, it checks if all the expected or necessary fields in a data set are filled with valid entries, leaving no room for blanks or null values.

Completeness is important as missing data can create a significant bias, leading to skewed results and ultimately impacting the credibility of your data analysis.

Here is a table listing methods that ensure completeness.

MethodDescription
Null checkFind and fill empty or null data points in the dataset.
Coverage checkMake sure your data covers all necessary dimensions of the entity it represents.
Missing value analysisIdentify patterns in missing data to find systematic data collection issues.
Data imputationFill in missing data based on various strategies like mean, median, mode, or predictive modeling.
Cross-reference checkCompare your data with a trusted source to identify any missing elements.
Cardinality checkAssess if the number of unique values in a field matches expectations.
Data sufficiency verificationEnsure you have enough data to support your analysis and conclusions.
Business rule confirmationVerify that all business rules or conditions are met in the data collection process.
  • Consistency

Consistency is about making sure your data is standardized across different platforms, systems, and even within the same dataset. Consistency is not just about maintaining a uniform format or removing duplicates. It is about establishing an environment where your data is reliable, trustworthy, and primed for accurate analysis.

The following table covers the methods typically employed to ensure consistency.

MethodDescription
Cross-system checkCompare data across different systems. They should match.
StandardizationMaintain uniform data formats. For instance, date fields should follow one format throughout.
Data deduplicationRemove duplicate data entries to avoid confusion and inconsistency.
Business rule checkEnsure data complies with the rules or constraints defined by your business requirements.
HarmonizationAlign disparate data representations to achieve uniformity.
Entity resolutionIdentify and link different representations of the same entity within or across datasets.
Temporal consistency checkCheck if data maintains logical order and sequencing over time.
  • Validity

Validity checks if data follows set rules, like a specific format or range.

Let’s say a field needs a date. Validity checks if that field has a date in the right format (for instance, mm/dd/yyyy).

The methods listed in the table below help in measuring validity checks.

MethodDescription
Format checksCheck if the data matches the expected format.
Range checksConfirms data falls within a specific range.
Existence checksMakes sure data is present where required.
Consistency checksVerifies data is uniform across all sources.
Cross-reference ChecksCompares data with another reliable source for confirmation.
Logical checksReview data to see if it makes sense. For example, a customer’s age can’t be negative.
  • Uniqueness

Uniqueness in data points ensures they only exist once in the system. This property is crucial, especially when test data lingers in production or failed data migrations leave incomplete entries. For example, a job might run twice without any system in place to prevent duplicate data flow. This problem is common in workflow engines, data sources, or targets.

Uniqueness checks can mitigate this issue by identifying and preventing duplicates.

Here’s a table covering popular methods that ensure data uniqueness.

MethodDescription
DeduplicationRemoves identical entries from the dataset.
Key constraintEnforces unique keys in a database to prevent duplicate entries.
Record matchingFinds and merges similar records based on set rules.
Data cleansingRemoves duplicates through a process of checking and correcting data.
NormalizationMinimizes data duplication by arranging data in tables.
Fuzzy matchingUses logic that looks for patterns to detect non-identical duplicates.
  • Timeliness

Timeliness checks if your data is up-to-date and ready when needed.

Timeliness keeps your data fresh and relevant. Think of a weather forecast. If it is a day late, it is not of much use.

The following table lists the most popular methodologies for ensuring timeliness.

MethodologyDescription
Real-time monitoringAllows instant tracking of data as it moves through pipelines.
Automated alertsSends notifications when there are significant delays or failures.
Scheduled jobsRuns data jobs at optimal times to avoid bottlenecks and improve flow.
Load balancingDistributes data jobs across systems to prevent overload and ensure swift processing.
Parallel processingUses multiple cores or servers to process data simultaneously, improving speed.
Data partitioningDivides data into smaller, more manageable parts, speeding up processing time.
Late arrival handlingImplements strategies to manage late-arriving data, such as using default placeholders.

Tidy Data

Tidy data means organizing your data in a simple, clean way.  The main benefit of using a tidy data structure for tabular data is that it allows for easy analysis.

Follow these rules:

  • Columns for Things: Each category (like age, height) gets its own column.
  • Rows for Items: Each row is about one thing (person, product).
  • Table for Each Type: Different things go in different tables (customers, sales).
  • Easy Rows, Columns: Keep data easy to read, with rows and columns doing their job.

Reference: https://byuidatascience.github.io/python4ds/tidy-data.html

Common Errors

  1. Spreading a category across several columns                      
CountryPopulation_2024Population_2023Population_2022
United States341,000,000332,000,000333,300,000
Canada39,000,00038,000,00037,000,000
Mexico123,100,000128,400,000127,500,000

Instead, only two columns are needed (Population and Year). This allows for easier analysis without having to rearrange the existing data. In the above example, time and resources would be wasted to clean the data for a machine-readable format.

CountryPopulationYear
United States341,000,0002024
United States332,000,0002023
United States333,300,0002022
Canada39,000,0002024
Canada38,000,0002023
Canada37,000,0002022
Mexico123,100,0002024
Mexico128,400,0002023
Mexico127,500,0002022
  • Putting two variables into a single column
ParksLatitude and Longitude
Liberty Park(40.7459898, -111.8743378)
Sugarhouse Park(40.7238758, -111.8492732)
Glendale Park(40.7326094, -111.9249696)

While the variables might be similar location attributes, they should be stored in separate columns. The primary purpose is to make each variable easy to analyze. Another common example of this error is having the gender and age of a person stored in one column.

ParksLatitudeLongitude
Liberty Park40.7459898-111.8743378
Sugarhouse Park40.7238758-111.8492732
Glendale Park40.7326094-111.9249696