What is Data Cleaning
Data cleaning—in a nutshell—, the process of preparing the data in a storage resource, mainly by detecting and removing irrelevant data (corrupted information, inaccurate records, wrongly inputted data, etc.), to make sure we have a correct and accurate data set.
Processing incorrect data can be extremely costly in this age of big data, it can lead to false conclusions and we might only realize our mistake after spending a significant amount of resources and time.
“Garbage in, garbage out” has been the mantra of data analytics and big data: no matter how good the data processing technology and algorithms are, if we feed them garbage data, we would only get another garbage as the result. On the other hand, even a simple analytics tool with basic algorithm can perform well when it is given a high-quality properly processed data.
In this guide, we will have a step-by-step guideline of the data cleansing process to walk you through the necessary checklist.
However, let us begin by discussing the key concepts behind data cleansing.
How To Measure Data Quality
Obviously, measuring the qualitative aspect of the data will be subjective depending on many different factors. However, here are some of the common data quality criteria and how to approach them:
Data Accuracy
“Accuracy” refers to how close the dataset is to the true values.
It’s important to note that a “valid” value won’t always mean they are actually accurate (we will discuss data validity further below). For example, black is a valid color, but when the object is actually red, black doesn’t represent reality, the accurate value.
Also, we have to consider the precision of the data. For example, we can say a customer lives in New York, which is actually true but not exactly precise. If we can cleanse the data so we can get the exact information of a particular street address this customer actually lives in, it’s more accurate.
Data Validity
Data validity refers to how the data is compliant to rules, regulations, and constraints that are predefined by the organization. These are some of the common data constraints:
- Data-Type: the data must be of a particular type (i.e., only numeric data, only textual data, etc.)
- Mandatory: for example, certain rows cannot be empty
- Range: for example, dates should fall within a certain range.
- Unique: a field or certain fields must be unique throughout the dataset
- Foreign-key: the foreign key column can’t contain value that previously didn’t exist in the referenced primary key.
- Cross-field validation: refers to a preliminary condition that must hold across multiple fields. For example, the expiry date of a license plate cannot be earlier than the issue date.
- Set membership: the value of the data must come from a set of discrete values. For example, there can only be two sides of a pair of shoes, left and right.
- Regular expression: textual data that must be in a certain pattern. For example, the common pattern for social security numbers or phone numbers.
Data Consistency
Pretty self-explanatory, how consistent the data is within a dataset, or across several datasets.
We can call two different data inconsistent when the two values contradict each other. For example, when the same customer is recorded in two different sheets, yet with two different phone numbers.
Also, the age 8 will not be valid for a data recorded for a driver’s license.
Data Completeness
Completeness refers to the amount—or degree— of required data is collected.
Missing data during collection, storing, and analysis processes is common, and can be caused by various reasons.
Data Uniformity
Some data types might require similar information tied throughout the dataset. For example, we should use the same unit of measure throughout the same set.
Sometimes during collection the different data might be recorded in meters or inches, and we probably receive nominal data both in USD and GBP.
Throughout the data cleansing process, all the data should be converted into similar measure unit.
The Data Cleansing Sequence
There are four key steps of a data cleansing process, with the goal of producing high-quality data based on the criteria above:
- Inspection process: parse the data to detect missing data, incorrect input, inconsistencies, and non-uniformed data, among other issues
- Cleaning process: simply put, the process to remove all the discovered errors and anomalies
- Verification process: re-checking the result to verify quality and correctness
- Documentation process: all changes made should be recorded properly to avoid the need to repeat all the processes when mistakes are made, and also to help with future sequences.
It’s important to understand that the data cleansing process is not a one-off thing, but rather can be an iterative, endless sequential process. New errors and inconsistencies might be discovered, and so you might need to repeat all the processes from inspection to verification yet again.
Let us discuss these processes one by one:
Data Cleansing: Inspection Process
Inspection is the first, most important process we should attempt, and also often the most difficult and time consuming.
Here are some methods and techniques we can use to execute this inspection process flawlessly:
Data Visualization
Data visualization techniques using statistical methods (i.e., standard deviation, range, mean) can be utilized to find errors and incorrect values within the data. Here, we use graphs, scatter plots, charts, and other visualization techniques to find outliers.
For example, if we have a data set about the average life expectancy across all different countries, we might find unexpected value when we put this data in a histogram.
Using Software Tools
Various tools are available to assist with data inspection, and most of them work by allowing us to specify certain constraints, and then check the data for constraint violations.
For example, you can set the constraint that height shouldn’t be below 3’ and can’t be negative. Typically these tools can also generate report so you can easily visualize the data.
Data Profiling Methods
Data profiling is the process of collecting statistics and informative summaries about the said data. This is useful to give us an overview about the data quality.
For example we can check whether a particular aspect of the data forms a specific pattern, missing values, how many unique values in the specific group, the distribution of the data, and so on.
Data Cleaning Techniques
There can be different techniques in attempting the cleaning process, which will depend on the type of the data and the specific problem.
Each of these different methods have their own advantages and disadvantages, but the goal remains the same: removing or correcting incorrect data.
Cleaning Data Duplicates
Duplicates—-repeated data in your dataset— can often happen when:
- Two or more datasets are combined, especially data from different sources
- Duplicate input by the user, for example when a user submitted the same form twice
- The input form was resubmitted with corrections to wrong information, causing data duplicates for the other information
In general, we can remove one of the data to “clean” it. However, in the third case above, we will need to figure out which form is the correct one (generally the later one submitted is more correct).
Cleaning Irrelevant Data
“Irrelevant data” refers to information that is not actually needed depending on our purpose to use the data.
For example, if we want to analyze data about the life expectancy of the population in a specific area, then Twitter handle is probably not necessary, so we can declare it as an irrelevant data.
On the other hand, you might only want to analyze data from a specific country, then all other countries in the dataset can be declared as irrelevant data.
Remember, however, that a data that might seem irrelevant at first might actually have correlations with your relevant data. Explore the correlation matrix, and only remove the data when you are sure that it’s 100% irrelevant.
Type Conversion
Convert all data types to its desired type. For example, a numerical data might be initially stored as textual, and vice versa.
We can use data profiling (as discussed above) to get a summary of the whole data, and so we can look at each data type of each column.
In some cases, it might be impossible to convert a data value to the specified type, for one reason or another (i.e. software error, corrupted data, etc), you can convert it as NA (or any) value and display a warning.
Fixing Syntax Errors
Fixing differences and typos: for example, some people might input “st.” instead of “street”, and some might incorrectly inputted “stret” instead.
In this case, the variable will have more classes than expected, all with different values.
The common approach here is to manually map all the common typos to the correct data value, for example, we will map “stret” to “street” with
dataframe[‘address’].map({‘stret’: ‘Street’, ‘st.’: ‘street’, …})
The second possible solution (but not available for all cases) is to use pattern match, for example we can look for ‘st’ or ‘St’ at the beginning of a string.
re.sub(r”\^st\$”, ‘Street’, ‘street’, flags=re.IGNORECASE)
Another option is to use fuzzy matching, essentially identifying the required corrections—the distance— to turn one string into another. For example, we need 5 operations to change “st.” into “street”, while we need only 1 to change “stret” into “street”.
The idea is to replace all possible strings into an agreed one, for example in this case, to “street”.
Watch out for possible values like “0” or “None” or “NA”, among others, which will indicate missing value.
Removing extra spaces: extra spaces at the end of a string, or at the beginning, should be removed, for example:
“ text “ to “text”
Padded strings: strings might be padded with characters or spaces to a certain width. For example some numeric data are often preceded with zeros to ensure a standard number of digits.
For example 0000121 instead of 121 to maintain a 7-digit format.
Data Standardization
The idea here is to make sure all values have similar standardized format.
For example, some string values might need a standardized lower case format, while some others will require all values to be in upper case.
For numerical data, the important thing is to maintain a similar measurement unit throughout all values. For example, height can be in feet or meters. We should convert all heights to one system to avoid confusion.
Also, recording the date as timestamp is not the same as a date object, and will need conversion. On the other hand, US and Europe have different date formats.
Data Transformation
Transformation—or scaling—, refers to changing your data’s value so it fits within a specific scale, for example 0-100 scale or 0-10 scale.
We can, for example, use a 5-star rating system or a percentage (0-100) system for a single data.
Scaling can also happen on data that include different measurement units, for example, weight data in kilograms can’t be compared to data in pounds, so we need to rescale the data to a standardized number.
The idea is, with scaling, we can compare different values.
Data Normalization
Normalization is a type of transformation or scaling, rescaling the data values into a range of 0 to 1, so the data can be normally distributed.
Statistical calculations normally will require a normally distributed data, which can be achieved using various normalization methods. Normalizing the data might change the shape of data distribution when visualized.
Dealing With Missing Values
We will always encounter missing values in any dataset, and there are at least three possible approaches in dealing with missing data:
- Ignore
This is the approach commonly taken when the missing values: a) occur at random and b) occur relatively rarely. In this case, we can simply drop observations on the rows that have missing values.
On the other hand, if most of a column’s values are missing and also occur at random, we can neglect the whole column.
In statistical analysis, adding otherwise missing values can lead to biased results, so this approach is often preferred.
- Attempt to ‘restore’ missing values via imputation
Here, our approach is to calculate the missing values by observing other values, by using several possible methods:
- linear regression, applied to existing data to calculate the best fit. Linear regression is relatively prone to unexpected outliers
- Statistical methods like mean, median, or other observations. When there are many missing values, however, using this approach can lead to bias
- Hot deck, when we have enough available data, we can copy values from similar records
There are also many other methods we can use here to impute the missing values with substitute data.
- Flagging the data
No matter what imputation method we use, and no matter how advanced, it will always lead to a bias or loss in information.
The missing data is an informative piece in itself, and replacing it with any other value will affect the algorithm’s calculation.
This is especially true when the missing data doesn’t really happen at random, for example, when in a survey people from a certain gender refuse to answer some questions.
Based on this assumption, we can “flag” the data, which is filling the data with, for example “NA”, and not including this data when calculating any statistical value.
Remember that missing values are not similar to default values. Also, it’s not similar to “unknown value”, for example when someone forgot their birthday or don’t know their actual age, it’s an unknown value instead of missing value.
Everytime we ignore or impute a value, we are essentially losing information. This is why flagging is in most cases, the better approach.
Handling Outliers
Outliers are, simply put, any data value that are unexpected and significantly different from all other values.
Mathematically, any data value that lies more than 1.5 Interquartile Range (IQR) away from the Q1 and Q3 quartiles is an outlier.
Let’s dig deeper to that statement. First, the Interquartile Range (IQR) is the measure where the bulk of the values lie, and is measured by subtracting the third quartile with the first quartile.
IQR = Q3 – Q1
When we cut all the available data into four equal parts, we get three quartiles (the quartiles are the “cuts”.So, as we can see the first and the third quartiles are especially important here. For example, if we have 8 values in a data set as follows: 1,2,3,4,5,6,7
Then 2 is the Q1, 4 is the Q2, and 6 is the Q3. Then, IQR is 6-2 so we get 4 as the Interquartile Range. Then, any number that is more than 1.5*4=6 numbers away from 2 (the Q1) and 6 (the Q3) is an outlier. For example, 13 is an outlier.
It’s important to note that unless there’s a clear reason to remove outliers, outliers are innocent until proven guilty. It’s worth investigating the reason behind the outlier occurrences before we remove the data.
Some statistical methods—like linear regression—, are especially prone to outliers.
Data Contraction
There is the possibility of having two or more values across different datasets (Cross-Datasets Errors) or in the same row (In-record Errors) that contradict each other.
For example, someone under 10 years old can’t be married, a black SUV shouldn’t be also listed as a motorcycle, and so on.
Data Cleaning: Verification Process
When all cleaning is deemed done, it’s important to re-check everything and re-inspect the data to verify that it’s totally free from errors and conforms to the regulations and constraints.
For example, data imputation might violate some constraints, and so we might need to figure out a different solution.
End Words
It’s important to look at data cleaning as a continuous, endless process instead of a one-off thing. Why? Because no matter how good the data validation and cleaning process is, we will always get issues as new data come in.
Recording everything related to the data cleaning is very important, so we know how these errors happen in the first place, and can prevent future similar issues with better validation.