Data cleansing is a process of removing incorrect, inaccurate, incomplete, improperly formatted, and duplicated data. The quality of data affects the data analysis results. In many real-world scenarios, we have the problem of incomplete or missing data, and missing or sparse data can also lead to highly misleading results.
Correcting the dirty data:
✓ Statistical methods – Statistical validations can be used to handle missing values. A strategy for dealing with missing data would be imputation: Replacement of missing values using certain statistics rather than complete removal.
(a) For categorical data, the missing value can be interpolated from the most frequent category
(b) For numerical data, the sample average or median can be used to interpolate missing values.
In general, substitution via k-nearest neighbour imputation is considered to be superior over substitution of missing data by the overall sample mean. Scikit-learn provides you with an Imputer() function in module pre-processing to handle the missing data.
✓ Text parsing – Text parsing can be used to validate the data and avoid the syntax errors.
✓ Detecting outliers – Including outliers in some of algorithms, unknowingly may lead to wrong results or conclusions. It is very important to account for them properly and have the right algorithms in order to handle them:
(a) Mean plus or minus three standard deviation – Mean plus or minus three standard deviation are used to detect the outliers in univariate data. For Gaussian data, we know that 68.27 percent of the data lies within one standard deviation, 95.45 percent in two, and 99.73 percent lies in three. Thus, according to our rule that any point that is more than three standard deviations from the mean is classified as an outlier. The finite sample breakdown point is defined as the proportion of the observations in a sample that can be replaced before the estimator fails to describe the data accurately.
(b) Median absolute deviation – Median is a more robust estimate. The median is the middle observation in a finite set of observations that is sorted in an ascending order. For the median to change drastically, we have to replace half of the observations in the data that are far away from the median. This gives you a 50 percent finite sample breakdown point for the median.
(c) Discovering outliers using the local outlier factor (LOF) method – LOF detects the outliers based on comparing the local density of the data instance with its neighbours. It’s inspired by the KNN (K-Nearest Neighbours) algorithm and is widely used.
✓ OpenRefine – OpenRefine is a formatting tool which is used for data cleansing, data exploration, and data transformation.
(a) Text facet – Text facet is a very useful tool, similar to filter in a spreadsheet. Text facet groups unique text values into groups.
(b) Clustering – We can cluster all the similar values by clicking on our text facet which allow us to find the duty data.
(c) Numeric facets – Numeric facet group numbers into numeric range bins.
(d) Transforming data – If the data is in different format, by transformation, we can bring all the data to same format.