Handling missing values in the dataset
- Basic Statistical Descriptions of Data — Mean, Median, Mode & Midrange,
- Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range
Let us imagine that we have gigabytes of real-world data. Because the data is real-time, obviously it is going to have a lot of missing values.
There are several ways in which we can handle missing values in our data, and these are -
Ignore the tuple with missing values
In this method, we remove the complete row with missing values from the dataset. But this method is only effective when most of the attributes of the tuple are missing.
It is also used in classification when the class label is itself missing.
Use a measure of central tendency for the attribute (e.g. the mean or median) to fill in the missing value
In this method, we employ the mean or median for filling in the missing values.
If the distribution of the attribute values is normal or symmetric we can replace the missing values with the mean of that data.
If the distribution of the attribute values is asymmetric or skewed, replacing the missing values with the median is a better idea then replacing with the mean.
Use prediction techniques to fill in the missing value
In this, we use machine learning techniques such as Linear regression or Decision Tree to predict the missing values. This method can prove to be more useful in handling missing values than other methods if missing values are large in number.