Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range
What do we understand by the Dispersion of Data?
The dispersion of data means the spread of data.
How do we measure the dispersion or spread of data?
Let x1, x2, x3….xn be a set of observations for some numeric attribute, X. We have the following terms for measuring the dispersion of data-
Range: It is defined as the difference between the largest and smallest values in the set.
Quantiles: These are points taken at regular intervals of data distribution, dividing it into essentially equal-size consecutive sets.
The kth q-quantile for given data distribution is the value x such at most k/q of data values are less than x and at most (q-k)/q of data values are more than x, where k is an integer such that 0 < k < q. Ther are total (q-1) q-quantiles.
For example, the 2-quantile is the data point dividing the lower and upper halves of the data distribution. It corresponds to the median of the set of values. The 4-quantiles are the three data points that split the data distribution into four equal parts, where each part represents one-fourth of the data distribution. Also commonly called as quartiles.
Interquartile Range (IQR): The distance between the first and third quartiles is a simple measure of the spread that gives the range covered by the middle half of the data. This distance is called the Interquartile range.
IQR = Q3-Q1
Variance & Standard Deviation: These are the measures of data dispersion. We can measure how spread out a data distribution is.
A low standard deviation means that the data observations tend to be very close to the mean.
A high standard deviation means that the data are spread out over a large range of values.
The variance of N observations, x1, x2, x3….xn, for a numeric attribute X is -
Mathematically, the standard deviation is defined as the square root of the variance.