In this article, we are going to discuss Outliers, Outlier Detection, and its importance in Data Science using a real-world example.

Let us suppose that you are a data scientist at a credit card company. Obviously, you manage a lot of customer data. Suppose that on a daily basis, you examine -

  • The amount of purchase of a customer.
  • The location of the purchase of a customer.

While you are examining the data, you may find out several patterns in the purchase of the customer data. Like, suppose the average purchase amount of the customer is around $50 per day at a restaurant in Las Vegas. Maybe the customer is a regular visitor to the restaurant.

But suppose, after a few months of this regular pattern, the customer suddenly does a purchase of $50,000 at a place very far from Las Vegas or very far from the place where the person lives.

When this happens, you might suspect this is an unusual activity unlike the previous purchases of the same customer or of similar other customers.

This activity might be a credit card fraud, which you need to take an action on as soon as possible.

Such types of data, which are very different from the expected pattern is known as Outliers or Anomalies. And the process of finding such data is known as Outlier Detection.

Assume that a given statistical process is used to generate a set of data objects.

An Outlier is a data object that deviates significantly from the rest of the objects as if it were generated by a different mechanism.

All the data objects which are expected or are other than outliers are referred to as “Normal”. Similarly, we may refer to outliers as “Abnormal” data.

Note: Outliers in many cases also occur due to errors in the process by which we generate or measure data.