Why is the Identification of Outliers So Important for Data Analytics?
August 21, 2018 In ANALYTICSData analytics predominantly deals with working with very large data sets; making observations from data via automated tools, to ascertain patterns and relationships that make sense. One really important task when talking about working around large data sets is the identification of an outlier, which can be defined as an event/sample that is not consistent with the rest of the dataset. The presence of an outlier could be due to multiple factors ranging from faulty measurement equipment to human error, making the observation point/value distant from other values in the dataset. The fact is that machine learning algorithms are very sensitive to the range as well as the distribution of the attribute values. The presence of data outliers can end up spoiling or misleading the training process, which results in longer training time, decreased accuracy of models and ultimately poor results. The Impact of Outliers on a Data Set: Outliers are capable of changing the results of data analysis drastically, with numerous unfavorable impacts such as: • Increasing the error variance and reducing the power of tests • Since outliers are distributed non-randomly they can end up decreasing the normality • They are capable of biasing/influencing estimates that may actually be of substantive interest Coming to the core question, why the identification of an outlier is so important? The answer to this lies in the applicability of the determination of an outlier to varied industries. A plethora of industries such as medical diagnosis, sensors (particularly IoT), fraud detection (credit-card etc.), intrusion detection systems etc. depend on the detection of an anomaly/outlier. For example, credit card frauds end up costing all types of companies billions of dollars each year; the identification of a fraudulent transaction accurately and in a timely fashion can help in cost reduction for everyone. Though the science behind the identification of outliers is a complex one, for fraud detection, the performance and the accuracy of running algorithms over a given dataset is quite important. In fact, fraud detection is a great example when it comes to the identification of outliers, as fraudulent transactions as a percentile of all transactions are much less than even 1%. Latest outlier data analytics solutions can help in faster assessment and accurate detection of data anomalies in complex and vast data sets. Thus, the detection of outliers and determining data that is important, as well as reliable, can help in assisting data scientists make better predictions in a wide range of industries.