Why is the Identification of Outliers So Important for Data Analytics?

Data analytics predominantly deals with working with very large data sets; making observations from data via automated tools, to ascertain patterns and relationships that make sense. One really important task when talking about working around large data sets is the identification of an outlier, which can be defined as an event/sample that is not consistent with the rest of the dataset. The presence of an outlier could be due to multiple factors ranging from faulty measurement equipment to human error, making the observation point/value distant from other values in the dataset. The fact is that machine learning algorithms are very sensitive to the range as well as the distribution of the attribute values. The presence of data outliers can end up spoiling or misleading the training process, which results in longer training time, decreased accuracy of models and ultimately poor results. The Impact of Outliers on a Data Set: Outliers are capable of changing the results of data analysis drastically, with numerous unfavorable impacts such as: • Increasing the error variance and reducing the power of tests • Since outliers are distributed non-randomly they can end up decreasing the normality • They are capable of biasing/influencing estimates that may actually be of substantive interest Coming to the core question, why the identification of an outlier is so important? The answer to this lies in the applicability of the determination of an outlier to varied industries. A plethora of industries such as medical diagnosis, sensors (particularly IoT), fraud detection (credit-card etc.), intrusion detection systems etc. depend on the detection of an anomaly/outlier. For example, credit card frauds end up costing all types of companies billions of dollars each year; the identification of a fraudulent transaction accurately and in a timely fashion can help in cost reduction for everyone. Though the science behind the identification of outliers is a complex one, for fraud detection, the performance and the accuracy of running algorithms over a given dataset is quite important. In fact, fraud detection is a great example when it comes to the identification of outliers, as fraudulent transactions as a percentile of all transactions are much less than even 1%. Latest outlier data analytics solutions can help in faster assessment and accurate detection of data anomalies in complex and vast data sets. Thus, the detection of outliers and determining data that is important, as well as reliable, can help in assisting data scientists make better predictions in a wide range of industries.

Author: Rajiv Diwan, Practice Head – Advanced Analytics

Rajiv Diwan heads the Advanced Analytics Practice at ITL and is responsible for both – customer acquisition and defining solution offerings of the Practice. He has setup the CoE on Machine Learning from scratch at ITL and has been instrumental in penetrating into new verticals for ITL; including BFSI. Rajiv is engineering graduate from BIT, Bangalore with specialization in Computer Science; having over 18 years of experience in Analytics, Data warehousing, BI and Large Program Management.

Post a Comment