Anomaly, also known as an outlier is a data point which is so far away from the other data points that suspicions arise over the authenticity or the truthfulness of the dataset. Hawkins (1980) defines outliers as:

“Observation which deviates so much from other observations as to arouse suspicion it was generated by a different mechanism” ”

**Types
of Outliers**

Depending
upon the feature space, outliers can be of two kinds: **Univariate**
and **Multivariate**.
The univariate outliers are the outliers generated by manipulating
the value of a single feature. Univariate outliers are visible to the
naked eye when plotted on one dimensional or two-dimensional feature
space. The multivariate outliers are generated by manipulating values
of multiple features.

In addition to categorizing outlier by feature space, we can also group outliers by their type. There are three major types of outliers:

###### 1.**Point Outliers**

Observation or data point that is too far from other data points in n-dimensional feature space. These are the simplest type of outlier

###### 2.**Contextual Outliers**

Contextual outliers are the type of outliers that depend upon the context. For instance, a temperature of -5 degrees in the north of Africa during summer(June/July) is considered an anomaly while the temperature of -5 degree in Norway during December is considered normal. Hence these outliers depend upon the context.

###### 3.**Collective outliers**

Collective outliers are a group of data points that occur together closely but are far away from the mean of the rest of the data points.

**Reasons
for Outliers**

Presence of outliers in the dataset, can be attributed to several reasons. Some of them have been enlisted below:

- Errors while performing data entry. Especially if the data is entered by a human, the chance of human error remains high.
- Outliers generate due to an error in experimentation
- Outliers generated during data preprocessing phase
- Nature outliers which arise due to the behaviour of the data and aren’t generated as a result of any error. These are the outliers that should be retained in the dataset.

**Why
Outlier Detection is Important**

Outlier detection is important for two reasons. Outliers correspond to the aberrations in the dataset, outlier detection can help detect fraudulent bank transactions. Consider the scenario where most of the bank transactions of a particular customer take place from a certain geographical location. Now if a transaction of that particular customer takes place through another geographical location, the transaction will be detected as an outlier. In such cases, further checks such as one-time-pin for cell phones can be used to ensure that the actual user is executing the transaction.

Outlier detection is also important because it highly impacts the mean and standard deviation of the dataset which can result in increased classification or regression error. To train a prediction algorithm that generalizes well on the unseen data, the outliers are often removed from the training data.

**Outlier
Detection Using Isolation Forest**

In this section, we will see how outlier detection can be performed using Isolation Forest, which is one of the most widely used algorithms for outlier detection.

**A
Simple Example**

We will first see a very simple and intuitive example of isolation forest before moving to a more advanced example where we will see how isolation forest can be used for predicting fraudulent transactions.

We will start by importing the required libraries. Execute the following script:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.ensemble import IsolationForest

Next, we need to create a two-dimensional array that will contain our dummy dataset. Execute the following script:

X = np.array( [ [9,17], [10,15],[9,16],[11,17],[12,17],

[10,21],[12,18],[13,20],[10,21],[12,13],

[9,15],[14,14],[90,30],[92,28],[15,15],

[13,14],[13,16],[14,16],[13,16],[15,17], ] )

After that, we will create a pandas dataframe from the two-dimensional array. The dataframe will contain two columns A and B. Run the script below:

new_data = pd.DataFrame(np.array(X), columns=[‘A’, ‘B’])

Let’s plot our dataset and see if we can find any outliers with the naked eye. In the script below, we increase the size of our plot and then plot the columns A and B against each other on a two-dimensional space.

import matplotlib.pyplot as plt

print(plt.rcParams.get(‘figure.figsize’))

fig_size = plt.rcParams[“figure.figsize”]

fig_size[0] = 10

fig_size[1] = 8

plt.rcParams[“figure.figsize”] = fig_size

new_data.plot(x=’A’, y=’B’, style=’o’)

In the output, you will see the following figure:

From the naked eye, we can see that the data points at the top right i.e. points (90, 30) and (92, 28) are the outliers. Let’s see if the isolation forest algorithm also declares these points as outliers or not. Look at the following script:

iso_forest = IsolationForest(n_estimators=300, contamination=0.10)

iso_forest = iso_forest .fit(new_data)

In the script above, we create an object of “IsolationForest” class and pass it our dataset. The “fit” method trains the algorithm and finds the outliers from our dataset. To find the outliers, we need to again pass our dataset to the “predict” method as shown below:

isof_outliers = iforest.predict(new_data)

The outliers are assigned a value of -1, therefore we can get actual data points by passing the result of the “predict” function to our dataset as shown below:

isoF_outliers_values = new_data[iforest.predict(new_data) == -1]

isoF_outliers_values

In the output, you should see the following result:

The result shows that the outlier data points predicted by the isolation forest are indeed (90, 30) and (92, 28) as we discussed earlier.

**Removing
Outliers Can Improve Algorithm Performance**

Removing outliers from the dataset can improve the performance of the algorithm in some cases. Let’s now compare the performance of a machine learning algorithm for predicting the value in columns B, give the value in column A. Since the values in column B are continuous, this is a regression problem.

Execute the following script to divide the data into feature and label set:

X = new_data.drop([‘B’], axis=1)

y = new_data[[‘B’]]

Next, we need to divide our data into training and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

We will use the random forest algorithm to predict the values. You can choose any algorithm and see if you achieve better results:

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

Next, let’s see how well the algorithm performs:

from sklearn import metrics

print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred))

print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred))

print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In the output, you should see the following results:

Mean Absolute Error: 2.2758333333333343

Mean Squared Error: 6.115945833333335

Root Mean Squared Error: 2.4730438397515995

Let’s now remove the outliers from our dataset and see if we can get better results:

X_train = X_train.drop(isoF_outliers_values .index.values.tolist())

y_train = y_train.drop(isoF_outliers_values .index.values.tolist())

Now, if you again train the algorithm on training set and evaluate it on test set as shown below:

regressor = RandomForestRegressor(n_estimators=20, random_state=0)

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

from sklearn import metrics

print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred))

print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred))

print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In the output, you should see the following results:

Mean Absolute Error: 2.1366666666666663

Mean Squared Error: 5.925653287981859

Root Mean Squared Error: 2.434266478424632

The results show that the algorithm performs better after removing the outliers as the Mean Absolute Error, Mean Squared Error and Root Mean Squared Error have decreased after removing the outliers.

**Detecting
Fraudulent Credit Card Detections**

One of the most common examples of anomaly detection is the detection of fraudulent credit card transactions. In this section, we will see how isolation forest algorithm can be used for detecting fraudulent transactions.

The dataset for this section can be downloaded from this kaggle link.

As a first step we need to import our dataset and drop the time column. The following script does that:

card_data = pd.read_csv(‘E:\Datasets\creditcard.csv’)

card_data = card_data .drop([‘Time’] , axis=1)

Next, we will divide our dataset into normal transactions and fraudulent transactions. All the normal transactions have 0 as the value for class column, while fraudulent transactions have class 1:

fraudulent_transactions = card_data.loc[card_data[‘Class’]==1]

normal_transactions = card_data.loc[ card_data[‘Class’]==0]

Since, anomaly detection is a supervised learning technique, we do not need the class labels. The following script removes the class labels:

fraudulent_transactions = fraudulent_transactions .drop([‘Class’] , axis=1)

normal_transactions = normal_transactions.drop([‘Class’] , axis=1)

Next, we need to divide our data into three sets: a training set which will be used for training the isolation forest, the test of normal transactions, and the test set of fraudulent transactions. The following script does that:

from sklearn.model_selection import train_test_split

train_set, dev_set= train_test_split(normal_transactions, test_size=0.5, random_state=42)

test_set = np.array(fraudulent_transactions)

The next step is to train the isolation forest algorithm on the training set:

classifier = IsolationForest(max_samples=100)

classifier.fit(train_set)

Finally, we evaluate the performance of our algorithm for detecting normal and fraudulent transactions:

train_predictions = classifier.predict(train_set)

dev_predictions = classifier.predict(dev_set)

test_predictions = classifier.predict(test_set)

print(“Normal Detection Accuracy:”, list(train_predictions ).count(1)/train_predictions.shape[0])

print(“Fraudulent Detection Accuracy:”, list(test_predictions).count(-1)/test_predictions.shape[0])

In the output, you should see the following results:

Normal Detection Accuracy: 0.89999788965721

Fraudulent Detection Accuracy: 0.8821138211382114

The result shows that isolation forest has accuracy for 89.99% for detecting normal transactions and an accuracy of 88.21 percent for detecting fraudulent detection which is pretty decent.

**Conclusion**

Anomaly or outline detection is one of the most important machine learning tasks. Anomaly detection has a variety of applications ranging from suspicious website login to fraudulent credit card transaction. In this article, the theory of outlier detection has been explained. Furthermore, fraudulent transaction detection has been explained as a practical example.