Using Isolation Forest for Outlier Detection In Python

Anomaly, also known as an outlier is a data point which is so far away from the other data points that suspicions arise over the authenticity or the truthfulness of the dataset. Hawkins (1980) defines outliers as:

“Observation which deviates so much from other observations as to arouse suspicion it was generated by a different mechanism” ”

Types of Outliers

Depending upon the feature space, outliers can be of two kinds: Univariate and Multivariate. The univariate outliers are the outliers generated by manipulating the value of a single feature. Univariate outliers are visible to the naked eye when plotted on one dimensional or two-dimensional feature space. The multivariate outliers are generated by manipulating values of multiple features.

In addition to categorizing outlier by feature space, we can also group outliers by their type. There are three major types of outliers:

1.Point Outliers

Observation or data point that is too far from other data points in n-dimensional feature space. These are the simplest type of outlier

2.Contextual Outliers

Contextual outliers are the type of outliers that depend upon the context. For instance, a temperature of -5 degrees in the north of Africa during summer(June/July) is considered an anomaly while the temperature of -5 degree in Norway during December is considered normal. Hence these outliers depend upon the context.

3.Collective outliers

Collective outliers are a group of data points that occur together closely but are far away from the mean of the rest of the data points.

Reasons for Outliers

Presence of outliers in the dataset, can be attributed to several reasons. Some of them have been enlisted below:

Errors while performing data entry. Especially if the data is entered by a human, the chance of human error remains high.
Outliers generate due to an error in experimentation
Outliers generated during data preprocessing phase
Nature outliers which arise due to the behaviour of the data and aren’t generated as a result of any error. These are the outliers that should be retained in the dataset.

Why Outlier Detection is Important

Outlier detection is important for two reasons. Outliers correspond to the aberrations in the dataset, outlier detection can help detect fraudulent bank transactions. Consider the scenario where most of the bank transactions of a particular customer take place from a certain geographical location. Now if a transaction of that particular customer takes place through another geographical location, the transaction will be detected as an outlier. In such cases, further checks such as one-time-pin for cell phones can be used to ensure that the actual user is executing the transaction.

Outlier detection is also important because it highly impacts the mean and standard deviation of the dataset which can result in increased classification or regression error. To train a prediction algorithm that generalizes well on the unseen data, the outliers are often removed from the training data.

Outlier Detection Using Isolation Forest

In this section, we will see how outlier detection can be performed using Isolation Forest, which is one of the most widely used algorithms for outlier detection.

A Simple Example

We will first see a very simple and intuitive example of isolation forest before moving to a more advanced example where we will see how isolation forest can be used for predicting fraudulent transactions.

We will start by importing the required libraries. Execute the following script:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.ensemble import IsolationForest

Next, we need to create a two-dimensional array that will contain our dummy dataset. Execute the following script:

X = np.array( [ [9,17], [10,15],[9,16],[11,17],[12,17],

[10,21],[12,18],[13,20],[10,21],[12,13],

[9,15],[14,14],[90,30],[92,28],[15,15],

[13,14],[13,16],[14,16],[13,16],[15,17], ] )

After that, we will create a pandas dataframe from the two-dimensional array. The dataframe will contain two columns A and B. Run the script below:

new_data = pd.DataFrame(np.array(X), columns=[‘A’, ‘B’])

Let’s plot our dataset and see if we can find any outliers with the naked eye. In the script below, we increase the size of our plot and then plot the columns A and B against each other on a two-dimensional space.

import matplotlib.pyplot as plt

print(plt.rcParams.get(‘figure.figsize’))

fig_size = plt.rcParams[“figure.figsize”]

fig_size[0] = 10

fig_size[1] = 8

plt.rcParams[“figure.figsize”] = fig_size

new_data.plot(x=’A’, y=’B’, style=’o’)

In the output, you will see the following figure:

, Using Isolation Forest for Outlier Detection In Python

From the naked eye, we can see that the data points at the top right i.e. points (90, 30) and (92, 28) are the outliers. Let’s see if the isolation forest algorithm also declares these points as outliers or not. Look at the following script:

iso_forest = IsolationForest(n_estimators=300, contamination=0.10)

iso_forest = iso_forest .fit(new_data)

In the script above, we create an object of “IsolationForest” class and pass it our dataset. The “fit” method trains the algorithm and finds the outliers from our dataset. To find the outliers, we need to again pass our dataset to the “predict” method as shown below:

isof_outliers = iforest.predict(new_data)

The outliers are assigned a value of -1, therefore we can get actual data points by passing the result of the “predict” function to our dataset as shown below:

isoF_outliers_values = new_data[iforest.predict(new_data) == -1]

isoF_outliers_values

In the output, you should see the following result:

The result shows that the outlier data points predicted by the isolation forest are indeed (90, 30) and (92, 28) as we discussed earlier.

Removing Outliers Can Improve Algorithm Performance

Removing outliers from the dataset can improve the performance of the algorithm in some cases. Let’s now compare the performance of a machine learning algorithm for predicting the value in columns B, give the value in column A. Since the values in column B are continuous, this is a regression problem.

Execute the following script to divide the data into feature and label set:

X = new_data.drop([‘B’], axis=1)

y = new_data[[‘B’]]

Next, we need to divide our data into training and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

We will use the random forest algorithm to predict the values. You can choose any algorithm and see if you achieve better results:

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

Next, let’s see how well the algorithm performs:

from sklearn import metrics

print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred))

print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred))

print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In the output, you should see the following results:

Mean Absolute Error: 2.2758333333333343
Mean Squared Error: 6.115945833333335
Root Mean Squared Error: 2.4730438397515995

Let’s now remove the outliers from our dataset and see if we can get better results:

X_train = X_train.drop(isoF_outliers_values .index.values.tolist())

y_train = y_train.drop(isoF_outliers_values .index.values.tolist())

Now, if you again train the algorithm on training set and evaluate it on test set as shown below:

regressor = RandomForestRegressor(n_estimators=20, random_state=0)

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

from sklearn import metrics

print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred))

print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred))

print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In the output, you should see the following results:

Mean Absolute Error: 2.1366666666666663
Mean Squared Error: 5.925653287981859
Root Mean Squared Error: 2.434266478424632

The results show that the algorithm performs better after removing the outliers as the Mean Absolute Error, Mean Squared Error and Root Mean Squared Error have decreased after removing the outliers.

Detecting Fraudulent Credit Card Detections

One of the most common examples of anomaly detection is the detection of fraudulent credit card transactions. In this section, we will see how isolation forest algorithm can be used for detecting fraudulent transactions.

The dataset for this section can be downloaded from this kaggle link.

As a first step we need to import our dataset and drop the time column. The following script does that:

card_data = pd.read_csv(‘E:\Datasets\creditcard.csv’)

card_data = card_data .drop([‘Time’] , axis=1)

Next, we will divide our dataset into normal transactions and fraudulent transactions. All the normal transactions have 0 as the value for class column, while fraudulent transactions have class 1:

fraudulent_transactions = card_data.loc[card_data[‘Class’]==1]

normal_transactions = card_data.loc[ card_data[‘Class’]==0]

Since, anomaly detection is a supervised learning technique, we do not need the class labels. The following script removes the class labels:

fraudulent_transactions = fraudulent_transactions .drop([‘Class’] , axis=1)

normal_transactions = normal_transactions.drop([‘Class’] , axis=1)

Next, we need to divide our data into three sets: a training set which will be used for training the isolation forest, the test of normal transactions, and the test set of fraudulent transactions. The following script does that:

from sklearn.model_selection import train_test_split

train_set, dev_set= train_test_split(normal_transactions, test_size=0.5, random_state=42)

test_set = np.array(fraudulent_transactions)

The next step is to train the isolation forest algorithm on the training set:

classifier = IsolationForest(max_samples=100)

classifier.fit(train_set)

Finally, we evaluate the performance of our algorithm for detecting normal and fraudulent transactions:

train_predictions = classifier.predict(train_set)

dev_predictions = classifier.predict(dev_set)

test_predictions = classifier.predict(test_set)

print(“Normal Detection Accuracy:”, list(train_predictions ).count(1)/train_predictions.shape[0])

print(“Fraudulent Detection Accuracy:”, list(test_predictions).count(-1)/test_predictions.shape[0])

In the output, you should see the following results:

Normal Detection Accuracy: 0.89999788965721
Fraudulent Detection Accuracy: 0.8821138211382114

The result shows that isolation forest has accuracy for 89.99% for detecting normal transactions and an accuracy of 88.21 percent for detecting fraudulent detection which is pretty decent.

Conclusion

Anomaly or outline detection is one of the most important machine learning tasks. Anomaly detection has a variety of applications ranging from suspicious website login to fraudulent credit card transaction. In this article, the theory of outlier detection has been explained. Furthermore, fraudulent transaction detection has been explained as a practical example.

Using Isolation Forest for Outlier Detection In Python

Types of Outliers

1.Point Outliers

2.Contextual Outliers

3.Collective outliers

Reasons for Outliers

Why Outlier Detection is Important

Outlier Detection Using Isolation Forest

A Simple Example

Removing Outliers Can Improve Algorithm Performance

Detecting Fraudulent Credit Card Detections

Conclusion

Leave a Reply Cancel Reply

Recent Blogs

Types of Outliers

1.Point Outliers

2.Contextual Outliers

3.Collective outliers

Reasons for Outliers

Why Outlier Detection is Important

Outlier Detection Using Isolation Forest

A Simple Example

Removing Outliers Can Improve Algorithm Performance

Detecting Fraudulent Credit Card Detections

Conclusion

Leave a Reply Cancel Reply

Recent Blogs

Tag Cloud