Static Data Visualizations in Python

, Static Data Visualizations in Python

With the advent of high-performance hardware and the availability of the huge amount of data, the domain of data science has taken a huge leap, both in terms of research as well as its application. It is for this reason that the term “Data is the new oil” has been doing rounds in the scientific and financial sectors.

Depending upon your objective, a data science project through various steps. However, one step that remains common among almost all the data science projects is the data visualization step, which is part of exploratory data analysis. Data visualization refers to presenting useful information from the data in a graphical manner via different types of plots. Data visualization is particularly important if you want to present the data trends to a non-technical audience.

According to Stephen Few (CEO of a Data Visualization Company) :

Graphs reveal more than a collection of individual values. Because of their visual nature, they show the overall shape of your data.”

In this article, we will see how we can extract useful information from the data and present it in the form of static visualization with the help of Python. Static visualizations, as the name suggests are visualizations that are not user interactive. You will see non-static data visualization in the next blog.

Type of Plots for Data Visualization

In this section, you will see different types of data visualizations with the help of Python.

Importing Libraries

Different Python libraries exist for static data visualization. However, we will be using the Matplotlib and the Seaborn libraries. With these two libraries, you can plot almost any type of data.

Execute the following script to import the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

The Dataset

The dataset that we will be using to plot our visualization can be downloaded from this Kaggle Link.

The dataset contains customer information for all the transactions made on the Black Friday for a retail store. Download the dataset and place it in your local directory.

The following script imports the dataset into the python program.

dataset = pd.read_csv(r”E:\Datasets\BlackFriday.csv”)

Let’s see how the dataset actually looks like. The following script prints the first five rows of the dataset:

dataset.head()

The output looks like this:

, Static Data Visualizations in Python

You can see that for each transaction in the dataset, you have information like customer Gender, Age group, the city of the customer, marital status and the amount spent (Purchase). Let’s now see how many records do we have in our dataset.

dataset.shape

In the output, you will see that the dataset has 537577 rows and 12 columns which means we have 537577 records in our dataset.

Let’s now remove the records where we have a missing value in any column. Missing values can actually affect the data visualizations, therefore it is always better that we remove them. The following script gets rid of the missing values from the dataset.

dataset.dropna()

Finally, before we plot any graph, let’s increase the default graph size in order to have a better understanding of the graphs.

import matplotlib.pyplot as plt

print(plt.rcParams.get(‘figure.figsize’))

fig_size = plt.rcParams[“figure.figsize”]

fig_size[0] = 10

fig_size[1] = 8

plt.rcParams[“figure.figsize”] = fig_size

We are all set to see different types of visualizations now.

Distribution Plot

Distribution plot, as the name suggests is used to plot numeric distribution for values in a particular. Take a look at this article to learn more about distributions.

A typical use-case for distribution plot in our dataset is to view the distribution of the amount spent by the customers per transaction. Let’s plot the distribution plot for the “Purchase” column that contains the amount spent per transaction.

sns.distplot(dataset[‘Purchase’])

To plot the distribution plot, we can use the “dist” function and pass it the column name. The output of the above script looks like this:

, Static Data Visualizations in Python

The output clearly shows that the majority of transactions have amounts between 5000 and 10000. You can also notice that customer tend to spend money in multiples of-of 5000 thousand. For instance, you can see peaks at 5000, 10000, 15000 and 20000.

You can also decrease or increase the number of bins in order to have a more general or refined view of distributions. For instance, the following script plots distributional plot with 10 bins.

sns.distplot(dataset[‘Purchase’], kde=False, bins=10)

The output looks like this:

, Static Data Visualizations in Python

Now you can more clearly see that the majority of the customers spend amount between 5000 to 10000 per transaction.

Count Plot

The count plot counts the frequency of occurrence of each of the unique values in a column. For instance, if you want to count the transactions per unique age group, you can use the count plot as follows:

sns.countplot(x=’Age’, data=dataset)

The output looks like this:

, Static Data Visualizations in Python

The output clearly shows that more than 200000 transactions were made by the people belonging to the age group of 26-35. This information can be very useful. For instance, if companies focus their marketing campaigns towards the people between the age of 26 and 35, they can make more profits.

We can also group count plots with respect to a specific column. For instance, if we want to display the age group along with the gender, we can use count plot as follows:

sns.countplot(x=’Age’, hue =’Gender’, data=dataset)

, Static Data Visualizations in Python

We can see that males are in the majority in every age group.

Bar Plot

Bar plot can be used to plot the average value of a numeric column, with respect to a categorical column. For instance, if you want to display the average amount spent by each age-group, you can use Bar plot as follows:

sns.barplot(x=’Age’, y=’Purchase’, data=dataset)

, Static Data Visualizations in Python

From the output, it can be seen that the average amount per transaction by each age-group is almost similar i.e. between 8000 and 10000.

Viewing Count and Bar Plot Together

Sometimes, count and bar plot individually does not convey very useful information. Therefore, you need to look at both of the plots in order to better understand the relation.

Let’s first use count plot to see the frequency of transactions per city using count plot:

sns.countplot(x=’City_Category’, data=dataset)

The output looks like this:

, Static Data Visualizations in Python

The output shows that City B has the highest number of transactions. There can be multiple reasons for that. The retail store might be located in city B or close to city B. OR The population of city B can be higher than the other cities.

Let’s now plot the average amount of money spent per transaction by the customers from three cities using the Bar plot.

sns.barplot(x=’City_Category’, y=’Purchase’, data=dataset)

, Static Data Visualizations in Python

The out is very interesting. From count plot we saw that most of the transactions were made by people of city B, however, the Bar plot above shows that on average, people of city C spent higher average amount per transaction. There can be several reasons for that too. People belonging to city C can be richer compared to other cities.

Box Plot

The box plot is similar to bar plots. However, the bar plot depicts numeric value in the form of quartiles against a categorical value. Let’s see an example of the Box Plot.

sns.boxplot(x=’Gender’, y=’Purchase’, data=dataset)

The output looks like this:

, Static Data Visualizations in Python

The output shows that 25% of the people spend an average amount of 0 to 5000 for both genders. Similarly, the next 25% customers spend an average amount of 5000 to around 7500 for both genders. The male percentile is slightly higher for the second percentile. The third percentile or the next 25% people spend amount between 7500 to 12000 approximately for female, and 7500 to 13000 approximately for male.

The box plot can also be grouped with respect to another column. We can show the average amount by both the genders, grouped by their marital status as follows:

, Static Data Visualizations in Python

The output clearly shows that the single men, (blue box on the right) spent the highest average amount per transaction on black Friday.

Pie Chart

Pie chart presents the numerical share of all the unique values in a categorical column with respect to numeric columns.

A typical use-case of pie chart in our dataset can be the presentation of the share of customer belonging to different occupations, in the total amount spent. Execute the following script:

import pandas as pd
from matplotlib.pyplot import pie, axis, show
sums = dataset.Purchase.groupby(dataset.Occupation).sum()
explode = (0.1, 0, 0, 0)
axis(‘equal’);
pie(sums, labels=sums.index, autopct=’%1.1f%%’);
show()

The output of the above script looks like this:

, Static Data Visualizations in Python

The output shows that customers with occupation id of 4, have the highest share in the total amount spent. Similarly, the unemployed people spent 12.5% of the total amount spent on Black Friday.

Conclusion

Data visualization is one of the most important tasks in data science. Visualizing data before performing any other operation can help identify the trends in the data. In this article, you saw different types of static data visualization along with their use-cases on the Black Friday transactions for retail stores.

Leave a Reply