Exploratory Data Analysis Cheatsheet (everything you might need)

Datasans
8 min readDec 18, 2022
https://www.exploringdata.org/

Exploratory Data Analysis (EDA) is a crucial step in the data science process. It involves analyzing and summarizing a dataset in order to understand its properties and relationships. EDA allows data scientists to uncover patterns, trends, and anomalies in the data, and to generate hypotheses for further investigation. It also helps to identify any missing or incorrect data, and to determine the most appropriate statistical methods and visualizations for the data. EDA is an iterative process, with data scientists constantly reviewing and refining their understanding of the data as they explore it. It is an essential tool for understanding and communicating the insights that can be derived from data, and for informing data-driven decision making.

EDA with Python

Then what might you use in the EDA process?
Here’s a simple cheatsheet for syntaxes that might be useful in general cases (although I’ll be using the Titanic dataset — https://www.kaggle.com/c/titanic/data)

# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# load the dataset
df = pd.read_csv("titanic.csv")

# print the first few rows of the DataFrame
print(df.head())

# print the DataFrame's shape
print(df.shape)

# print the DataFrame's data types
print(df.dtypes)

# check for missing values
print(df.isnull().sum())

# visualize the distribution of a numeric column
plt.hist(df['Age'])
plt.show()

# visualize the distribution of a categorical column
df['Sex'].value_counts().plot(kind='bar')
plt.show()

# calculate basic statistics for a numeric column
print(df['Fare'].describe())

# calculate the correlation between two numeric columns
print(df['Fare'].corr(df['Survived']))

# group the data by a categorical column and calculate statistics
grouped_df = df.groupby('Pclass')['Survived'].mean()
print(grouped_df)

# create a scatter plot to visualize the relationship between two numeric columns
plt.scatter(df['Age'], df['Fare'])
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()

# create a box plot to visualize the distribution of a numeric column
plt.boxplot(df['Fare'])
plt.ylabel('Fare')
plt.show()

# create a bar plot to visualize the mean of a numeric column for each category of a categorical column
df.groupby('Sex')['Age'].mean().plot(kind='bar')
plt.ylabel('Average Age')
plt.show()

# create a pivot table to summarize the data
pivot_table = df.pivot_table(index='Sex', columns='Pclass', values='Fare', aggfunc='mean')
print(pivot_table)

# create a heatmap to visualize the pivot table
plt.pcolor(pivot_table, cmap='Reds')
plt.colorbar()
plt.show()

# create a pairplot to visualize the relationships between multiple numeric columns
import seaborn as sns
sns.pairplot(df, vars=['Age', 'Fare', 'SibSp'])
plt.show()

# create a bar plot to visualize the count of a categorical column
df['Embarked'].value_counts().plot(kind='bar')
plt.ylabel('Count')
plt.show()

# create a countplot to visualize the count of a categorical column by the categories of another categorical column
sns.countplot(x='Sex', hue='Pclass', data=df)
plt.show()

# create a point plot to visualize the mean of a numeric column by the categories of a categorical column
sns.pointplot(x='Sex', y='Age', data=df)
plt.ylabel('Average Age')
plt.show()

# create a violin plot to visualize the distribution of a numeric column by the categories of a categorical column
sns.violinplot(x='Sex', y='Age', data=df)
plt.ylabel('Age')
plt.show()

# create a box plot to visualize the distribution of a numeric column by the categories of a categorical column
sns.boxplot(x='Sex', y='Age', data=df)
plt.ylabel('Age')
plt.show()

# create a swarm plot to visualize the distribution of a numeric column by the categories of a categorical column
sns.swarmplot(x='Sex', y='Age', data=df)
plt.ylabel('Age')
plt.show()

# create a faceting grid to visualize the distribution of multiple numeric columns by the categories of a categorical column
g = sns.FacetGrid(df, col='Sex')
g.map(plt.hist, 'Age')
plt.show()

# create a heatmap to visualize the correlation between multiple numeric columns
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), cmap='RdYlGn', annot=True)
plt.show()

# create a lag plot to check for autocorrelation in a numeric column
from pandas.plotting import lag_plot
lag_plot(df['Fare'])
plt.show()

# create an autocorrelation plot to visualize the autocorrelation in a numeric column
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(df['Fare'])
plt.show()

# create a scatter plot matrix to visualize the relationships between multiple numeric columns
from pandas.plotting import scatter_matrix
scatter_matrix(df[['Age', 'Fare', 'SibSp']], alpha=0.2, figsize=(6, 6))
plt.show()

# create a regression plot to visualize the relationship between two numeric columns
sns.regplot(x='Age', y='Fare', data=df)
plt.show()

# create a barplot to visualize the mean of a numeric column by the categories of a categorical column
sns.barplot(x='Sex', y='Age', data=df)
plt.ylabel('Average Age')
plt.show()

# create a pointplot to visualize the mean and confidence interval of a numeric column by the categories of a categorical column
sns.pointplot(x='Sex', y='Age', data=df, ci=95)
plt.ylabel('Average Age')
plt.show()

# create a lmplot to visualize the relationship between two numeric columns and the categories of a categorical column
sns.lmplot(x='Age', y='Fare', hue='Sex', data=df)
plt.show()

# create a factorplot to visualize the distribution of a numeric column by the categories of a categorical column
sns.factorplot(x='Sex', y='Age', data=df)
plt.ylabel('Average Age')
plt.show()

# create a boxenplot to visualize the distribution of a numeric column by the categories of a categorical column
sns.boxenplot(x='Sex', y='Age', data=df)
plt.ylabel('Age')
plt.show()

# create a distplot to visualize the distribution of a numeric column
sns.distplot(df['Fare'])
plt.show()

# create a kdeplot to visualize the kernel density estimate of a numeric column
sns.kdeplot(df['Fare'])
plt.show()

# create a rugplot to visualize the distribution of a numeric column
sns.rugplot(df['Fare'])
plt.show()

# create a jointplot to visualize the relationship between two numeric columns and their distributions
sns.jointplot(x='Age', y='Fare', data=df)
plt.show()

Data Preprocessing

Here are some steps for data preprocessing that might be useful:

Handling missing values: This technique is used when there are missing values in the dataset. There are various ways to handle missing values, such as filling them with the mean, median, or mode of the column, or dropping rows with missing values. The appropriate method will depend on the specific dataset and the goal of the analysis.

Encoding categorical variables: This technique is used when the dataset contains categorical variables, which are variables that can take on a limited number of categories. One-hot encoding is a common method for encoding categorical variables, which creates a new binary column for each category. This is useful for inputting categorical variables into machine learning models, which typically only accept numerical input.

Standardizing numeric columns: This technique is used to scale the values of a numeric column so that they have zero mean and unit variance. This is often useful when the numeric columns have different scales and the machine learning model will be sensitive to this difference in scales.

Normalizing numeric columns: This technique is used to scale the values of a numeric column so that they have a minimum value of 0 and a maximum value of 1. This is often useful when the numeric columns have different scales and the machine learning model will be sensitive to this difference in scales.

Binning numeric columns: This technique is used to divide the values of a numeric column into bins. This is useful for turning a continuous numeric column into a categorical column, which can be useful for certain types of analysis or machine learning models.

Applying min-max scaling: This technique is used to scale the values of a numeric column so that they have a minimum value of 0 and a maximum value of 1. This is often useful when the numeric columns have different scales and the machine learning model will be sensitive to this difference in scales.

Applying robust scaling: This technique is used to scale the values of a numeric column using the median and interquartile range. This is often useful when the data contains outliers, as it is less sensitive to the influence of outliers compared to other scaling methods.

Applying power transformations: Power transformations are a class of functions that can be used to transform the values of a numeric column in order to stabilize or improve the assumptions of certain statistical models. Power transformations can be useful for correcting the skewness of a distribution, as skewed distributions can cause problems when fitting certain types of models.

Applying quantile transformations: This technique is used to transform the values of a numeric column so that they have a uniform or normal distribution. This can be useful for improving the assumptions of certain machine learning models, which may assume that the predictor variables are normally distributed.

Applying box-cox transformations: This technique is used to transform the values of a numeric column so that they are approximately normally distributed. This can be useful for improving the assumptions of certain machine learning models, which may assume that the predictor variables are normally distributed.

# create a copy of the original DataFrame
df_preprocessed = df.copy()

# handle missing values in the DataFrame
df_preprocessed['Age'].fillna(df_preprocessed['Age'].median(), inplace=True)
df_preprocessed.dropna(inplace=True)

# encode categorical variables using one-hot encoding
df_preprocessed = pd.get_dummies(df_preprocessed, columns=['Sex', 'Pclass'], prefix=['sex', 'pclass'])

# standardize the values of a numeric column
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_preprocessed['Age_scaled'] = scaler.fit_transform(df_preprocessed[['Age']])

# normalize the values of a numeric column
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
df_preprocessed['Age_normalized'] = normalizer.fit_transform(df_preprocessed[['Age']])

# bin the values of a numeric column
from sklearn.preprocessing import KBinsDiscretizer

discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal')
df_preprocessed['Age_binned'] = discretizer.fit_transform(df_preprocessed[['Age']])

# apply a min-max scaling to a numeric column
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_preprocessed['Age_scaled'] = scaler.fit_transform(df_preprocessed[['Age']])

# apply a robust scaling to a numeric column
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
df_preprocessed['Age_scaled'] = scaler.fit_transform(df_preprocessed[['Age']])

# apply a power transformation to a numeric column
from sklearn.preprocessing import PowerTransformer

transformer = PowerTransformer(method='yeo-johnson')
df_preprocessed['Age_transformed'] = transformer.fit_transform(df_preprocessed[['Age']])

# apply a quantile transformation to a numeric column
from sklearn.preprocessing import QuantileTransformer

transformer = QuantileTransformer(output_distribution='normal')
df_preprocessed['Age_transformed'] = transformer.fit_transform(df_preprocessed[['Age']])

# apply a box-cox transformation to a numeric column
from scipy.stats import boxcox

df_preprocessed['Age_transformed'], lambda_ = boxcox(df_preprocessed['Age'])

And several statistical analysis methods…

Mann-Whitney U test: This technique is used to compare the distribution of two numeric columns. It can be used to test the hypothesis that the two columns have the same distribution, or to determine the statistical significance of the difference between the two distributions.

Kruskal-Wallis H test: This technique is similar to the Mann-Whitney U test, but it can be used to compare the distribution of two or more numeric columns. It can be used to test the hypothesis that the columns have the same distribution, or to determine the statistical significance of the difference between the distributions.

Wilcoxon signed-rank test: This technique is similar to the Mann-Whitney U test, but it is used to compare the distribution of two paired numeric columns. It can be used to test the hypothesis that the two columns have the same distribution, or to determine the statistical significance of the difference between the distributions.

# calculate summary statistics for a numeric column
print(df_preprocessed['Age'].describe())

# calculate the skewness and kurtosis of a numeric column
print(df_preprocessed['Age'].skew())
print(df_preprocessed['Age'].kurtosis())

# calculate the correlation between two numeric columns
print(df_preprocessed['Age'].corr(df['Fare']))

# perform a t-test to compare the means of two numeric columns
from scipy.stats import ttest_ind

t, p = ttest_ind(df_preprocessed['Age'], df_preprocessed['Fare'])
print(t, p)

# perform an ANOVA test to compare the means of two or more numeric columns
from scipy.stats import f_oneway

f, p = f_oneway(df_preprocessed['Age'], df_preprocessed['Fare'])
print(f, p)

# perform a Mann-Whitney U test to compare the distribution of two numeric columns
from scipy.stats import mannwhitneyu

u, p = mannwhitneyu(df_preprocessed['Age'], df_preprocessed['Fare'])
print(u, p)

# perform a Kruskal-Wallis H test to compare the distribution of two or more numeric columns
from scipy.stats import kruskal

h, p = kruskal(df_preprocessed['Age'], df_preprocessed['Fare'])
print(h, p)

# perform a Wilcoxon signed-rank test to compare the distribution of two paired numeric columns
from scipy.stats import wilcoxon

w, p = wilcoxon(df_preprocessed['Age'], df_preprocessed['Fare'])
print(w, p)

Update: Download FULL VERSION PDF!

Now you can get full version of EDA cheatsheet (73 pages) for only $2!!!

Buy on Gumroad: https://datasansid.gumroad.com

Table of Content:
Chapter 1: Introduction to Exploratory Data Analysis with Python
Chapter 2: Data Cleaning and Preprocessing Techniques
Chapter 3: Essential Python Libraries for EDA: NumPy, pandas, and matplotlib
Chapter 4: Descriptive Statistics and Data Visualization with Python
Chapter 5: Advanced Data Visualization Techniques with seaborn and Plotly
Chapter 6: Handling Missing Data and Outliers in Python
Chapter 7: Dimensionality Reduction Techniques: PCA and t-SNE
Chapter 8: Time Series Analysis and Forecasting with Python
Chapter 9: Text Data Exploration and Natural Language Processing
Chapter 10: Case Studies: Real-World Applications of EDA with Python

You will acquire a fully straighforward cheatsheet about EDA. The book offers hands-on code examples with essential Python libraries, enabling readers to effectively apply statistical analysis and visualization tools. By covering topics such as dimensionality reduction, time series forecasting, and text data exploration, this guide equips readers with practical knowledge to uncover hidden insights and make data-driven decisions across various industries.

--

--

Datasans

All things about data science that are discussed “sans ae”, data sains? sans lah…