Project: Marketing Campaign Analysis¶


In this project, my aim is to analyze marketing data and address some important business problems/questions.


Context¶


Marketing Analytics broadly refers to the practice of using analytical methods and techniques to understand the effectiveness of various marketing activities and deploy data-driven decisions to optimize for ROI on conversion rates. It typically involves analyzing various metrics around customer engagement with various marketing activities including but not limited to, ATL (above the line) marketing activities, BTL (below the line) campaigns, targeting personalized offers. Typically the variables of interest are customer profile, campaign conversion rates,and costs associated with various marketing channels. These can generate valuable insights that can help an organization form better marketing strategies, optimize/innovate on delivery, and achieve overall growth.


Problem Statement¶


The Chief Marketing Officer share that the recent marketing campaigns have not been as effective as they were expected to be and the conversion rate is very low. My task is to analyze the related data, understand the problem, and identify key insights and recommendations for the CMO to potentially implement.

The data set marketing_data.csv consists of 2,240 customers of Atqlik company with data on:

  • Campaign successes/failures
  • Product preferences
  • Channel performances
  • Customer profiles based on the spending habits

Data Dictionary¶


  • ID : Unique ID of each customer
  • Year_Birth : Age of the customer
  • Education : Customer's level of education
  • Marital_Status : Customer's marital status
  • Kidhome : Number of small children in customer's household
  • Teenhome : Number of teenagers in customer's household
  • Income : Customer's yearly household income
  • Recency : Number of days since the last purchase
  • MntFishProducts : The amount spent on fish products in the last 2 years
  • MntMeatProducts : The amount spent on meat products in the last 2 years
  • MntFruits : The amount spent on fruits products in the last 2 years
  • MntSweetProducts : Amount spent on sweet products in the last 2 years
  • MntWines : The amount spent on wine products in the last 2 years
  • MntGoldProds : The amount spent on gold products in the last 2 years
  • NumDealsPurchases : Number of purchases made with discount
  • NumCatalogPurchases : Number of purchases made using catalog (buying goods to be shipped through the mail)
  • NumStorePurchases : Number of purchases made directly in stores
  • NumWebPurchases : Number of purchases made through the company's website
  • NumWebVisitsMonth : Number of visits to company's website in the last month
  • AcceptedCmp1 : 1 if customer accepted the offer in the first campaign, 0 otherwise
  • AcceptedCmp2 : 1 if customer accepted the offer in the second campaign, 0 otherwise
  • AcceptedCmp3 : 1 if customer accepted the offer in the third campaign, 0 otherwise
  • AcceptedCmp4 : 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  • AcceptedCmp5 : 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  • AcceptedCmp6 : 1 if customer accepted the offer in the last campaign, 0 otherwise
  • Complain : 1 If the customer complained in the last 2 years, 0 otherwise
  • Country: Country customer belongs to

Importing libraries and overview of the dataset¶

In [2]:
# Library to supress warnings or deprecation notes 
import warnings
warnings.filterwarnings('ignore')

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
In [3]:
# from google.colab import files
# uploaded = files.upload()

Load the dataset¶

In [6]:
# loading the datset

df = pd.read_csv("C:/Users/ND/Downloads/Marketing+data (1).csv")
df.head()
Out[6]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits ... NumStorePurchases NumWebVisitsMonth AcceptedCmp1 AcceptedCmp2 AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp6 Complain Country
0 1826 1970 Graduation Divorced 84835.0 0 0 0 189 104 ... 6 1 0 1 0 0 0 0 0 SP
1 1 1961 Graduation Single 57091.0 0 0 0 464 5 ... 7 5 0 1 0 0 0 1 0 CA
2 10476 1958 Graduation Married 67267.0 0 1 0 134 11 ... 5 2 0 0 0 0 0 0 0 US
3 1386 1967 Graduation Together 32474.0 1 1 0 10 0 ... 2 7 0 0 0 0 0 0 0 AUS
4 5371 1989 Graduation Single 21474.0 1 0 0 6 16 ... 2 7 1 1 0 0 0 0 0 SP

5 rows × 27 columns

Check info of the dataset¶

In [7]:
#Checking the info

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Recency              2240 non-null   int64  
 8   MntWines             2240 non-null   int64  
 9   MntFruits            2240 non-null   int64  
 10  MntMeatProducts      2240 non-null   int64  
 11  MntFishProducts      2240 non-null   int64  
 12  MntSweetProducts     2240 non-null   int64  
 13  MntGoldProds         2240 non-null   int64  
 14  NumDealsPurchases    2240 non-null   int64  
 15  NumWebPurchases      2240 non-null   int64  
 16  NumCatalogPurchases  2240 non-null   int64  
 17  NumStorePurchases    2240 non-null   int64  
 18  NumWebVisitsMonth    2240 non-null   int64  
 19  AcceptedCmp1         2240 non-null   int64  
 20  AcceptedCmp2         2240 non-null   int64  
 21  AcceptedCmp3         2240 non-null   int64  
 22  AcceptedCmp4         2240 non-null   int64  
 23  AcceptedCmp5         2240 non-null   int64  
 24  AcceptedCmp6         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Country              2240 non-null   object 
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB

Observations:

  • There are a total of 27 columns and 2,240 observations in the dataset
  • We can see that the Income column has less than 2,240 non-null values i.e. column has missing values. We'll explore this further

Check the percentage of missing values for the Income column.¶

In [8]:
# % Null values in the Income column

(df.isnull().sum()/df.shape[0]*100)['Income']
Out[8]:
1.0714285714285714

Observations:

  • Income has ~1.07% missing values.

Create a list for numerical columns in the dataset and check the summary statistics¶

Summary statistics for numerical columns¶

In [9]:
# num_cols contain numerical varibales 
num_cols=['Year_Birth','Income','Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth','Kidhome',
       'Teenhome']
In [10]:
# printing descriptive statistics of numerical columns

#Uncomment the following code and fill in the blanks
df[num_cols].describe().T
Out[10]:
count mean std min 25% 50% 75% max
Year_Birth 2240.0 1968.805804 11.984069 1893.0 1959.00 1970.0 1977.00 1996.0
Income 2216.0 52247.251354 25173.076661 1730.0 35303.00 51381.5 68522.00 666666.0
Recency 2240.0 49.109375 28.962453 0.0 24.00 49.0 74.00 99.0
MntWines 2240.0 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 26.302232 39.773434 0.0 1.00 8.0 33.00 199.0
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00 67.0 232.00 1725.0
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00 8.0 33.00 263.0
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00 24.0 56.00 362.0
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00 6.0 7.00 20.0
Kidhome 2240.0 0.444196 0.538398 0.0 0.00 0.0 1.00 2.0
Teenhome 2240.0 0.506250 0.544538 0.0 0.00 0.0 1.00 2.0

Observations:¶

1️⃣ Potential Outliers in Product Purchases: A significant gap between the 3rd quartile and the maximum values for all products purchased in the last two years suggests the presence of right-skewed outliers in these variables.

2️⃣ Customer Household Structure: Each customer had a maximum of two children, either in the kidhome or teenhome category.

3️⃣ Spending Behavior on Wine: Analysis indicates that customers allocated a significant portion of their income to wine purchases, making it the most frequently bought product.

4️⃣ Income Distribution Anomaly: A high-income outlier is observed on the right, suggesting the presence of a few customers with exceptionally high earnings compared to the rest of the dataset.

Create a list for categorical columns in the dataset and check the count of each category¶

In [11]:
#cat_cols contain categorical variables
cat_cols=['Education', 'Marital_Status', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'AcceptedCmp6', 'Complain', 'Country']
In [12]:
# Printing the count of each unique value in each column

for column in cat_cols:
    print(df[column].value_counts(normalize=True))
    print("-" * 40)
Graduation    0.503125
PhD           0.216964
Master        0.165179
2n Cycle      0.090625
Basic         0.024107
Name: Education, dtype: float64
----------------------------------------
Married     0.385714
Together    0.258929
Single      0.214286
Divorced    0.103571
Widow       0.034375
Alone       0.001339
YOLO        0.000893
Absurd      0.000893
Name: Marital_Status, dtype: float64
----------------------------------------
0    0.927232
1    0.072768
Name: AcceptedCmp3, dtype: float64
----------------------------------------
0    0.935714
1    0.064286
Name: AcceptedCmp4, dtype: float64
----------------------------------------
0    0.925446
1    0.074554
Name: AcceptedCmp5, dtype: float64
----------------------------------------
0    0.927232
1    0.072768
Name: AcceptedCmp1, dtype: float64
----------------------------------------
0    0.850893
1    0.149107
Name: AcceptedCmp2, dtype: float64
----------------------------------------
0    0.986607
1    0.013393
Name: AcceptedCmp6, dtype: float64
----------------------------------------
0    0.990625
1    0.009375
Name: Complain, dtype: float64
----------------------------------------
SP     0.488839
SA     0.150446
CA     0.119643
AUS    0.071429
IND    0.066071
GER    0.053571
US     0.048661
ME     0.001339
Name: Country, dtype: float64
----------------------------------------

Observations:

1️⃣ Education Categories Simplification: The terms "2nd Cycle" and "Master" represent the same education level and should be combined into a single category.

2️⃣ Marital Status Consolidation: The "Alone" category can be merged with "Single" to streamline the data.

3️⃣ Unclear Categories in Marital Status: The meanings of "Absurd" and "YOLO" are not clear from the data. These can be grouped into a new category labeled "Others" for better analysis.

4️⃣ Customer Complaints: Only 21 customers have filed complaints in the last two years, indicating a low complaint rate.

5️⃣ Customer Distribution by Country: The majority of customers are from Spain, while Mexico has the lowest representation.

6️⃣ Most Common Educational Level: The most frequent education level among customers is Graduation.

7️⃣ Most Common Marital Status: The majority of customers are Married.

Data Preprocessing and Exploratory Data Analysis¶

  • Fixing the categories
  • Creating new columns as the total amount spent, total purchase made, total kids at home, and total accepted campaigns
  • Dealing with missing values and outliers
  • Extract key insights from the data

Replacing the "2n Cycle" category with "Master" in Education and "Alone" with "Single" in Marital_Status and "Absurd" and "YOLO" categories with "Others" in Marital_Status¶

In [13]:
# Replacing 2n Cycle with Master

df["Education"].replace("2n Cycle", "Master", inplace=True)
In [14]:
# Replacing Alone with Single

df["Marital_Status"].replace(["Alone",], "Single", inplace=True)
In [15]:
# Replacing YOLO, Absurd with Others

df['Marital_Status'].replace(["Absurd", "YOLO"], "Others", inplace=True)

We have fixed the categories in the Marital_Status. Check distribution count in different categories for marital status.

In [16]:
df.Marital_Status.value_counts()
Out[16]:
Married     864
Together    580
Single      483
Divorced    232
Widow        77
Others        4
Name: Marital_Status, dtype: int64

Observation:

The majority of customers fall into the married category, while the other category has only four observations.

Creating new features from the existing features¶

In [17]:
# creating new features to get overall picture of a customer, how much he/she has spend, 
#how many children he/she has, total campaigns accepted, etc.


# total spending by a customer
spending_col = [col for col in df.columns if 'Mnt' in col]
df['Total_Spending'] = df[spending_col].sum(axis = 1) 

#total purchases made by a customer
platform_col = [col for col in df.columns if 'Purchases' in col]
df['Total_Purchase'] = df[platform_col].sum(axis = 1) 

#total no. of childern
df['NumberofChildren'] = df['Kidhome'] + df['Teenhome']  

# Total no. of campaign accepted by a customer
campaigns_cols = [col for col in df.columns if 'Cmp' in col]
df['TotalCampaignsAcc'] = df[campaigns_cols].sum(axis=1)

Check outliers for new variables - Total_Spending, Total_Purchase and also analyze the Year_Birth column as we observed above that it had a minimum value of 1893.¶

In [18]:
# Plotting boxplot for Year_Birth, Total_Spending, Total_Purchase

cols=['Year_Birth','Total_Spending','Total_Purchase']
for i in cols:
    sns.boxplot(x=df[i])
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations:

1️⃣ Potential Data Entry Errors in Birth Year: Some users have a birth year reported as ≤1900, while the current year is 2021. It is highly unlikely that these individuals are still alive, suggesting a possible data entry or reporting error.

2️⃣ Presence of Outliers in Total Spending and Total Purchases: The dataset contains outliers in both total spending and total purchases. However, these observations are close to the upper whisker of the distribution, and extreme values are expected for variables like total spending.

3️⃣ Decision on Outlier Treatment: Given that some extreme values are naturally expected, these outliers do not require immediate removal or treatment. Keeping them in the dataset ensures that we retain high-value customers or significant purchasing patterns.

Check the number of observations for which year birth is less than 1900.

In [19]:
df[df['Year_Birth'] < 1900]
Out[19]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits ... AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp6 Complain Country Total_Spending Total_Purchase NumberofChildren TotalCampaignsAcc
513 11004 1893 Master Single 60182.0 0 1 23 8 0 ... 0 0 0 0 0 SA 22 4 1 0
827 1150 1899 PhD Together 83532.0 0 0 36 755 144 ... 1 0 0 0 0 SP 1853 15 0 1

2 rows × 31 columns

Observation:

  • There are only 2 observations for which birth year is less than 1900. We can drop these observations.
In [20]:
#keeping data for customers having birth year >1900

df = df[df['Year_Birth'] > 1900]

Check the outliers and impute the missing values for the Income variable¶

In [21]:
#plotting Boxplot for income

plt.figure(figsize=(10,4))
sns.boxplot(df['Income'])
plt.title('Income boxplot', size=16)
plt.show()
No description has been provided for this image

Observations:

-- The boxplot indicates the presence of outliers in the income variable. -- To quantify the outliers, we need to calculate the upper whisker value and determine how many observations exceed this threshold

In [22]:
#Calculating the upper whisker for the Income variable

Q1 = df.quantile(q=0.25) #First quartile
Q3 = df.quantile(q=0.75) #Third quartile
IQR = Q3 - Q1            #Inter Quartile Range

upper_whisker = (Q3 + 1.5*IQR)['Income']   #Upper Whisker
print(upper_whisker)
118348.5
In [23]:
#Checking the observations marked as outliers
df[df.Income>upper_whisker]
Out[23]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits ... AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp6 Complain Country Total_Spending Total_Purchase NumberofChildren TotalCampaignsAcc
325 4931 1977 Graduation Together 157146.0 0 0 13 1 0 ... 0 0 0 0 0 SA 1730 28 0 0
497 1501 1982 PhD Married 160803.0 0 0 21 55 16 ... 0 0 0 0 0 US 1717 44 0 0
527 9432 1977 Graduation Together 666666.0 1 0 23 9 14 ... 0 0 0 0 0 SA 62 11 1 0
731 1503 1976 PhD Together 162397.0 1 1 31 85 1 ... 0 0 0 0 0 SP 107 1 2 0
853 5336 1971 Master Together 157733.0 1 0 37 39 1 ... 0 0 0 0 0 SP 59 2 1 0
1826 5555 1975 Graduation Divorced 153924.0 0 0 81 1 1 ... 0 0 0 0 0 SP 6 0 0 0
1925 11181 1949 PhD Married 156924.0 0 0 85 2 1 ... 0 0 0 0 0 CA 8 0 0 0
2204 8475 1973 PhD Married 157243.0 0 1 98 20 2 ... 0 0 0 0 0 IND 1608 37 1 0

8 rows × 31 columns

Observations:

  • Only 8 customers have an income above the upper whisker (outliers).
  • Of these, only 3 (ID: 4931, 1501, 8475) made more than 11 purchases in the last 2 years.
  • The remaining 5 outliers have minimal total spending, indicating that higher income does not always correlate with higher purchases.

Compare the summary statistics for these observations with observations on the other side of the upper whisker.

In [24]:
#Checking the summary statistics for observations marked as outliers
df[df.Income>upper_whisker].describe().T
Out[24]:
count mean std min 25% 50% 75% max
ID 8.0 5989.250 3525.251308 1501.0 4074.00 5445.5 8714.25 11181.0
Year_Birth 8.0 1972.500 10.028531 1949.0 1972.50 1975.5 1977.00 1982.0
Income 8.0 221604.500 179850.404431 153924.0 157090.50 157488.0 161201.50 666666.0
Kidhome 8.0 0.375 0.517549 0.0 0.00 0.0 1.00 1.0
Teenhome 8.0 0.250 0.462910 0.0 0.00 0.0 0.25 1.0
Recency 8.0 48.625 33.687376 13.0 22.50 34.0 82.00 98.0
MntWines 8.0 26.500 30.798887 1.0 1.75 14.5 43.00 85.0
MntFruits 8.0 4.500 6.524678 0.0 1.00 1.0 5.00 16.0
MntMeatProducts 8.0 621.875 846.511402 1.0 7.25 17.0 1592.00 1725.0
MntFishProducts 8.0 4.250 5.650537 1.0 1.00 2.0 3.50 17.0
MntSweetProducts 8.0 1.250 0.886405 0.0 1.00 1.0 1.25 3.0
MntGoldProds 8.0 3.750 4.131759 1.0 1.00 1.5 5.00 12.0
NumDealsPurchases 8.0 4.250 6.777062 0.0 0.00 0.0 6.75 15.0
NumWebPurchases 8.0 0.500 1.069045 0.0 0.00 0.0 0.25 3.0
NumCatalogPurchases 8.0 9.875 13.484780 0.0 0.00 0.5 23.50 28.0
NumStorePurchases 8.0 0.750 1.035098 0.0 0.00 0.5 1.00 3.0
NumWebVisitsMonth 8.0 1.125 2.031010 0.0 0.00 0.5 1.00 6.0
AcceptedCmp1 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
AcceptedCmp2 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
AcceptedCmp3 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
AcceptedCmp4 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
AcceptedCmp5 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
AcceptedCmp6 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
Complain 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
Total_Spending 8.0 662.125 848.380884 6.0 46.25 84.5 1635.25 1730.0
Total_Purchase 8.0 15.375 18.220377 0.0 0.75 6.5 30.25 44.0
NumberofChildren 8.0 0.625 0.744024 0.0 0.00 0.5 1.00 2.0
TotalCampaignsAcc 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
In [25]:
#Checking the summary statistics for observations not marked as outliers
df[df.Income<upper_whisker].describe().T
Out[25]:
count mean std min 25% 50% 75% max
ID 2205.0 5585.439456 3247.546423 0.0 2815.0 5455.0 8418.0 11191.0
Year_Birth 2205.0 1968.904308 11.705801 1940.0 1959.0 1970.0 1977.0 1996.0
Income 2205.0 51622.094785 20713.063826 1730.0 35196.0 51287.0 68281.0 113734.0
Kidhome 2205.0 0.442177 0.537132 0.0 0.0 0.0 1.0 2.0
Teenhome 2205.0 0.506576 0.544380 0.0 0.0 0.0 1.0 2.0
Recency 2205.0 49.009070 28.932111 0.0 24.0 49.0 74.0 99.0
MntWines 2205.0 306.164626 337.493839 0.0 24.0 178.0 507.0 1493.0
MntFruits 2205.0 26.403175 39.784484 0.0 2.0 8.0 33.0 199.0
MntMeatProducts 2205.0 165.312018 217.784507 0.0 16.0 68.0 232.0 1725.0
MntFishProducts 2205.0 37.756463 54.824635 0.0 3.0 12.0 50.0 259.0
MntSweetProducts 2205.0 27.128345 41.130468 0.0 1.0 8.0 34.0 262.0
MntGoldProds 2205.0 44.057143 51.736211 0.0 9.0 25.0 56.0 321.0
NumDealsPurchases 2205.0 2.318367 1.886107 0.0 1.0 2.0 3.0 15.0
NumWebPurchases 2205.0 4.100680 2.737424 0.0 2.0 4.0 6.0 27.0
NumCatalogPurchases 2205.0 2.645351 2.798647 0.0 0.0 2.0 4.0 28.0
NumStorePurchases 2205.0 5.823583 3.241796 0.0 3.0 5.0 8.0 13.0
NumWebVisitsMonth 2205.0 5.336961 2.413535 0.0 3.0 6.0 7.0 20.0
AcceptedCmp1 2205.0 0.073923 0.261705 0.0 0.0 0.0 0.0 1.0
AcceptedCmp2 2205.0 0.151020 0.358150 0.0 0.0 0.0 0.0 1.0
AcceptedCmp3 2205.0 0.073016 0.260222 0.0 0.0 0.0 0.0 1.0
AcceptedCmp4 2205.0 0.064399 0.245518 0.0 0.0 0.0 0.0 1.0
AcceptedCmp5 2205.0 0.074376 0.262442 0.0 0.0 0.0 0.0 1.0
AcceptedCmp6 2205.0 0.013605 0.115872 0.0 0.0 0.0 0.0 1.0
Complain 2205.0 0.009070 0.094827 0.0 0.0 0.0 0.0 1.0
Total_Spending 2205.0 606.821769 601.675284 5.0 69.0 397.0 1047.0 2525.0
Total_Purchase 2205.0 14.887982 7.615277 0.0 8.0 15.0 21.0 43.0
NumberofChildren 2205.0 0.948753 0.749231 0.0 0.0 1.0 1.0 3.0
TotalCampaignsAcc 2205.0 0.450340 0.894075 0.0 0.0 0.0 1.0 5.0

Observations:

1️⃣ No Engagement with Campaigns or Complaints: Outliers did not accept any campaigns or submit complaints in the last two years.

2️⃣ Lower Mean Expenditure Across Products: Outliers spend less on all products except meat products.

3️⃣ Distinct Purchase Behavior: They prefer catalog purchases but make very few web purchases.

4️⃣ Recommendation: Remove Outliers: The five observations at indices [527, 731, 853, 1826, 1925] provide no value to the analysis and should be dropped.

In [26]:
#Dropping 5 observations at indices 527, 731, 853, 1826, 1925
df.drop(index=[527, 731, 853, 1826, 1925], inplace=True)

Check the distribution for Income¶

In [27]:
#plotting displot for income

sns.displot(df['Income'], kde=True, height=5, aspect=2)
plt.title('Income distribution', size=16, )
plt.ylabel('count');
No description has been provided for this image

Observations:

  • After treating outliers, the income variable follows a nearly normal distribution, with only a few extreme values on the right. Given its slight right skew, missing values should be imputed using the median rather than the mean to prevent distortion from outliers.
In [28]:
#filling null values with median

df['Income'].fillna(df.Income.median(), inplace=True)

Analyzing all the campaigns¶

find out what is the acceptance rate for each campaign?¶

In [29]:
# PLotting the % acceptance for every campaign

Camp_cols=['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp6']

success_campaign=(df[Camp_cols].sum()/df.shape[0])*100

# plot
success_campaign.plot(kind='bar', figsize=(6,6))
plt.ylabel("Perentage")
plt.show()
No description has been provided for this image

Observations:¶

  • Campaign 2 had the highest acceptance rate, while Campaign 6 had the lowest, indicating a significant difference in strategy effectiveness.
  • Campaigns 1, 3, 4, and 5 had similar acceptance rates, suggesting a consistent but moderate impact.
  • Further analysis is needed to understand why Campaign 2 was successful and why Campaign 6 underperformed, so strategies can be optimized across all campaigns.

Analyze what kind of customer are accepting campaigns?¶

In [30]:
plt.figure(figsize=(8,8))
sns.swarmplot(x='TotalCampaignsAcc', y='Income', data=df)
plt.show()
No description has been provided for this image

Observations:

  • Higher the income higher the number of campaigns accepted.
In [31]:
# Let's see the mean income of customers
df.Income.mean()
Out[31]:
51762.59811827957
  • The average customer income is approximately 52K. To analyze campaign effectiveness, we can segment customers into two income groups: those earning above 52K and those earning below 52K. This will help us compare campaign acceptance rates across different income levels and identify any trends or anomalies in customer behavior.
In [32]:
# making dataframes of customers having income <52k and >52K
df1=df[df.Income<52000]
df2=df[df.Income>52000]

Camp_cols=['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp6']

#Calculating success rate of each campaign for both segments 
success_campaign1=pd.DataFrame((df1[Camp_cols].sum()/df1.shape[0])*100, columns=['Income <52K'])

success_campaign2=pd.DataFrame((df2[Camp_cols].sum()/df2.shape[0])*100, columns=['Income >52K'])

new_df=pd.concat([success_campaign1, success_campaign2], axis=1)

# plot
plt.figure(figsize=(8,8))
sns.lineplot(data=new_df)
plt.title("Percentage Acceptance of each campaign")
plt.ylabel("Percentage Acceptance of a campaign")
plt.show()
No description has been provided for this image

**Observations:

  • Campaign 1: Customers earning below 52K had a higher acceptance rate than those earning above 52K, an anomaly that requires further investigation.
  • Campaigns 2 to 6: Customers with higher income (above 52K) accepted more offers, while those earning below 52K had lower acceptance rates.

Find out who has accepted the last campaign and what could be the reason

In [33]:
df[df['AcceptedCmp6']==1].shape
Out[33]:
(30, 31)
  • Only 30 customers accepted the last campaign.
  • Next Step: Analyze whether these customers are new or have previously accepted other campaigns.
In [34]:
grouped2=df.groupby('AcceptedCmp6').mean()['TotalCampaignsAcc']
grouped2
Out[34]:
AcceptedCmp6
0    0.404632
1    3.633333
Name: TotalCampaignsAcc, dtype: float64

Observations:

  • Customers can accept a maximum of 5 campaigns, and those who accepted the last campaign have an average TotalCampaignsAcc of ~3.6. This suggests that these 30 customers are highly loyal, consistently engaging with most campaigns.

It could be that different campaigns are focussed on different set of products. Check if the product preference for those who accepted the campaigns is different from those who didn't - using amount spent and number of purchases¶

Define a function which will take the column name for the product as input and will generate the barplot for every campaign and average amount spent on a product

In [35]:
def amount_per_campaign(columns_name):
    p1=pd.DataFrame(df.groupby(['AcceptedCmp1']).mean()[columns_name]).T
    p2=pd.DataFrame(df.groupby(['AcceptedCmp2']).mean()[columns_name]).T
    p3=pd.DataFrame(df.groupby(['AcceptedCmp3']).mean()[columns_name]).T
    p4=pd.DataFrame(df.groupby(['AcceptedCmp4']).mean()[columns_name]).T
    p5=pd.DataFrame(df.groupby(['AcceptedCmp5']).mean()[columns_name]).T
    p6=pd.DataFrame(df.groupby(['AcceptedCmp6']).mean()[columns_name]).T
    pd.concat([p1,p2,p3,p4,p5,p6],axis=0).set_index([Camp_cols]).plot(kind='line', figsize=(8,8))
    plt.ylabel('Average amount spend on' + ' ' + columns_name)
    plt.show()

Use the function defined above to generate barplots for different purchasing Products¶

In [36]:
#here is an example showing how to use this function on the column MntWines
amount_per_campaign('MntWines')
No description has been provided for this image

Observations:

  • Customers who accepted Campaigns 3, 4, 5, and 6 exhibited a high average spending on wine, indicating a strong preference for this product. This trend suggests that wine promotions may be a key driver of campaign success and should be leveraged in future marketing strategies.
In [37]:
#meat products

#call the function amount_per_campaign for MntMeatProducts
amount_per_campaign('MntMeatProducts')
No description has been provided for this image
In [38]:
# Fruit products

#call the function amount_per_campaign for MntFruits
amount_per_campaign('MntFruits')
No description has been provided for this image
In [39]:
# gold products

#call the function amount_per_campaign for MntGoldProds
amount_per_campaign('MntGoldProds')
No description has been provided for this image
In [40]:
#sweet products

#call the function amount_per_campaign for MntSweetProducts
amount_per_campaign('MntSweetProducts')
No description has been provided for this image

Observations¶

1️⃣ Meat Products – Customers who accepted Campaigns 2, 3, and 4 spent significantly more on meat compared to those in Campaigns 5 and 6.

2️⃣ Fruit Products – Campaign 6 had a higher average spending among non-acceptors than acceptors, while Campaigns 2, 3, and 4 drove higher fruit purchases among acceptors.

3️⃣ Gold Products – The difference in spending between acceptors and non-acceptors was high for Campaigns 1, 2, 3, 4, and 6, but low for Campaign 5.

4️⃣ Sweet Products – Campaign 1 showed minimal difference between accepted and rejected purchases. However, spending increased for acceptors in Campaigns 2, 3, and 4, while Campaign 5 saw a sharp decline in accepted purchases.

5️⃣ Overall Trend – Customers who accepted campaigns consistently spent more across all product categories compared to those who rejected them.

Check the relationship of campaigns with different purchasing channels.¶

We have a defined a function which will take the column name of the channel name as input and will generate the barplot for every campaign and average purchase made through that channel if the campaign is accepted

In [41]:
def Purchases_per_campaign(columns_name):
    dp1=pd.DataFrame(df.groupby(['AcceptedCmp1']).mean()[columns_name]).T
    dp2=pd.DataFrame(df.groupby(['AcceptedCmp2']).mean()[columns_name]).T
    dp3=pd.DataFrame(df.groupby(['AcceptedCmp3']).mean()[columns_name]).T
    dp4=pd.DataFrame(df.groupby(['AcceptedCmp4']).mean()[columns_name]).T
    dp5=pd.DataFrame(df.groupby(['AcceptedCmp5']).mean()[columns_name]).T
    dp6=pd.DataFrame(df.groupby(['AcceptedCmp6']).mean()[columns_name]).T
    pd.concat([dp1,dp2,dp3,dp4,dp5,dp6],axis=0).set_index([Camp_cols]).plot(kind='line', figsize=(8,8))
    plt.ylabel('Average' + ' ' + columns_name)
    plt.show()
In [42]:
#here is an example showing how to use this function on the column NumDealsPurchases
Purchases_per_campaign('NumDealsPurchases')
No description has been provided for this image

Observations:

  • For the customers accepting campaign 3, 4, and 6 the average deals purchase is quite low.
In [64]:
# store purchase
Purchases_per_campaign('NumStorePurchases')
#call the function Purchases_per_campaign for NumStorePurchases
No description has been provided for this image
In [65]:
#Catalog purchase
Purchases_per_campaign('NumCatalogPurchases')
#call the function Purchases_per_campaign for NumCatalogPurchases
No description has been provided for this image
In [67]:
#Web purchases
Purchases_per_campaign('NumWebPurchases')
#call the function Purchases_per_campaign for NumWebPurchases
No description has been provided for this image

Observations:¶

1️⃣ Store Purchases: Customers who accepted Campaign 1 had a lower average number of store purchases compared to those who rejected it. This trend differs from Campaigns 2, 3, 4, 5, and 6, where accepted customers had higher store purchases.

2️⃣ Catalog Purchases: While rejected campaigns had lower catalog purchases overall, accepted Campaigns 3 and 4 had a higher average number of catalog purchases than Campaigns 1, 2, 5, and 6.

3️⃣ Web Purchases: There was a steady increase in web purchases from Campaign 1 to 4, followed by a gradual decline in Campaigns 5 and 6.

4️⃣ Spending on Products:

  • Gold Products: The difference in spending between accepted and rejected campaigns was high for Campaigns 1, 2, 3, 4, and 6, but low for Campaign 5.
  • Sweet Products: Campaign 1 had nearly identical spending for accepted and rejected offers. Spending increased in Campaigns 2, 3, and 4 but sharply declined in Campaign 5 despite acceptance.
  • Overall Spending: Customers who rejected campaigns consistently spent less across all product categories compared to those who accepted them.
In [68]:
#Recency

Purchases_per_campaign('Recency')
No description has been provided for this image

Observations:

  • Customers who accepted Campaign 2 had a low recency, meaning they had recently made a purchase before engaging with the campaign. This suggests that Campaign 2 was more effective for active, recently engaged customers, highlighting the importance of timing and customer activity in campaign success.

Check see the relationship of campaigns with different categorical variables¶

Check the percentage acceptance of each campaign with respect to each category in the categorical variable. The percentage acceptance is calculated as number of customers who have accepted the campaign to the total number of customers.

In [69]:
def Cat_Campaign_Relation(df, column_name):
    e1=(df.groupby([column_name]).sum()['AcceptedCmp1']/df.groupby([column_name]).count()['AcceptedCmp1'])
    e2=(df.groupby([column_name]).sum()['AcceptedCmp2']/df.groupby([column_name]).count()['AcceptedCmp2'])
    e3=(df.groupby([column_name]).sum()['AcceptedCmp3']/df.groupby([column_name]).count()['AcceptedCmp3'])
    e4=(df.groupby([column_name]).sum()['AcceptedCmp4']/df.groupby([column_name]).count()['AcceptedCmp4'])
    e5=(df.groupby([column_name]).sum()['AcceptedCmp5']/df.groupby([column_name]).count()['AcceptedCmp5'])
    e6=(df.groupby([column_name]).sum()['AcceptedCmp6']/df.groupby([column_name]).count()['AcceptedCmp6'])
    df_new=pd.concat([e1,e2,e3,e4,e5,e6],axis=1).T
    plt.figure(figsize=(8,8))
    sns.lineplot(data=df_new, markers=True, linewidth=2)
    plt.ylabel('Percentage Acceptance')
    plt.show()
In [70]:
#here is an example showing how to use this function on the column Education
Cat_Campaign_Relation(df, 'Education')
No description has been provided for this image

Observations:

  • Campaign 2 had strong engagement, with over 20% of Ph.D. holders accepting the offer.
  • Customers with basic education only accepted Campaigns 1 and 2, showing limited engagement with other campaigns.
  • All other education levels followed a consistent trend in campaign acceptance, except for those with basic education.
In [71]:
#NumberofChildren

#call the function Cat_Campaign_Relation for NumberofChildren
Cat_Campaign_Relation(df, 'NumberofChildren')
No description has been provided for this image
In [72]:
#Let's filter the observations with 'Others' category as they are only 4 such observations
df_rest=df[df.Marital_Status!='Others']

#call the function Cat_Campaign_Relation for Marital_Status with dataframe df_rest
Cat_Campaign_Relation(df_rest, 'Marital_Status')
No description has been provided for this image
In [73]:
#Let's filter the observations for 'ME' country as they are only 3 such observations
df_not_mexico=df[df.Country!='ME']

#Plot
plt.figure(figsize=(8,8))
sns.heatmap((df_not_mexico.groupby('Country').sum()[Camp_cols]/df_not_mexico.groupby('Country').count()[Camp_cols])*100, annot=True, fmt='0.2f', cmap="YlGnBu")
Out[73]:
<AxesSubplot:ylabel='Country'>
No description has been provided for this image

Observation:¶

📌 Number of Children 25%+ of customers who accepted Campaign 2 have no children. For Campaign 1, acceptance rates are similar for customers with 0 to 2 children. Customers with 1 to 3 children show a consistent trend in acceptance percentages across campaigns.

📌 Marital Status Over 20% of widowed, divorced, and single customers accepted Campaign 2. Campaign 6 had the lowest acceptance rate across all marital statuses. All campaigns follow a similar acceptance trend regardless of marital status.

📌 Country-Based Trends Customers from all countries (except India) accepted Campaign 2 the most. Campaign 6 had the lowest acceptance percentage across all countries.

Check the product preferences by customers¶

In [74]:
#creating a list which contains name of all products

mnt_cols = [col for col in df.columns if 'Mnt' in col]

spending=df[mnt_cols].mean(axis=0)
spending.plot(kind='bar', figsize=(10,5))
plt.ylabel("Average spend by customers")
plt.show()
No description has been provided for this image

Observations:

📌 Spending Trends: Over the last two years, customers spent the most on wines, followed by meat products

Check if the product preferences are similar for different types of customers. Then calculate the percentage amount spent by customers on a product for each category with respect to the total spending by customers belonging to that category.

In [75]:
def amount_per_category(df, column_name):
    df_new1=((df.groupby([column_name]).sum()[mnt_cols].T)/df.groupby([column_name]).sum()['Total_Spending'])
    plt.figure(figsize=(10,8))
    sns.heatmap(df_new1.T, annot=True, cmap="YlGnBu")
    plt.show()
In [76]:
# plot showing the percentage of total spending of different products by a group of customers having the same education level

amount_per_category(df, 'Education')
No description has been provided for this image

Observations:¶

📌 Higher Education → Higher Spending on Wine & Meat PhD holders allocate ~60% of their total spending to wine and ~25% to meat. Graduates & Master’s degree holders spend ~45-50% on wine and ~27-29% on meat. Customers with a Master’s or PhD dedicate ~80% of their spending to wine and meat combined.

📌 Basic Education → Higher Spending on Other Products Customers with basic education prefer Fruits, Fish, Sweets, and Gold products over Wine and Meat.

In [77]:
#call the function amount_per_category for Marital_Status with dataframe df_rest
amount_per_category( df_rest, 'Marital_Status')
No description has been provided for this image
In [78]:
#call the function amount_per_category for Country with dataframe df_not_mexico
amount_per_category( df_not_mexico, 'Country')
No description has been provided for this image

Observations:¶

📌 Marital Status Across all marital statuses, customers spent the most on wine (48% to 53% of their salary), followed by meat products. Sweet products had the lowest spending across all categories in the last two years.

📌 Country Spanish customers spent more on wine than any other product compared to other countries. Across all seven countries, wine had the highest spending, while sweet products had the lowest spending consistently.

Check different channel performances¶

Calculate the percentage of purchases for all the channels.

In [79]:
# list of cols for channels

channel_cols = [col for col in df.columns if 'Purchases' in col] 

#making dataframe of columns having purchase and taking sum of them.
channels = pd.DataFrame(df[channel_cols].sum()/df.Total_Purchase.sum(), columns=['NumberofPurchases'])

# plot
channels.plot(kind='bar', figsize=(6,6))
plt.ylabel("Percentage Purchases")
plt.show()
No description has been provided for this image

Observations:

  • Most purchases occur in stores, followed by web purchases.
  • Deal and catalog purchases remain low, requiring strategy improvements.

Check how number of purchases via different channels varies for different income bins.

In [58]:
#Binning the income column
df['income_cat']=pd.qcut(df.Income, q=[0, 0.25, 0.50, 0.75, 1], labels=['low', 'medium', 'high', 'very_high'])
In [59]:
group=df.groupby('income_cat').sum()[channel_cols]
(group/group.sum()).plot(kind='bar', figsize=(8,8))
plt.ylabel("Percentage Purchases")
plt.show()
No description has been provided for this image
Customer Income & Purchase Behavior Insights¶
  • High-income customers had the highest purchase percentage across all channels.
  • Low-income customers had the lowest purchase percentage across all channels.
  • High-income customers distributed their purchases more evenly across different channels, showing a balanced spending pattern.

Visualize the correlation by purchases from different channels and income of the customer.¶

In [60]:
corr=df[['Income', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases' ]].corr()

#Write your code here
sns.heatmap(corr, annot = True)
Out[60]:
<AxesSubplot:>
No description has been provided for this image

Observations:¶

  • The plot indicates a moderate positive correlation (59%) between Income and Number of Catalog Purchases, followed by a 53% correlation between Income and Number of Store Purchases. Additionally, there is a strong negative correlation between Income and Number of Deals Purchases, suggesting that higher-income customers prefer catalog and in-store purchases over discounted deals.

  • From our analysis, the three key factors influencing campaign acceptance are: 1️⃣ Income Level 2️⃣ Number of Children 3️⃣ Spending on Wine

  • To create a new customer profile, we can segment customers based on these attributes and predict their likelihood of campaign acceptance.

In [61]:
df3=df[df.Income>52000]
df4=df3[df3.MntWines>df3.MntWines.mean()]
new_profile=df4[df4.NumberofChildren==0]
In [62]:
#Calculating success rate of each campaign for both segments 
success_campaign3=pd.DataFrame(success_campaign, columns=['Overall Acceptance'])

success_campaign4=pd.DataFrame((new_profile[Camp_cols].sum()/new_profile.shape[0])*100, columns=['New Customer Profile Acceptance '])

# plot
pd.concat([success_campaign3, success_campaign4], axis=1).plot(kind='bar', figsize=(8,6))
plt.title("")
plt.ylabel("")
plt.show()
No description has been provided for this image

Observations:

  • Orange bars in the plot indicates that acceptance rate would have been high for new customer profile i.e. income greater than the mean income, no kid at home, amount spent of wines is greater than the mean amount spent on wines.

Conclusion¶

1️⃣ Spending Trends Across Countries Customers from different countries consistently spend more on wine than any other product.

2️⃣ Campaign Performance Campaign 2 had the highest acceptance rate, indicating a successful strategy. Campaign 6 had the lowest acceptance rate, requiring further investigation. The effectiveness of Campaign 2's strategy should be analyzed to see if it can be applied to improve acceptance rates for Campaigns 1, 3, 4, 5, and 6. A detailed review of Campaign 6 is necessary to understand why customers did not accept many offers.

3️⃣ Income-Based Anomalies in Campaign Acceptance An anomaly was observed in Campaign 1, where customers earning below 52K had a higher acceptance rate than those earning above 52K. This deviation should be investigated to understand why lower-income earners responded better to Campaign 1 compared to other campaigns.

4️⃣ Campaign 5: Low Spending Variation For Campaign 5, the difference in average spending between accepted and rejected campaigns was minimal across all products. This suggests that Campaign 5 may not be significantly influencing purchasing decisions and should either be optimized or discontinued.

5️⃣ Impact of Campaign Strategies on Spending Campaigns 3 and 4 led to high average spending across all products. The successful strategies from Campaigns 3 and 4 should be evaluated and potentially adapted to Campaigns 1, 2, 5, and 6 to drive higher spending.

6️⃣ Effect of Children on Campaign Acceptance Customers without children accepted more offers than those with at least one child. New targeted campaign strategies should be developed to improve engagement among customers with children.

7️⃣ Campaign Effectiveness by Marital Status and Country Campaign 2 was effective across different marital statuses and countries, while Campaign 6 was the least effective. These trends should be further analyzed to enhance future campaign personalization.

8️⃣ Customer Behavior Based on Income & Purchase Channels A negative correlation between income and purchases made with discounts indicates that lower-income customers rely more on discounts. A positive correlation between income and catalog purchases suggests that higher-income customers prefer catalog-based orders for mail delivery.

Recommendations¶

📌 1. Investigate the Success of Campaign 2 Identify key drivers behind its high acceptance rate and apply those insights to improve Campaigns 1, 3, 4, 5, and 6.

📌 2. Address the Poor Performance of Campaign 6 Conduct a detailed analysis to understand why it had the lowest acceptance rate. Adjust marketing strategies, promotional timing, or product offerings accordingly.

📌 3. Investigate the Income-Based Anomaly in Campaign 1 Identify factors influencing higher acceptance among customers earning below 52K. Consider whether discounts, messaging, or specific product offers played a role.

📌 4. Evaluate the Viability of Campaign 5 Since spending did not significantly differ between accepted and rejected offers, determine whether Campaign 5 needs strategy improvements or discontinuation.

📌 5. Adapt Strategies from Campaigns 3 & 4 These campaigns successfully increased average spending. Apply similar pricing, promotions, or engagement techniques to Campaigns 1, 2, 5, and 6.

📌 6. Improve Engagement Among Customers with Children Develop specialized campaigns tailored to the needs and preferences of families to boost acceptance rates.

📌 7. Align Marketing Strategies with Income-Based Purchase Preferences Promote discount-based offers to lower-income customers to increase conversions. Enhance catalog-based promotions for higher-income customers to optimize revenue.

📌 8. Identify Factors Behind the Increasing Trend in Purchases from Campaigns 1 to 4 The CMO should analyze what drove the consistent increase in purchases across all channels. This insight can be leveraged to further enhance future campaigns.

📌 9. Optimize Product-Specific Campaigns The CMO should develop strategies to increase sales of sweet, gold, and fish products without negatively impacting other product categories. These strategies should be effective across all customer demographics (countries & marital statuses).