KPMG

Drawing

Table of Conetnets

1. Data Introduction

2. Data Quality Assessment

a. Null Data & Data Consistency
b. Exploratory Data Analysis (EDA)

3. Conclusion


In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from datetime import datetime
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
In [2]:
xls = pd.ExcelFile('KPMG_VI_New_raw_data_update_final.xlsx')
transactions = pd.read_excel(xls, 'Transactions', header=1)
demographic = pd.read_excel(xls, 'CustomerDemographic', header=1)
address = pd.read_excel(xls, 'CustomerAddress', header=1)

1. Data Introduction

In [3]:
transactions.shape
Out[3]:
(20000, 13)
In [4]:
transactions.head(5)
Out[4]:
transaction_id product_id customer_id transaction_date online_order order_status brand product_line product_class product_size list_price standard_cost product_first_sold_date
0 1 2 2950 2017-02-25 0.0 Approved Solex Standard medium medium 71.49 53.62 41245.0
1 2 3 3120 2017-05-21 1.0 Approved Trek Bicycles Standard medium large 2091.47 388.92 41701.0
2 3 37 402 2017-10-16 0.0 Approved OHM Cycles Standard low medium 1793.43 248.82 36361.0
3 4 88 3135 2017-08-31 0.0 Approved Norco Bicycles Standard medium medium 1198.46 381.10 36145.0
4 5 78 787 2017-10-01 1.0 Approved Giant Bicycles Standard medium large 1765.30 709.48 42226.0
In [5]:
demographic.shape
Out[5]:
(4000, 13)
In [6]:
demographic.head(5)
Out[6]:
customer_id first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator default owns_car tenure
0 1 Laraine Medendorp F 93 1953-10-12 Executive Secretary Health Mass Customer N "' Yes 11.0
1 2 Eli Bockman Male 81 1980-12-16 Administrative Officer Financial Services Mass Customer N <script>alert('hi')</script> Yes 16.0
2 3 Arlin Dearle Male 61 1954-01-20 Recruiting Manager Property Mass Customer N 2018-02-01 00:00:00 Yes 15.0
3 4 Talbot NaN Male 33 1961-10-03 NaN IT Mass Customer N () { _; } >_[$($())] { touch /tmp/blns.shellsh... No 7.0
4 5 Sheila-kathryn Calton Female 56 1977-05-13 Senior Editor NaN Affluent Customer N NIL Yes 8.0
In [7]:
address.shape
Out[7]:
(3999, 6)
In [8]:
address.head(5)
Out[8]:
customer_id address postcode state country property_valuation
0 1 060 Morning Avenue 2016 New South Wales Australia 10
1 2 6 Meadow Vale Court 2153 New South Wales Australia 10
2 4 0 Holy Cross Court 4211 QLD Australia 9
3 5 17979 Del Mar Point 2448 New South Wales Australia 4
4 6 9 Oakridge Court 3216 VIC Australia 9

2. Data Quality Assessment

a. Not Applicable (N/A) Data & Data Consistency

Transactions

In [9]:
headerColor = '#dade9b'
rowSpecialColor = '#ffeb74'
rowColor = 'white'

nullValues = [transactions[item].isna().sum() for item in transactions.columns]
allValues =  [transactions.shape[0] for item in transactions.columns]
percentages = [round((transactions[item].isna().sum() / transactions.shape[0])*100, 2) \
                                                 for item in transactions.columns]
colors = [rowSpecialColor if x == max(nullValues) else rowColor for x in nullValues]

fig = go.Figure(data=[go.Table(header=dict(values=['<b>Columns</b>','<b>NULL</b>',
                                                   '<b>All Values</b>','<b> Percentage (%)</b>'],
                                           line_color='darkslategray', fill_color=headerColor,
                                           align=['left', 'center'], font=dict(color='darkslategray', size=12)),
                               cells=dict(values=[transactions.columns, nullValues, allValues, percentages],
                                          line_color='darkslategray',
                                          fill_color = [colors],
                                          align = ['left', 'center'], font=dict(color='black', size=11)))])

fig.update_layout(title='Transactions')
fig.show()

Online Order column has the most NULL values.

First, we will fill out NULL values in Online Order column with 0.0 if order status is cancelled and 1.1 if the order status is approved. This is because most of online order happened once order status is approved.

In [10]:
transactions.loc[transactions['order_status'].eq('Approved') & transactions['online_order'].isnull(), 'online_order'] = 1.0
transactions.loc[transactions['order_status'].eq('Cancelled') & transactions['online_order'].isnull(), 'online_order'] = 0.0

Next, for Brand, Product Line, and Product Size column, we will use the most brand sold. As you can see, SOLEX has the most number of bike sold. So we will replace with SOLEX for NAN brand values.

In [11]:
brands = transactions['brand'].value_counts().to_frame().reset_index()\
        .rename(columns={'index': 'Brand', 'brand': 'Counts'})
product_line = transactions['product_line'].value_counts().to_frame().reset_index()\
        .rename(columns={'index': 'Product Line', 'product_line': 'Counts'})
product_class = transactions['product_class'].value_counts().to_frame().reset_index()\
        .rename(columns={'index': 'Product Class', 'product_class': 'Counts'})
product_size = transactions['product_size'].value_counts().to_frame().reset_index()\
        .rename(columns={'index': 'Product Size', 'product_size': 'Counts'})
In [12]:
fig = make_subplots(rows=4, cols=1)
fig.add_trace(go.Bar(x=brands['Brand'], y=brands['Counts'], name='Brand',
                    marker=dict(color='#ffdfd3')), 1, 1)
fig.add_trace(go.Bar(x=product_line['Product Line'], y=product_line['Counts'], name='Product Line',
                    marker=dict(color='#957dad')), 2, 1)
fig.add_trace(go.Bar(x=product_class['Product Class'], y=product_class['Counts'], name='Product Class',
                    marker=dict(color='#fec8d8')), 3, 1)
fig.add_trace(go.Bar(x=product_size['Product Size'], y=product_size['Counts'], name='Product Size',
                    marker=dict(color='#e0bbe4')), 4, 1)

fig.update_layout(template='none', title='The Most Frequent Values')
fig.update_yaxes(showgrid=False, showticklabels=False)
fig.show()
In [13]:
transactions['brand'] = transactions['brand'].fillna(transactions['brand'].mode().iloc[0])
transactions['product_line'] = transactions['product_line'].fillna(transactions['product_line'].mode().iloc[0])
transactions['product_class'] = transactions['product_class'].fillna(transactions['product_class'].mode().iloc[0])
transactions['product_size'] = transactions['product_size'].fillna(transactions['product_size'].mode().iloc[0])

For a Standard Cost column, we will use average cost to fill the NULL values.

In [14]:
transactions['standard_cost'] = transactions['standard_cost'].fillna(transactions['standard_cost'].mean())

Unfortunately, the data value for the Product First Sold Date is very unclelar. And it does not make sense to guess the date, so we will drop the entire column. If there is more information, we will be able to create a new column for it. But for now, to clean, we will drop.

In [15]:
transactions.drop(columns='product_first_sold_date', inplace=True)

Finally, cleaned data

In [16]:
transactions.head()
Out[16]:
transaction_id product_id customer_id transaction_date online_order order_status brand product_line product_class product_size list_price standard_cost
0 1 2 2950 2017-02-25 0.0 Approved Solex Standard medium medium 71.49 53.62
1 2 3 3120 2017-05-21 1.0 Approved Trek Bicycles Standard medium large 2091.47 388.92
2 3 37 402 2017-10-16 0.0 Approved OHM Cycles Standard low medium 1793.43 248.82
3 4 88 3135 2017-08-31 0.0 Approved Norco Bicycles Standard medium medium 1198.46 381.10
4 5 78 787 2017-10-01 1.0 Approved Giant Bicycles Standard medium large 1765.30 709.48
In [17]:
transactions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   transaction_id    20000 non-null  int64         
 1   product_id        20000 non-null  int64         
 2   customer_id       20000 non-null  int64         
 3   transaction_date  20000 non-null  datetime64[ns]
 4   online_order      20000 non-null  float64       
 5   order_status      20000 non-null  object        
 6   brand             20000 non-null  object        
 7   product_line      20000 non-null  object        
 8   product_class     20000 non-null  object        
 9   product_size      20000 non-null  object        
 10  list_price        20000 non-null  float64       
 11  standard_cost     20000 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(3), object(5)
memory usage: 1.8+ MB

Demographic

In [18]:
headerColor = '#dade9b'
rowSpecialColor = '#ffeb74'
rowColor = 'white'

nullValues = [demographic[item].isna().sum() for item in demographic.columns]
allValues =  [demographic.shape[0] for item in demographic.columns]
percentages = [round((demographic[item].isna().sum() / demographic.shape[0])*100, 2) \
                                                 for item in demographic.columns]
colors = [rowSpecialColor if x == max(nullValues) else rowColor for x in nullValues]

fig = go.Figure(data=[go.Table(header=dict(values=['<b>Columns</b>','<b>NULL</b>',
                                                   '<b>All Values</b>','<b> Percentage (%)</b>'],
                                           line_color='darkslategray', fill_color=headerColor,
                                           align=['left', 'center'], font=dict(color='darkslategray', size=12)),
                               cells=dict(values=[demographic.columns, nullValues, allValues, percentages],
                                          line_color='darkslategray',
                                          fill_color = [colors],
                                          align = ['left', 'center'], font=dict(color='black', size=11)))])

fig.update_layout(title='Demographic')
fig.show()

Here we are going to replace the null job industry category with the most frequent job industry category in each job titles. For example, if job title 'Account Coordinator' is the most frequent in industry 'property', we will replace all the null job industry category value with 'property'.

In [19]:
frequent_jobs = demographic.groupby(['job_title', 'job_industry_category'], as_index=False)['customer_id'].count()
freq_lookup = frequent_jobs.groupby('job_title').apply(lambda x: x['job_industry_category'][x['customer_id'].idxmax()])\
                .to_frame().reset_index().rename(columns={0: 'job_industry_category'})
In [20]:
missing = demographic['job_industry_category'].isna()
demographic.loc[missing, 'job_industry_category'] = demographic.loc[missing, 
                                                'job_title'].map(freq_lookup.set_index('job_title')['job_industry_category'])
In [21]:
demographic.loc[demographic['job_industry_category'].isna() == True].shape
Out[21]:
(105, 13)
In [22]:
freq_lookup_ind = frequent_jobs.groupby('job_industry_category').apply(lambda x: x['job_title'][x['customer_id'].idxmax()])\
                    .to_frame().reset_index().rename(columns={0: 'job_title'})
In [23]:
missing = demographic['job_title'].isna()
demographic.loc[missing, 'job_title'] = demographic.loc[missing, 
                                'job_industry_category'].map(freq_lookup_ind.set_index('job_industry_category')['job_title'])
In [24]:
demographic.loc[demographic['job_title'].isna() == True].shape
Out[24]:
(105, 13)

We have reduced the NULL value in job industry category and job title from 656 to 105. The 105 data does not have both job title and job industry category. We cannot make any guess on the customers' jobs, so we will leave them as NaN at the moment. Additionally, last names and birthdays cannot be also guessed, so we will leave them as NaN.
For the default columns, it seems meaningless, so we will drop it.

In [25]:
demographic.drop(columns='default', inplace=True)
In [26]:
demographic.head(5)
Out[26]:
customer_id first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator owns_car tenure
0 1 Laraine Medendorp F 93 1953-10-12 Executive Secretary Health Mass Customer N Yes 11.0
1 2 Eli Bockman Male 81 1980-12-16 Administrative Officer Financial Services Mass Customer N Yes 16.0
2 3 Arlin Dearle Male 61 1954-01-20 Recruiting Manager Property Mass Customer N Yes 15.0
3 4 Talbot NaN Male 33 1961-10-03 Business Systems Development Analyst IT Mass Customer N No 7.0
4 5 Sheila-kathryn Calton Female 56 1977-05-13 Senior Editor Manufacturing Affluent Customer N Yes 8.0

For the tenure column, we will find the mean tenure year for each job, since tenure usually depends on the job. Then we will fill null values with those mean.

In [27]:
mean_tenure_lookup = demographic.groupby('job_title').mean()['tenure']
In [28]:
missing = demographic['tenure'].isna()
demographic.loc[missing, 'tenure'] = demographic.loc[missing, 'job_title'].map(mean_tenure_lookup)
In [29]:
demographic.loc[demographic['tenure'].isna() == True].shape
Out[29]:
(0, 12)

We will clean up the gender columns.

In [30]:
demographic['gender'].value_counts()
Out[30]:
Female    2037
Male      1872
U           88
M            1
F            1
Femal        1
Name: gender, dtype: int64
In [31]:
demographic['gender'].replace({'F':'Female', 'M':'Male', 'Femal':'Female', 'U':np.NaN}, inplace=True)
In [32]:
demographic['gender'].value_counts()
Out[32]:
Female    2039
Male      1873
Name: gender, dtype: int64

For the birthday, we will extract the age and add it as a new column called age. We will check the minimum and maxmimum age.

In [33]:
now = pd.to_datetime('now')
demographic['age'] = (now - demographic['DOB']).astype('<m8[Y]')
In [34]:
demographic['age'].min()
Out[34]:
18.0
In [35]:
demographic['age'].max()
Out[35]:
176.0

Since age 176 seems like an outlier, we will replace with the average age of the customers.

In [36]:
demographic.loc[demographic['age'] == 176.0, 'age'] = round(demographic['age'].mean())

Although there are still null values, we will procede.

In [37]:
demographic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   customer_id                          4000 non-null   int64         
 1   first_name                           4000 non-null   object        
 2   last_name                            3875 non-null   object        
 3   gender                               3912 non-null   object        
 4   past_3_years_bike_related_purchases  4000 non-null   int64         
 5   DOB                                  3913 non-null   datetime64[ns]
 6   job_title                            3895 non-null   object        
 7   job_industry_category                3895 non-null   object        
 8   wealth_segment                       4000 non-null   object        
 9   deceased_indicator                   4000 non-null   object        
 10  owns_car                             4000 non-null   object        
 11  tenure                               4000 non-null   float64       
 12  age                                  3913 non-null   float64       
dtypes: datetime64[ns](1), float64(2), int64(2), object(8)
memory usage: 406.4+ KB

Address

In [38]:
headerColor = '#dade9b'
rowColor = 'white'

nullValues = [address[item].isna().sum() for item in address.columns]
allValues =  [address.shape[0] for item in address.columns]
percentages = [round((address[item].isna().sum() / address.shape[0])*100, 2) \
                                                 for item in address.columns]


fig = go.Figure(data=[go.Table(header=dict(values=['<b>Columns</b>','<b>NULL</b>',
                                                   '<b>All Values</b>','<b> Percentage (%)</b>'],
                                           line_color='darkslategray', fill_color=headerColor,
                                           align=['left', 'center'], font=dict(color='darkslategray', size=12)),
                               cells=dict(values=[address.columns, nullValues, allValues, percentages],
                                          line_color='darkslategray',
                                          fill_color = [rowColor],
                                          align = ['left', 'center'], font=dict(color='black', size=11)))])

fig.update_layout(title='Address')
fig.show()
In [39]:
address.replace({'QLD': 'Queensland', 'VIC': 'Victoria', 'NSW': 'New South Wales'}, inplace=True)
In [40]:
address['state'].value_counts()
Out[40]:
New South Wales    2140
Victoria           1021
Queensland          838
Name: state, dtype: int64

b. EDA

Transactions

1) Transaction Dates with 20 days Moving Average

In [41]:
trans_dates = transactions.groupby('transaction_date').count()['transaction_id']\
                .to_frame().reset_index().rename(columns={'transaction_id':'counts'})
trans_dates['MA20'] = trans_dates['counts'].rolling(window=20).mean()
In [42]:
fig = px.line(trans_dates, x='transaction_date', y="MA20", template='none',
             title='Number of Transactions in Each Months')
fig.update_yaxes(title='Counts', showgrid=False)
fig.update_xaxes(title='', showgrid=False)
fig.show()

2) Online Order and Order Status

In [43]:
trans_online = transactions['online_order'].value_counts().to_frame().reset_index()\
                .rename(columns={'index':'Online?', 'online_order':'Counts'}).replace({1.0:'Yes', 0.0:'No'})
In [44]:
trans_status = transactions['order_status'].value_counts().to_frame().reset_index()\
                .rename(columns={'index':'Status', 'order_status':'Counts'})
In [45]:
fig = make_subplots(rows=1, cols=2, shared_yaxes=True,
                   subplot_titles=('Online Order?', 'Status'))

fig.add_trace(go.Bar(x=trans_online['Online?'] ,y=trans_online['Counts'], name=''),
              row=1, col=1)

fig.add_trace(go.Bar(x=trans_status['Status'], y=trans_status['Counts'], name=''),
              row=1, col=2)
fig.update_layout(template='none')
fig.update_yaxes(showgrid=False)
fig.show()
In [46]:
most_pop_brand = transactions['brand'].value_counts().to_frame().reset_index()\
                    .rename(columns={'index':'Brand', 'brand':'Counts'})
In [47]:
fig = px.pie(most_pop_brand, values='Counts', names='Brand', 
             color_discrete_sequence=px.colors.sequential.Purp,
             title='The Most Popular Brand')
fig.show()

4) List Price and Standard Cost

In [48]:
fig = go.Figure()
fig.add_trace(go.Box(y=transactions['list_price'], notched=True, name='List Price'))
fig.add_trace(go.Box(y=transactions['standard_cost'], notched=True, name='Standard Cost'))

fig.update_layout(template='none', title='List Price and Standard Cost')
fig.show()

Demographic

1) Gender

In [49]:
fig = px.histogram(demographic, x="gender", opacity=0.5,
                   color_discrete_sequence=['indianred'],
                   template='none', title='Gender of the Customers')
fig.update_yaxes(range=[0, 2300])
fig.show()

2) Age

In [50]:
fig = px.histogram(demographic, x='age', opacity=0.5,
                   color_discrete_sequence=['indianred'],
                   template='none', title='Age of the Customers')

fig.show()

3) Job Industry Category

In [51]:
job_ind_cat = demographic['job_industry_category'].value_counts().to_frame()\
                .reset_index().rename(columns={'index':'Job Industry Category', 'job_industry_category':'Counts'})
In [52]:
fig = px.pie(job_ind_cat, values='Counts', names='Job Industry Category', 
             color_discrete_sequence=px.colors.sequential.Purp,
             title='Job Industry')
fig.show()

4) Wealth Segment and Car Ownership

In [53]:
wealth = demographic['wealth_segment'].value_counts().to_frame()\
        .reset_index().rename(columns={'index':'Wealth Segment', 'wealth_segment':'Counts'})
car = demographic['owns_car'].value_counts().to_frame()\
        .reset_index().rename(columns={'index':'Car Ownership', 'owns_car':'Counts'})
In [54]:
car
Out[54]:
Car Ownership Counts
0 Yes 2024
1 No 1976
In [55]:
colors = ['darkorange', 'mediumturquoise']
fig = make_subplots(rows=1, cols=2, specs=[[{"type": "xy"}, {"type": "domain"}]],
                   subplot_titles=('Wealth Segment', 'Car Ownership'))

fig.add_trace(go.Bar(x=wealth['Wealth Segment'], y=wealth['Counts'],
                       name='Wealth Segment', marker_color='indianred'), row=1, col=1)
fig.add_trace(go.Pie(labels=car['Car Ownership'],values=car['Counts'],
                            name='Car Ownership', marker=dict(colors=colors, line=dict(color='#000000', width=2))), 
              row=1, col=2)

fig.update_layout(template='none', showlegend=False)
fig.show()

4) Tenure

In [56]:
fig = go.Figure(data=go.Violin(y=demographic['tenure'], box_visible=True, line_color='black',
                               meanline_visible=True, fillcolor='lightseagreen', opacity=0.6,
                               x0='Tenure'))

fig.update_layout(yaxis_zeroline=False, template='none', title='Tenure')
fig.update_yaxes(showgrid=False)
fig.show()

Address

In [57]:
postCode = pd.read_csv('AUS_postcode.csv')
postCode.drop(columns=['suburb', 'state', 'dc', 'type'], inplace=True)
postCode = postCode.drop_duplicates(subset='postcode')
In [58]:
address = address.merge(postCode, on='postcode', how='left')

In [68]:
u_states = np.unique(address['state'].values) 
d = dict(zip(u_states, np.arange(len(u_states))))
         
fig = go.Figure(data=go.Scattergeo(lon=address['lon'], lat=address['lat'],
                                   mode='markers', marker_color=[d[s] for s in address['state']]))

fig.update_layout(title='The Customer Locations', 
                  geo=dict(projection_scale=5,
                           center=dict(lat=-28.892778, lon=153.2268)))
fig.show()

3. Conclusion

In [72]:
transactions.head(5)
Out[72]:
transaction_id product_id customer_id transaction_date online_order order_status brand product_line product_class product_size list_price standard_cost
0 1 2 2950 2017-02-25 0.0 Approved Solex Standard medium medium 71.49 53.62
1 2 3 3120 2017-05-21 1.0 Approved Trek Bicycles Standard medium large 2091.47 388.92
2 3 37 402 2017-10-16 0.0 Approved OHM Cycles Standard low medium 1793.43 248.82
3 4 88 3135 2017-08-31 0.0 Approved Norco Bicycles Standard medium medium 1198.46 381.10
4 5 78 787 2017-10-01 1.0 Approved Giant Bicycles Standard medium large 1765.30 709.48
In [73]:
demographic.head(5)
Out[73]:
customer_id first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator owns_car tenure age
0 1 Laraine Medendorp Female 93 1953-10-12 Executive Secretary Health Mass Customer N Yes 11.0 67.0
1 2 Eli Bockman Male 81 1980-12-16 Administrative Officer Financial Services Mass Customer N Yes 16.0 39.0
2 3 Arlin Dearle Male 61 1954-01-20 Recruiting Manager Property Mass Customer N Yes 15.0 66.0
3 4 Talbot NaN Male 33 1961-10-03 Business Systems Development Analyst IT Mass Customer N No 7.0 59.0
4 5 Sheila-kathryn Calton Female 56 1977-05-13 Senior Editor Manufacturing Affluent Customer N Yes 8.0 43.0
In [74]:
address.head(5)
Out[74]:
customer_id address postcode state country property_valuation lat lon
0 1 060 Morning Avenue 2016 New South Wales Australia 10 -33.892778 151.203901
1 2 6 Meadow Vale Court 2153 New South Wales Australia 10 -33.758601 150.992887
2 4 0 Holy Cross Court 4211 Queensland Australia 9 -28.033052 153.279625
3 5 17979 Del Mar Point 2448 New South Wales Australia 4 -30.615511 152.999909
4 6 9 Oakridge Court 3216 Victoria Australia 9 -38.175587 144.342666

Overall, the datasets are not complete. The main problems include:

  • Null values: many data have null values, and I have replaced them with a median or mean value accordingly. However, I cannot guess data such as last name or birthday, so they were left as null values to keep the data.
  • Inconsistent values: some data such as state, were wrongly named. For example, some data has the full state name, which is Victoria, and the other has a 'V.' There were all matched as a complete name.
  • Data type: there were columns with the inconsistent data type, but all have been unified.