Udemy Course Analysis

Table of Contents

  1. Data Cleaning
  2. Exploratory Data Analysis
    2.1 Subject Categories
    2.2 The Most Famouse Course
    2.2.1 All subjects
    2.2.2 Each subjects
    
    2.3 The Level of Courses under Each Categories
    2.4 Price of the Courses
    2.5 Free vs Paid Courses
     2.5.1 Number of free vs paid courses
     2.5.2 Different factors in free vs paid courses
    
    2.6 Time Series
    2.7 Relationships Between Numberical Values
  3. Machine Learning
    3.1 Linear Regression
    3.2 Random Forest
In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
In [2]:
df = pd.read_csv('udemy_courses.csv')
df.head()
Out[2]:
course_id course_title url is_paid price num_subscribers num_reviews num_lectures level content_duration published_timestamp subject
0 1070968 Ultimate Investment Banking Course https://www.udemy.com/ultimate-investment-bank... True 200 2147 23 51 All Levels 1.5 hours 2017-01-18T20:58:58Z Business Finance
1 1113822 Complete GST Course & Certification - Grow You... https://www.udemy.com/goods-and-services-tax/ True 75 2792 923 274 All Levels 39 hours 2017-03-09T16:34:20Z Business Finance
2 1006314 Financial Modeling for Business Analysts and C... https://www.udemy.com/financial-modeling-for-b... True 45 2174 74 51 Intermediate Level 2.5 hours 2016-12-19T19:26:30Z Business Finance
3 1210588 Beginner to Pro - Financial Analysis in Excel ... https://www.udemy.com/complete-excel-finance-c... True 95 2451 11 36 All Levels 3 hours 2017-05-30T20:07:24Z Business Finance
4 1011058 How To Maximize Your Profits Trading Options https://www.udemy.com/how-to-maximize-your-pro... True 200 1276 45 26 Intermediate Level 2 hours 2016-12-13T14:57:18Z Business Finance
In [3]:
df.describe()
Out[3]:
course_id num_subscribers num_reviews num_lectures
count 3.683000e+03 3683.000000 3683.000000 3683.000000
mean 6.764546e+05 3193.371165 156.448004 40.062178
std 3.437217e+05 9498.231406 935.078241 50.366788
min 8.324000e+03 0.000000 0.000000 0.000000
25% 4.077270e+05 110.000000 4.000000 15.000000
50% 6.882440e+05 911.000000 18.000000 25.000000
75% 9.617290e+05 2537.500000 67.000000 45.000000
max 1.282064e+06 268923.000000 27445.000000 779.000000
In [4]:
df.dtypes
Out[4]:
course_id               int64
course_title           object
url                    object
is_paid                object
price                  object
num_subscribers         int64
num_reviews             int64
num_lectures            int64
level                  object
content_duration       object
published_timestamp    object
subject                object
dtype: object

Data Cleaning

'content_duration' is in string, but we will simplify to float, indicating 1.5 hours == 1.5 and 35 mins == 0.35.

In [6]:
# First we will take care of all hours and hour.
df['content_duration'] = df['content_duration'].str.replace('hours', '').str.replace('hour', '')
In [7]:
# Then we will change minutes into decimals.
df['content_duration'] = df['content_duration'].apply(lambda x: int(x.replace('mins', '').strip()) * 1/100 if 'mins' in x else x)
In [8]:
# There is a one data that mixed data with level and content duration. So we fix it here.
df.loc[df['level'] == '52'] = df.loc[df['level'] == '52'].replace({'52':'Beginner Level', 'Beginner Level': 0.52})

Also, we will change the date type of Price to int, rather than string. "Free" and "TRUE" items will have 0 value.

In [9]:
df['price'] = df['price'].replace('Free', 0).replace('TRUE', 0).astype(int)

_We will clean the 'ispaid' column as there are duplicated values such as True/TRUE and False/False. Also we will drop the one with incorrect value.

In [10]:
df['is_paid'] = df['is_paid'].replace("TRUE", "True").replace("FALSE", "False")
df.drop(df.loc[df['is_paid'] == 'https://www.udemy.com/learnguitartoworship/'].index, inplace=True)

There is a data with wrong content duration or NaN value, so we will drop it as we are not sure what it means.

In [11]:
df.drop(df.loc[df['content_duration'].str.contains('questions', na=False)].index, inplace=True)
df['content_duration'] = df['content_duration'].astype(float)

Lastly, we will change the time stamps to just date (month/date/year) to simplify.

In [12]:
df['published_timestamp'] = pd.to_datetime(df['published_timestamp']).dt.strftime('%m/%d/%Y')

EDA

1. Subject Categories

In [13]:
# The number of courses under the categories
course_categories = df['subject'].value_counts().to_frame().reset_index().rename({'subject':'Number of Courses', 
                                                                                  'index':'Categories'}, axis='columns')

fig = px.pie(course_categories, values='Number of Courses', names='Categories', 
             title='Number of Courses under Different Categories', color='Categories',
             color_discrete_map={'Web Development':'darkblue', 'Business Finance':'royalblue', 
                                'Musical Instruments':'cyan', 'Graphic Design':'lightcyan'})
fig.show()

Majority of the courses are either in Web Development and Business Finance. However, we did not see any Business Finance courses in Top 10 the most subscribed course. Learning musicla instruments is quite famouse in Udemy.

2. The Most Famous Course

We will be using number of suscribers of the courses to measure which ones are the most famous courses in Udemy.

a) All Subjects

In [14]:
# Most subscribed courses
most_sub_courses = df.sort_values(by='num_subscribers', ascending=False)[:10]

fig = px.bar(most_sub_courses, x='course_title', y='num_subscribers', color='course_title', height=600)
fig.update_layout(autosize=True)
fig.show()

So all top 10 courses with the most subscribers are mostly in Web Development except one course in "Free Beginner Electric Guitar Lessons" (obvioulsy, it is free!). Especially HTML5 and CSS3 is hot!

b) Each Subjects

In [15]:
bf_all = df.loc[df['subject'] == 'Business Finance']
bf_top10 = bf_all.sort_values(by='num_subscribers', ascending=False)[:10]
bf_top10 = bf_top10.sort_values(by='num_subscribers')

wd_all = df.loc[df['subject'] == 'Web Development']
wd_top10 = wd_all.sort_values(by='num_subscribers', ascending=False)[:10]
wd_top10 = wd_top10.sort_values(by='num_subscribers')

mi_all = df.loc[df['subject'] == 'Musical Instruments']
mi_top10 = mi_all.sort_values(by='num_subscribers', ascending=False)[:10]
mi_top10 = mi_top10.sort_values(by='num_subscribers')

gd_all = df.loc[df['subject'] == 'Graphic Design']
gd_top10 = gd_all.sort_values(by='num_subscribers', ascending=False)[:10]
gd_top10 = gd_top10.sort_values(by='num_subscribers')
In [16]:
fig = make_subplots(rows=4, cols=1,
                   subplot_titles=("Business Finance", "Web Development", 
                                   "Musical Instruments", "Graphic Designs"))

fig.add_trace(go.Bar(y=bf_top10['course_title'], x=bf_top10['num_subscribers'],
                     orientation='h', name='Business Finance'), row=1, col=1)
fig.add_trace(go.Bar(y=wd_top10['course_title'], x=wd_top10['num_subscribers'],
                     orientation='h', name='Web Development'), row=2, col=1)
fig.add_trace(go.Bar(y=mi_top10['course_title'], x=mi_top10['num_subscribers'],
                     orientation='h', name='Musical Instruments'), row=3, col=1)
fig.add_trace(go.Bar(y=gd_top10['course_title'], x=gd_top10['num_subscribers'],
                     orientation='h', name='Graphic Designs'), row=4, col=1)

fig.update_layout(height=1300, title="Famouse Courses in Each Subject by the Number of Subscribers")

fig.show()

3. The level of courses under each categories

Here we will look at the content level of courses in each categories. There are 4 major levles such as: all levels, bigenner, intermediate and expert level.

In [17]:
# Level and the number of courses under each level
level_courses = df['level'].value_counts().reset_index().rename(columns={'index':'Levels', 'level':'Number of Courses'})

fig = px.pie(level_courses, values='Number of Courses', names='Levels',
             title='Number of Courses under Different Categories', color='Levels',
             color_discrete_map={'All Levels':'lavender', 'Beginner Level':'lightsalmon', 
                                'Intermediate Level':'lightcoral', 'Expert Level':'indigo'})
fig.show()

All level courses are more than half. And also lots of beginner level courses. Less than 2% is expert level courses. Now we will look at the levels of courses under the different categories.

In [18]:
# Dataframes for each subject categories
bf = df.groupby(['subject', 'level']).count()['course_id']['Business Finance'].to_frame().reset_index().rename(columns={'level':'Level', 
                                                                                       'course_id':'Number of Courses'})
wd = df.groupby(['subject', 'level']).count()['course_id']['Web Development'].to_frame().reset_index().rename(columns={'level':'Level', 
                                                                                       'course_id':'Number of Courses'})
mi = df.groupby(['subject', 'level']).count()['course_id']['Musical Instruments'].to_frame().reset_index().rename(columns={'level':'Level', 
                                                                                       'course_id':'Number of Courses'})
gd = df.groupby(['subject', 'level']).count()['course_id']['Graphic Design'].to_frame().reset_index().rename(columns={'level':'Level', 
                                                                                       'course_id':'Number of Courses'})
In [19]:
# Levles and the number of courses under each categories

fig = make_subplots(rows=2, cols=2, specs=[[{"type": "pie"}, {"type": "pie"}],
                                           [{"type": "pie"}, {"type": "pie"}]], 
                    subplot_titles=("Business Finance","Wep Development", "Musical Instruments", "Graphic Design"))

colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen']

fig.add_trace(go.Pie(values=bf['Number of Courses'], labels=bf['Level']),
              row=1, col=1)

fig.add_trace(go.Pie(values=wd['Number of Courses'], labels=wd['Level']),
              row=1, col=2)

fig.add_trace(go.Pie(values=mi['Number of Courses'], labels=mi['Level']),
              row=2, col=1)

fig.add_trace(go.Pie(values=gd['Number of Courses'], labels=gd['Level']),
              row=2, col=2)

fig.update_layout(height=900, showlegend=True,  title_text="The number of Courses with Different Levels Under Each Categories")
fig.update_traces(hoverinfo='label+percent', textinfo='label', textfont_size=13,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))

fig.show()

Most of the subjects has the most courses in all and beginner level. Especially for graphic design courses, almost 90% of its courses are for beginner or all levels. The courses in musical instruments has the most courses in intermediate level.

4. Price of the Courses

In [20]:
# Mean price and subject categories
mean_price_category = df.groupby('subject')['price'].mean().to_frame().reset_index()

fig = px.bar(mean_price_category, x='subject', y='price', color='subject', text='price', 
             height=500, color_discrete_sequence=px.colors.sequential.Plasma_r)
fig.update_layout(autosize=True, title="Mean Price of Each Subjects", 
                  yaxis_title="Mean Price", xaxis_title="Subjects")
fig.update_traces(textposition='outside')
fig.update_yaxes(range=[0, 100])
fig.show()

Web Development has the highest mean price, and Musical Instruments has the cheapest.

5. Free vs Paid Courses

Here we will be comparing two types of courses, free and paid. We will try to find out if paid materials has the higher number of letures and longer hours, or if free materials has the higher number of subscribers and reviews.

a) Number of Free vs Paid courses

In [21]:
free_count = df['is_paid'].value_counts().to_frame().reset_index().rename(columns={'index':'Is Free?', 'is_paid':'Count'})
free_count.replace({'True':'Free', 'False':'Paid'}, inplace=True)

fig = px.pie(free_count, values='Count', names='Is Free?', title='Free vs Paid Courses', color='Count',
            color_discrete_sequence=['darksalmon', 'olive'])

fig.show()

b) Different factors in free vs paid courses

In [22]:
free_v_paid = df.groupby('is_paid')[['content_duration', 
                       'num_lectures', 
                       'num_subscribers', 
                       'num_reviews']].sum().reset_index().replace({'False': 'Free', 'True':'Paid'})
In [23]:
fig = make_subplots(rows=2, cols=2, specs=[[{"type": "bar"}, {"type": "bar"}], [{"type": "bar"}, {"type": "bar"}]],
                   subplot_titles=("Content Duration", "Number of Lectures", "Number of Subscribers", "Number of Reviews"))

fig.add_trace(go.Bar(x=free_v_paid['is_paid'], y=free_v_paid['content_duration'],
                    marker=dict(color=free_v_paid['content_duration'])),
              row=1, col=1)

fig.add_trace(go.Bar(x=free_v_paid['is_paid'], y=free_v_paid['num_lectures'],
                    marker=dict(color=free_v_paid['num_lectures'])),
              row=1, col=2)

fig.add_trace(go.Bar(x=free_v_paid['is_paid'], y=free_v_paid['num_subscribers'],
                    marker=dict(color=free_v_paid['num_subscribers'])),
              row=2, col=1)

fig.add_trace(go.Bar(x=free_v_paid['is_paid'], y=free_v_paid['num_reviews'],
                    marker=dict(color=free_v_paid['num_reviews'])),
              row=2, col=2)

fig.update_layout(showlegend=False, height=700, title_text="Free vs Paid Couses")

fig.show()

As you can see from here, Paid courses has higher numbers of lectures, subscribers and reveiws than Free courses.

6. Time Series

_We will look at the date publisehd for each course subjects, this visualization was originally generated by user Sayar_Banner in Kaggle.

In [24]:
subjects = df['subject'].unique()
subset = df[['published_timestamp','subject']]
subset = subset.sort_values('published_timestamp')
time_series = subset['published_timestamp'].value_counts().reset_index()
time_series.columns = ['Date', 'Counts']
time_series = time_series.sort_values('Date')
time_series['Cum Count'] = time_series['Counts'].cumsum()
dummies = pd.get_dummies(subset['subject'])
subset = subset.join(dummies)

subset['Cum Business'] = subset['Business Finance'].cumsum()
subset['Cum Software'] = subset['Web Development'].cumsum()
subset['Cum Music'] = subset['Musical Instruments'].cumsum()
subset['Cum Design'] = subset['Graphic Design'].cumsum()
subset_melt = subset.melt(id_vars='date_published', value_vars=['Cum Business', 'Cum Software', 'Cum Design', 'Cum Music'])
C:\Users\Muffin\Anaconda3\lib\site-packages\pandas\core\indexing.py:1494: FutureWarning:


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike

In [25]:
fig = make_subplots(rows=2, cols=1, subplot_titles=("Time series plot of number of courses",
                                                    "Time series plot of number of courses by subject"))
df.sort_values('published_timestamp', inplace=True)

fig.append_trace(go.Scatter(x=time_series['Date'], y=time_series['Cum Count'], name="All",
                            mode='lines'), row=1, col=1)

fig.append_trace(go.Scatter(x=subset['published_timestamp'], y=subset['Cum Business'], mode="lines",
                            name="Business", line=dict(color="#617C58")), row=2, col=1)
fig.append_trace(go.Scatter(x=subset['published_timestamp'], y=subset['Cum Software'], mode="lines",
                            name="Software", line=dict(color="#74597D", dash="longdashdot")),
                 row=2, col=1)

fig.append_trace(go.Scatter(x=subset['published_timestamp'], y=subset['Cum Design'], mode="lines",
                            name="Design", line=dict(color="#C85A17", dash="dash")),
                 row=2, col=1)

fig.append_trace(go.Scatter(x=subset['published_timestamp'], y=subset['Cum Music'], mode="lines",
                            name="Music", line=dict(color="#1884C7", dash="dashdot")),
                 row=2, col=1)

fig.update_layout(width=900, height=800)
fig.show()

7. Relationships between numerical values

We will look at some scatter plots for numerical values and if there is any relationships

In [26]:
fig = make_subplots(rows=2, cols=2,
                   subplot_titles=("Number of Subscrb. and Number of Reviews", 
                                   "Number of Subscrb. and Price", 
                                   "Number of Lectures and Price",
                                   "Content Duration and Price"))

fig.add_trace(go.Scatter(x=df['num_subscribers'], y=df['num_reviews'], mode='markers'), row=1, col=1)
fig.add_trace(go.Scatter(x=df['num_subscribers'], y=df['price'], mode='markers'), row=1, col=2)
fig.add_trace(go.Scatter(x=df['num_lectures'], y=df['price'], mode='markers'), row=2, col=1)
fig.add_trace(go.Scatter(x=df['content_duration'], y=df['price'], mode='markers'), row=2, col=2)

fig.update_xaxes(title_text="Number of Subscribers", row=1, col=1)
fig.update_xaxes(title_text="Number of Subscribers", row=1, col=2)
fig.update_xaxes(title_text="Number of Lectures", row=2, col=1)
fig.update_xaxes(title_text="Content Duration", row=2, col=2)

fig.update_yaxes(title_text="Number of Reviews", row=1, col=1)
fig.update_yaxes(title_text="Price", row=1, col=2)
fig.update_yaxes(title_text="Price", row=2, col=1)
fig.update_yaxes(title_text="Price", row=2, col=2)

fig.update_layout(height=800, width=900, showlegend=False)
fig.show()

We can see that there are no speical relationships except the first plot, number of subscribers and number of reviews. It seems apparent that if the course has high number of subscribers, it will most likely high number of reviews. Other than that, it seems no relationships between other variables.

Machine Learning

We will try to predict the price based on the number of subscribers, and other factors

a) Linear Regression (number of subscribers and price)

In [38]:
from sklearn.linear_model import LinearRegression
df_linear = df[['course_id', 'course_title', 'price']]
In [31]:
skl_reg = LinearRegression().fit(df[['num_subscribers']].values, df['price'].values)
skl_reg.score(df[['num_subscribers']].values, df['price'].values)
Out[31]:
0.0025775267802027324
In [40]:
df_linear['pred_price'] = skl_reg.predict(df[['price']].values)
C:\Users\Muffin\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [43]:
df_linear.head()
Out[43]:
course_id course_title price pred_price
1966 347104 Learn to Play the Oboe: Beginner to Pro in Und... 50 65.023883
2205 344348 Master the Clarinet: Intermediate Instruction ... 50 65.023883
2206 373556 Advanced Clarinet Studies - Learn to Master th... 50 65.023883
2285 349142 Master Oboe Playing: Intermediate Instruction ... 50 65.023883
1844 338926 Learn to Play Clarinet: Beginner to Pro in Und... 50 65.023883
In [54]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
In [45]:
num_cols = ['price', 'num_reviews', 'num_lectures', 'content_duration']
cat_cols = ['is_paid', 'level', 'subject']
X_data, y_data = df[num_cols].merge(pd.get_dummies(df[cat_cols]), left_index=True, right_index=True), df['num_subscribers']
In [48]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=42)
col_names = X_train.columns
In [52]:
# Applying Standard Scaler
scaler = StandardScaler()
scaler = scaler.fit(X_train) 
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
In [55]:
rf_model = RandomForestRegressor(n_estimators=500, random_state=42)
In [56]:
rf_model.fit(X_train, y_train)
Out[56]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=500,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False)
In [58]:
y_train_preds = rf_model.predict(X_train)
In [63]:
print("Mean Squared Error on training data is: {:.2f}".format(mean_squared_error(y_train_preds, y_train)))
Mean Squared Error on training data is: 5978532.78
In [61]:
y_pred = rf_model.predict(X_test)
In [64]:
print("Mean Squared Error on testing data is: {:.2f}".format(mean_squared_error(y_pred, y_test)))
Mean Squared Error on testing data is: 23191721.85
In [66]:
# Feature Importance
imp_features = pd.Series(rf_model.feature_importances_, index=col_names).nlargest(5)
px.bar(x=imp_features.index, y=imp_features.values,
       labels={'x':"Features", 'y':"Importance Criterion"},
       color=imp_features.index,
       color_discrete_sequence=px.colors.qualitative.T10,
       title="Feature Importance")