YouTube Video Anlysis in US and Canada¶

Here we will analyze trending YouTube videos in the United States and Canada. Additoinally, we will explore Korean YouTube trending videos.

Table of Contents¶

1. Data Cleaning¶

2. Data Analysis and Visualization¶

2.1 Most trending categories in YouTube <br>
2.2 Top 20 the Most Liked Videos <br>
2.3 How long does it take to start trending? <br>
2.4 Published Time of Trending Videos <br>
2.5 Most Viewed Videos in Each Category <br>
2.6 Most Disliked, Controversial Videos <br>

3. Word Cloud¶

3.1 Title from People & Blogs <br>

4. Conclusion¶

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import re

us = pd.read_csv('USvideos.csv')
ca = pd.read_csv('CAvideos.csv')

us.head(2)

us.describe()

us.columns

Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
       'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled',
       'video_error_or_removed', 'description'],
      dtype='object')

1. Data Cleaning¶

Change all date columns to datetime objects.

us['trending_date'] = pd.to_datetime(us['trending_date'], format='%y.%d.%m')
us['publish_time'] = pd.to_datetime(us['publish_time'])

us['publish_date'] = pd.to_datetime(us['publish_time']).dt.date
us['publish_time'] = pd.to_datetime(us['publish_time']).dt.time

us['publish_date'] = pd.to_datetime(us['publish_date'])

ca['trending_date'] = pd.to_datetime(ca['trending_date'], format='%y.%d.%m')
ca['publish_time'] = pd.to_datetime(ca['publish_time'])

ca['publish_date'] = pd.to_datetime(ca['publish_time']).dt.date
ca['publish_time'] = pd.to_datetime(ca['publish_time']).dt.time
ca['publish_date'] = pd.to_datetime(ca['publish_date'])

We will change category id numbers to categories.

categories = {2: 'Autos & Vehicles', 1: 'Film & Animation', 10: 'Music', 15: 'Pets & Animals', 
              17: 'Sports', 18: 'Short Movies', 19: 'Travel & Events', 20: 'Gaming', 21: 'Videoblogging',
              22: 'People & Blogs', 23: 'Comedy', 24: 'Entertainment', 25: 'News & Politics', 26: 'Howto & Style',
              27: 'Education', 28: 'Science & Technology', 29: 'Nonprofits & Activism', 30: 'Movies', 
              31: 'Anime/Animation', 32: 'Action/Adventure', 33: 'Classics', 34: 'Comedy', 35: 'Documentary', 
              36: 'Drama', 37: 'Family', 38: 'Foreign', 39: 'Horror', 40: 'Sci-Fi/Fantasy', 41: 'Thriller',
              42: 'Shorts', 43: 'Shows', 44: 'Trailers'}

us['category_id'] = us['category_id'].map(categories)
ca['category_id'] = ca['category_id'].map(categories)

2. Data Analysis and Visualization¶

most_trending_cat_us = us['category_id'].value_counts().to_frame()\
                    .reset_index().rename(columns={'index':'Category', 'category_id':'Counts'})[:10]
most_trending_cat_ca = ca['category_id'].value_counts().to_frame()\
                    .reset_index().rename(columns={'index':'Category', 'category_id':'Counts'})[:10]

fig = go.Figure()
fig.add_trace(go.Bar(x=most_trending_cat_us['Category'], y=most_trending_cat_us['Counts'],
                     name='United States', marker_color='lightskyblue'))
fig.add_trace(go.Bar(x=most_trending_cat_ca['Category'], y=most_trending_cat_ca['Counts'],
                     name='Canada', marker_color='crimson'))

fig.update_layout(barmode='group', xaxis_tickangle=-45, template='none',
                 title='Most Trending Categories in US and CA')
fig.show()

2) Top 20 the Most Liked Videos¶

a. US¶

top20_likes_us = us.sort_values(by='likes', ascending=False)[:20]

fig = px.scatter(top20_likes_us, x="likes", y="views", size="comment_count", color="title",
                 hover_name="title", log_x=True, size_max=50, labels={'title':''})
fig.update_layout(template='none', title='Top 20 the Most Liked Videos in the US')
fig.show()

b. Canada¶

top20_likes_ca = ca.sort_values(by='likes', ascending=False)[:20]

fig = px.scatter(top20_likes_ca, x="likes", y="views", size="comment_count", color="title",
                 hover_name="title", log_x=True, size_max=50, labels={'title':''})
fig.update_layout(template='none', title='Top 20 the Most Liked Videos in the CA')
fig.show()

Interestingly, Korean boy band BTS has the most liked videos in both US and Canada. Here we will also check Korean YouTube to see if it is the same.

3) How long does it take to start trending?¶

a. US¶

us['days_to_trend'] = us['trending_date'] - us['publish_date']
us['days_to_trend'] = us['days_to_trend'].dt.days

days_to_trend = us.groupby('days_to_trend').count()['video_id'].to_frame()\
                .reset_index().sort_values('video_id', ascending=False)[:15]\
                .rename(columns={'days_to_trend':'Days Took', 'video_id': 'Counts'}).reset_index(drop=True)
days_to_trend = days_to_trend.sort_values('Days Took')

fig = px.bar(days_to_trend, x='Days Took', y='Counts', color='Counts', height=500,
            labels={'Counts':'Number of Trending Videos'}, color_continuous_scale=px.colors.sequential.Blues)
fig.update_layout(template='none', title='Number of Days Took for Videos to Trend')
fig.update_yaxes(showgrid=False)
fig.show()

b. Canada¶

ca['days_to_trend'] = ca['trending_date'] - ca['publish_date']
ca['days_to_trend'] = ca['days_to_trend'].dt.days

days_to_trend_ca = ca.groupby('days_to_trend').count()['video_id'].to_frame()\
                .reset_index().sort_values('video_id', ascending=False)[:10]\
                .rename(columns={'days_to_trend':'Days Took', 'video_id': 'Counts'}).reset_index(drop=True)
days_to_trend_ca = days_to_trend_ca.sort_values('Days Took')

fig = px.bar(days_to_trend_ca, x='Days Took', y='Counts', color='Counts', height=500,
            labels={'Counts':'Number of Trending Videos'}, color_continuous_scale=px.colors.sequential.OrRd)
fig.update_layout(template='none', title='Number of Days Took for Videos to Trend')
fig.update_yaxes(showgrid=False, title='')
fig.show()

The chart shows that it usually takes from 2 to 6 days for videos to trend in the US, but interesingly in Canada, although the number of videos is much higher than the US, it only takes from 1-2 days for videos to Trend.

us['published_hour'] = pd.to_datetime(us['publish_time'], format='%H:%M:%S').dt.hour
us['published_hour'] = pd.to_datetime(us['published_hour'], format='%H').dt.time
published_hour_us = us.groupby('published_hour').count()['video_id'].to_frame().reset_index()
published_hour_us['published_hour'] = published_hour_us['published_hour'].apply(lambda x: x.strftime('%I:%M %p'))
published_hour_us = published_hour_us.rename(columns={'published_hour': 'Published Time', 'video_id': 'Counts'})

ca['published_hour'] = pd.to_datetime(ca['publish_time'], format='%H:%M:%S').dt.hour
ca['published_hour'] = pd.to_datetime(ca['published_hour'], format='%H').dt.time
published_hour_ca = ca.groupby('published_hour').count()['video_id'].to_frame().reset_index()
published_hour_ca['published_hour'] = published_hour_ca['published_hour'].apply(lambda x: x.strftime('%I:%M %p'))
published_hour_ca = published_hour_ca.rename(columns={'published_hour': 'Published Time', 'video_id': 'Counts'})

fig = go.Figure()
fig.add_trace(go.Scatter(x=published_hour_us['Published Time'], y=published_hour_us['Counts'],
                         mode='lines', name='US', line=dict(color='royalblue'),
                         hovertemplate='Counts: %{y}'))
fig.add_trace(go.Scatter(x=published_hour_ca['Published Time'], y=published_hour_ca['Counts'],
                         mode='lines', name='Canada', line=dict(color='firebrick'), 
                         hovertemplate='Counts: %{y}'))
fig.update_layout(template='none', title='Published Time of Trending Videos in US and Canada',
                 yaxis_title='Number of Videos')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

5) Most Viewed Videos in Each Category¶

a. US¶

us_grouped = us.groupby(['category_id', 'title']).sum()[['views']]
us_grouped = us_grouped.reset_index().sort_values(['views']).drop_duplicates('category_id', keep='last')

fig = px.sunburst(us_grouped, path=['category_id', 'title'], values='views', height=700,
                 color='views', color_continuous_scale='PuBu')
fig.update_layout(title='Most Viewed Videos in the US')
fig.show()

b. Canada¶

ca_grouped = ca.groupby(['category_id', 'title']).sum()[['views']]
ca_grouped = ca_grouped.reset_index().sort_values(['views']).drop_duplicates('category_id', keep='last')

fig = px.sunburst(ca_grouped, path=['category_id', 'title'], values='views', height=700,
                 color='views', color_continuous_scale='PuRd')
fig.update_layout(title='Most Viewed Videos in the US')
fig.show()

Interestingly, in both countries, Childish Gambino's "This is America" music video has the highest view in Music Category.

6) Most Disliked, Controversial Videos¶

a. US¶

most_disliked = us.sort_values(by='dislikes', ascending=False)[:50].drop_duplicates(subset='video_id')
most_disliked = most_disliked.rename(columns={'title': 'Title', 'channel_title': 'Channel Title',
                              'dislikes': 'Number of Dislikes'})

fig = go.Figure(data=[go.Table(header=dict(values=list(most_disliked[['Channel Title', 'Title', 'Number of Dislikes']]),
                                           fill_color='gainsboro', align='left'),
                               cells=dict(values=[most_disliked['Channel Title'], most_disliked['Title'],
                                                  most_disliked['Number of Dislikes']],
                                          fill_color='aliceblue', align='left'))])
fig.update_layout(title='Most Disliked Videos on US Youtube')
fig.show()

By dislike-percentage, which is caluclated as (dislikes) / (all likes) * 100¶

us['dislike_percentage'] = us['dislikes']/(us['likes'] + us['dislikes'])
us_over_million_view = us.loc[us['views'] > 1000000]
videos_disliked = us_over_million_view.sort_values(by='dislike_percentage', ascending=False)
videos_disliked = videos_disliked.dropna(subset=['dislike_percentage'])

top10_videos_disliked = videos_disliked.drop_duplicates(subset='title')[:10]
top10_videos_disliked = top10_videos_disliked.rename(columns={'title': 'Title', 'channel_title': 'Channel Title', 
                                                              'category_id': 'Category',
                                                              'dislike_percentage': 'Percentage of Dislikes'})
top10_videos_disliked['Percentage of Dislikes'] = top10_videos_disliked['Percentage of Dislikes'].round(2) * 100
top10_videos_disliked['Percentage of Dislikes'] = top10_videos_disliked['Percentage of Dislikes'].astype(str) + '%'

fig = go.Figure(data=[go.Table(header=dict(values=list(top10_videos_disliked[['Category', 'Channel Title', 'Title', 
                                                                              'Percentage of Dislikes']]),
                                           fill_color='gainsboro', align='left'),
                               cells=dict(values=[top10_videos_disliked['Category'], top10_videos_disliked['Channel Title'],
                                                  top10_videos_disliked['Title'], top10_videos_disliked['Percentage of Dislikes']],
                                          fill_color='aliceblue', align='left'))])
fig.update_layout(title='Highest Disliked Percentage Videos on US Youtube')
fig.show()

b. Canada¶

most_disliked_ca = ca.sort_values(by='dislikes', ascending=False)[:30].drop_duplicates(subset='video_id')
most_disliked_ca = most_disliked_ca.rename(columns={'title': 'Title', 'channel_title': 'Channel Title',
                              'dislikes': 'Number of Dislikes'})

fig = go.Figure(data=[go.Table(header=dict(values=list(most_disliked_ca[['Channel Title', 'Title', 'Number of Dislikes']]),
                                           fill_color='gainsboro', align='left'),
                               cells=dict(values=[most_disliked_ca['Channel Title'], most_disliked_ca['Title'],
                                                  most_disliked_ca['Number of Dislikes']],
                                          fill_color='#FAD5D9', align='left'))])
fig.update_layout(title='Most Disliked Videos on Canadian Youtube')
fig.show()

By dislike-percentage, which is caluclated as (dislikes) / (all likes) * 100¶

ca['dislike_percentage'] = ca['dislikes']/(ca['likes'] + ca['dislikes'])
ca_over_million_view = ca.loc[ca['views'] > 1000000]
videos_disliked_ca = ca_over_million_view.sort_values(by='dislike_percentage', ascending=False)
videos_disliked_ca = videos_disliked_ca.dropna(subset=['dislike_percentage'])

top10_videos_disliked_ca = videos_disliked_ca.drop_duplicates(subset='title')[:10]
top10_videos_disliked_ca = top10_videos_disliked_ca.rename(columns={'title': 'Title', 'channel_title': 'Channel Title', 
                                                              'category_id': 'Category',
                                                              'dislike_percentage': 'Percentage of Dislikes'})
top10_videos_disliked_ca['Percentage of Dislikes'] = top10_videos_disliked_ca['Percentage of Dislikes'].round(2) * 100
top10_videos_disliked_ca['Percentage of Dislikes'] = top10_videos_disliked_ca['Percentage of Dislikes'].astype(str) + '%'

fig = go.Figure(data=[go.Table(header=dict(values=list(top10_videos_disliked_ca[['Category', 'Channel Title', 'Title', 
                                                                              'Percentage of Dislikes']]),
                                           fill_color='gainsboro', align='left'),
                               cells=dict(values=[top10_videos_disliked_ca['Category'], top10_videos_disliked_ca['Channel Title'],
                                                  top10_videos_disliked_ca['Title'], top10_videos_disliked_ca['Percentage of Dislikes']],
                                          fill_color='#FAD5D9', align='left'))])
fig.update_layout(title='Highest Disliked Percentage Videos on Canadian Youtube')
fig.show()

3. Word Cloud¶

People & Blog¶

people_title_us = us.loc[us['category_id'] == 'People & Blogs']['title']

comment_words = ''
stopwords = set(STOPWORDS)
for val in people_title_us:
    val = str(val)
    tokens = val.split()    
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()        
    comment_words += " ".join(tokens) + " "

wordcloud_us = WordCloud(width=700, height=700, background_color ='white', 
                      stopwords=stopwords, min_font_size=10).generate(comment_words)

people_title_ca = ca.loc[ca['category_id'] == 'People & Blogs']['title']

comment_words_ca = ''
stopwords_ca = set(STOPWORDS)
for val in people_title_ca:
    val = str(val)
    tokens = val.split()    
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()        
    comment_words_ca += " ".join(tokens) + " "

wordcloud_ca = WordCloud(width=700, height=700, background_color ='white', 
                      stopwords=stopwords_ca, min_font_size=10).generate(comment_words_ca)

fig, axs = plt.subplots(1, 2, figsize=(20, 10))
fig.suptitle('Most Frequent Words in Title in People & Blogs Category', fontsize=30)

axs[0].imshow(wordcloud_us) 
axs[0].axis("off")
axs[0].set_title('US', fontsize=15)
axs[1].imshow(wordcloud_ca) 
axs[1].axis("off")
axs[1].set_title('Canada', fontsize=15)
  
plt.show()

Interestingly, Canadian talks more about US politics than the US! Especially, you can see the big letter TRUMP on Canadian side, and also names of comedian such as Joe Rogan. In both chart, we can find Frontal On, which is a news satire TV program.

4. Conclusion¶

From this analysis, I have found a few things exciting. The highest viewed or liked videos are mostly music videos. And also, music videos of BTS, a Korean boy band, were viral in both countries, the US and Canada.

Additionally, it was fascinating to see that the most disliked videos in US YouTube were mainly composed of daily life videos. In contrast, the most disliked videos in Canada was mainly composed of politics and news. And I also found Korean news among the top 10 the most disliked videos in Canada.

Lastly, the word cloud was very fun. It seems like Canadians are very interested in American politics (I mean, who would not? The 2020 presidential debate was so much fun!)

	category_id	views	likes	dislikes	comment_count
count	40949.000000	4.094900e+04	4.094900e+04	4.094900e+04	4.094900e+04
mean	19.972429	2.360785e+06	7.426670e+04	3.711401e+03	8.446804e+03
std	7.568327	7.394114e+06	2.288853e+05	2.902971e+04	3.743049e+04
min	1.000000	5.490000e+02	0.000000e+00	0.000000e+00	0.000000e+00
25%	17.000000	2.423290e+05	5.424000e+03	2.020000e+02	6.140000e+02
50%	24.000000	6.818610e+05	1.809100e+04	6.310000e+02	1.856000e+03
75%	25.000000	1.823157e+06	5.541700e+04	1.938000e+03	5.755000e+03
max	43.000000	2.252119e+08	5.613827e+06	1.674420e+06	1.361580e+06

	video_id	trending_date	title	channel_title	category_id	publish_time	tags	views	likes	dislikes	comment_count	thumbnail_link	comments_disabled	ratings_disabled	video_error_or_removed	description
0	2kyS6SvSYSE	17.14.11	WE WANT TO TALK ABOUT OUR MARRIAGE	CaseyNeistat	22	2017-11-13T17:13:01.000Z	SHANtell martin	748374	57527	2966	15954	https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg	False	False	False	SHANTELL'S CHANNEL - https://www.youtube.com/s...
1	1ZAPwfrtAFY	17.14.11	The Trump Presidency: Last Week Tonight with J...	LastWeekTonight	24	2017-11-13T07:30:00.000Z	last week tonight trump presidency\|"last week ...	2418783	97185	6146	12703	https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg	False	False	False	One year after the presidential election, John...

YouTube Video Anlysis in US and Canada¶

Table of Contents¶

1. Data Cleaning¶

2. Data Analysis and Visualization¶

3. Word Cloud¶

4. Conclusion¶

1. Data Cleaning¶

2. Data Analysis and Visualization¶

1) Most trending categories in YouTube¶

2) Top 20 the Most Liked Videos¶

a. US¶

b. Canada¶

3) How long does it take to start trending?¶

a. US¶

b. Canada¶

4) Published Time of Trending Videos¶

5) Most Viewed Videos in Each Category¶

a. US¶

b. Canada¶

6) Most Disliked, Controversial Videos¶

a. US¶

By dislike-percentage, which is caluclated as (dislikes) / (all likes) * 100¶

b. Canada¶

By dislike-percentage, which is caluclated as (dislikes) / (all likes) * 100¶

3. Word Cloud¶

People & Blog¶

4. Conclusion¶