title

Table of Contents

1. Data Cleaning

2. Visualization and Analysis

2.1 Movies and shows by released years
2.2 Countries with the most movies/shows
2.3 Casts with the most movies/shows
2.4 Movie Ratings
2.5 Movie Ratings in IMDb
2.6 Duration of Movies
2.7 Oldest/Newest Shows
2.8 Word Cloud

3. Machine Learning: Recommendation System

4. Conclusion




1. Data Cleaning

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
In [79]:
netflix = pd.read_csv('netflix.csv')
In [3]:
netflix.head(2)
Out[3]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
In [4]:
netflix.shape
Out[4]:
(6234, 12)
In [5]:
netflix.describe()
Out[5]:
show_id release_year
count 6.234000e+03 6234.00000
mean 7.670368e+07 2013.35932
std 1.094296e+07 8.81162
min 2.477470e+05 1925.00000
25% 8.003580e+07 2013.00000
50% 8.016337e+07 2016.00000
75% 8.024489e+07 2018.00000
max 8.123573e+07 2020.00000
In [6]:
netflix.columns
Out[6]:
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')
In [7]:
netflix.dtypes
Out[7]:
show_id          int64
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

Let's delete columns such as showid, description as we have index and do not need descriptions.

In [80]:
netflix.drop(columns=['show_id'], inplace=True)

First we will divide the dataframes into shows and movies, to simplify the process.

In [9]:
netflix['type'].value_counts()
Out[9]:
Movie      4265
TV Show    1969
Name: type, dtype: int64

Then change data type for date added for the future use.

In [10]:
netflix['date_added'] = pd.to_datetime(netflix['date_added'])
In [11]:
movies = netflix.loc[netflix['type'] == 'Movie']
shows = netflix.loc[netflix['type'] == 'TV Show']
In [12]:
movies.drop(columns=['type'], inplace=True)
shows.drop(columns=['type'], inplace=True)
C:\Users\Muffin\Anaconda3\lib\site-packages\pandas\core\frame.py:3997: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [13]:
movies.reset_index(drop=True, inplace=True)
shows.reset_index(drop=True, inplace=True)
In [14]:
movies['duration'] = movies['duration'].str.replace('min', '')
movies['duration'] = movies['duration'].astype(int)
C:\Users\Muffin\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Muffin\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [15]:
movies.head(2)
Out[15]:
title director cast country date_added release_year rating duration listed_in
0 Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China 2019-09-09 2019 TV-PG 90 Children & Family Movies, Comedies
1 Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom 2016-09-09 2016 TV-MA 94 Stand-Up Comedy
In [16]:
shows.head(2)
Out[16]:
title director cast country date_added release_year rating duration listed_in
0 Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States 2018-09-08 2013 TV-Y7-FV 1 Season Kids' TV
1 Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States 2018-09-08 2016 TV-Y7 1 Season Kids' TV

2. Visualization

1) Movies and Shows by Release Year and Added Date

a. Movies and shows by release year

In [17]:
movies_release_year = movies.groupby('release_year').count()['title'].to_frame()\
                .reset_index().rename(columns={'release_year':'Release Year', 'title':'Count'})
shows_release_year = shows.groupby('release_year').count()['title'].to_frame()\
                .reset_index().rename(columns={'release_year':'Release Year', 'title':'Count'})
In [18]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=movies_release_year['Release Year'], y=movies_release_year['Count'],
                    mode='lines',
                    name='Movies'))
fig.add_trace(go.Scatter(x=shows_release_year['Release Year'], y=shows_release_year['Count'],
                    mode='lines',
                    name='Shows'))

fig.update_layout(template='none', title='Movies and Shows by Release Year',
                 width=950, height=500)
fig.update_yaxes(showgrid=False)
fig.update_xaxes(showgrid=False)

fig.show()

Netflix carries most of the shows and movies released in 2000 to 2020. There is much less movies earlier than 1980.

b. Movies and shows by added date

In [19]:
movies['date_added'] = pd.to_datetime(movies['date_added'], errors='coerce').dt.strftime('%Y-%m')
shows['date_added'] = pd.to_datetime(shows['date_added'], errors='coerce').dt.strftime('%Y-%m')
C:\Users\Muffin\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Muffin\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [20]:
movies_date_added = movies.groupby('date_added').count()['title']\
            .to_frame().reset_index().rename(columns = {'date_added':'Date Added', 'title':'Count'})
shows_date_added = shows.groupby('date_added').count()['title']\
            .to_frame().reset_index().rename(columns = {'date_added':'Date Added', 'title':'Count'})
In [21]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=movies_date_added['Date Added'], y=movies_date_added['Count'],
                    mode='lines',
                    name='Movies'))
fig.add_trace(go.Scatter(x=shows_date_added['Date Added'], y=shows_date_added['Count'],
                    mode='lines',
                    name='Shows'))

fig.update_layout(template='none', title='Movies and Shows by Added Date',
                 width=950, height=500)
fig.update_yaxes(showgrid=False)
fig.update_xaxes(showgrid=False)

fig.show()

Netflix added a lot of shows and movies from 2016. According to Netflix, it started streaming from 2007.

2) Countries with the Most Movies and Shows

In [22]:
netflix_country = netflix['country'].str.split(', ').explode()\
                .value_counts()[:15].to_frame().reset_index().rename(columns={'index':'Country', 'country':'Count'})
In [23]:
fig = px.scatter_geo(netflix_country, locations="Country", color="Count",
                     locationmode='country names', size_max=50,
                     hover_name="Country", size="Count",
                     projection="natural earth", color_continuous_scale=px.colors.diverging.BrBG,
                     title='Top 15 Countries with the Most Movies and Shows on Netflix')
fig.show()

a. Countries with the most movies

In [24]:
movies_country = movies['country'].str.split(', ').explode().value_counts()[:15]\
            .to_frame().reset_index().rename(columns={'index':'Country', 'country':'Number of Movies'})
In [25]:
fig = px.scatter_geo(movies_country, locations="Country", color="Number of Movies",
                     locationmode='country names', size_max=50,
                     hover_name="Country", size="Number of Movies",
                     projection="natural earth", color_continuous_scale=px.colors.diverging.BrBG,
                     title='Top 15 Countries with the Most Movies on Netflix')


fig.show()

b. Countries with the most shows

In [26]:
shows_country = shows['country'].str.split(', ').explode().value_counts()[:15]\
            .to_frame().reset_index().rename(columns={'index':'Country', 'country':'Number of Shows'})
In [27]:
fig = px.scatter_geo(shows_country, locations="Country", color="Number of Shows",
                     locationmode='country names', size_max=50,
                     hover_name="Country", size="Number of Shows",
                     projection="natural earth", color_continuous_scale=px.colors.diverging.BrBG,
                     title='Top 15 Countries with the Most Shows on Netflix')
fig.show()

United States has the most shows and movies on Netflix, but also, Indian has the second most movies and shows, which I did not expect.

3) Casts with the Most Movies and Shows on Netflix

a. Casts with the most movies

In [28]:
movies_casts = movies['cast'].str.split(', ').explode().value_counts()[:10]\
            .to_frame().reset_index().rename(columns={'index':'Cast', 'cast':'Number of Movies'})
In [29]:
fig = px.bar(movies_casts, x="Number of Movies", y="Cast", color='Cast', orientation='h',
            template='none', width=900, height=400, color_discrete_sequence=px.colors.qualitative.Vivid,
            title='Top 15 Countries with the Most Movies on Netflix')
fig.update_layout(yaxis={'categoryorder':'total ascending'},
                  margin=dict(l=130, r=10, t=100))
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showticklabels=True, title_text=None)
fig.show()

b. Casts with the most shows

In [30]:
shows_casts = shows['cast'].str.split(', ').explode().value_counts()[:10]\
            .to_frame().reset_index().rename(columns={'index':'Cast', 'cast':'Number of Shows'})
In [31]:
fig = px.bar(shows_casts, x="Number of Shows", y="Cast", color='Cast', orientation='h',
            template='none', width=900, height=400, color_discrete_sequence=px.colors.qualitative.Vivid,
            title='Top 15 Countries with the Most Shows on Netflix')
fig.update_layout(yaxis={'categoryorder':'total ascending'},
                 margin=dict(l=130, r=10, t=100))
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showticklabels=True, title_text=None)
fig.show()

Although United States has the most shows and movies on Netflix, the cast with the most movies is Indian. Unlike U.S., it seems like there is less diversities in Indian movie industry. Additionally, the cast with the most shows is Japanese.

4) Moving Ratings

In [32]:
movie_ratings = movies.groupby('rating').count()['title'].sort_values(ascending=False)\
                .to_frame().reset_index().rename(columns = {'rating':'Rating', 'title':'Count'})
In [33]:
fig = px.pie(movie_ratings, values='Count', names='Rating', title='Netflix Movie Ratings',
            color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

TV-MA and TV-14 is the most common ratings on Netflix.

5) Best Rated Movies in Netflix using IMDb Ratings

In [34]:
ratings = pd.read_csv('IMDb ratings.csv', usecols=['weighted_average_vote'])
movie_ratings = pd.read_csv('IMDb movies.csv', usecols=['title', 'year', 'genre'])
imdb = pd.DataFrame({'title':movie_ratings.title, 'release_year':movie_ratings.year,
                    'rating':ratings.weighted_average_vote, 'listed_in':movie_ratings.genre})
imdb.drop_duplicates(subset=['title','release_year','rating'], inplace=True)
In [35]:
movies_rated = movies.merge(imdb, on='title', how='inner')
movies_rated.drop(columns=['release_year_y', 'listed_in_y'], inplace=True)
In [36]:
movies_rated = movies_rated.rename(columns={'title': 'Title', 'director': 'Director', 'cast': 'Cast',
                             'date_added': 'Date Added', 'release_year_x': 'Release Year', 'country': 'Country',
                             'rating_x': 'TV Rating', 'duration': 'Duration',
                             'listed_in_x': 'Genre', 'rating_y':'Rating'})
top_10 = movies_rated.sort_values(by='Rating', ascending=False)[:10]
In [37]:
fig = px.sunburst(top_10, path=['Title','Country'], values='Rating', color='Rating',
                  color_continuous_scale='YlGn', title='Highst Rated Movies on Netflix in Each Countries',
                  hover_data=['Title', 'Country', 'Rating'])

fig.update_traces(hovertemplate='Title: %{customdata[0]} <br>Country: %{customdata[1]} <br>Rating: %{customdata[2]}')
fig.update_layout(legend_title_text='Rating')
fig.show()

6) Duration of Movies

In [38]:
fig = px.histogram(movies, x='duration', nbins=22, template='none', 
                   title='Netflix Movie Duration')
fig.update_yaxes(showgrid=False)

fig.show()

The most common durations for movies is from 80 to 90 minutes, but also it is common if the movies is a bit longer than that.

7) Oldest/Newest Shows

In [69]:
us_shows = shows.loc[shows['country'] == 'United States']
us_shows_old = us_shows.sort_values(by='release_year')[:15]
us_shows_new = us_shows.sort_values(by='release_year', ascending=False)[:15]
In [64]:
fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'Release Year'], 
                                           fill_color='paleturquoise'), 
                               cells=dict(values=[us_shows_old['title'], us_shows_old['release_year']], 
                                          fill_color='lavender'))])
fig.update_layout(title='Oldest American Shows on Netflix')
fig.show()
In [74]:
fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'Release Year'], 
                                           fill_color='paleturquoise'), 
                               cells=dict(values=[us_shows_new['title'], us_shows_new['release_year']], 
                                          fill_color='lavenderblush'))])
fig.update_layout(title='Newest American Shows on Netflix')
fig.show()

8) Word Cloud

In [39]:
genres = list(movies['listed_in'])
genre = []

for i in genres:
    i = list(i.split(','))
    for j in i:
        genre.append(j.replace(' ', ''))
gen = Counter(genre)
In [43]:
text = list(set(genre))
plt.rcParams['figure.figsize'] = (15, 10)
wordcloud = WordCloud(max_font_size=50, max_words=50,background_color="white").generate(str(text))

plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

The most popular genre on Netflix are independent movie, thriller, standup comedy and classic movies. Also Children movies are quite popular, too.

3. Machine Learning

Recommendation System

I have not done any recommendation system using Scikit-Learn, so I decided to learn about it.
(by: Niharika Pandit https://www.kaggle.com/niharika41298/netflix-visualizations-recommendation-eda)

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [81]:
tfidf = TfidfVectorizer(stop_words='english')
netflix['description'] = netflix['description'].fillna('')
tfidf_matrix = tfidf.fit_transform(netflix['description'])
tfidf_matrix.shape
Out[81]:
(6234, 16151)

There is over 16,000 words describe about 6,000 movies.

In [82]:
from sklearn.metrics.pairwise import linear_kernel
cosine_similarity = linear_kernel(tfidf_matrix, tfidf_matrix)
indices = pd.Series(netflix.index, index=netflix['title']).drop_duplicates()
In [87]:
def get_recommendations(title, cosine_sim=cosine_similarity):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]

    movie_indices = [i[0] for i in sim_scores]

    return netflix['title'].iloc[movie_indices]
In [88]:
get_recommendations('Jane The Virgin')
Out[88]:
1036                                           The Lovers
2993                                          How It Ends
5144                                           Love Storm
5441    Club Friday To Be Continued - My Beautiful Tomboy
1938                                            Champions
4841                                       One Last Thing
1564                                         Kaleidoscope
2715                                          B.A. Pass 2
4436                                           White Girl
2702                                 Life in the Doghouse
Name: title, dtype: object

4. Conclusion

As a person who watches Netflix regularly, this analysis was something that I expected, but also was surprised. For example, I did not know there were so many of Indian movies and shows, as I only watch shows and movies in english. Additionally, although Netflix started streaming from 2007, there were not many movies added at the time. Rather, a lot of movies were added later from 2015. This has been a fun analysis, and I'm looking towards watching more Netflix shows and movies!