Video Game Analysis¶

Table of Contents¶

1. Data Cleaning¶

2. Data Analysis and Visuazliation¶

2.1 Data Information¶

2.1.a Correlations
2.1.b Histograms

2.2 Best Video Games¶

2.2.a All Sales
2.2.b Sales by Regions
2.2.c Sales by Genres
2.2.d Sales by Platforms

2.3 Game Trends by Time (1973-2020)¶

2.3.a Total Sales by Generations
2.3.b The Most Famous Genres in Each Generation
2.4.c The Most Famous Games in Each Generation

2.4 Video Games by Regions¶

2.4.a The Most Popular Game Genres by Regions
2.4.b The Most Popular Platforms in Each Region

2.5 Video Games by Publishers¶

2.5.a Publishers with the Most Games
2.5.b Publishers with the Most High Ranking Games
2.5.c Publishers with the Highest Sales

3. Machine Learning¶

#### 3.a Linear Regression
#### 3.b K-Nearest Neighbors
#### 3.c MLPClassifier Neighbors
#### 3.d TensorFlow

4. Conclusion¶

Data Information

Name	Data
Rank	Ranking of overall sales
Name	The games name
Platform	Platform of the games release (i.e. PC,PS4, etc.)
Year	Year of the game's release
Genre	Genre of the game
Publisher	Publisher of the game
NA_Sales	Sales in North America (in millions)
EU_Sales	Sales in Europe (in millions)
JP_Sales	Sales in Japan (in millions)
Other_Sales	Sales in the rest of the world (in millions)
Global_Sales	Total worldwide sales

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from collections import Counter
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
import tensorflow as tf

vg_2016 = pd.read_csv('vgsales.csv')
vg_2019 = pd.read_csv('vgsales_2019.csv')
ps4 = pd.read_csv('ps4.csv', header=0, encoding='unicode_escape')
xbox = pd.read_csv('xboxone.csv', header=0, encoding='unicode_escape')

vg_2016.head()

vg_2016.describe()

vg_2016.columns

Index(['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'],
      dtype='object')

vg_2016.corr()

1. Data Cleaning¶

vg_2016.isna().sum()

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

Lots of NULL values in Year and Publisher. Let's check.

vg_2019.isna().sum()

Rank                  0
Name                  0
basename              0
Genre                 0
ESRB_Rating       32169
Platform              0
Publisher             0
Developer            17
VGChartz_Score    55792
Critic_Score      49256
User_Score        55457
Total_Shipped     53965
Global_Sales      36377
NA_Sales          42828
PAL_Sales         42603
JP_Sales          48749
Other_Sales       40270
Year                979
Last_Update       46606
url                   0
status                0
Vgchartzscore     54993
img_url               0
dtype: int64

What we can do is, we can get the average year of the games by each platform. For example, if PS2 was released in 2000 and PS3 was released in 2006. So the PS2 game must have been released between 2000 and 2006. So we will use the average year for all games in that platform.

platformYear = vg_2016.groupby('Platform').agg({'Year':'mean'}).reset_index().round(0)
platformYear['Year'] = platformYear['Year'].astype(int)
platformYear.head()

fillDict = platformYear.set_index('Platform')['Year'].to_dict()

vg_2016['Year'] = vg_2016['Year'].replace(np.nan, vg_2016['Platform'].map(fillDict))

vg_2019['Year'] = vg_2019['Year'].replace(np.nan, vg_2019['Platform'].map(fillDict))

Because the dataset is very large for 2019 data, it takes very long to finish filling the null value. I think I need to find a better way to fill the null value to reduce the time in the future.

vg_2016['Year'].isna().sum()

0

vg_2019['Year'].isna().sum()

241

vg_2019['Year'] = vg_2019['Year'].replace(np.nan, 0)
vg_2016['Year'] = vg_2016['Year'].astype(int)
vg_2019['Year'] = vg_2019['Year'].astype(int)

Unfortunately, there are not enough data for publishers data that I can use to seperate the games. Since there is not a good way to guess the publishers from the other data unlike the year, we will have to keep it as null at this time.

vg_2016.loc[vg_2016['Publisher'].isna() == True].head()

I will drop the columns with high null values, and columns not needed.

vg_2019.drop(columns=['basename', 'ESRB_Rating', 'Developer', 'Critic_Score', 'User_Score', 'Total_Shipped', 
                      'VGChartz_Score', 'PAL_Sales', 'Last_Update', 'url', 'status', 'Vgchartzscore', 
                      'img_url'], inplace=True)

vg_2019.head()

Because there is two different rank in 2016 and 2019 data, we will seperate them with [very high, high, medium, low, very low], for each dataset. The two dataframe does not have same amount of data, so it will not be very accurate, but this time we will continue with this method until I found a better way.

vg_2016['Class'] = pd.cut(vg_2016['Rank'], bins=5, labels=['Very High', 'High', 'Medium', 'Low', 'Very Low'])
vg_2019['Class'] = pd.cut(vg_2019['Rank'], bins=5, labels=['Very High', 'High', 'Medium', 'Low', 'Very Low'])

vg_2016.head()

vg_2019.head()

Create a new dataframe by appending 2019 data to 2016 data.

vg = vg_2016.append(vg_2019)
vg.head()

Now we will drop any duplicates using name, platfor, year and genre to avoid any mistake.

vg = vg.drop_duplicates(subset=['Name', 'Platform', 'Year', 'Genre'])

vg.shape

(60045, 12)

vg.head()

Now we will add the two datasets: XBox One video games. I will try to clean these dataset for analysis.

ps4.head()

xbox.head()

First we will change the column names to match with video game data.

ps4.rename(columns={'Game':'Name', 'North America':'NA_Sales', 'Europe':'EU_Sales',
                   'Japan':'JP_Sales', 'Rest of World':'Other_Sales', 'Global':'Global_Sales'}, inplace=True)
xbox.rename(columns={'Pos':'Rank', 'Game':'Name', 'North America':'NA_Sales', 'Europe':'EU_Sales',
                   'Japan':'JP_Sales', 'Rest of World':'Other_Sales', 'Global':'Global_Sales'}, inplace=True)

Also, ps4 data does not have rank column, but we can assume that it is already sorted by the rank. This is because as you can see the xbox data, Pos column is the rank, so we can make a reasonable guess here. Then we will add the class level as a new rank.

ps4['Rank'] = ps4.index + 1

Also we will need to add platforms for both datasets.

ps4['Platform'] = 'PS4'
xbox['Platform'] = 'XOne'

We will fill the NULL data in Year using the platform year. Unfortunately, we cannot guess publisher, so we will leave as null for now.

ps4.isna().sum()

Name              0
Year            209
Genre             0
Publisher       209
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
Rank              0
Platform          0
dtype: int64

xbox.isna().sum()

Rank              0
Name              0
Year            108
Genre             0
Publisher       108
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
Platform          0
dtype: int64

ps4['Year'] = ps4['Year'].replace(np.nan, ps4['Platform'].map(fillDict))
xbox['Year'] = xbox['Year'].replace(np.nan, xbox['Platform'].map(fillDict))

ps4['Year'] = ps4['Year'].astype(int)
xbox['Year'] = xbox['Year'].astype(int)

Now we will add the class for each dataset using the same method for video sales dataset.

ps4['Class'] = pd.cut(ps4['Rank'], bins=5, labels=['Very High', 'High', 'Medium', 'Low', 'Very Low'])
xbox['Class'] = pd.cut(xbox['Rank'], bins=5, labels=['Very High', 'High', 'Medium', 'Low', 'Very Low'])

ps4.head()

xbox.head()

First let's combine this two dataset.

consoles = ps4.append(xbox)
consoles.head()

consoles.shape

(1647, 12)

Now to total dataset.

total = vg.append(consoles)
total.head()

total = total.drop_duplicates(subset=['Name', 'Platform', 'Year', 'Genre'])

Final Dataset is here!

total.head()

total.shape

(60489, 12)

total.isna().sum()

Rank                0
Name                0
Platform            0
Year                0
Genre               0
Publisher         288
NA_Sales        39113
EU_Sales        43449
JP_Sales        40334
Other_Sales     37680
Global_Sales    35816
Class               0
dtype: int64

percent_missing = total.isnull().sum() * 100 / len(total)
missing_values = pd.DataFrame({'column_name': total.columns,
                               'percent_missing': percent_missing})
missing_values

total.sort_values(['Global_Sales', 'Rank'], ascending=[False, True], inplace=True)

There are still lots of null values in the half of the columns, but for now, we will continue.

2. Data Analysis and Visualizations¶

1. Data Information¶

a) Correlations¶

corr = total.corr()

fig = go.Figure(data=go.Heatmap(z=corr, x=corr.index, y=corr.columns, 
                                hoverongaps=False, colorscale="Reds"))
fig.update_layout(title='Correlations Between Columns')
fig.show()

b) Histogram¶

fig = px.histogram(total, x="Platform",
                   title='Histogram of Platform',
                   opacity=0.8,
                   color_discrete_sequence=['indianred'])
fig.update_layout(template=None)
fig.show()

fig = px.histogram(total, x="Year",
                   title='Histogram of Platform',
                   opacity=0.8,
                   color_discrete_sequence=['paleturquoise'])
fig.update_layout(template=None)
fig.update_xaxes(range=[1960, 2020])
fig.show()

fig = px.histogram(total, x="Genre",
                   title='Histogram of Genre',
                   opacity=0.8,
                   color_discrete_sequence=['mediumpurple'])
fig.update_layout(template=None)
fig.show()

2. Best Video Games¶

a) All sales¶

top10all = total[:10]

fig = px.bar(top10all, x='Global_Sales', y='Name', orientation='h', color='Global_Sales',
            color_continuous_scale="RdBu",  labels={'Global_Sales':'Global Sales'})

fig.update_layout(template='none', yaxis={'categoryorder':'total ascending'},
                 title='Top 10 Video Games')
fig.update_xaxes(automargin=True, title_text='Global Sales', 
                 showline=True, linewidth=2, mirror=True, showgrid=False)
fig.update_yaxes(automargin=True, title_text='', showline=True, linewidth=2, mirror=True)
fig.show()

b) Sales by each regions¶

fig = go.Figure()

fig.add_trace(go.Violin(y=total['NA_Sales'], name='North America'))
fig.add_trace(go.Violin(y=total['EU_Sales'], name='Europe'))
fig.add_trace(go.Violin(y=total['JP_Sales'], name='Japan'))
fig.add_trace(go.Violin(y=total['Other_Sales'], name='Others'))

fig.update_traces(meanline_visible=True)
fig.update_layout(template=None, violingap=0, violinmode='overlay',
                 title='Sales by Regions')
fig.show()

NA_top5 = total.sort_values(by='NA_Sales', ascending=False)[:5]
EU_top5 = total.sort_values(by='EU_Sales', ascending=False)[:5]
JP_top5 = total.sort_values(by='JP_Sales', ascending=False)[:5]
OT_top5 = total.sort_values(by='Other_Sales', ascending=False)[:5]

fig = make_subplots(rows=2, cols=2, shared_yaxes=True,
                   subplot_titles=("North America", "Europe", "Japan", "Others"))

fig.add_trace(go.Bar(x=NA_top5['Name'], y=NA_top5['NA_Sales'],
                    marker=dict(color=NA_top5['NA_Sales'], coloraxis="coloraxis")),
              row=1, col=1)

fig.add_trace(go.Bar(x=EU_top5['Name'], y=EU_top5['EU_Sales'],
                    marker=dict(color=EU_top5['EU_Sales'], coloraxis="coloraxis")),
              row=1, col=2)

fig.add_trace(go.Bar(x=JP_top5['Name'], y=JP_top5['JP_Sales'],
                    marker=dict(color=JP_top5['JP_Sales'], coloraxis="coloraxis")),
              row=2, col=1)

fig.add_trace(go.Bar(x=OT_top5['Name'], y=OT_top5['Other_Sales'],
                    marker=dict(color=OT_top5['Other_Sales'], coloraxis="coloraxis")),
              row=2, col=2)


fig.update_layout(coloraxis=dict(colorscale='Purples'), 
                  showlegend=False, template=None, height=600,
                  title="Top 5 Video Games By Reigons")
fig.show()

c) By genre¶

genre_top5 = total.groupby('Genre')[['Name', 'Global_Sales']].apply(lambda x: x.nlargest(5, columns=['Global_Sales']))\
            .reset_index().drop(columns='level_1')

fig = px.bar(genre_top5, x="Genre", y="Global_Sales", color="Name",
            color_discrete_sequence=px.colors.qualitative.Safe)
fig.update_layout(template=None, xaxis={'categoryorder':'total descending'},
                 title='Top 5 Games by Genre')
fig.update_xaxes(title=None)
fig.update_yaxes(title=None)
fig.show()

d) By platform¶

top10_platforms = total['Platform'].value_counts()[:10].index.to_list()
top10_platforms_vg = total[total['Platform'].isin(top10_platforms)]
top5_genre = total['Genre'].value_counts()[1:6].index.to_list()
top10_platforms_vg = top10_platforms_vg[top10_platforms_vg['Genre'].isin(top5_genre)]
top5_vgs = top10_platforms_vg.groupby(['Platform','Genre']).apply(lambda x: x.nlargest(1, columns='Global_Sales')\
                           [['Name', 'Global_Sales']]).reset_index().drop(columns='level_2')

fig = px.sunburst(top5_vgs, path=['Platform', 'Genre', 'Name'], values='Global_Sales',
                  color='Global_Sales', color_continuous_scale='RdBu', 
                  color_continuous_midpoint=np.average(top5_vgs['Global_Sales'], weights=top5_vgs['Global_Sales']),
                  height=800)
fig.update_layout(title='Top 5 Games by Platform and Genre')
fig.show()

3. Game Trend By Time (1973 - 2020)¶

We will refer to the generation chart I found online to seperate the time.

Genration By Year

Generation Name	Births Start	Births End
Generation X	1965	1979
Xennials	1975	1985
Millennials	1980	1994
Gen Z	1995	2012
Gen Alpha	2013	2025

def set_gen(row):
    value = '0'
    if 1965 <= row['Year'] <= 1979:
        value = 'Generation X'
    elif (1980 <= row['Year'] <= 1985):
        value = 'Xennials'
    elif (1986 <= row['Year'] <= 1994):
        value = 'Millennials'
    elif (1995 <= row['Year'] <= 2012):
        value = 'Gen Z'
    elif (2013 <= row['Year'] <= 2025):
        value = 'Gen Alpha'
    return value

total['Generation'] = total.apply(set_gen, axis=1)

a) Total sales by generations¶

generation_sales = total.groupby('Generation').agg({'Global_Sales':'sum'}).reset_index()

generation_sales = generation_sales.reindex([3, 5, 4, 2, 1, 0])
generation_sales.reset_index(drop=True, inplace=True)

colors = ['lightslategray'] * 6
colors[3] = 'crimson'
genres=['Role-Playing', 'Action', 'Platform', 'Misc', 'Misc', 'Action']

fig = go.Figure(data=[go.Bar(x=generation_sales['Generation'],
                             y=generation_sales['Global_Sales'],
                             marker_color=colors, 
                             hovertemplate='Years: %{text} <br>Total Sales:%{y}',
                             text=['1965-1979', '1980-1985', '1986-1994', '1995-2012', '2013-'],
                             name='')])

fig.update_layout(title_text='Total Sales By Generations', template=None)

b) The most famous genre in each generation¶

grouped = total.groupby(['Generation', 'Genre'])['Rank'].count().to_frame()
generations_genres = grouped.loc[grouped.groupby(level='Generation')['Rank'].idxmax()]
generations_genres = generations_genres.reset_index().reindex([3, 5, 4, 2, 1, 0])
generations_genres.reset_index(drop=True, inplace=True)

fig = px.sunburst(generations_genres, path=['Generation', 'Genre'],
                  color='Generation', color_discrete_sequence=px.colors.qualitative.Dark2,
                  height=700)
fig.update_layout(title='The most famous genre in each generation')
fig.show()

c) Most Famous Game in Each Generation¶

grouped_games = total.groupby(['Generation', 'Name']).agg({'Global_Sales':'max'}).sort_values(by='Global_Sales', 
                                                                                              ascending=False)
generations_games = grouped_games.loc[grouped_games.groupby(level='Generation')['Global_Sales'].idxmax()]\
                    .reset_index().reindex([3, 5, 4, 2, 1, 0])
generations_games.reset_index(drop=True, inplace=True)

fig = px.sunburst(generations_games, path=['Generation', 'Name', 'Global_Sales'],
                  color='Generation', color_discrete_sequence=px.colors.qualitative.Dark2,
                  height=700)
fig.update_layout(title='The most famous game in each generation')
fig.show()

4. Video Games by Regions¶

a) The most popular game genres by region¶

top600NA = total.sort_values(by='NA_Sales', ascending=False)[:600]
top600NA = top600NA.groupby('Genre', as_index=False).agg({'NA_Sales':'mean'})\
            .sort_values(by='NA_Sales', ascending=False)[:1].reset_index(drop=True)
top600EU = total.sort_values(by='EU_Sales', ascending=False)[:600]
top600EU = top600EU.groupby('Genre', as_index=False).agg({'EU_Sales':'mean'})\
            .sort_values(by='EU_Sales', ascending=False)[:1].reset_index(drop=True)
top600JP = total.sort_values(by='JP_Sales', ascending=False)[:600]
top600JP = top600JP.groupby('Genre', as_index=False).agg({'JP_Sales':'mean'})\
            .sort_values(by='JP_Sales', ascending=False)[:1].reset_index(drop=True)

top_genre_region_dt = {'Region':['North America', 'Europe', 'Japan'],
                    'Genre':[top600NA['Genre'][0], top600EU['Genre'][0], top600JP['Genre'][0]],
                    'Sales':[top600NA['NA_Sales'][0], top600EU['EU_Sales'][0], top600JP['JP_Sales'][0]]}
top_genre_region = pd.DataFrame(data=top_genre_region_dt)
top_genre_region['Sales'] = top_genre_region['Sales'].round(2)

fig = go.Figure(data=[go.Table(header=dict(values=list(top_genre_region.columns),
                                           fill_color='white',
                                           line_color='darkslategray',
                                           align=['right', 'center', 'center'],
                                           font=dict(color='black', size=15)),
                               cells=dict(values=[top_genre_region.Region, top_genre_region.Genre, top_genre_region.Sales],
                                          fill_color = [['lightcoral', 'peachpuff', 'lightyellow']],
                                          align = ['right', 'center', 'center'],
                                          line_color='darkslategray',
                                          font = dict(color = 'darkslategray', size = 13)))])
fig.update_layout(width=800, height=300, title="The Most Popular Genre In Each Region")
fig.show()

b) The most popular platform in each region¶

NA_platforms = total.groupby('Platform', as_index=False).agg({'NA_Sales':'sum'})\
                .sort_values(by='NA_Sales', ascending=False)[:5].reset_index(drop=True)
EU_platforms = total.groupby('Platform', as_index=False).agg({'EU_Sales':'sum'})\
                .sort_values(by='EU_Sales', ascending=False)[:5].reset_index(drop=True)
JP_platforms = total.groupby('Platform', as_index=False).agg({'JP_Sales':'sum'})\
                .sort_values(by='JP_Sales', ascending=False)[:5].reset_index(drop=True)

fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['North America', 'Europe', 'Japan'])

fig.add_trace(go.Pie(labels=NA_platforms['Platform'], values=NA_platforms['NA_Sales'], 
                     hole=.3, pull=[0.2, 0, 0, 0, 0]), 1, 1)
fig.add_trace(go.Pie(labels=EU_platforms['Platform'], values=EU_platforms['EU_Sales'], 
                     hole=.3, pull=[0.2, 0, 0, 0, 0]), 1, 2)
fig.add_trace(go.Pie(labels=JP_platforms['Platform'], values=JP_platforms['JP_Sales'], 
                     hole=.3, pull=[0.2, 0, 0, 0, 0]), 1, 3)

fig.update_traces(hoverinfo='label', textinfo='value', textfont_size=15,
                  marker=dict(line=dict(color='#000000', width=2)))

fig.update_layout(title_text='The Most Popular Platform in Each Region',
                 annotations=[dict(text='NA', x=0.145, y=0.465, font_size=20, showarrow=False),
                              dict(text='EU', x=0.50, y=0.465, font_size=20, showarrow=False), 
                              dict(text='JP', x=0.858, y=0.465, font_size=20, showarrow=False)])
fig.show()

5. Video Games by Publishers¶

a) Publishers with the most game¶

top10_pub = total.groupby('Publisher').count()['Name'].sort_values(ascending=False)[1:11]\
            .to_frame().reset_index().rename(columns={'Name':'Number of Games'})

fig = px.bar(top10_pub, x='Number of Games', y='Publisher', orientation='h', color='Number of Games',
            color_continuous_scale="PuBu")

fig.update_layout(template='none', yaxis={'categoryorder':'total ascending'},
                 title='Top 10 Publishers with the Most Games')
fig.update_xaxes(automargin=True, title_text='Number of Games', 
                 showline=True, linewidth=2, mirror=True, showgrid=False)
fig.update_yaxes(automargin=True, title_text='', showline=True, linewidth=2, mirror=True)
fig.show()

b) Publishers with the most high ranking games¶

very_highs = total.loc[total['Class'] == 'Very High']

pub_highs = very_highs.groupby(['Publisher'], as_index=False).agg({'Class':'count'})\
           .sort_values(by='Class', ascending=False)[:10].reset_index(drop=True)\
           .rename(columns={'Class':'Number of High Ranking Games'})

fig = px.bar(pub_highs, x='Number of High Ranking Games', y='Publisher', orientation='h', color='Number of High Ranking Games',
            color_continuous_scale="Burg")

fig.update_layout(template='none', yaxis={'categoryorder':'total ascending'},
                 title='Top 10 Publishers with the High Ranking Games')
fig.update_xaxes(automargin=True, title_text='Number of Games', 
                 showline=True, linewidth=2, mirror=True, showgrid=False)
fig.update_yaxes(automargin=True, title_text='', showline=True, linewidth=2, mirror=True)
fig.show()

c) Publishers with the highest sales¶

pub_high_sales = total.groupby('Publisher', as_index=False).agg({'Global_Sales':'sum'})\
                .sort_values(by='Global_Sales', ascending=False)[:10].reset_index(drop=True)

fig = px.bar(pub_high_sales, x='Global_Sales', y='Publisher', orientation='h', color='Global_Sales',
            color_continuous_scale="Mint")

fig.update_layout(template='none', yaxis={'categoryorder':'total ascending'},
                 title='Top 10 Publishers with the Highest Sales')
fig.update_xaxes(automargin=True, title_text='Sales in Millions', 
                 showline=True, linewidth=2, mirror=True, showgrid=False)
fig.update_yaxes(automargin=True, title_text='', showline=True, linewidth=2, mirror=True)
fig.show()

3. Machine Learning¶

First, Let's try to find outliers.

def outlier(data, attributes):
    outlier_list = []
    
    for item in attributes:
        Q1 = np.percentile(data[item], 25)
        Q3 = np.percentile(data[item], 75)
        IQR = Q3 - Q1
        outlier_step = 1.5 * IQR
        outlier_det = data[(data[item] < Q1 - outlier_step) | (data[item] > Q3 + outlier_step)].index
        outlier_list.extend(outlier_det)
    
    outlier_list = Counter(outlier_list)
    multiple_outliers = list(i for i, v in outlier_list.items() if v > 2)
    
    return multiple_outliers

total.loc[outlier(total, ["NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales", "Global_Sales"])]

C:\Users\Muffin\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3826: RuntimeWarning:

Invalid value encountered in percentile

Because there are many data with 0 sales value, all of them considered to be outliers.

Let's do linear regression between global sales and NA sales.

a) Linear Regression¶

fig = px.scatter(total, x="Global_Sales", y="NA_Sales")
fig.update_layout(template=None)
fig.show()

total.dropna(subset=['NA_Sales', 'Global_Sales'], inplace=True)

x = total.NA_Sales.values.reshape(-1, 1)
Y = total.Global_Sales.values.reshape(-1, 1)

linear = LinearRegression()
linear.fit(x, Y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

x_ = np.arange(min(x), max(x), 0.1).reshape(-1,1)
pred = linear.predict(x_)

Yhat = linear.predict(x)
print("r score: ", r2_score(Y, Yhat))

r score:  0.8758973646431796

fig = px.scatter(total, x="NA_Sales", y="Global_Sales", trendline="ols",
                trendline_color_override="red")
fig.update_layout(template=None, title='Linear Regression')
fig.show()

results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

b) K-Nearest Neighbors¶

top10pub = total.groupby(["Publisher"]).sum().filter(["Global_Sales"]).sort_values(by=["Global_Sales"],ascending=False).head(10)
pub_dict = {top10pub.index.values[i]:i for i in range(len(top10pub.index.values))}
vg = total.copy()
vg["Publisher"].replace(pub_dict, inplace=True)

top10val = [i for i in range(len(top10pub.index.values))]
vg = vg.drop(vg.query("Publisher not in @top10val").index)
vg = vg.reset_index(drop=True)

unique_platforms = vg["Platform"].unique()
plat_dict = {unique_platforms[i]:i for i in range(len(unique_platforms))}
vg["Platform"].replace(plat_dict, inplace=True)

unique_genres = vg["Genre"].unique()
genre_dict = {unique_genres[i]:i for i in range(len(unique_genres))}
vg["Genre"].replace(genre_dict, inplace=True)
vg.dropna(inplace=True)

def normalization(column):
  return (column - column.mean()) / column.std()

vg['Year'] = normalization(vg['Year'])
vg['NA_Sales'] = normalization(vg['NA_Sales'])
vg['EU_Sales'] = normalization(vg['EU_Sales'])
vg['JP_Sales'] = normalization(vg['JP_Sales'])
vg['Other_Sales'] = normalization(vg['Other_Sales'])
vg['Global_Sales'] = normalization(vg['Global_Sales'])

vg.head()

X = vg.reset_index().filter(["Year", "Genre", "Publisher", "NA_Sales", "EU_Sales", 
                             "JP_Sales", "Other_Sales", "Global_Sales"]).to_numpy()
y = vg.reset_index().filter(["Platform"]).to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y)
print("Separation from {} elements to : train = {} ; test = {}.".format(X.shape[0], X_train.shape[0], X_test.shape[0]))

Separation from 6820 elements to : train = 5115 ; test = 1705.

We will try to use a K-Nearest Neighbour.

knn = KNeighborsClassifier(len(unique_platforms))
knn.fit(X_train, y_train.ravel())
score = knn.score(X_test, y_test)
print("Score :", score)

Score : 0.33548387096774196

Let's do random test with Nitendo as a publisher.

test = np.array([5, 1, 0, 1, 2, 2, 1, 2])
test = test.reshape((1, 8))

print("For " + str(top10pub.index.values[test[0, 2]]) + " as a publisher and " 
      + str(unique_genres[test[0,1]]) + " as a genre, the platform predicted is : \n" 
      + str(unique_platforms[np.argmax(knn.predict(test))]))

For Nintendo as a publisher and 1 as a genre, the platform predicted is : 
Wii

c) MLPClassifier Neighbor¶

According to Wikipedia, A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). Additionally, the Deep Learning Book by Goodfellow states "A multilayer perceptron is just a mathematical function mapping some set of input values to output values. The function is formed by composing many simpler functions. We can think of each application of a different mathematical function as providing a new representation of the input."

model_sk = MLPClassifier(hidden_layer_sizes=(8, 16, 32, 64))
model_sk.fit(X_train, y_train.ravel())
score = model_sk.score(X_test, y_test)
print("Score :", score)

Score : 0.5102639296187683

C:\Users\Muffin\Anaconda3\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py:566: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

Let's run the same test as above.

test = np.array([5, 1, 0, 1, 2, 2, 1, 2])
test = test.reshape((1, 8))

print("For " + str(top10pub.index.values[test[0, 2]]) + " as a publisher and " 
      + str(unique_genres[test[0,1]]) + " as a genre, the platform predicted is : \n" 
      + str(unique_platforms[np.argmax(model_sk.predict(test))]))

For Nintendo as a publisher and 1 as a genre, the platform predicted is : 
Wii

d) TensorFlow¶

X_train = np.array(X_train).astype('float32')
X_test = np.array(X_test).astype('float32')

model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(8,)))
model.add(tf.keras.layers.Dense(8, activation="relu"))
model.add(tf.keras.layers.Dense(16, activation="relu"))
model.add(tf.keras.layers.Dense(32, activation="relu"))
model.add(tf.keras.layers.Dense(64, activation="relu"))
model.add(tf.keras.layers.Dense(len(unique_platforms), activation="softmax"))
model.build(X[0].shape)
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 8)                 72        
_________________________________________________________________
dense_5 (Dense)              (None, 16)                144       
_________________________________________________________________
dense_6 (Dense)              (None, 32)                544       
_________________________________________________________________
dense_7 (Dense)              (None, 64)                2112      
_________________________________________________________________
dense_8 (Dense)              (None, 33)                2145      
=================================================================
Total params: 5,017
Trainable params: 5,017
Non-trainable params: 0
_________________________________________________________________

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
model.compile(loss=loss_fn, optimizer='adam', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=150, validation_data=(X_test, y_test))

loss_curve = history.history["loss"]
acc_curve = history.history["accuracy"]
loss_val_curve = history.history["val_loss"]
acc_val_curve = history.history["val_accuracy"]

plt.plot(loss_curve,label="Train")
plt.plot(loss_val_curve,label="Validation")
plt.legend(loc='upper right')
plt.title("Loss")
plt.show()
plt.plot(acc_curve,label="Train")
plt.plot(acc_val_curve,label="Validation")
plt.legend(loc='lower right')
plt.title("Accuracy")
plt.show()

Let's do the test again with TensorFlow.

test = np.array([5, 1, 0, 1, 2, 2, 1, 2])
test = test.reshape((1, 8))

print("For " + str(top10pub.index.values[test[0, 2]]) + " as a publisher and " 
      + str(unique_genres[test[0,1]]) + " as a genre, the platform predicted is : \n" 
      + str(unique_platforms[np.argmax(model.predict(test))]))

For Nintendo as a publisher and 1 as a genre, the platform predicted is : 
XOne

From this results, I will need to find a better model to predict. There are so many different machine learning models, and even after going over tutorials, I found it very difficult utilizing the right models. If I continue to learn, I hope to build the good models! (Thanks for the tutorial at, https://www.kaggle.com/aurbcd/simple-look-at-vgsales-data-viz-science)

4. Conclusion¶

Completing this project took a while, but I had lots of fun doing this analysis. As a person who loves video games and online games, this surprised me that not all my favorite games were on any top lists.

Also, I combined a few different data from the other times, but I wished I had a full date with fewer missing values. In some parts, I realized I need to use better functions to shorten the processing time. Since the data was pretty big, I need to find better ways to clean and process data.

Learning new machine learning models was also the fun part of this project. Although I do not have in-depth knowledge of machine learning, this project helped me understand them. It was challenging but also very interesting.

In conclusion, I still have lots to learn and improve, but I'm very excited to finish this project and move onto the new ones!

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
0	1	Wii Sports	Wii	2006.0	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74
1	2	Super Mario Bros.	NES	1985.0	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24
2	3	Mario Kart Wii	Wii	2008.0	Racing	Nintendo	15.85	12.88	3.79	3.31	35.82
3	4	Wii Sports Resort	Wii	2009.0	Sports	Nintendo	15.75	11.01	3.28	2.96	33.00
4	5	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37

	Rank	Year	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
count	16598.000000	16327.000000	16598.000000	16598.000000	16598.000000	16598.000000	16598.000000
mean	8300.605254	2006.406443	0.264667	0.146652	0.077782	0.048063	0.537441
std	4791.853933	5.828981	0.816683	0.505351	0.309291	0.188588	1.555028
min	1.000000	1980.000000	0.000000	0.000000	0.000000	0.000000	0.010000
25%	4151.250000	2003.000000	0.000000	0.000000	0.000000	0.000000	0.060000
50%	8300.500000	2007.000000	0.080000	0.020000	0.000000	0.010000	0.170000
75%	12449.750000	2010.000000	0.240000	0.110000	0.040000	0.040000	0.470000
max	16600.000000	2020.000000	41.490000	29.020000	10.220000	10.570000	82.740000

	Rank	Year	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
Rank	1.000000	0.178814	-0.401362	-0.379123	-0.267785	-0.332986	-0.427407
Year	0.178814	1.000000	-0.091402	0.006014	-0.169316	0.041058	-0.074735
NA_Sales	-0.401362	-0.091402	1.000000	0.767727	0.449787	0.634737	0.941047
EU_Sales	-0.379123	0.006014	0.767727	1.000000	0.435584	0.726385	0.902836
JP_Sales	-0.267785	-0.169316	0.449787	0.435584	1.000000	0.290186	0.611816
Other_Sales	-0.332986	0.041058	0.634737	0.726385	0.290186	1.000000	0.748331
Global_Sales	-0.427407	-0.074735	0.941047	0.902836	0.611816	0.748331	1.000000

	Platform	Year
0	2600	1982
1	3DO	1995
2	3DS	2013
3	DC	2000
4	DS	2008

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	Other_Sales	Global_Sales
470	471	wwe Smackdown vs. Raw 2006	PS2	2005	Fighting	NaN	1.57	1.02	0.41	3.00
1303	1305	Triple Play 99	PS	1998	Sports	NaN	0.81	0.55	0.10	1.46
1662	1664	Shrek / Shrek 2 2-in-1 Gameboy Advance Video	GBA	2007	Misc	NaN	0.87	0.32	0.02	1.21
2222	2224	Bentley's Hackpack	GBA	2005	Misc	NaN	0.67	0.25	0.02	0.93
3159	3161	Nicktoons Collection: Game Boy Advance Video V...	GBA	2004	Misc	NaN	0.46	0.17	0.01	0.64

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales	Class
0	1	Wii Sports	Wii	2006	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74	Very High
1	2	Super Mario Bros.	NES	1985	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24	Very High
2	3	Mario Kart Wii	Wii	2008	Racing	Nintendo	15.85	12.88	3.79	3.31	35.82	Very High
3	4	Wii Sports Resort	Wii	2009	Sports	Nintendo	15.75	11.01	3.28	2.96	33.00	Very High
4	5	Pokemon Red/Pokemon Blue	GB	1996	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37	Very High

	Game	Year	Genre	Publisher	North America	Europe	Japan	Rest of World	Global
0	Grand Theft Auto V	2014.0	Action	Rockstar Games	6.06	9.71	0.60	3.02	19.39
1	Call of Duty: Black Ops 3	2015.0	Shooter	Activision	6.18	6.05	0.41	2.44	15.09
2	Red Dead Redemption 2	2018.0	Action-Adventure	Rockstar Games	5.26	6.21	0.21	2.26	13.94
3	Call of Duty: WWII	2017.0	Shooter	Activision	4.67	6.21	0.40	2.12	13.40
4	FIFA 18	2017.0	Sports	EA Sports	1.27	8.64	0.15	1.73	11.80

	Name	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales	Rank	Platform	Class
0	Grand Theft Auto V	2014	Action	Rockstar Games	6.06	9.71	0.60	3.02	19.39	1	PS4	Very High
1	Call of Duty: Black Ops 3	2015	Shooter	Activision	6.18	6.05	0.41	2.44	15.09	2	PS4	Very High
2	Red Dead Redemption 2	2018	Action-Adventure	Rockstar Games	5.26	6.21	0.21	2.26	13.94	3	PS4	Very High
3	Call of Duty: WWII	2017	Shooter	Activision	4.67	6.21	0.40	2.12	13.40	4	PS4	Very High
4	FIFA 18	2017	Sports	EA Sports	1.27	8.64	0.15	1.73	11.80	5	PS4	Very High

Dep. Variable:	y	R-squared:	0.876
Model:	OLS	Adj. R-squared:	0.876
Method:	Least Squares	F-statistic:	1.509e+05
Date:	Sun, 13 Dec 2020	Prob (F-statistic):	0.00
Time:	14:43:48	Log-Likelihood:	-15636.
No. Observations:	21376	AIC:	3.128e+04
Df Residuals:	21374	BIC:	3.129e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.0594	0.004	16.354	0.000	0.052	0.066
x1	1.7942	0.005	388.400	0.000	1.785	1.803

Omnibus:	15479.825	Durbin-Watson:	1.747
Prob(Omnibus):	0.000	Jarque-Bera (JB):	49218291.111
Skew:	1.865	Prob(JB):	0.00
Kurtosis:	238.045	Cond. No.	1.51

	Rank	Name	Platform	Year	Genre	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales	Class	Generation
0	1	Wii Sports	0	-0.076117	0	37.381479	42.255474	10.176579	36.930921	39.402882	Very High	Gen Z
1	2	Super Mario Bros.	1	-3.861328	1	26.104502	4.956395	18.595182	3.098865	18.987323	Very High	Xennials
2	3	Mario Kart Wii	0	0.284380	2	14.082390	18.591671	10.231964	14.273562	16.864105	Very High	Gen Z
3	4	Wii Sports Resort	0	0.464628	0	13.991520	15.849955	8.819633	12.733742	15.509472	Very High	Gen Z
4	6	Tetris	2	-3.140335	3	20.761341	3.021065	11.422754	2.262962	14.193269	Very High	Millennials