Data Information
Name | Data |
---|---|
Rank | Ranking of overall sales |
Name | The games name |
Platform | Platform of the games release (i.e. PC,PS4, etc.) |
Year | Year of the game's release |
Genre | Genre of the game |
Publisher | Publisher of the game |
NA_Sales | Sales in North America (in millions) |
EU_Sales | Sales in Europe (in millions) |
JP_Sales | Sales in Japan (in millions) |
Other_Sales | Sales in the rest of the world (in millions) |
Global_Sales | Total worldwide sales |
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from collections import Counter
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
import tensorflow as tf
vg_2016 = pd.read_csv('vgsales.csv')
vg_2019 = pd.read_csv('vgsales_2019.csv')
ps4 = pd.read_csv('ps4.csv', header=0, encoding='unicode_escape')
xbox = pd.read_csv('xboxone.csv', header=0, encoding='unicode_escape')
vg_2016.head()
vg_2016.describe()
vg_2016.columns
vg_2016.corr()
vg_2016.isna().sum()
Lots of NULL values in Year and Publisher. Let's check.
vg_2019.isna().sum()
What we can do is, we can get the average year of the games by each platform. For example, if PS2 was released in 2000 and PS3 was released in 2006. So the PS2 game must have been released between 2000 and 2006. So we will use the average year for all games in that platform.
platformYear = vg_2016.groupby('Platform').agg({'Year':'mean'}).reset_index().round(0)
platformYear['Year'] = platformYear['Year'].astype(int)
platformYear.head()
fillDict = platformYear.set_index('Platform')['Year'].to_dict()
vg_2016['Year'] = vg_2016['Year'].replace(np.nan, vg_2016['Platform'].map(fillDict))
vg_2019['Year'] = vg_2019['Year'].replace(np.nan, vg_2019['Platform'].map(fillDict))
Because the dataset is very large for 2019 data, it takes very long to finish filling the null value. I think I need to find a better way to fill the null value to reduce the time in the future.
vg_2016['Year'].isna().sum()
vg_2019['Year'].isna().sum()
vg_2019['Year'] = vg_2019['Year'].replace(np.nan, 0)
vg_2016['Year'] = vg_2016['Year'].astype(int)
vg_2019['Year'] = vg_2019['Year'].astype(int)
Unfortunately, there are not enough data for publishers data that I can use to seperate the games. Since there is not a good way to guess the publishers from the other data unlike the year, we will have to keep it as null at this time.
vg_2016.loc[vg_2016['Publisher'].isna() == True].head()
I will drop the columns with high null values, and columns not needed.
vg_2019.drop(columns=['basename', 'ESRB_Rating', 'Developer', 'Critic_Score', 'User_Score', 'Total_Shipped',
'VGChartz_Score', 'PAL_Sales', 'Last_Update', 'url', 'status', 'Vgchartzscore',
'img_url'], inplace=True)
vg_2019.head()
Because there is two different rank in 2016 and 2019 data, we will seperate them with [very high, high, medium, low, very low], for each dataset. The two dataframe does not have same amount of data, so it will not be very accurate, but this time we will continue with this method until I found a better way.
vg_2016['Class'] = pd.cut(vg_2016['Rank'], bins=5, labels=['Very High', 'High', 'Medium', 'Low', 'Very Low'])
vg_2019['Class'] = pd.cut(vg_2019['Rank'], bins=5, labels=['Very High', 'High', 'Medium', 'Low', 'Very Low'])
vg_2016.head()
vg_2019.head()
Create a new dataframe by appending 2019 data to 2016 data.
vg = vg_2016.append(vg_2019)
vg.head()
Now we will drop any duplicates using name, platfor, year and genre to avoid any mistake.
vg = vg.drop_duplicates(subset=['Name', 'Platform', 'Year', 'Genre'])
vg.shape
vg.head()
Now we will add the two datasets: XBox One video games. I will try to clean these dataset for analysis.
ps4.head()
xbox.head()
First we will change the column names to match with video game data.
ps4.rename(columns={'Game':'Name', 'North America':'NA_Sales', 'Europe':'EU_Sales',
'Japan':'JP_Sales', 'Rest of World':'Other_Sales', 'Global':'Global_Sales'}, inplace=True)
xbox.rename(columns={'Pos':'Rank', 'Game':'Name', 'North America':'NA_Sales', 'Europe':'EU_Sales',
'Japan':'JP_Sales', 'Rest of World':'Other_Sales', 'Global':'Global_Sales'}, inplace=True)
Also, ps4 data does not have rank column, but we can assume that it is already sorted by the rank. This is because as you can see the xbox data, Pos column is the rank, so we can make a reasonable guess here. Then we will add the class level as a new rank.
ps4['Rank'] = ps4.index + 1
Also we will need to add platforms for both datasets.
ps4['Platform'] = 'PS4'
xbox['Platform'] = 'XOne'
We will fill the NULL data in Year using the platform year. Unfortunately, we cannot guess publisher, so we will leave as null for now.
ps4.isna().sum()
xbox.isna().sum()
ps4['Year'] = ps4['Year'].replace(np.nan, ps4['Platform'].map(fillDict))
xbox['Year'] = xbox['Year'].replace(np.nan, xbox['Platform'].map(fillDict))
ps4['Year'] = ps4['Year'].astype(int)
xbox['Year'] = xbox['Year'].astype(int)
Now we will add the class for each dataset using the same method for video sales dataset.
ps4['Class'] = pd.cut(ps4['Rank'], bins=5, labels=['Very High', 'High', 'Medium', 'Low', 'Very Low'])
xbox['Class'] = pd.cut(xbox['Rank'], bins=5, labels=['Very High', 'High', 'Medium', 'Low', 'Very Low'])
ps4.head()
xbox.head()
First let's combine this two dataset.
consoles = ps4.append(xbox)
consoles.head()
consoles.shape
Now to total dataset.
total = vg.append(consoles)
total.head()
total = total.drop_duplicates(subset=['Name', 'Platform', 'Year', 'Genre'])
Final Dataset is here!
total.head()
total.shape
total.isna().sum()
percent_missing = total.isnull().sum() * 100 / len(total)
missing_values = pd.DataFrame({'column_name': total.columns,
'percent_missing': percent_missing})
missing_values
total.sort_values(['Global_Sales', 'Rank'], ascending=[False, True], inplace=True)
corr = total.corr()
fig = go.Figure(data=go.Heatmap(z=corr, x=corr.index, y=corr.columns,
hoverongaps=False, colorscale="Reds"))
fig.update_layout(title='Correlations Between Columns')
fig.show()
fig = px.histogram(total, x="Platform",
title='Histogram of Platform',
opacity=0.8,
color_discrete_sequence=['indianred'])
fig.update_layout(template=None)
fig.show()
fig = px.histogram(total, x="Year",
title='Histogram of Platform',
opacity=0.8,
color_discrete_sequence=['paleturquoise'])
fig.update_layout(template=None)
fig.update_xaxes(range=[1960, 2020])
fig.show()
fig = px.histogram(total, x="Genre",
title='Histogram of Genre',
opacity=0.8,
color_discrete_sequence=['mediumpurple'])
fig.update_layout(template=None)
fig.show()
top10all = total[:10]
fig = px.bar(top10all, x='Global_Sales', y='Name', orientation='h', color='Global_Sales',
color_continuous_scale="RdBu", labels={'Global_Sales':'Global Sales'})
fig.update_layout(template='none', yaxis={'categoryorder':'total ascending'},
title='Top 10 Video Games')
fig.update_xaxes(automargin=True, title_text='Global Sales',
showline=True, linewidth=2, mirror=True, showgrid=False)
fig.update_yaxes(automargin=True, title_text='', showline=True, linewidth=2, mirror=True)
fig.show()
fig = go.Figure()
fig.add_trace(go.Violin(y=total['NA_Sales'], name='North America'))
fig.add_trace(go.Violin(y=total['EU_Sales'], name='Europe'))
fig.add_trace(go.Violin(y=total['JP_Sales'], name='Japan'))
fig.add_trace(go.Violin(y=total['Other_Sales'], name='Others'))
fig.update_traces(meanline_visible=True)
fig.update_layout(template=None, violingap=0, violinmode='overlay',
title='Sales by Regions')
fig.show()
NA_top5 = total.sort_values(by='NA_Sales', ascending=False)[:5]
EU_top5 = total.sort_values(by='EU_Sales', ascending=False)[:5]
JP_top5 = total.sort_values(by='JP_Sales', ascending=False)[:5]
OT_top5 = total.sort_values(by='Other_Sales', ascending=False)[:5]
fig = make_subplots(rows=2, cols=2, shared_yaxes=True,
subplot_titles=("North America", "Europe", "Japan", "Others"))
fig.add_trace(go.Bar(x=NA_top5['Name'], y=NA_top5['NA_Sales'],
marker=dict(color=NA_top5['NA_Sales'], coloraxis="coloraxis")),
row=1, col=1)
fig.add_trace(go.Bar(x=EU_top5['Name'], y=EU_top5['EU_Sales'],
marker=dict(color=EU_top5['EU_Sales'], coloraxis="coloraxis")),
row=1, col=2)
fig.add_trace(go.Bar(x=JP_top5['Name'], y=JP_top5['JP_Sales'],
marker=dict(color=JP_top5['JP_Sales'], coloraxis="coloraxis")),
row=2, col=1)
fig.add_trace(go.Bar(x=OT_top5['Name'], y=OT_top5['Other_Sales'],
marker=dict(color=OT_top5['Other_Sales'], coloraxis="coloraxis")),
row=2, col=2)
fig.update_layout(coloraxis=dict(colorscale='Purples'),
showlegend=False, template=None, height=600,
title="Top 5 Video Games By Reigons")
fig.show()
genre_top5 = total.groupby('Genre')[['Name', 'Global_Sales']].apply(lambda x: x.nlargest(5, columns=['Global_Sales']))\
.reset_index().drop(columns='level_1')
fig = px.bar(genre_top5, x="Genre", y="Global_Sales", color="Name",
color_discrete_sequence=px.colors.qualitative.Safe)
fig.update_layout(template=None, xaxis={'categoryorder':'total descending'},
title='Top 5 Games by Genre')
fig.update_xaxes(title=None)
fig.update_yaxes(title=None)
fig.show()
top10_platforms = total['Platform'].value_counts()[:10].index.to_list()
top10_platforms_vg = total[total['Platform'].isin(top10_platforms)]
top5_genre = total['Genre'].value_counts()[1:6].index.to_list()
top10_platforms_vg = top10_platforms_vg[top10_platforms_vg['Genre'].isin(top5_genre)]
top5_vgs = top10_platforms_vg.groupby(['Platform','Genre']).apply(lambda x: x.nlargest(1, columns='Global_Sales')\
[['Name', 'Global_Sales']]).reset_index().drop(columns='level_2')
fig = px.sunburst(top5_vgs, path=['Platform', 'Genre', 'Name'], values='Global_Sales',
color='Global_Sales', color_continuous_scale='RdBu',
color_continuous_midpoint=np.average(top5_vgs['Global_Sales'], weights=top5_vgs['Global_Sales']),
height=800)
fig.update_layout(title='Top 5 Games by Platform and Genre')
fig.show()
We will refer to the generation chart I found online to seperate the time.
Genration By Year
Generation Name | Births Start | Births End |
---|---|---|
Generation X | 1965 | 1979 |
Xennials | 1975 | 1985 |
Millennials | 1980 | 1994 |
Gen Z | 1995 | 2012 |
Gen Alpha | 2013 | 2025 |
def set_gen(row):
value = '0'
if 1965 <= row['Year'] <= 1979:
value = 'Generation X'
elif (1980 <= row['Year'] <= 1985):
value = 'Xennials'
elif (1986 <= row['Year'] <= 1994):
value = 'Millennials'
elif (1995 <= row['Year'] <= 2012):
value = 'Gen Z'
elif (2013 <= row['Year'] <= 2025):
value = 'Gen Alpha'
return value
total['Generation'] = total.apply(set_gen, axis=1)
generation_sales = total.groupby('Generation').agg({'Global_Sales':'sum'}).reset_index()
generation_sales = generation_sales.reindex([3, 5, 4, 2, 1, 0])
generation_sales.reset_index(drop=True, inplace=True)
colors = ['lightslategray'] * 6
colors[3] = 'crimson'
genres=['Role-Playing', 'Action', 'Platform', 'Misc', 'Misc', 'Action']
fig = go.Figure(data=[go.Bar(x=generation_sales['Generation'],
y=generation_sales['Global_Sales'],
marker_color=colors,
hovertemplate='Years: %{text} <br>Total Sales:%{y}',
text=['1965-1979', '1980-1985', '1986-1994', '1995-2012', '2013-'],
name='')])
fig.update_layout(title_text='Total Sales By Generations', template=None)
grouped = total.groupby(['Generation', 'Genre'])['Rank'].count().to_frame()
generations_genres = grouped.loc[grouped.groupby(level='Generation')['Rank'].idxmax()]
generations_genres = generations_genres.reset_index().reindex([3, 5, 4, 2, 1, 0])
generations_genres.reset_index(drop=True, inplace=True)
fig = px.sunburst(generations_genres, path=['Generation', 'Genre'],
color='Generation', color_discrete_sequence=px.colors.qualitative.Dark2,
height=700)
fig.update_layout(title='The most famous genre in each generation')
fig.show()
grouped_games = total.groupby(['Generation', 'Name']).agg({'Global_Sales':'max'}).sort_values(by='Global_Sales',
ascending=False)
generations_games = grouped_games.loc[grouped_games.groupby(level='Generation')['Global_Sales'].idxmax()]\
.reset_index().reindex([3, 5, 4, 2, 1, 0])
generations_games.reset_index(drop=True, inplace=True)
fig = px.sunburst(generations_games, path=['Generation', 'Name', 'Global_Sales'],
color='Generation', color_discrete_sequence=px.colors.qualitative.Dark2,
height=700)
fig.update_layout(title='The most famous game in each generation')
fig.show()
top600NA = total.sort_values(by='NA_Sales', ascending=False)[:600]
top600NA = top600NA.groupby('Genre', as_index=False).agg({'NA_Sales':'mean'})\
.sort_values(by='NA_Sales', ascending=False)[:1].reset_index(drop=True)
top600EU = total.sort_values(by='EU_Sales', ascending=False)[:600]
top600EU = top600EU.groupby('Genre', as_index=False).agg({'EU_Sales':'mean'})\
.sort_values(by='EU_Sales', ascending=False)[:1].reset_index(drop=True)
top600JP = total.sort_values(by='JP_Sales', ascending=False)[:600]
top600JP = top600JP.groupby('Genre', as_index=False).agg({'JP_Sales':'mean'})\
.sort_values(by='JP_Sales', ascending=False)[:1].reset_index(drop=True)
top_genre_region_dt = {'Region':['North America', 'Europe', 'Japan'],
'Genre':[top600NA['Genre'][0], top600EU['Genre'][0], top600JP['Genre'][0]],
'Sales':[top600NA['NA_Sales'][0], top600EU['EU_Sales'][0], top600JP['JP_Sales'][0]]}
top_genre_region = pd.DataFrame(data=top_genre_region_dt)
top_genre_region['Sales'] = top_genre_region['Sales'].round(2)
fig = go.Figure(data=[go.Table(header=dict(values=list(top_genre_region.columns),
fill_color='white',
line_color='darkslategray',
align=['right', 'center', 'center'],
font=dict(color='black', size=15)),
cells=dict(values=[top_genre_region.Region, top_genre_region.Genre, top_genre_region.Sales],
fill_color = [['lightcoral', 'peachpuff', 'lightyellow']],
align = ['right', 'center', 'center'],
line_color='darkslategray',
font = dict(color = 'darkslategray', size = 13)))])
fig.update_layout(width=800, height=300, title="The Most Popular Genre In Each Region")
fig.show()
NA_platforms = total.groupby('Platform', as_index=False).agg({'NA_Sales':'sum'})\
.sort_values(by='NA_Sales', ascending=False)[:5].reset_index(drop=True)
EU_platforms = total.groupby('Platform', as_index=False).agg({'EU_Sales':'sum'})\
.sort_values(by='EU_Sales', ascending=False)[:5].reset_index(drop=True)
JP_platforms = total.groupby('Platform', as_index=False).agg({'JP_Sales':'sum'})\
.sort_values(by='JP_Sales', ascending=False)[:5].reset_index(drop=True)
fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
subplot_titles=['North America', 'Europe', 'Japan'])
fig.add_trace(go.Pie(labels=NA_platforms['Platform'], values=NA_platforms['NA_Sales'],
hole=.3, pull=[0.2, 0, 0, 0, 0]), 1, 1)
fig.add_trace(go.Pie(labels=EU_platforms['Platform'], values=EU_platforms['EU_Sales'],
hole=.3, pull=[0.2, 0, 0, 0, 0]), 1, 2)
fig.add_trace(go.Pie(labels=JP_platforms['Platform'], values=JP_platforms['JP_Sales'],
hole=.3, pull=[0.2, 0, 0, 0, 0]), 1, 3)
fig.update_traces(hoverinfo='label', textinfo='value', textfont_size=15,
marker=dict(line=dict(color='#000000', width=2)))
fig.update_layout(title_text='The Most Popular Platform in Each Region',
annotations=[dict(text='NA', x=0.145, y=0.465, font_size=20, showarrow=False),
dict(text='EU', x=0.50, y=0.465, font_size=20, showarrow=False),
dict(text='JP', x=0.858, y=0.465, font_size=20, showarrow=False)])
fig.show()
top10_pub = total.groupby('Publisher').count()['Name'].sort_values(ascending=False)[1:11]\
.to_frame().reset_index().rename(columns={'Name':'Number of Games'})
fig = px.bar(top10_pub, x='Number of Games', y='Publisher', orientation='h', color='Number of Games',
color_continuous_scale="PuBu")
fig.update_layout(template='none', yaxis={'categoryorder':'total ascending'},
title='Top 10 Publishers with the Most Games')
fig.update_xaxes(automargin=True, title_text='Number of Games',
showline=True, linewidth=2, mirror=True, showgrid=False)
fig.update_yaxes(automargin=True, title_text='', showline=True, linewidth=2, mirror=True)
fig.show()
very_highs = total.loc[total['Class'] == 'Very High']
pub_highs = very_highs.groupby(['Publisher'], as_index=False).agg({'Class':'count'})\
.sort_values(by='Class', ascending=False)[:10].reset_index(drop=True)\
.rename(columns={'Class':'Number of High Ranking Games'})
fig = px.bar(pub_highs, x='Number of High Ranking Games', y='Publisher', orientation='h', color='Number of High Ranking Games',
color_continuous_scale="Burg")
fig.update_layout(template='none', yaxis={'categoryorder':'total ascending'},
title='Top 10 Publishers with the High Ranking Games')
fig.update_xaxes(automargin=True, title_text='Number of Games',
showline=True, linewidth=2, mirror=True, showgrid=False)
fig.update_yaxes(automargin=True, title_text='', showline=True, linewidth=2, mirror=True)
fig.show()
pub_high_sales = total.groupby('Publisher', as_index=False).agg({'Global_Sales':'sum'})\
.sort_values(by='Global_Sales', ascending=False)[:10].reset_index(drop=True)
fig = px.bar(pub_high_sales, x='Global_Sales', y='Publisher', orientation='h', color='Global_Sales',
color_continuous_scale="Mint")
fig.update_layout(template='none', yaxis={'categoryorder':'total ascending'},
title='Top 10 Publishers with the Highest Sales')
fig.update_xaxes(automargin=True, title_text='Sales in Millions',
showline=True, linewidth=2, mirror=True, showgrid=False)
fig.update_yaxes(automargin=True, title_text='', showline=True, linewidth=2, mirror=True)
fig.show()
First, Let's try to find outliers.
def outlier(data, attributes):
outlier_list = []
for item in attributes:
Q1 = np.percentile(data[item], 25)
Q3 = np.percentile(data[item], 75)
IQR = Q3 - Q1
outlier_step = 1.5 * IQR
outlier_det = data[(data[item] < Q1 - outlier_step) | (data[item] > Q3 + outlier_step)].index
outlier_list.extend(outlier_det)
outlier_list = Counter(outlier_list)
multiple_outliers = list(i for i, v in outlier_list.items() if v > 2)
return multiple_outliers
total.loc[outlier(total, ["NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales", "Global_Sales"])]
Because there are many data with 0 sales value, all of them considered to be outliers.
Let's do linear regression between global sales and NA sales.
fig = px.scatter(total, x="Global_Sales", y="NA_Sales")
fig.update_layout(template=None)
fig.show()
total.dropna(subset=['NA_Sales', 'Global_Sales'], inplace=True)
x = total.NA_Sales.values.reshape(-1, 1)
Y = total.Global_Sales.values.reshape(-1, 1)
linear = LinearRegression()
linear.fit(x, Y)
x_ = np.arange(min(x), max(x), 0.1).reshape(-1,1)
pred = linear.predict(x_)
Yhat = linear.predict(x)
print("r score: ", r2_score(Y, Yhat))
fig = px.scatter(total, x="NA_Sales", y="Global_Sales", trendline="ols",
trendline_color_override="red")
fig.update_layout(template=None, title='Linear Regression')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()
top10pub = total.groupby(["Publisher"]).sum().filter(["Global_Sales"]).sort_values(by=["Global_Sales"],ascending=False).head(10)
pub_dict = {top10pub.index.values[i]:i for i in range(len(top10pub.index.values))}
vg = total.copy()
vg["Publisher"].replace(pub_dict, inplace=True)
top10val = [i for i in range(len(top10pub.index.values))]
vg = vg.drop(vg.query("Publisher not in @top10val").index)
vg = vg.reset_index(drop=True)
unique_platforms = vg["Platform"].unique()
plat_dict = {unique_platforms[i]:i for i in range(len(unique_platforms))}
vg["Platform"].replace(plat_dict, inplace=True)
unique_genres = vg["Genre"].unique()
genre_dict = {unique_genres[i]:i for i in range(len(unique_genres))}
vg["Genre"].replace(genre_dict, inplace=True)
vg.dropna(inplace=True)
def normalization(column):
return (column - column.mean()) / column.std()
vg['Year'] = normalization(vg['Year'])
vg['NA_Sales'] = normalization(vg['NA_Sales'])
vg['EU_Sales'] = normalization(vg['EU_Sales'])
vg['JP_Sales'] = normalization(vg['JP_Sales'])
vg['Other_Sales'] = normalization(vg['Other_Sales'])
vg['Global_Sales'] = normalization(vg['Global_Sales'])
vg.head()
X = vg.reset_index().filter(["Year", "Genre", "Publisher", "NA_Sales", "EU_Sales",
"JP_Sales", "Other_Sales", "Global_Sales"]).to_numpy()
y = vg.reset_index().filter(["Platform"]).to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y)
print("Separation from {} elements to : train = {} ; test = {}.".format(X.shape[0], X_train.shape[0], X_test.shape[0]))
We will try to use a K-Nearest Neighbour.
knn = KNeighborsClassifier(len(unique_platforms))
knn.fit(X_train, y_train.ravel())
score = knn.score(X_test, y_test)
print("Score :", score)
Let's do random test with Nitendo as a publisher.
test = np.array([5, 1, 0, 1, 2, 2, 1, 2])
test = test.reshape((1, 8))
print("For " + str(top10pub.index.values[test[0, 2]]) + " as a publisher and "
+ str(unique_genres[test[0,1]]) + " as a genre, the platform predicted is : \n"
+ str(unique_platforms[np.argmax(knn.predict(test))]))
According to Wikipedia, A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). Additionally, the Deep Learning Book by Goodfellow states "A multilayer perceptron is just a mathematical function mapping some set of input values to output values. The function is formed by composing many simpler functions. We can think of each application of a different mathematical function as providing a new representation of the input."
model_sk = MLPClassifier(hidden_layer_sizes=(8, 16, 32, 64))
model_sk.fit(X_train, y_train.ravel())
score = model_sk.score(X_test, y_test)
print("Score :", score)
Let's run the same test as above.
test = np.array([5, 1, 0, 1, 2, 2, 1, 2])
test = test.reshape((1, 8))
print("For " + str(top10pub.index.values[test[0, 2]]) + " as a publisher and "
+ str(unique_genres[test[0,1]]) + " as a genre, the platform predicted is : \n"
+ str(unique_platforms[np.argmax(model_sk.predict(test))]))
X_train = np.array(X_train).astype('float32')
X_test = np.array(X_test).astype('float32')
model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(8,)))
model.add(tf.keras.layers.Dense(8, activation="relu"))
model.add(tf.keras.layers.Dense(16, activation="relu"))
model.add(tf.keras.layers.Dense(32, activation="relu"))
model.add(tf.keras.layers.Dense(64, activation="relu"))
model.add(tf.keras.layers.Dense(len(unique_platforms), activation="softmax"))
model.build(X[0].shape)
model.summary()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
model.compile(loss=loss_fn, optimizer='adam', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=150, validation_data=(X_test, y_test))
loss_curve = history.history["loss"]
acc_curve = history.history["accuracy"]
loss_val_curve = history.history["val_loss"]
acc_val_curve = history.history["val_accuracy"]
plt.plot(loss_curve,label="Train")
plt.plot(loss_val_curve,label="Validation")
plt.legend(loc='upper right')
plt.title("Loss")
plt.show()
plt.plot(acc_curve,label="Train")
plt.plot(acc_val_curve,label="Validation")
plt.legend(loc='lower right')
plt.title("Accuracy")
plt.show()
Let's do the test again with TensorFlow.
test = np.array([5, 1, 0, 1, 2, 2, 1, 2])
test = test.reshape((1, 8))
print("For " + str(top10pub.index.values[test[0, 2]]) + " as a publisher and "
+ str(unique_genres[test[0,1]]) + " as a genre, the platform predicted is : \n"
+ str(unique_platforms[np.argmax(model.predict(test))]))
From this results, I will need to find a better model to predict. There are so many different machine learning models, and even after going over tutorials, I found it very difficult utilizing the right models. If I continue to learn, I hope to build the good models! (Thanks for the tutorial at, https://www.kaggle.com/aurbcd/simple-look-at-vgsales-data-viz-science)
Completing this project took a while, but I had lots of fun doing this analysis. As a person who loves video games and online games, this surprised me that not all my favorite games were on any top lists.
Also, I combined a few different data from the other times, but I wished I had a full date with fewer missing values. In some parts, I realized I need to use better functions to shorten the processing time. Since the data was pretty big, I need to find better ways to clean and process data.
Learning new machine learning models was also the fun part of this project. Although I do not have in-depth knowledge of machine learning, this project helped me understand them. It was challenging but also very interesting.
In conclusion, I still have lots to learn and improve, but I'm very excited to finish this project and move onto the new ones!