Table of Contents¶

Data Cleaning
Data Visualization and Analysis
Machine Learning
Conclusion

1. Data Cleaning¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import plotly.express as px
import plotly.graph_objects as go

C:\Users\Muffin\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

trump = pd.read_csv('https://raw.githubusercontent.com/mwaugh0328/Data_Bootcamp_Fall_2017/master/data_bootcamp_1127/trump_data.csv', 
                    encoding='latin-1')

trump.head()

trump.shape

(3112, 17)

trump.describe()

trump.corr()

trump.dtypes

Unnamed: 0           int64
population         float64
income             float64
NAME                object
county               int64
state                int64
FIPS               float64
StateCode           object
StateName           object
CountyFips           int64
CountyName          object
CountyTotalVote      int64
Party               object
Candidate           object
VoteCount          float64
_merge              object
trump_share        float64
dtype: object

Drop unnecessary columns such as Unnamed:0.

trump.drop(columns=['Unnamed: 0', 'NAME', 'Party', 'StateName', 'FIPS', '_merge', 'Candidate',
                    'county', 'state', 'CountyFips'], inplace=True)

trump.rename(columns={'population': 'Population', 'income': 'Income', 'StateCode': 'State',
                      'CountyName': 'County', 'CountyTotalVote': 'County Total Vote',
                      'VoteCount': 'Vote Count', 'trump_share': 'Trump Share'}, inplace=True)

Some counties have all uppercase, so we will change it to same format.

trump['County'] = trump['County'].str.title()

2. Visualization¶

top10_trump_support = trump.sort_values(by='Trump Share', ascending=False)[:10]

fig = px.sunburst(top10_trump_support, path=['State','County'], values='Trump Share', color='Trump Share',
                  color_continuous_scale='OrRd', title='Top 10 Trump Support County and State',
                  hover_data=['Trump Share'])
fig.update_layout(coloraxis_colorbar_title='Trump Share')
fig.update_traces(hovertemplate="Trump Share: %{customdata[0]}")
fig.show()

You can see that Texas is the number one state that support trump. And the county Roberts has the highest trump share among Texas conties.

fig = px.density_heatmap(trump, x="Income", y="Trump Share",
                        marginal_x="histogram", marginal_y="histogram",
                        title='Income and Trump Share')
fig.show()

3) Population and Trump Share¶

trump['Population_ln'] = np.log(trump['Population'])

fig = px.scatter(trump, x='Population_ln', y='Trump Share',
                labels={'Population_ln': 'Population (ln)'},
                title='Relationship Between Population and Trump Share',
                template='none')

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

a. Ordinary Least Sqaures¶

Ordinary Least Squares (OLS) is to fit the fuction with the data by minimizing the sum of squared errors. This method is used to estimate the unknown parameters in a linear regression model.

trump['Income_ln'] = np.log(trump['Income'])
trump['Income_ln'].fillna(method='ffill', inplace=True)

import statsmodels.formula.api as smf

ols_reg = smf.ols('Q("Trump Share") ~ Income_ln', trump).fit()
trump['pred_ols_reg'] = ols_reg.predict()

C:\Users\Muffin\Anaconda3\lib\site-packages\statsmodels\compat\pandas.py:23: FutureWarning:

The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version

print(ols_reg.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:       Q("Trump Share")   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                  0.019
Method:                 Least Squares   F-statistic:                     61.72
Date:                Tue, 11 Aug 2020   Prob (F-statistic):           5.40e-15
Time:                        22:33:59   Log-Likelihood:                 1389.9
No. Observations:                3112   AIC:                            -2776.
Df Residuals:                    3110   BIC:                            -2764.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.5917      0.122     13.082      0.000       1.353       1.830
Income_ln     -0.0891      0.011     -7.856      0.000      -0.111      -0.067
==============================================================================
Omnibus:                      335.316   Durbin-Watson:                   1.399
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              451.754
Skew:                          -0.881   Prob(JB):                     7.99e-99
Kurtosis:                       3.615   Cond. No.                         474.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

trump['pred_ols_reg'].head()

0    0.624882
1    0.626685
2    0.664276
3    0.650025
4    0.634933
Name: pred_ols_reg, dtype: float64

We could simply plot the OLS using Plotly trendline function.

fig = px.scatter(trump, x='Income_ln', y='Trump Share', trendline='ols',
                labels={'Income_ln': 'Income (ln)'},
                title='Relationship Between Population and Trump Share',
                template='none', trendline_color_override='red')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

b. Random Forest Regressor¶

A random Forest (RF) regressor is an ensemble techqunie that performs regression and classicifcation tasks using multiple decision trees.

trump['Income_ln'] = np.log(trump['Income'])
trump['Income_ln'].fillna(method='ffill', inplace=True)

from sklearn.ensemble import RandomForestRegressor as rf
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(trump[['Income_ln']].values, trump['Trump Share'].values)

skl_rf = rf(n_estimators=100).fit(x_train, y_train)
skl_rf.score(x_test, y_test)

-0.31806535406376835

trump['pred_skl_rf'] = skl_rf.predict(trump[['Income_ln']].values)

trump['pred_skl_rf'].head()

0    0.707976
1    0.772681
2    0.616012
3    0.679038
4    0.750440
Name: pred_skl_rf, dtype: float64

fig = go.Figure()

fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['Trump Share'],
                    mode='markers',
                    name='Trump Share'))
fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['pred_skl_rf'],
                    mode='markers',
                    name='RF Predicted'))

fig.update_layout(template='none', title='Income and Trump Share Predicted by Random Forest',
                  xaxis_title='Income (ln)', yaxis_title='Trump Share')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

c. K-Nearest Neighbor (KNN)¶

The K-Nearest Neighbor (KNN) is a simple supervised machine learning technique that groups and classifies each cases based on their similarities to its neighbor.

from sklearn.neighbors import KNeighborsRegressor as knn
skl_knn = knn(n_neighbors=100).fit(trump[['Income_ln']].values, trump['Trump Share'].values)
trump['pred_skl_knn'] = skl_knn.predict(trump[['Income_ln']].values)

trump['pred_skl_knn'].head()

0    0.648736
1    0.611395
2    0.650770
3    0.694707
4    0.656258
Name: pred_skl_knn, dtype: float64

fig = go.Figure()

fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['Trump Share'],
                    mode='markers',
                    name='Trump Share'))
fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['pred_skl_knn'],
                    mode='markers',
                    name='KNN Predicted'))

fig.update_layout(template='none', title='Income and Trump Share Predicted by K Nereast Neighbor',
                  xaxis_title='Income (ln)', yaxis_title='Trump Share')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

4. Conclusion¶

This is a few of very simple ML model that I've learned from NYU Data Bootcamp class. Although, it is not the most accurate model (maybe very wrong), I have noticed that the random forest regression model prediction is very close to the actual Trump vote share. Also, the locations with the higher Trump vote share tend to have lower income than the locations with the lower Trump vote share. Even though I am not a big political person, I found these vote analysis quite interesting!

	Unnamed: 0	population	income	NAME	county	state	FIPS	StateCode	StateName	CountyFips	CountyName	CountyTotalVote	Party	Candidate	VoteCount	_merge	trump_share
0	0	55221.0	51281.0	Autauga County, Alabama	1	1	1001.0	AL	alabama	1001	Autauga	24661	GOP	Trump	18110.0	both	0.734358
1	5	195121.0	50254.0	Baldwin County, Alabama	3	1	1003.0	AL	alabama	1003	Baldwin	94090	GOP	Trump	72780.0	both	0.773515
2	10	26932.0	32964.0	Barbour County, Alabama	5	1	1005.0	AL	alabama	1005	Barbour	10390	GOP	Trump	5431.0	both	0.522714
3	15	22604.0	38678.0	Bibb County, Alabama	7	1	1007.0	AL	alabama	1007	Bibb	8748	GOP	Trump	6733.0	both	0.769662
4	20	57710.0	45813.0	Blount County, Alabama	9	1	1009.0	AL	alabama	1009	Blount	25384	GOP	Trump	22808.0	both	0.898519

	Unnamed: 0	population	income	county	state	FIPS	CountyFips	CountyTotalVote	VoteCount	trump_share
count	3112.000000	3.112000e+03	3111.000000	3112.000000	3112.000000	3112.000000	3112.000000	3.112000e+03	3112.000000	3112.000000
mean	7777.500000	1.014722e+05	46662.262295	103.175129	30.548522	30651.696979	30651.696979	4.031575e+04	19139.641067	0.636029
std	4492.506724	3.244532e+05	12124.417511	107.834596	14.965305	14984.651238	14984.651238	1.059490e+05	38486.948628	0.156363
min	0.000000	1.170000e+02	19328.000000	1.000000	1.000000	1001.000000	1001.000000	6.400000e+01	57.000000	0.041221
25%	3888.750000	1.120550e+04	38747.500000	35.000000	19.000000	19038.500000	19038.500000	4.807000e+03	3193.000000	0.549716
50%	7777.500000	2.601000e+04	45023.000000	78.500000	29.000000	29208.000000	29208.000000	1.091050e+04	7087.500000	0.667110
75%	11666.250000	6.793625e+04	52109.500000	133.000000	46.000000	46005.500000	46005.500000	2.843350e+04	17321.250000	0.750448
max	15555.000000	1.003839e+07	123453.000000	840.000000	56.000000	56045.000000	56045.000000	2.240323e+06	566019.000000	0.952727

	Unnamed: 0	population	income	county	state	FIPS	CountyFips	CountyTotalVote	VoteCount	trump_share
Unnamed: 0	1.000000	-0.061086	0.092661	0.204040	0.995888	0.996071	0.996071	-0.055698	-0.054335	0.050111
population	-0.061086	1.000000	0.247073	-0.048813	-0.061003	-0.061276	-0.061276	0.951308	0.853867	-0.346788
income	0.092661	0.247073	1.000000	-0.055744	0.095195	0.094672	0.094672	0.308981	0.343856	-0.189405
county	0.204040	-0.048813	-0.055744	1.000000	0.175918	0.182888	0.182888	-0.059105	-0.061169	0.030484
state	0.995888	-0.061003	0.095195	0.175918	1.000000	0.999975	0.999975	-0.052594	-0.050709	0.050352
FIPS	0.996071	-0.061276	0.094672	0.182888	0.999975	1.000000	1.000000	-0.052951	-0.051084	0.050506
CountyFips	0.996071	-0.061276	0.094672	0.182888	0.999975	1.000000	1.000000	-0.052951	-0.051084	0.050506
CountyTotalVote	-0.055698	0.951308	0.308981	-0.059105	-0.052594	-0.052951	-0.052951	1.000000	0.930217	-0.392625
VoteCount	-0.054335	0.853867	0.343856	-0.061169	-0.050709	-0.051084	-0.051084	0.930217	1.000000	-0.319443
trump_share	0.050111	-0.346788	-0.189405	0.030484	0.050352	0.050506	0.050506	-0.392625	-0.319443	1.000000

Trump Vote Share - Income Analysis¶