Trump Vote Share - Income Analysis

Table of Contents

  1. Data Cleaning
  2. Data Visualization and Analysis
  3. Machine Learning
  4. Conclusion

1. Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import plotly.express as px
import plotly.graph_objects as go
C:\Users\Muffin\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [2]:
trump = pd.read_csv('https://raw.githubusercontent.com/mwaugh0328/Data_Bootcamp_Fall_2017/master/data_bootcamp_1127/trump_data.csv', 
                    encoding='latin-1')
In [3]:
trump.head()
Out[3]:
Unnamed: 0 population income NAME county state FIPS StateCode StateName CountyFips CountyName CountyTotalVote Party Candidate VoteCount _merge trump_share
0 0 55221.0 51281.0 Autauga County, Alabama 1 1 1001.0 AL alabama 1001 Autauga 24661 GOP Trump 18110.0 both 0.734358
1 5 195121.0 50254.0 Baldwin County, Alabama 3 1 1003.0 AL alabama 1003 Baldwin 94090 GOP Trump 72780.0 both 0.773515
2 10 26932.0 32964.0 Barbour County, Alabama 5 1 1005.0 AL alabama 1005 Barbour 10390 GOP Trump 5431.0 both 0.522714
3 15 22604.0 38678.0 Bibb County, Alabama 7 1 1007.0 AL alabama 1007 Bibb 8748 GOP Trump 6733.0 both 0.769662
4 20 57710.0 45813.0 Blount County, Alabama 9 1 1009.0 AL alabama 1009 Blount 25384 GOP Trump 22808.0 both 0.898519
In [4]:
trump.shape
Out[4]:
(3112, 17)
In [5]:
trump.describe()
Out[5]:
Unnamed: 0 population income county state FIPS CountyFips CountyTotalVote VoteCount trump_share
count 3112.000000 3.112000e+03 3111.000000 3112.000000 3112.000000 3112.000000 3112.000000 3.112000e+03 3112.000000 3112.000000
mean 7777.500000 1.014722e+05 46662.262295 103.175129 30.548522 30651.696979 30651.696979 4.031575e+04 19139.641067 0.636029
std 4492.506724 3.244532e+05 12124.417511 107.834596 14.965305 14984.651238 14984.651238 1.059490e+05 38486.948628 0.156363
min 0.000000 1.170000e+02 19328.000000 1.000000 1.000000 1001.000000 1001.000000 6.400000e+01 57.000000 0.041221
25% 3888.750000 1.120550e+04 38747.500000 35.000000 19.000000 19038.500000 19038.500000 4.807000e+03 3193.000000 0.549716
50% 7777.500000 2.601000e+04 45023.000000 78.500000 29.000000 29208.000000 29208.000000 1.091050e+04 7087.500000 0.667110
75% 11666.250000 6.793625e+04 52109.500000 133.000000 46.000000 46005.500000 46005.500000 2.843350e+04 17321.250000 0.750448
max 15555.000000 1.003839e+07 123453.000000 840.000000 56.000000 56045.000000 56045.000000 2.240323e+06 566019.000000 0.952727
In [6]:
trump.corr()
Out[6]:
Unnamed: 0 population income county state FIPS CountyFips CountyTotalVote VoteCount trump_share
Unnamed: 0 1.000000 -0.061086 0.092661 0.204040 0.995888 0.996071 0.996071 -0.055698 -0.054335 0.050111
population -0.061086 1.000000 0.247073 -0.048813 -0.061003 -0.061276 -0.061276 0.951308 0.853867 -0.346788
income 0.092661 0.247073 1.000000 -0.055744 0.095195 0.094672 0.094672 0.308981 0.343856 -0.189405
county 0.204040 -0.048813 -0.055744 1.000000 0.175918 0.182888 0.182888 -0.059105 -0.061169 0.030484
state 0.995888 -0.061003 0.095195 0.175918 1.000000 0.999975 0.999975 -0.052594 -0.050709 0.050352
FIPS 0.996071 -0.061276 0.094672 0.182888 0.999975 1.000000 1.000000 -0.052951 -0.051084 0.050506
CountyFips 0.996071 -0.061276 0.094672 0.182888 0.999975 1.000000 1.000000 -0.052951 -0.051084 0.050506
CountyTotalVote -0.055698 0.951308 0.308981 -0.059105 -0.052594 -0.052951 -0.052951 1.000000 0.930217 -0.392625
VoteCount -0.054335 0.853867 0.343856 -0.061169 -0.050709 -0.051084 -0.051084 0.930217 1.000000 -0.319443
trump_share 0.050111 -0.346788 -0.189405 0.030484 0.050352 0.050506 0.050506 -0.392625 -0.319443 1.000000
In [7]:
trump.dtypes
Out[7]:
Unnamed: 0           int64
population         float64
income             float64
NAME                object
county               int64
state                int64
FIPS               float64
StateCode           object
StateName           object
CountyFips           int64
CountyName          object
CountyTotalVote      int64
Party               object
Candidate           object
VoteCount          float64
_merge              object
trump_share        float64
dtype: object

Drop unnecessary columns such as Unnamed:0.

In [8]:
trump.drop(columns=['Unnamed: 0', 'NAME', 'Party', 'StateName', 'FIPS', '_merge', 'Candidate',
                    'county', 'state', 'CountyFips'], inplace=True)
In [9]:
trump.rename(columns={'population': 'Population', 'income': 'Income', 'StateCode': 'State',
                      'CountyName': 'County', 'CountyTotalVote': 'County Total Vote',
                      'VoteCount': 'Vote Count', 'trump_share': 'Trump Share'}, inplace=True)

Some counties have all uppercase, so we will change it to same format.

In [10]:
trump['County'] = trump['County'].str.title()

2. Visualization

1) Trump Share by County and State

In [11]:
top10_trump_support = trump.sort_values(by='Trump Share', ascending=False)[:10]
In [12]:
fig = px.sunburst(top10_trump_support, path=['State','County'], values='Trump Share', color='Trump Share',
                  color_continuous_scale='OrRd', title='Top 10 Trump Support County and State',
                  hover_data=['Trump Share'])
fig.update_layout(coloraxis_colorbar_title='Trump Share')
fig.update_traces(hovertemplate="Trump Share: %{customdata[0]}")
fig.show()

You can see that Texas is the number one state that support trump. And the county Roberts has the highest trump share among Texas conties.

2) Relationship between Income and Trump Share

In [13]:
fig = px.density_heatmap(trump, x="Income", y="Trump Share",
                        marginal_x="histogram", marginal_y="histogram",
                        title='Income and Trump Share')
fig.show()

3) Population and Trump Share

In [14]:
trump['Population_ln'] = np.log(trump['Population'])
In [15]:
fig = px.scatter(trump, x='Population_ln', y='Trump Share',
                labels={'Population_ln': 'Population (ln)'},
                title='Relationship Between Population and Trump Share',
                template='none')

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

a. Ordinary Least Sqaures

image.png

Ordinary Least Squares (OLS) is to fit the fuction with the data by minimizing the sum of squared errors. This method is used to estimate the unknown parameters in a linear regression model.

In [16]:
trump['Income_ln'] = np.log(trump['Income'])
trump['Income_ln'].fillna(method='ffill', inplace=True)
In [17]:
import statsmodels.formula.api as smf

ols_reg = smf.ols('Q("Trump Share") ~ Income_ln', trump).fit()
trump['pred_ols_reg'] = ols_reg.predict()
C:\Users\Muffin\Anaconda3\lib\site-packages\statsmodels\compat\pandas.py:23: FutureWarning:

The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version

In [18]:
print(ols_reg.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:       Q("Trump Share")   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                  0.019
Method:                 Least Squares   F-statistic:                     61.72
Date:                Tue, 11 Aug 2020   Prob (F-statistic):           5.40e-15
Time:                        22:33:59   Log-Likelihood:                 1389.9
No. Observations:                3112   AIC:                            -2776.
Df Residuals:                    3110   BIC:                            -2764.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.5917      0.122     13.082      0.000       1.353       1.830
Income_ln     -0.0891      0.011     -7.856      0.000      -0.111      -0.067
==============================================================================
Omnibus:                      335.316   Durbin-Watson:                   1.399
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              451.754
Skew:                          -0.881   Prob(JB):                     7.99e-99
Kurtosis:                       3.615   Cond. No.                         474.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [19]:
trump['pred_ols_reg'].head()
Out[19]:
0    0.624882
1    0.626685
2    0.664276
3    0.650025
4    0.634933
Name: pred_ols_reg, dtype: float64

We could simply plot the OLS using Plotly trendline function.

In [20]:
fig = px.scatter(trump, x='Income_ln', y='Trump Share', trendline='ols',
                labels={'Income_ln': 'Income (ln)'},
                title='Relationship Between Population and Trump Share',
                template='none', trendline_color_override='red')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

b. Random Forest Regressor

image.png

A random Forest (RF) regressor is an ensemble techqunie that performs regression and classicifcation tasks using multiple decision trees.

In [21]:
trump['Income_ln'] = np.log(trump['Income'])
trump['Income_ln'].fillna(method='ffill', inplace=True)
In [22]:
from sklearn.ensemble import RandomForestRegressor as rf
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(trump[['Income_ln']].values, trump['Trump Share'].values)

skl_rf = rf(n_estimators=100).fit(x_train, y_train)
skl_rf.score(x_test, y_test)
Out[22]:
-0.31806535406376835
In [23]:
trump['pred_skl_rf'] = skl_rf.predict(trump[['Income_ln']].values)
In [24]:
trump['pred_skl_rf'].head()
Out[24]:
0    0.707976
1    0.772681
2    0.616012
3    0.679038
4    0.750440
Name: pred_skl_rf, dtype: float64
In [25]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['Trump Share'],
                    mode='markers',
                    name='Trump Share'))
fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['pred_skl_rf'],
                    mode='markers',
                    name='RF Predicted'))

fig.update_layout(template='none', title='Income and Trump Share Predicted by Random Forest',
                  xaxis_title='Income (ln)', yaxis_title='Trump Share')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

c. K-Nearest Neighbor (KNN)

image.png

The K-Nearest Neighbor (KNN) is a simple supervised machine learning technique that groups and classifies each cases based on their similarities to its neighbor.

In [26]:
from sklearn.neighbors import KNeighborsRegressor as knn
skl_knn = knn(n_neighbors=100).fit(trump[['Income_ln']].values, trump['Trump Share'].values)
trump['pred_skl_knn'] = skl_knn.predict(trump[['Income_ln']].values)
In [27]:
trump['pred_skl_knn'].head()
Out[27]:
0    0.648736
1    0.611395
2    0.650770
3    0.694707
4    0.656258
Name: pred_skl_knn, dtype: float64
In [28]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['Trump Share'],
                    mode='markers',
                    name='Trump Share'))
fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['pred_skl_knn'],
                    mode='markers',
                    name='KNN Predicted'))

fig.update_layout(template='none', title='Income and Trump Share Predicted by K Nereast Neighbor',
                  xaxis_title='Income (ln)', yaxis_title='Trump Share')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

4. Conclusion

This is a few of very simple ML model that I've learned from NYU Data Bootcamp class. Although, it is not the most accurate model (maybe very wrong), I have noticed that the random forest regression model prediction is very close to the actual Trump vote share. Also, the locations with the higher Trump vote share tend to have lower income than the locations with the lower Trump vote share. Even though I am not a big political person, I found these vote analysis quite interesting!