2014 Mental Health Survey in Tech Industry

Drawing

Table of Contents

1. Data Introduction

2. Data Cleaning

2.1 Null Values

2.1.1 comments
2.1.2 states
2.1.3 work interfere
2.1.4 self employed

2.2 Data Consistency

2.2.1 Gender
2.2.2 Age

3. Data Analysis & Visualizations

3.1 Age
3.2 Gender
3.3. Country
3.4 Work Interference with Mental Health based on Gender
3.5 Work Interference with Mental Health based on the Company Type (Tech/Non-Tech)
3.6 Remote Work Environment and Work Interference
3.7 Mental Health Benefits
3.8 Mental Health vs. Pysical Health
    3.8.1 Do you think that discussing a health issue with your employer would have negative consequences?
    3.8.2 Have you heard of or observed negative consequences with mental health conditions in your workplace?
3.9 Mental/Physical Health Issues in Interviews
    3.9.1 Would you bring up a mental/physical health issue with a potential employer in an interview?
3.10 Willingness to Discuss a Mental Health Issue
3.11 Word Colud on Comments

4. Machine Learning

- Cleaning
- Encoding
- Scailing & Fitting
- Logistic Regression
- K-Neighbors Classifier
- Decision Tree Classifier
- Random Forest
- Accuracy Scores

5. Conclusion



In [171]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from collections import Counter
import matplotlib.pyplot as plt
import re
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
pd.set_option('display.max_columns', None)
In [2]:
survey = pd.read_csv('survey.csv')

Data Descriptions

1. Data Introduction

In [3]:
survey.head()
Out[3]:
Column Name Description
family_history Do you have a family history of mental illness?
treatment Have you sought treatment for a mental health condition?
work_interfere If you have a mental health condition, do you feel that it interferes with your work?
no_employees How many employees does your company or organization have?
remote_work Do you work remotely (outside of an office) at least 50% of the time?
tech_company Is your employer primarily a tech company/organization?
benefits Does your employer provide mental health benefits?
care_options Do you know the options for mental health care your employer provides?
wellness_program Has your employer ever discussed mental health as part of an employee wellness program?
seek_help Does your employer provide resources to learn more about mental health issues and how to seek help?
anonymity Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
leave How easy is it for you to take medical leave for a mental health condition?
mentalhealthconsequence Do you think that discussing a mental health issue with your employer would have negative consequences?
physhealthconsequence Do you think that discussing a physical health issue with your employer would have negative consequences?
coworkers Would you be willing to discuss a mental health issue with your coworkers?
supervisor Would you be willing to discuss a mental health issue with your direct supervisor(s)?
mentalhealthinterview Would you bring up a mental health issue with a potential employer in an interview?
physhealthinterview Would you bring up a physical health issue with a potential employer in an interview?
mentalvsphysical Do you feel that your employer takes mental health as seriously as physical health?
obs_consequence Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
comments Any additional notes or comments
Timestamp Age Gender Country state self_employed family_history treatment work_interfere no_employees remote_work tech_company benefits care_options wellness_program seek_help anonymity leave mental_health_consequence phys_health_consequence coworkers supervisor mental_health_interview phys_health_interview mental_vs_physical obs_consequence comments
0 2014-08-27 11:29:31 37 Female United States IL NaN No Yes Often 6-25 No Yes Yes Not sure No Yes Yes Somewhat easy No No Some of them Yes No Maybe Yes No NaN
1 2014-08-27 11:29:37 44 M United States IN NaN No No Rarely More than 1000 No No Don't know No Don't know Don't know Don't know Don't know Maybe No No No No No Don't know No NaN
2 2014-08-27 11:29:44 32 Male Canada NaN NaN No No Rarely 6-25 No Yes No No No No Don't know Somewhat difficult No No Yes Yes Yes Yes No No NaN
3 2014-08-27 11:29:46 31 Male United Kingdom NaN NaN Yes Yes Often 26-100 No Yes No Yes No No No Somewhat difficult Yes Yes Some of them No Maybe Maybe No Yes NaN
4 2014-08-27 11:30:22 31 Male United States TX NaN No No Never 100-500 Yes Yes Yes No Don't know Don't know Don't know Don't know No No Some of them Yes Yes Yes Don't know No NaN
In [4]:
survey.shape
Out[4]:
(1259, 27)
In [5]:
survey.columns
Out[5]:
Index(['Timestamp', 'Age', 'Gender', 'Country', 'state', 'self_employed',
       'family_history', 'treatment', 'work_interfere', 'no_employees',
       'remote_work', 'tech_company', 'benefits', 'care_options',
       'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'comments'],
      dtype='object')
In [6]:
survey.describe()
Out[6]:
Age
count 1.259000e+03
mean 7.942815e+07
std 2.818299e+09
min -1.726000e+03
25% 2.700000e+01
50% 3.100000e+01
75% 3.600000e+01
max 1.000000e+11
In [7]:
survey.isna().sum()
Out[7]:
Timestamp                       0
Age                             0
Gender                          0
Country                         0
state                         515
self_employed                  18
family_history                  0
treatment                       0
work_interfere                264
no_employees                    0
remote_work                     0
tech_company                    0
benefits                        0
care_options                    0
wellness_program                0
seek_help                       0
anonymity                       0
leave                           0
mental_health_consequence       0
phys_health_consequence         0
coworkers                       0
supervisor                      0
mental_health_interview         0
phys_health_interview           0
mental_vs_physical              0
obs_consequence                 0
comments                     1095
dtype: int64
In [8]:
percent_missing = survey.isnull().sum() * 100 / len(survey)
null_percentage = pd.DataFrame({'Column Name': survey.columns,
                                 'Missing Percentage': percent_missing})
In [9]:
null_percentage.sort_values('Missing Percentage', ascending=False, inplace=True)
null_percentage
Out[9]:
Column Name Missing Percentage
comments comments 86.973789
state state 40.905481
work_interfere work_interfere 20.969023
self_employed self_employed 1.429706
seek_help seek_help 0.000000
obs_consequence obs_consequence 0.000000
mental_vs_physical mental_vs_physical 0.000000
phys_health_interview phys_health_interview 0.000000
mental_health_interview mental_health_interview 0.000000
supervisor supervisor 0.000000
coworkers coworkers 0.000000
phys_health_consequence phys_health_consequence 0.000000
mental_health_consequence mental_health_consequence 0.000000
leave leave 0.000000
anonymity anonymity 0.000000
Timestamp Timestamp 0.000000
wellness_program wellness_program 0.000000
Age Age 0.000000
benefits benefits 0.000000
tech_company tech_company 0.000000
remote_work remote_work 0.000000
no_employees no_employees 0.000000
treatment treatment 0.000000
family_history family_history 0.000000
Country Country 0.000000
Gender Gender 0.000000
care_options care_options 0.000000

2. Data Cleaning

a) Null Values

# comments

In [10]:
survey.loc[survey['comments'].isna() == False].head()
Out[10]:
Timestamp Age Gender Country state self_employed family_history treatment work_interfere no_employees remote_work tech_company benefits care_options wellness_program seek_help anonymity leave mental_health_consequence phys_health_consequence coworkers supervisor mental_health_interview phys_health_interview mental_vs_physical obs_consequence comments
13 2014-08-27 11:33:26 36 Male United States CT NaN Yes No Never 500-1000 No Yes Don't know Not sure No Don't know Don't know Don't know No No Yes Yes No No Don't know No I'm not on my company's health insurance which...
15 2014-08-27 11:34:00 29 female United States IL NaN Yes Yes Rarely 26-100 No Yes Yes Not sure No No Don't know Somewhat easy No No Yes Some of them Maybe Maybe Don't know No I have chronic low-level neurological issues t...
16 2014-08-27 11:34:20 23 Male United Kingdom NaN NaN No Yes Sometimes 26-100 Yes Yes Don't know No Don't know Don't know Don't know Very easy Maybe No Some of them No Maybe Maybe No No My company does provide healthcare but not to ...
24 2014-08-27 11:36:48 33 male United States CA No Yes Yes Rarely 26-100 No Yes Yes Not sure Don't know Yes Yes Don't know No No Yes Yes No Yes Don't know No Relatively new job. Ask again later
25 2014-08-27 11:37:08 35 male United States TN No Yes Yes Sometimes More than 1000 No No Yes Yes No Don't know No Very easy Yes No Some of them Yes No Yes No No Sometimes I think about using drugs for my me...

Although 87% of the 'comments' column is missing and I thought about dropping the column, I think I can use the comments for world cloud to find the most common words. So we will just leave it as it is for now.

# state

In [11]:
survey.loc[survey['state'].isna() == True].shape
Out[11]:
(515, 27)
In [12]:
survey.loc[(survey['state'].isna() == True) & (survey['Country'] != 'United States')].shape
Out[12]:
(504, 27)

It looks like except the 11 rows with NULL, the rest of the 504 rows with NULL in the state column belong to non-US countries such as UK and Australia. But we can query data only for United States, and find some information on different states, so we will also keep this as it is.

# work_interfere

In [13]:
survey['work_interfere'].value_counts()
Out[13]:
Sometimes    465
Never        213
Rarely       173
Often        144
Name: work_interfere, dtype: int64
In [14]:
survey.loc[survey['work_interfere'].isna() == True].head()
Out[14]:
Timestamp Age Gender Country state self_employed family_history treatment work_interfere no_employees remote_work tech_company benefits care_options wellness_program seek_help anonymity leave mental_health_consequence phys_health_consequence coworkers supervisor mental_health_interview phys_health_interview mental_vs_physical obs_consequence comments
19 2014-08-27 11:35:08 36 Male France NaN Yes Yes No NaN 6-25 Yes Yes No No Yes No Yes Somewhat easy No No Some of them Some of them Maybe Maybe Don't know No NaN
26 2014-08-27 11:37:23 33 male United States TN No No No NaN 1-5 No Yes Don't know Not sure No Don't know Don't know Don't know Maybe Maybe Some of them No No No Don't know No NaN
37 2014-08-27 11:41:50 38 Male Portugal NaN No No No NaN 100-500 No Yes No Yes No No Don't know Somewhat easy Maybe No Some of them Some of them No Maybe No No NaN
38 2014-08-27 11:42:08 50 M United States IN No No No NaN 100-500 No Yes Yes Yes No Don't know Don't know Don't know No No Some of them Yes No Maybe Don't know No NaN
41 2014-08-27 11:42:31 35 Male United States MI No No No NaN More than 1000 Yes Yes Yes Not sure Don't know Yes Don't know Somewhat difficult Yes Yes Some of them No No Maybe Don't know No NaN

We will fill the NULL values with the mode value of the 'work interfere' column.

In [15]:
survey['work_interfere'].fillna(survey['work_interfere'].mode()[0], inplace=True)
In [16]:
survey['work_interfere'].isna().sum()
Out[16]:
0

# self_employed

In [17]:
survey.loc[survey['self_employed'] == 'Yes']['no_employees'].value_counts()
Out[17]:
1-5               98
6-25              31
26-100             8
100-500            5
More than 1000     4
Name: no_employees, dtype: int64
In [18]:
survey.loc[survey['self_employed'] == 'No']['no_employees'].value_counts()
Out[18]:
More than 1000    277
26-100            276
6-25              253
100-500           168
1-5                62
500-1000           59
Name: no_employees, dtype: int64
In [19]:
survey.loc[survey['self_employed'].isna() == True].head()
Out[19]:
Timestamp Age Gender Country state self_employed family_history treatment work_interfere no_employees remote_work tech_company benefits care_options wellness_program seek_help anonymity leave mental_health_consequence phys_health_consequence coworkers supervisor mental_health_interview phys_health_interview mental_vs_physical obs_consequence comments
0 2014-08-27 11:29:31 37 Female United States IL NaN No Yes Often 6-25 No Yes Yes Not sure No Yes Yes Somewhat easy No No Some of them Yes No Maybe Yes No NaN
1 2014-08-27 11:29:37 44 M United States IN NaN No No Rarely More than 1000 No No Don't know No Don't know Don't know Don't know Don't know Maybe No No No No No Don't know No NaN
2 2014-08-27 11:29:44 32 Male Canada NaN NaN No No Rarely 6-25 No Yes No No No No Don't know Somewhat difficult No No Yes Yes Yes Yes No No NaN
3 2014-08-27 11:29:46 31 Male United Kingdom NaN NaN Yes Yes Often 26-100 No Yes No Yes No No No Somewhat difficult Yes Yes Some of them No Maybe Maybe No Yes NaN
4 2014-08-27 11:30:22 31 Male United States TX NaN No No Never 100-500 Yes Yes Yes No Don't know Don't know Don't know Don't know No No Some of them Yes Yes Yes Don't know No NaN

Based on the above information, when there is 1-6 employees for the company, the respondees are mostly self-employeed. Although this is not completely accurate, we will temporarily fill the NULL value according to that for now.

In [20]:
survey['self_employed'].value_counts()
Out[20]:
No     1095
Yes     146
Name: self_employed, dtype: int64
In [21]:
values = survey['no_employees'].eq('1-5').map({False: 'No', True: 'Yes'})
survey['self_employed'] = survey['self_employed'].fillna(values)
In [22]:
survey['self_employed'].isna().sum()
Out[22]:
0

2) Data Consistency

# Gender

There is a lot going on here. We will first fix all typos such as 'M', 'F', 'f' to correct Male and Female category first.

In [23]:
survey['Gender'].value_counts().to_frame().sample(2)
Out[23]:
Gender
Neuter 1
Genderqueer 1
In [24]:
LGBT = survey['Gender'].str.contains('Trans|Neuter|queer|andro|Andro|Enby|binary|trans')
survey.loc[LGBT, 'Gender'] = 'LGBT'
In [25]:
Others = survey['Gender'].str.contains('N|A|p')
survey.loc[Others, 'Gender'] = 'Others'
In [26]:
Female = survey['Gender'].str.contains('F|Wo|f|wo')
survey.loc[Female, 'Gender'] = 'Female'
In [27]:
Male = ~survey['Gender'].isin(['Female', 'Others', 'LGBT'])
survey.loc[Male, 'Gender'] = 'Male'
In [28]:
survey['Gender'].value_counts()
Out[28]:
Male      994
Female    248
LGBT       12
Others      5
Name: Gender, dtype: int64

# Age

In [29]:
min(survey['Age'])
Out[29]:
-1726
In [30]:
max(survey['Age'])
Out[30]:
99999999999
In [31]:
survey['Age'].mean()
Out[31]:
79428148.31135821
In [32]:
survey.loc[survey['Age'] > 80, 'Age'] = np.nan
survey.loc[survey['Age'] < 18, 'Age'] = np.nan
In [33]:
survey['Age'].mean()
Out[33]:
32.07673860911271
In [34]:
survey['Age'] = survey['Age'].fillna(value=survey['Age'].mean())

3. Data Analysis & Visualizations

1) Age

In [35]:
fig = px.histogram(survey, x="Age")
fig.update_layout(template='none')
fig.show()

2) Gender

In [36]:
gender_df = survey.groupby('Gender').count()['Timestamp'].to_frame()\
            .reset_index().rename(columns={'Timestamp':'Count'}).sort_values(by='Count', ascending=False)
In [37]:
colors = ['mediumturquoise', 'darkorange', 'lightgreen', 'gold']

fig = go.Figure(data=[go.Pie(labels=gender_df['Gender'],
                             values=gender_df['Count'])])
fig.update_traces(hoverinfo='percent', textinfo='label', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(title='Gender')
fig.show()

3) Country

In [38]:
country_df = survey.groupby('Country').count()['Timestamp'].to_frame().reset_index().rename(columns={'Timestamp':'Count'})
In [39]:
codes = pd.read_csv('codes.csv')
codes.drop(columns='GDP (BILLIONS)', inplace=True)
code_dict = codes.set_index('COUNTRY')['CODE'].to_dict()
In [40]:
country_df['Code'] = country_df['Country'].map(code_dict)
In [41]:
fig = px.choropleth(country_df, locations="Code",
                    color="Count", 
                    hover_name="Country",
                    color_continuous_scale=px.colors.sequential.OrRd)
fig.update_layout(title='Countries')
fig.show()

4) Work Interference with Mental Health based on Gender

In [42]:
work_gender = survey.groupby(['work_interfere', 'Gender']).count()['Timestamp']\
            .to_frame().reset_index().rename(columns={'work_interfere':'Work Interference', 'Timestamp':'Count'})
In [43]:
fig = px.sunburst(work_gender, path=['Work Interference', 'Gender'], values='Count',
                 color='Work Interference', color_discrete_sequence=px.colors.qualitative.G10)
fig.update_layout(height=700, title='Work Interference Level')
fig.show()
In [44]:
work_gender_percent = work_gender.groupby(['Gender','Work Interference'])\
                    .agg({'Count':'sum'}).groupby(level=0).apply(lambda x: round(100 * x / float(x.sum()), 2))\
                    .reset_index()
work_gender_percent = work_gender_percent.rename(columns={'Count':'Percentage'})
In [45]:
fig = px.sunburst(work_gender_percent, path=['Gender', 'Work Interference'], values='Percentage',
                 color='Percentage', color_continuous_scale='PuRd')
fig.update_layout(height=700, title='Work Interference Level based on Gender')
fig.show()

5) Work Interference with Mental Health based on the Company Type (Tech/Non-Tech)

In [46]:
tech_work = survey.groupby(['tech_company', 'work_interfere']).count()['Timestamp']\
            .to_frame().reset_index().rename(columns={'tech_company': 'Tech Company?', 
                                                      'work_interfere': 'Work Interference', 
                                                      'Timestamp': 'Count'})
In [47]:
tech_work = tech_work.groupby(['Tech Company?','Work Interference'])\
                    .agg({'Count':'sum'}).groupby(level=0).apply(lambda x: round(100 * x / float(x.sum()), 2))\
                    .reset_index()
tech_work = tech_work.rename(columns={'Count':'Percentage'})
In [48]:
fig = px.bar(tech_work, x="Percentage", y="Tech Company?", color='Work Interference', orientation='h',
             height=400, title='Is company directly related to Tech?', text='Percentage',
             color_discrete_sequence=px.colors.qualitative.G10)
fig.update_layout(template='none', hovermode='closest')
fig.update_traces(texttemplate='%{text}%')
fig.show()

6) Remote Work Environment and Work Interference

In [49]:
remote_work = survey.groupby(['remote_work', 'work_interfere']).count()['Timestamp']\
            .to_frame().reset_index().rename(columns={'remote_work': 'Work From Home?', 
                                                      'work_interfere': 'Work Interference', 
                                                      'Timestamp': 'Count'})
In [50]:
remote_work = remote_work.groupby(['Work From Home?','Work Interference'])\
                    .agg({'Count':'sum'}).groupby(level=0).apply(lambda x: round(100 * x / float(x.sum()), 2))\
                    .reset_index()
remote_work = remote_work.rename(columns={'Count':'Percentage'})
In [51]:
fig = px.bar(remote_work, x="Percentage", y="Work From Home?", color='Work Interference', orientation='h',
             height=400, title='Remote Work Environment', text='Percentage',
             color_discrete_sequence=px.colors.qualitative.G10)
fig.update_layout(template='none', hovermode='closest')
fig.update_traces(texttemplate='%{text}%')
fig.show()

7) Mental Health Benefits

In [52]:
benefits = survey.groupby('benefits').count()['Timestamp'].to_frame().\
            reset_index().rename(columns={'benefits': 'Benefits', 'Timestamp': 'Count'})
care_options = survey.groupby('care_options').count()['Timestamp'].to_frame().\
            reset_index().rename(columns={'care_options': 'Care Options', 'Timestamp': 'Count'})
wellness_program = survey.groupby('wellness_program').count()['Timestamp'].to_frame().\
            reset_index().rename(columns={'wellness_program': 'Wellness Program', 'Timestamp': 'Count'})
help_resource = survey.groupby('seek_help').count()['Timestamp'].to_frame().\
            reset_index().rename(columns={'seek_help': 'Help Resource', 'Timestamp': 'Count'})
In [53]:
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen']
specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]]
fig = make_subplots(rows=2, cols=2, specs=specs)

fig.add_trace(go.Pie(labels=benefits['Benefits'], 
                     values=benefits['Count'], 
                     name="Benefits"), 1, 1)
fig.add_trace(go.Pie(labels=care_options['Care Options'], 
                     values=care_options['Count'], 
                     name="Care Options"), 1, 2)
fig.add_trace(go.Pie(labels=wellness_program['Wellness Program'], 
                     values=wellness_program['Count'], 
                     name="Wellness Program"), 2, 1)
fig.add_trace(go.Pie(labels=help_resource['Help Resource'], 
                     values=help_resource['Count'], 
                     name="Help Resource"), 2, 2)

fig.update_traces(hole=.5, hoverinfo="label+percent+name",
                  marker=dict(line=dict(color='#000000', width=2)))
fig.update_layout(title_text="Mental Health Benefits", height=800, width=950,
                  annotations=[dict(text='Benefits', x=0.168, y=0.815, font_size=20, showarrow=False),
                               dict(text='Care<br>Options', x=0.828, y=0.822, font_size=20, showarrow=False),
                               dict(text='Wellness<br>Program', x=0.168, y=0.18, font_size=20, showarrow=False),
                               dict(text='Help<br>Resource', x=0.833, y=0.18, font_size=20, showarrow=False)])
fig.show()

8) Mental Health vs. Pysical Health

Do you think that discussing a mental/phsical health issue with your employer would have negative consequences?

In [54]:
mental = survey['mental_health_consequence'].value_counts().to_frame().reset_index()\
        .rename(columns={'index': 'Mental', 'mental_health_consequence': 'Count'})
physical = survey['phys_health_consequence'].value_counts().to_frame().reset_index()\
        .rename(columns={'index': 'Physical', 'phys_health_consequence': 'Count'})
In [55]:
fig = go.Figure()
fig.add_trace(go.Bar(y=mental['Mental'], x=mental['Count'], 
                     name='Mental', orientation='h',
                     marker=dict(color='rgba(246, 78, 139, 0.6)',
                                 line=dict(color='rgba(246, 78, 139, 1.0)', width=3))))
fig.add_trace(go.Bar(y=physical['Physical'], x=physical['Count'],
                     name='Physical', orientation='h',
                     marker=dict(color='rgba(58, 71, 80, 0.6)', 
                                 line=dict(color='rgba(58, 71, 80, 1.0)', width=3))))

fig.update_layout(barmode='stack', template='none', hovermode='closest',
                 title='Mental or Physical Health results in Negative Consequences?')
fig.update_xaxes(showgrid=False)
fig.show()

Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?

In [56]:
observations = survey['obs_consequence'].value_counts().to_frame().reset_index()\
                .rename(columns={'index':'Observations', 'obs_consequence': 'Count'})
In [57]:
fig = go.Figure()
fig.add_trace(go.Bar(y=mental['Mental'], x=mental['Count'], 
                     name='Assumptions', orientation='h',
                     marker=dict(color='rgba(161, 191, 133, 0.6)',
                                 line=dict(color='rgba(161, 191, 133, 1.0)', width=3))))
fig.add_trace(go.Bar(y=observations['Observations'], x=observations['Count'],
                     name='Consequences', orientation='h',
                     marker=dict(color='rgba(35, 34, 45, 0.6)', 
                                 line=dict(color='rgba(35, 34, 45, 1.0)', width=3))))

fig.update_layout(barmode='stack', template='none', hovermode='closest',
                 title='Mental Health Condition results in Negative Consequences?')
fig.update_xaxes(showgrid=False)
fig.show()

9) Mental/Physical Health Issues in Interviews

Would you bring up a mental/physical health issue with a potential employer in an interview?

In [58]:
mental_int = survey['mental_health_interview'].value_counts().to_frame().reset_index()\
        .rename(columns={'index': 'Mental', 'mental_health_interview': 'Count'})
physical_int = survey['phys_health_interview'].value_counts().to_frame().reset_index()\
        .rename(columns={'index': 'Physical', 'phys_health_interview': 'Count'})
In [59]:
fig = go.Figure()
fig.add_trace(go.Bar(y=mental_int['Mental'], x=mental_int['Count'], 
                     name='Mental', orientation='h',
                     marker=dict(color='rgba(246, 78, 139, 0.6)',
                                 line=dict(color='rgba(246, 78, 139, 1.0)', width=3))))
fig.add_trace(go.Bar(y=physical_int['Physical'], x=physical_int['Count'],
                     name='Physical', orientation='h',
                     marker=dict(color='rgba(58, 71, 80, 0.6)', 
                                 line=dict(color='rgba(58, 71, 80, 1.0)', width=3))))

fig.update_layout(barmode='stack', template='none', hovermode='closest',
                 title='Mental or Physical Health discussion in Interview Process?')
fig.update_xaxes(showgrid=False)
fig.show()

10) Willingness to Discuss a Mental Health Issue

Would you be willing to discuss a mental health issue with your coworkers?

In [60]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=survey['coworkers'], name='Co-workers'))
fig.add_trace(go.Histogram(x=survey['supervisor'], name='Supervisors'))
fig.update_traces(opacity=0.75)
fig.update_layout(template='none', hovermode='closest', 
                  title='Willingness to Discuss a Mental Health Issue')
fig.show()

11) Word Colud on Comments

In [61]:
comments = list(survey['comments'].dropna())
In [62]:
comment = []
for i in comments:
    i = list(i.split(' '))
    for j in i:
        comment.append(j.replace(' ', ''))
        
com = Counter(comment)
lower_com =  {k.lower(): v for k, v in com.items()}
In [63]:
conjunction = ['Although', 'As if', 'Because', 'Even', 'Even though', 'If then', 'In order that', 'Now when', 'Rather than']
conjunction = [i.lower() for i in conjunction]
other_words = ['i\'m', 'not', 'on', 'the', 'my', 'be', 'when', 'which', 'could', 'would', 'a', 'an', 'to', 'too',
              'so', 'many', 'of', 'don\'t', 'have', 'that', 'also', 'did', 'get', 'may', 'lot' 'i\'ve', 'i',
              'this', 'however', 'based', 'doesn\'t', 'it\'', 'than', 'including', 'non', 'however\'', 'than\'',
              'inclulding\'', 'my\'', 'them', 'does', 'though', 'they', 'we', 'know']
all_words = conjunction + other_words
In [64]:
formatted_comments = []
for key, value in lower_com.items():
    if (len(key) > 3) & (key not in all_words):
        formatted_comments.append(key)
In [65]:
formatted_comments_apo = []
for comment in formatted_comments:
    formatted_comments_apo.append(re.sub(r"'[^.\s]*", "", comment))
In [66]:
plt.rcParams['figure.figsize'] = (15, 10)
wordcloud = WordCloud(max_font_size=50, 
                      max_words=60,
                      stopwords=all_words,
                      background_color="white").generate(str(formatted_comments_apo))

plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

4. Machine Learning

In [67]:
train_df = pd.read_csv('survey.csv')
In [68]:
train_df = train_df.drop(['comments'], axis= 1)
train_df = train_df.drop(['state'], axis= 1)
train_df = train_df.drop(['Timestamp'], axis= 1)

Cleaning all NaN values

In [69]:
train_df.columns
Out[69]:
Index(['Age', 'Gender', 'Country', 'self_employed', 'family_history',
       'treatment', 'work_interfere', 'no_employees', 'remote_work',
       'tech_company', 'benefits', 'care_options', 'wellness_program',
       'seek_help', 'anonymity', 'leave', 'mental_health_consequence',
       'phys_health_consequence', 'coworkers', 'supervisor',
       'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence'],
      dtype='object')
In [70]:
int_features = []
float_features = [] 
string_features = []

for column in train_df.columns:
    if isinstance(train_df[column][0], np.integer):
        int_features.append(column)
    elif isinstance(train_df[column][0], np.float):
        float_features.append(column)
    else:
        string_features.append(column)    
In [71]:
null_int = 0
null_string = 'NaN'
null_float = 0

for feature in train_df:
    if feature in int_features:
        train_df[feature] = train_df[feature].fillna(null_int)
    elif feature in string_features:
        train_df[feature] = train_df[feature].fillna(null_string)
    elif feature in float_features:
        train_df[feature] = train_df[feature].fillna(null_float)
In [72]:
train_df['Gender'] = survey['Gender']
genders = ['Female', 'Male']
trans = ~train_df['Gender'].isin(genders)
train_df.loc[trans, 'Gender'] = 'trans'
train_df['Gender'] = train_df['Gender'].str.lower()
In [76]:
train_df['Age'].fillna(train_df['Age'].median(), inplace = True)
ages = pd.Series(train_df['Age'])
ages[ages<18] = train_df['Age'].median()

train_df['Age'] = ages
ages = pd.Series(train_df['Age'])
ages[ages>120] = train_df['Age'].median()
train_df['Age'] = ages

train_df['age_range'] = pd.cut(train_df['Age'], [0,20,30,65,100], 
                               labels=["0-20", "21-30", "31-65", "66-100"], 
                               include_lowest=True)
In [79]:
train_df['self_employed'] = train_df['self_employed'].replace(np.nan, 'No')
train_df['self_employed'] = train_df['self_employed'].replace(0, 'No')
In [83]:
train_df['self_employed'].value_counts()
Out[83]:
No     1113
Yes     146
Name: self_employed, dtype: int64
In [88]:
train_df['work_interfere'] = train_df['work_interfere'].replace(np.nan, 'Don\'t know')
train_df['work_interfere'] = train_df['work_interfere'].replace('NaN', 'Don\'t know')

Encoding Data

In [89]:
label_dict = {}
for feature in train_df:
    
    encoder = preprocessing.LabelEncoder()
    encoder.fit(train_df[feature])
    name_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
    train_df[feature] = encoder.transform(train_df[feature])
    
    label_key = 'label_' + feature
    label_value = [*name_mapping]
    
    label_dict[label_key] = label_value
In [103]:
train_df = train_df.drop(['Country'], axis= 1)
In [110]:
corr = train_df.corr()

fig = go.Figure(data=go.Heatmap(z=corr, x=corr.index, y=corr.columns, 
                                hoverongaps=False))
fig.update_layout(title='Correlations Between Columns', width=800, height=800)
fig.show()

Scailing and Fitting

In [112]:
scaler = MinMaxScaler()
train_df['Age'] = scaler.fit_transform(train_df[['Age']])
Out[112]:
Age Gender self_employed family_history treatment work_interfere no_employees remote_work tech_company benefits care_options wellness_program seek_help anonymity leave mental_health_consequence phys_health_consequence coworkers supervisor mental_health_interview phys_health_interview mental_vs_physical obs_consequence age_range
0 0.431818 0 0 0 1 2 4 0 1 2 1 1 2 2 2 1 1 1 2 1 0 2 0 2
1 0.590909 1 0 0 0 3 5 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 2
2 0.318182 1 0 0 0 3 4 0 1 1 0 1 1 0 1 1 1 2 2 2 2 1 0 2
3 0.295455 1 0 1 1 2 2 0 1 1 2 1 1 1 1 2 2 1 0 0 0 1 1 2
4 0.295455 1 0 0 0 1 1 1 1 2 0 0 0 0 0 1 1 1 2 2 2 0 0 2
In [113]:
features = ['Age', 'Gender', 'family_history', 'benefits', 'care_options', 'anonymity', 'leave', 'work_interfere']
X = train_df[features]
y = train_df.treatment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

method_dict = {}
rmse_dict = ()

Random Forest

In [114]:
forest = ExtraTreesClassifier(n_estimators=250, random_state=0)
forest.fit(X, y)

importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

labels = []
for i in range(X.shape[1]):
    labels.append(features[i])      
In [150]:
fig = go.Figure()
fig.add_trace(go.Bar(x=X.columns, y=importances[indices],
                     error_y=dict(type='data', color='olive', array=std[indices])))
fig.update_traces(marker_color='darkred', marker_line_color='darkred',
                  marker_line_width=1.5, opacity=0.6)
fig.update_layout(barmode='group', template='none', title='Feature Importances')
fig.show()

Logistic Regression

In [153]:
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred_class = log_reg.predict(X_test)

log_reg_accruacy_score = metrics.accuracy_score(y_test, y_pred_class)
log_reg_accruacy_score
C:\Users\Muffin\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning:

Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

Out[153]:
0.7883597883597884

KNeighbors Classifier

In [157]:
knn = KNeighborsClassifier(n_neighbors=27, weights='uniform')
knn.fit(X_train, y_train)
y_pred_class = knn.predict(X_test)

knn_accuracy_score = metrics.accuracy_score(y_test, y_pred_class)
knn_accuracy_score
Out[157]:
0.8068783068783069

Decision Tree Classifier

In [158]:
tree = DecisionTreeClassifier(max_depth=3, min_samples_split=8, max_features=6, 
                              criterion='entropy', min_samples_leaf=7)
tree.fit(X_train, y_train)
y_pred_class = tree.predict(X_test)

tree_accuracy_score = metrics.accuracy_score(y_test, y_pred_class)
tree_accuracy_score
Out[158]:
0.8095238095238095

Random Forest

In [159]:
rf = RandomForestClassifier(max_depth=None, min_samples_leaf=8, min_samples_split=2, 
                                n_estimators=20, random_state=1)
rf.fit(X_train, y_train)
y_pred_class = rf.predict(X_test)

rf_accuracy_score = metrics.accuracy_score(y_test, y_pred_class)
rf_accuracy_score
Out[159]:
0.8148148148148148

Visualization of Accuracy Scores of All Models

In [161]:
scores = {'Name': ['Logistic Regression', 'KNeighbors Classifier', 'Decision Tree Classifier', 'Random Forest'], 
         'Score': [0.788, 0.807, 0.81, 0.815]}
scores = pd.DataFrame(data=scores)
In [170]:
fig = px.bar(scores, x='Name', y='Score', color='Score',
            color_continuous_scale=px.colors.sequential.Purples)
fig.update_layout(template='none', title='Accuracy Scores')
fig.update_xaxes(title='')
fig.show()

5. Conclusion

Well, I'm very excited to finish this data analysis on mental health survey in the Technology industry. As a person working in this industry for many coming years, I was always curious about people's mental health. Also, I could tell there were more males than female or LGBT individuals in the field. But since this is from 2014, this might not be right anymore. Also, I found it was interesting that people were identifying themselves with cis-gender, non-binary, etc. Because I only recently started learning about different gender types. And another interesting thing about gender was that people who did not belong to any male, female, and LGBT group answered their mental health interfered with work (often, sometimes) more than other genders. But I will need a larger size of data to check this.

Also, either the company was directly related to Technology or not, or remote or not, the people answered similarly. It turned out many employees were not aware of wellness programs or mental health resources offered by a company.

I tried using WordCloud on the comment sections, but I need to find a better way to deal with long sentences. Also, I need to continue to study machine learning, so when I follow tutorials or build models, I can have a better understanding.

In conclusion, this was a very interesting analysis, and I'm hoping to do it again once I get the latest data on this topic!