1. Introduction
According to Workpermit.com, The US H-1B visa is a non-immigrant visa that allows US companies to employ graduate-level workers in specialty occupations that require theoretical or technical expertise in specialized fields such as in IT, finance, accounting, architecture, engineering, mathematics, science, medicine, etc.
As for this report, we would like to show the trend of the H1B petition over the years from 2008 to 2018 and potentia opportunities for international students to focus when they are applying for jobs in the United States.
Due to the Coronavirus, companies are laying off, and job searching is hard, but we still need to stay strong and be hopeful!
2. Data Cleaning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.set_option('display.max_columns', None)
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
fy2008 = pd.read_csv('H-1B_Case_Data_FY2008.csv')
fy2008 = fy2008[['CASE_NO','APPROVAL_STATUS','SUBMITTED_DATE','DOL_DECISION_DATE','NAME',
'CITY','OCCUPATIONAL_TITLE',
'STATE','POSTAL_CODE','JOB_CODE','JOB_TITLE',
'RATE_PER_1','PART_TIME_1',
'PREVAILING_WAGE_1']]
#Convert Part-time data to full-time to align with other data
fy2008['PART_TIME_1'] = fy2008['PART_TIME_1'].replace({'N': 'Y', 'Y': 'N'})
newnames08 = ['CASE_NO','CASE_STATUS','CASE_SUBMITTED', 'DECISION_DATE','EMPLOYER_NAME',
'CITY','SOC_NAME','STATE','POSTAL_CODE','SOC_CODE','JOB_TITLE',
'WAGE_UNIT','FULL_TIME','PREVAILING_WAGE']
fy2008.columns = newnames08
fy2009 = pd.read_csv('Icert_ LCA_ FY2009.csv',dtype='object')
fy2009 = fy2009[['LCA_CASE_NUMBER','STATUS','LCA_CASE_SUBMIT','Decision_Date',
'LCA_CASE_EMPLOYER_NAME', 'LCA_CASE_EMPLOYER_CITY',
'LCA_CASE_EMPLOYER_STATE', 'LCA_CASE_EMPLOYER_POSTAL_CODE','LCA_CASE_SOC_CODE','LCA_CASE_SOC_NAME',
'LCA_CASE_JOB_TITLE','FULL_TIME_POS', 'PW_1',
'PW_UNIT_1' ]]
newnames09 = ['CASE_NO','CASE_STATUS','CASE_SUBMITTED', 'DECISION_DATE','EMPLOYER_NAME',
'CITY','STATE','POSTAL_CODE','SOC_CODE','SOC_NAME','JOB_TITLE',
'FULL_TIME','PREVAILING_WAGE','WAGE_UNIT']
fy2009.columns = newnames09
fy2010 = pd.read_csv('H-1B_FY2010.csv',dtype='object')
fy2010 = fy2010[['LCA_CASE_NUMBER', 'STATUS', 'LCA_CASE_SUBMIT','DECISION_DATE','LCA_CASE_EMPLOYER_NAME',
'LCA_CASE_EMPLOYER_CITY','LCA_CASE_EMPLOYER_STATE', 'LCA_CASE_EMPLOYER_POSTAL_CODE','LCA_CASE_SOC_CODE','LCA_CASE_SOC_NAME',
'LCA_CASE_JOB_TITLE','PW_1','PW_UNIT_1' ]]
#Missing full_time information here
newnames10 = ['CASE_NO','CASE_STATUS','CASE_SUBMITTED', 'DECISION_DATE','EMPLOYER_NAME',
'CITY','STATE','POSTAL_CODE','SOC_CODE','SOC_NAME','JOB_TITLE',
'PREVAILING_WAGE','WAGE_UNIT']
fy2010.columns = newnames10
H1B08_10 = pd.concat([fy2008,fy2009,fy2010],sort=False)
H1B11_14 = pd.concat(map(pd.read_csv, ['H-1B_iCert_LCA_FY2011_Q4 (1).csv','LCA_FY2012_Q4.csv','LCA_FY2013.csv','H-1B_FY2014.csv']))
H1B11_14 = H1B11_14[['LCA_CASE_NUMBER','STATUS','LCA_CASE_SUBMIT','LCA_CASE_JOB_TITLE','DECISION_DATE',
'LCA_CASE_EMPLOYER_NAME','LCA_CASE_EMPLOYER_CITY','LCA_CASE_EMPLOYER_STATE','LCA_CASE_EMPLOYER_POSTAL_CODE',
'FULL_TIME_POS','LCA_CASE_SOC_CODE','LCA_CASE_SOC_NAME','PW_1',
'LCA_CASE_WAGE_RATE_UNIT']]
newnames11 = ['CASE_NO','CASE_STATUS','CASE_SUBMITTED', 'JOB_TITLE','DECISION_DATE','EMPLOYER_NAME',
'CITY','STATE','POSTAL_CODE', 'FULL_TIME','SOC_CODE','SOC_NAME',
'PREVAILING_WAGE','WAGE_UNIT']
H1B11_14.columns = newnames11
H1B15_18 = pd.concat(map(pd.read_csv, ['FY2015.csv','FY2016.csv','FY2017.csv','FY2018.csv']))
H1B15_18 = H1B15_18[['CASE_NUMBER','CASE_STATUS', 'CASE_SUBMITTED','JOB_TITLE','DECISION_DATE',
'EMPLOYER_NAME','EMPLOYER_CITY','EMPLOYER_STATE','EMPLOYER_POSTAL_CODE',
'FULL_TIME_POSITION','SOC_CODE','SOC_NAME','PREVAILING_WAGE',
'WAGE_UNIT_OF_PAY']]
newnames15 = ['CASE_NO','CASE_STATUS','CASE_SUBMITTED', 'JOB_TITLE','DECISION_DATE','EMPLOYER_NAME',
'CITY','STATE','POSTAL_CODE','FULL_TIME','SOC_CODE','SOC_NAME',
'PREVAILING_WAGE','WAGE_UNIT']
H1B15_18.columns = newnames15
h1b = pd.concat([H1B08_10,H1B11_14,H1B15_18])
h1b
h1b.dtypes
h1b.shape
At this point we combined all the H1B data for 11 years, then we will start data cleaning.
h1b['CASE_STATUS'].value_counts()
h1b.drop(h1b.loc[h1b['CASE_STATUS'] == 'PENDING QUALITY AND COMPLIANCE REVIEW - UNASSIGNED'].index, inplace=True)
h1b.loc[h1b['CASE_STATUS']=='REJECTED','CASE_STATUS'] = 'DENIED'
h1b.loc[h1b['CASE_STATUS']=='INVALIDATED','CASE_STATUS'] = 'DENIED'
h1b['WAGE_UNIT'].value_counts()
h1b.loc[h1b['WAGE_UNIT']=='yr','WAGE_UNIT'] = 'Year'
h1b.loc[h1b['WAGE_UNIT']=='hr','WAGE_UNIT'] = 'Hour'
h1b.loc[h1b['WAGE_UNIT']=='mth','WAGE_UNIT'] = 'Month'
h1b.loc[h1b['WAGE_UNIT']=='wk','WAGE_UNIT'] = 'Week'
h1b.loc[h1b['WAGE_UNIT']=='bi','WAGE_UNIT'] = 'Bi-Weekly'
h1b.loc[h1b['WAGE_UNIT']== 'Select Pay Range']
h1b.drop(h1b.loc[h1b['WAGE_UNIT']== 'Select Pay Range'].index, inplace=True)
h1b.drop(h1b.loc[h1b['PREVAILING_WAGE'] == '20-70'].index, inplace=True)
h1b.drop(h1b.loc[h1b['PREVAILING_WAGE'] == '.'].index, inplace=True)
h1b['PREVAILING_WAGE'] = h1b['PREVAILING_WAGE'].astype('str').str.replace(',','').str.replace('$','').str.replace(' ','')
# Change hourly wage into yearly wage
h1b.at[h1b['WAGE_UNIT'] == 'Hour', 'PREVAILING_WAGE'] = h1b.loc[h1b[
'WAGE_UNIT'] == 'Hour']['PREVAILING_WAGE'].astype(float).apply(lambda x: x * 37.5 * 52)
# Change montly wage into yearly wage
h1b.at[h1b['WAGE_UNIT'] == 'Month', 'PREVAILING_WAGE'] = h1b.loc[h1b[
'WAGE_UNIT'] == 'Month']['PREVAILING_WAGE'].astype(float).apply(lambda x: x * 12)
# Change weekly into yearly wage
h1b.at[h1b['WAGE_UNIT'] == 'Week', 'PREVAILING_WAGE'] = h1b.loc[h1b[
'WAGE_UNIT'] == 'Week']['PREVAILING_WAGE'].astype(float).apply(lambda x: x * 52)
# Change bi-weekly into yearly wage
h1b.at[h1b['WAGE_UNIT'] == 'Bi-Weekly', 'PREVAILING_WAGE'] = h1b.loc[h1b[
'WAGE_UNIT'] == 'Bi-Weekly']['PREVAILING_WAGE'].astype(float).apply(lambda x: x * 26)
# Change all columns into float
h1b['PREVAILING_WAGE'] = h1b['PREVAILING_WAGE'].apply(lambda x: float(x.replace(',', '')) if isinstance(x, str) else x)
# Drop the wage unit column
h1b.drop(columns='WAGE_UNIT', inplace=True)
#drop wages not inbetween 10,000 and 2,000,000 as these were put with wrong wage units
h1b.drop(h1b.loc[(h1b['PREVAILING_WAGE'] > 2000000) | (h1b['PREVAILING_WAGE'] < 10000)].index, inplace=True)
h1b['STATE']
h1b['FSTATE'] = h1b['STATE']
usStateAbbr = {
'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona', 'AR': 'Arkansas', 'CA': 'California',
'CO': 'Colorado', 'CT': 'Connecticut', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia',
'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'IA': 'Iowa', 'KS': 'Kansas',
'KY': 'Kentucky', 'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland', 'MA': 'Massachusetts',
'MI': 'Michigan', 'MN': 'Minnesota', 'MS': 'Mississippi', 'MO': 'Missouri', 'MT': 'Montana',
'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire', 'NJ': 'New Jersey', 'NM': 'New Mexico',
'NY': 'New York', 'NC': 'North Carolina', 'ND': 'North Dakota', 'OH': 'Ohio', 'OK': 'Oklahoma',
'OR': 'Oregon', 'PA': 'Pennsylvania', 'RI': 'Rhode Island', 'SC': 'South Carolina',
'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'UT': 'Utah', 'VT': 'Vermont',
'VA': 'Virginia', 'WA': 'Washington', 'WV': 'West Virginia', 'WI': 'Wisconsin', 'WY': 'Wyoming'}
h1b['FSTATE'] = h1b['FSTATE'].map(usStateAbbr).str.upper()
h1b['DECISION_DATE'] = pd.to_datetime(h1b['DECISION_DATE'])
h1b['CASE_SUBMITTED'] = pd.to_datetime(h1b['CASE_SUBMITTED'])
h1b = h1b.dropna(subset=['DECISION_DATE','CASE_SUBMITTED'])
# Find one erro in SOC Name and cleaned here
h1b.loc[h1b['SOC_NAME'] == '15-1121', 'SOC_NAME'] = 'COMPUTER SYSTEMS ANALYSTS'
h1b['SOC_NAME'].value_counts().head(20)
h1b.loc[h1b['SOC_NAME']=='Computer Programmers','SOC_NAME'] = 'COMPUTER PROGRAMMERS'
h1b.loc[h1b['SOC_NAME']=='Computer Systems Analysts','SOC_NAME'] = 'COMPUTER SYSTEMS ANALYSTS'
h1b.loc[h1b['SOC_NAME']=='COMPUTER SYSTEMS ANALYST','SOC_NAME'] = 'COMPUTER SYSTEMS ANALYSTS'
h1b.loc[h1b['SOC_NAME']=='Computer Software Engineers, Applications','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='SOFTWARE DEVELOPERS, SYSTEMS SOFTWARE','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='Software Developers, Applications','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='SOFTWARE DEVELOPERS, APPLICATIONS ','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='Software Developers, Systems Software','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='Computer Software Engineers, Systems Software','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='Computer Occupations, All Other','SOC_NAME'] = 'COMPUTER OCCUPATIONS, ALL OTHER'
h1b.loc[h1b['SOC_NAME']=='Financial Analysts','SOC_NAME'] = 'FINANCIAL ANALYSTS'
h1b
h1b.dtypes
3. Data Analysis and Visualization
Since 2017, the Trump administration has announced series reformation on H1B visa policy, it changes the landscape a lot. In the following analysis, we will compare the 2017 to 2018 data to visualize these changes.
decisionDaysTook = h1b['DECISION_DATE'] - h1b['CASE_SUBMITTED']
decisionDaysTook = decisionDaysTook.dropna().dt.days
days_over_30 = len(decisionDaysTook[(decisionDaysTook > 30) & (decisionDaysTook < 90)])
days_over_90 = len(decisionDaysTook[(decisionDaysTook > 90)])
top3_dec = decisionDaysTook.value_counts().iloc[:3]
top5 = pd.Series(days_over_30, index =['30 to 90'])
top4 = pd.Series(days_over_90, index =['90 or more'])
top4n5 = top4.append(top5)
top_days = top3_dec.append(top4n5)
fig, ax = plt.subplots(figsize=(12, 4))
sns.barplot(y=top_days.index, x=top_days.values);
plt.title("Number of Days Taken for the Result", fontsize=15, fontweight='bold')
plt.ylabel("Days")
plt.xlabel("Number of Cases")
plt.legend(['Cases']);
Here we used case submitted date and decision date to calculate the number of days taken for the decision. The majority of decisions were made less than a week.
h1b['YEAR'] = h1b['DECISION_DATE'].dt.year.astype('str')
H1B08_16 = h1b[~h1b["YEAR"].isin(["2017","2018"]) ]
H1B17_18 = h1b[h1b["YEAR"].isin(["2017","2018"]) ]
fig,ax = plt.subplots(nrows=1,ncols=3,sharey=False)
caseStatusSum = h1b.groupby(['CASE_STATUS'])[['CASE_NO']].count()
caseStatusb17 = H1B08_16.groupby(['CASE_STATUS'])[['CASE_NO']].count()
caseStatusa17 = H1B17_18.groupby(['CASE_STATUS'])[['CASE_NO']].count()
caseStatusSum.plot.pie(ax=ax[0],figsize=(15,10),y='CASE_NO', explode=(0.1, 0.1, 0.1, 0.1), autopct='%1.1f%%', shadow=True)
caseStatusb17.plot.pie(ax=ax[1],figsize=(15,10),y='CASE_NO', explode=(0.1, 0.1, 0.1, 0.1), autopct='%1.1f%%', shadow=True)
caseStatusa17.plot.pie(ax=ax[2],figsize=(15,10),y='CASE_NO', explode=(0.1, 0.1, 0.1, 0.1), autopct='%1.1f%%', shadow=True)
fig.tight_layout()
ax[0].set_title('Overall Decision on the Cases', fontsize=15, fontweight='bold')
ax[1].set_title('Before 2017 Decision on the Cases', fontsize=15, fontweight='bold')
ax[2].set_title('After 2017 Decision on the Cases', fontsize=15, fontweight='bold')
plt.legend(bbox_to_anchor=(1.4, 1), loc='upper right', ncol=1);
As the abrove pictures show, from the percentage perspective, there are no big changes before 2017, after 2017 and the overall data. The certified cases are around 88%, which is slightly increased after 2017.There is a 2% decrease shows in the denied cases and 1% increase on Certified-Withdrawn, which is interesting.
The results are different from our expection because we think the certified cases might decrease after the policy change in 2017. The next step, we will figure out if the case number is droped, which means less people apply for H1B visa per year after 2017.
ycase = h1b.groupby('YEAR').agg('count')
ycase1 = ycase.iloc[:,:1]
ycase1
#Shows how many H1B cases was submitted through out the years
fig, ax = plt.subplots(figsize=(10,5))
ycase1.plot(ax=ax)
plt.title('H1B Applications Over Years',fontsize=15,fontweight='bold')
plt.ylabel('Number of Applications')
ax.legend(['Jobs'])
In year 2013, the H1B application hit the lowest number which is 74897 and it's increasing after that. We can also see a tendency of dropping after year 2017, which matched with our expectation.
a = H1B08_10.groupby('STATE').agg('count')
b = a.iloc[:,:1].reset_index().rename(columns={'CASE_NO': 'Cases'})
fig = px.choropleth(b,
locations="STATE",
color="Cases",
hover_name="STATE" ,
locationmode = 'USA-states')
fig.update_layout(
title_text = 'Year 2008 to 2010 H1B Job Rankings in States of US',
geo_scope='usa',
)
fig.show()
a = h1b.groupby('STATE').agg('count')
b = a.iloc[:,:1].reset_index().rename(columns={'CASE_NO': 'Cases'})
fig = px.choropleth(b,
locations="STATE",
color="Cases",
hover_name="STATE" ,
locationmode = 'USA-states')
fig.update_layout(
title_text = 'H1B Job Rankings in States of US',
geo_scope='usa',
)
fig.show()
certified = h1b.loc[h1b['CASE_STATUS'] == 'CERTIFIED']
denied = h1b.loc[h1b['CASE_STATUS'] == 'DENIED']
certified18 = H1B17_18.loc[H1B17_18['CASE_STATUS'] == 'CERTIFIED']
denied18 = H1B17_18.loc[H1B17_18['CASE_STATUS'] == 'DENIED']
fig, ax = plt.subplots(2, 1, figsize=(13, 10))
sns.countplot(y=certified['FSTATE'], order=certified['FSTATE'].value_counts().iloc[:10].index, ax=ax[0])
ax[0].set_title('States with the Most Certified H1B Applicants', fontsize=15, fontweight='bold')
ax[0].set_xlabel('Number of Certified Cases')
ax[0].set_ylabel('City')
ax[0].legend(['Certified Cases'])
sns.countplot(y=certified18['FSTATE'], order=certified18['FSTATE'].value_counts().iloc[:10].index, ax=ax[1])
ax[1].set_title('2017 to 2018 States with the Most Certified H1B Applicants', fontsize=15, fontweight='bold')
ax[1].set_xlabel('Number of Certified Cases')
ax[1].set_ylabel('City')
ax[1].legend(['Certified Cases'])
plt.subplots_adjust(hspace=0.5);
fig, ax = plt.subplots(2, 1, figsize=(13, 10))
sns.countplot(y=denied['FSTATE'], order=denied['FSTATE'].value_counts().iloc[:10].index, ax=ax[0])
ax[0].set_title('States with the Most denied H1B Applicants', fontsize=15, fontweight='bold')
ax[0].set_xlabel('Number of denied Cases')
ax[0].set_ylabel('City')
ax[0].legend(['Denied Cases'])
sns.countplot(y=denied18['FSTATE'], order=denied18['FSTATE'].value_counts().iloc[:10].index, ax=ax[1])
ax[1].set_title('2017 to 2018 States with the Most denied H1B Applicants', fontsize=15, fontweight='bold')
ax[1].set_xlabel('Number of denied Cases')
ax[1].set_ylabel('City')
ax[1].legend(['Denied Cases'])
plt.subplots_adjust(hspace=0.5);
As the above pictures show, the number of H1B applications of states don't change that much. But we do see that Texas is poping up to offer more H1B visa and New York is slowing down it's visa application. According to a report from Business in Texas,Texas'IT services workforce is the second largest in the nation.
Number wise, California has the most number of applicants in both certified and denied. Interestingly, New Jersey has the second highest number of certified applicatns with the forth denied application. On the contrary, New York has higher denied rate with lower certified rate.
International students can focus on CA, NY, NJ and TX when looking for full-time jobs. Compare to the overall data with the data from 2017 to 2018,we suggest international students to pay extra attention on Texas and New Jersey are great opportunities.
fig, ax = plt.subplots(2,1, figsize=(13, 10))
sns.countplot(y=certified['CITY'], order=certified['CITY'].value_counts().iloc[:10].index, ax=ax[0])
ax[0].set_title('Most Certified H1B Cities', fontsize=15, fontweight='bold')
ax[0].set_xlabel('Number of Certified Cases')
ax[0].set_ylabel('City')
ax[0].legend(['Certified Cases'])
sns.countplot(y=denied['CITY'], order=denied['CITY'].value_counts().iloc[:10].index, ax=ax[1])
ax[1].set_title('Most denied H1B Cities', fontsize=15, fontweight='bold')
ax[1].set_xlabel('Number of denied Cases')
ax[1].set_ylabel('City')
ax[1].legend(['Denied Cases'])
plt.subplots_adjust(hspace=0.5);
From this chart, we can find that New York City is the highest numbers in certified and the most highest numbers in denied applications. And, Plano in Texas is a great opportunity with the second highest number of certified cases.
fig, ax = plt.subplots(figsize=(12, 5))
sns.countplot(y=h1b['FULL_TIME'])
plt.title("Full/Part Time Position of Applicants", fontsize=15, fontweight='bold')
plt.xlabel("Numbers of Cases")
plt.ylabel("Position")
ax.set_yticklabels(['Full Time', 'Part Time'])
plt.legend(['Cases']);
Most of the applicants are working as a full-time rather than a part-time. Over 1.8 millions of applicatns were full-time.
fig, ax = plt.subplots(2,1, figsize=(13, 10))
sns.countplot(y=certified['SOC_NAME'], order=certified['SOC_NAME'].value_counts().iloc[:10].index, ax=ax[0])
ax[0].set_title('Overall Most Certified H1B Occupations', fontsize=15, fontweight='bold')
ax[0].set_xlabel('Number of Certified Cases')
ax[0].set_ylabel('Occupations')
ax[0].legend(['Certified Cases'])
sns.countplot(y=certified18['SOC_NAME'], order=certified18['SOC_NAME'].value_counts().iloc[:10].index, ax=ax[1])
ax[1].set_title('2017 to 2018 Most Certified H1B Occupations', fontsize=15, fontweight='bold')
ax[1].set_xlabel('Number of Certified Cases')
ax[1].set_ylabel('Occupations')
ax[1].legend(['Certified Cases'])
plt.subplots_adjust(hspace=0.5);
fig, ax = plt.subplots(2,1, figsize=(13, 10))
sns.countplot(y=denied['SOC_NAME'], order=denied['SOC_NAME'].value_counts().iloc[:10].index, ax=ax[0])
ax[0].set_title('Overall Most Denied H1B Occupations', fontsize=15, fontweight='bold')
ax[0].set_xlabel('Number of Denied Cases')
ax[0].set_ylabel('Occupations')
ax[0].legend(['Denied Cases'])
sns.countplot(y=denied18['SOC_NAME'], order=denied18['SOC_NAME'].value_counts().iloc[:10].index, ax=ax[1])
ax[1].set_title('2017 to 2018 Most Denied H1B Occupations', fontsize=15, fontweight='bold')
ax[1].set_xlabel('Number of Denied Cases')
ax[1].set_ylabel('Occupations')
ax[1].legend(['Denied Cases'])
plt.subplots_adjust(hspace=0.5);
The occupations with the highest certified rate are all in IT field and we do see diversified results of denied applications. Besides IT, there are also opportunities in analytics, accounting and engineering roles
Software developers have the highest certified rate and the highest denied rate.
fig, ax = plt.subplots(2,1,figsize=(13, 10))
sns.countplot(y=h1b['EMPLOYER_NAME'], order=h1b['EMPLOYER_NAME'].value_counts().iloc[:10].index, ax=ax[0]);
ax[0].set_title("Overall H1B Popular Comapnies", fontsize=15, fontweight='bold')
ax[0].set_ylabel("Companies")
ax[0].set_xlabel("Number")
ax[0].legend(['Cases'])
sns.countplot(y=H1B17_18['EMPLOYER_NAME'], order=H1B17_18['EMPLOYER_NAME'].value_counts().iloc[:10].index, ax=ax[1]);
ax[1].set_title("2017 to 2018 H1B Popular Comapnies", fontsize=15, fontweight='bold')
ax[1].set_ylabel("Companies")
ax[1].set_xlabel("Number")
ax[1].legend(['Cases'])
plt.subplots_adjust(hspace=0.5);
As we can see the trend, the companies with most H1B applicants are technology consulting firms. It's interesting to see all of the big tech firms, such as Apply, Google, Microsoft and Amazon were not in the list of top 10 sponsored companies.
top10_company = h1b['EMPLOYER_NAME'].value_counts().head(10).index.to_list()
Company_job = h1b[h1b['EMPLOYER_NAME'].isin(top10_company)].groupby(['EMPLOYER_NAME'])['JOB_TITLE'].value_counts(ascending=False).to_frame()
Company_job.rename(columns={'JOB_TITLE': 'Count'})
Company_job.drop(Company_job.loc[(Company_job['JOB_TITLE'] <2000)].index,inplace=True)
Company_job
fig, ax = plt.subplots(2, 1, figsize=(12, 8))
sns.countplot(y=certified['EMPLOYER_NAME'], order=certified['EMPLOYER_NAME'].value_counts().iloc[:10].index, ax=ax[0]);
ax[0].set_title("Comapnies with Most Certified H1B Applicants", fontsize=15, fontweight='bold')
ax[0].set_ylabel("Company")
ax[0].set_xlabel("Number")
ax[0].legend(['Cases']);
sns.countplot(y=denied['EMPLOYER_NAME'], order=denied['EMPLOYER_NAME'].value_counts().iloc[:10].index, ax=ax[1]);
ax[1].set_title("20017 Comapnies with Most Denied H1B Applicants", fontsize=15, fontweight='bold')
ax[1].set_ylabel("Company")
ax[1].set_xlabel("Number")
ax[1].legend(['Cases'])
plt.subplots_adjust(hspace=0.5);
Infosys limited, the one with the most H1B application has the highest certified cases. However, Deloitte consulting LLP has the highest denied cases. Also, those companies like IBM and Ernst&Young LLP, that relatively had lower numbers in H1B applications, has the higher denied cases that the companies like Infosys and Tata consulting services limited.
h1b_agg1=h1b.groupby('EMPLOYER_NAME').agg({'PREVAILING_WAGE':[np.mean,np.median,np.min,np.max]})
h1b_agg1.columns = ['wage_mean','wage_median','wage_min','wage_max']
h1b_agg1.sort_values('wage_mean',ascending = False).head(10)
h1b_agg=h1b.groupby('FSTATE').agg({'PREVAILING_WAGE':[np.mean,np.median,np.min,np.max]})
h1b_agg.columns = ['wage_mean','wage_median','wage_min','wage_max']
h1b_agg.sort_values('wage_mean',ascending = False).head(10)
top10_state = h1b['FSTATE'].value_counts().head(10).index.to_list()
state_wage = h1b.groupby(['FSTATE','YEAR'],as_index=False)['PREVAILING_WAGE'].mean()
top10_state1 = H1B17_18['FSTATE'].value_counts().head(10).index.to_list()
state_wage1 = H1B17_18.groupby(['FSTATE','YEAR'],as_index=False)['PREVAILING_WAGE'].mean()
fig, ax = plt.subplots(2,1,figsize=(12,15))
for i in top10_state:
tmp = state_wage.loc[state_wage['FSTATE']==i]
ax[0].plot(tmp['YEAR'], tmp['PREVAILING_WAGE'],
label = i,
linewidth = 2)
for i in top10_state1:
tmp1 = state_wage1.loc[state_wage1['FSTATE']==i]
ax[1].plot(tmp1['YEAR'], tmp1['PREVAILING_WAGE'],
label = i,
linewidth = 2)
ax[0].set_title('Overall Trend of H1B Job Wages in Top 10 States',fontsize=15,fontweight='bold')
ax[1].set_title('2017 to 2018 Trend of H1B Job Wages in Top 10 States',fontsize=15,fontweight='bold')
ax[0].set_xlabel('Year')
ax[0].set_ylabel('Wages')
ax[0].legend(bbox_to_anchor=(1.1, 1.05),loc = 'best')
ax[1].set_xlabel('Year')
ax[1].set_ylabel('Wages')
ax[1].legend(bbox_to_anchor=(1.1, 1.05),loc = 'best')
plt.subplots_adjust(hspace=0.5);
The overall trend for the wages of H1B applicants are increasing, and we see a big increase in 2013.California and New York have relatively high wages than other states. Also, it's interesting to see that washington has high wages almost close to California in year 2017 and 2018
4.Descriptive Data Mining
pd.set_option('display.max_columns', None)
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
import seaborn as sns
h1b = pd.concat([H1B08_10,H1B11_14,H1B15_18])
h1b.dropna(how = 'all')
h1b02 = h1b[['CASE_STATUS','PREVAILING_WAGE']]
h1b02['PREVAILING_WAGE'] = h1b02['PREVAILING_WAGE'].astype('str').str.replace(',','').str.replace('$','').str.replace('-','')
h1b02['PREVAILING_WAGE']= pd.to_numeric(h1b02['PREVAILING_WAGE'],errors = 'coerce')
h1b02['CASE_STATUS']=h1b02['CASE_STATUS'].astype('str').str.replace('CERTIFIED-WITHDRAWN','DENIED')
h1b02['CASE_STATUS']=h1b02['CASE_STATUS'].astype('str').str.replace('DENIED','0').str.replace('CERTIFIED','1').str.replace('WITHDRAWN','0')
h1b02['CASE_STATUS']=h1b02['CASE_STATUS'].astype('str').str.replace('PENDING QUALITY AND COMPLIANCE REVIEW - UNASSIGNED','0')
h1b02['CASE_STATUS']=h1b02['CASE_STATUS'].astype('str').str.replace('REJECTED','0')
h1b02['CASE_STATUS']=h1b02['CASE_STATUS'].astype('str').str.replace('INVALIDATED','0')
h1b02.dropna(inplace=True)
h1b02
from sklearn.linear_model import LogisticRegression
skl_reg1 = LogisticRegression().fit(h1b02[['PREVAILING_WAGE']].values,h1b02['CASE_STATUS'].values)
skl_reg1.score(h1b02[['PREVAILING_WAGE']].values,h1b02['CASE_STATUS'].values)
Then we got the result: 87.3%, which proves that 'Status' does have strong connection with income.
According to the H1b official report published in April 2017,there will be more H1b visas captured by STEM workers, and generally STEM workers who got H1b visas earned more more on average (US Dollar 76356) than American workers with a bachelor’s degree (US Dollar 67301), within the same age group and occupation. This is why H1b visas still go to workers with higher income, because most of them are able to get a job at high-tech companies such as Google and Facebook, and some of them are sponsored by consulting firms.
5.Conclusion
In this rerport, we analyzed the H1B petition data from year 2008 to year 2018. There are some interesting findings:
1.In year 2013, the case submission has a big drop.And the H1B wages has a slightly increase in the same year.
2.The Overall certified HIB rate is around 88% and the processing time is around a week.
3.California is the state that has the highest H1B case submission and average wage, which around 8k
4.New York, New Jerseay, and Texas are three states that have high H1B submission. Among these states,the application number and denied rate are increaing in New York, but Texsas is a growing places that deserve extra attention.
5.Besides, Washington has the highest average wages
6.IT is the most popular area that get H1B sponsored. The occupations are related to programming and software development. Beside IT, there are opportunities in accounting and analytics (financial, marketing, and management)
7.IT consulting firms are most welcome H1B sponsorship, but we do see the big tech companies such as Apply, Amazon, Google, and Microsoft are missing in the top 10 list
Thanks for reading. Hope you find it interesting and helpful!
#h1b.to_csv('Databootcamp_finalreport.csv')