DATA BOOTCAMP FINAL PROJECT


Title: Analysis of H1B Petition from year 2008 to 2018

Data Source: United States Department of Labor - LCA Programs

Authors: Sarah Kim, Yilin Zhao, Jiaran Gu

Submission Date: May 15, 2020

In [ ]:
 

TABLE OF CONTENT


1.Introduction


2.Data Cleaning


3.Data Analysis and Visualization


4.Descriptive Data Mining


5.Conclusion

1. Introduction

According to Workpermit.com, The US H-1B visa is a non-immigrant visa that allows US companies to employ graduate-level workers in specialty occupations that require theoretical or technical expertise in specialized fields such as in IT, finance, accounting, architecture, engineering, mathematics, science, medicine, etc.

As for this report, we would like to show the trend of the H1B petition over the years from 2008 to 2018 and potentia opportunities for international students to focus when they are applying for jobs in the United States.

Due to the Coronavirus, companies are laying off, and job searching is hard, but we still need to stay strong and be hopeful!

2. Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.set_option('display.max_columns', None)
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

Read in and Combine Data

In [2]:
fy2008 = pd.read_csv('H-1B_Case_Data_FY2008.csv')
In [3]:
fy2008 = fy2008[['CASE_NO','APPROVAL_STATUS','SUBMITTED_DATE','DOL_DECISION_DATE','NAME',
                                                     'CITY','OCCUPATIONAL_TITLE',
                                                      'STATE','POSTAL_CODE','JOB_CODE','JOB_TITLE',
                                                     'RATE_PER_1','PART_TIME_1',
                                                      'PREVAILING_WAGE_1']]
In [4]:
#Convert Part-time data to full-time to align with other data
fy2008['PART_TIME_1'] = fy2008['PART_TIME_1'].replace({'N': 'Y', 'Y': 'N'})
In [5]:
newnames08 = ['CASE_NO','CASE_STATUS','CASE_SUBMITTED', 'DECISION_DATE','EMPLOYER_NAME',
              'CITY','SOC_NAME','STATE','POSTAL_CODE','SOC_CODE','JOB_TITLE',
             'WAGE_UNIT','FULL_TIME','PREVAILING_WAGE']
fy2008.columns = newnames08  
In [6]:
fy2009 = pd.read_csv('Icert_ LCA_ FY2009.csv',dtype='object')
In [7]:
fy2009 = fy2009[['LCA_CASE_NUMBER','STATUS','LCA_CASE_SUBMIT','Decision_Date', 
                 'LCA_CASE_EMPLOYER_NAME',  'LCA_CASE_EMPLOYER_CITY',
                 'LCA_CASE_EMPLOYER_STATE', 'LCA_CASE_EMPLOYER_POSTAL_CODE','LCA_CASE_SOC_CODE','LCA_CASE_SOC_NAME',
                'LCA_CASE_JOB_TITLE','FULL_TIME_POS', 'PW_1',
                 'PW_UNIT_1' ]]
In [8]:
newnames09 = ['CASE_NO','CASE_STATUS','CASE_SUBMITTED', 'DECISION_DATE','EMPLOYER_NAME',
              'CITY','STATE','POSTAL_CODE','SOC_CODE','SOC_NAME','JOB_TITLE',
             'FULL_TIME','PREVAILING_WAGE','WAGE_UNIT']
fy2009.columns = newnames09  
In [9]:
fy2010 = pd.read_csv('H-1B_FY2010.csv',dtype='object')
In [10]:
fy2010 = fy2010[['LCA_CASE_NUMBER', 'STATUS', 'LCA_CASE_SUBMIT','DECISION_DATE','LCA_CASE_EMPLOYER_NAME',
                  'LCA_CASE_EMPLOYER_CITY','LCA_CASE_EMPLOYER_STATE', 'LCA_CASE_EMPLOYER_POSTAL_CODE','LCA_CASE_SOC_CODE','LCA_CASE_SOC_NAME',
                   'LCA_CASE_JOB_TITLE','PW_1','PW_UNIT_1' ]]
#Missing full_time information here
In [11]:
newnames10 = ['CASE_NO','CASE_STATUS','CASE_SUBMITTED', 'DECISION_DATE','EMPLOYER_NAME',
              'CITY','STATE','POSTAL_CODE','SOC_CODE','SOC_NAME','JOB_TITLE',
            'PREVAILING_WAGE','WAGE_UNIT']
fy2010.columns = newnames10  
In [12]:
H1B08_10 =  pd.concat([fy2008,fy2009,fy2010],sort=False)
In [13]:
H1B11_14 = pd.concat(map(pd.read_csv, ['H-1B_iCert_LCA_FY2011_Q4 (1).csv','LCA_FY2012_Q4.csv','LCA_FY2013.csv','H-1B_FY2014.csv']))
C:\Data_Bootcamp\lib\site-packages\IPython\core\interactiveshell.py:3326: DtypeWarning:

Columns (22) have mixed types. Specify dtype option on import or set low_memory=False.

C:\Data_Bootcamp\lib\site-packages\IPython\core\interactiveshell.py:3326: DtypeWarning:

Columns (22,26) have mixed types. Specify dtype option on import or set low_memory=False.

C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:1: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.


In [14]:
H1B11_14 = H1B11_14[['LCA_CASE_NUMBER','STATUS','LCA_CASE_SUBMIT','LCA_CASE_JOB_TITLE','DECISION_DATE',
                 'LCA_CASE_EMPLOYER_NAME','LCA_CASE_EMPLOYER_CITY','LCA_CASE_EMPLOYER_STATE','LCA_CASE_EMPLOYER_POSTAL_CODE',
                 'FULL_TIME_POS','LCA_CASE_SOC_CODE','LCA_CASE_SOC_NAME','PW_1',
                 'LCA_CASE_WAGE_RATE_UNIT']]
In [15]:
newnames11 = ['CASE_NO','CASE_STATUS','CASE_SUBMITTED', 'JOB_TITLE','DECISION_DATE','EMPLOYER_NAME',
              'CITY','STATE','POSTAL_CODE', 'FULL_TIME','SOC_CODE','SOC_NAME',
            'PREVAILING_WAGE','WAGE_UNIT']
H1B11_14.columns = newnames11  
In [16]:
H1B15_18 = pd.concat(map(pd.read_csv, ['FY2015.csv','FY2016.csv','FY2017.csv','FY2018.csv']))
C:\Data_Bootcamp\lib\site-packages\IPython\core\interactiveshell.py:3326: DtypeWarning:

Columns (14,15) have mixed types. Specify dtype option on import or set low_memory=False.

C:\Data_Bootcamp\lib\site-packages\IPython\core\interactiveshell.py:3326: DtypeWarning:

Columns (14,15,16,24) have mixed types. Specify dtype option on import or set low_memory=False.

C:\Data_Bootcamp\lib\site-packages\IPython\core\interactiveshell.py:3326: DtypeWarning:

Columns (15,16) have mixed types. Specify dtype option on import or set low_memory=False.

C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:1: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.


In [18]:
H1B15_18 = H1B15_18[['CASE_NUMBER','CASE_STATUS', 'CASE_SUBMITTED','JOB_TITLE','DECISION_DATE',
                 'EMPLOYER_NAME','EMPLOYER_CITY','EMPLOYER_STATE','EMPLOYER_POSTAL_CODE',
                'FULL_TIME_POSITION','SOC_CODE','SOC_NAME','PREVAILING_WAGE',
                 'WAGE_UNIT_OF_PAY']]
In [19]:
newnames15 = ['CASE_NO','CASE_STATUS','CASE_SUBMITTED', 'JOB_TITLE','DECISION_DATE','EMPLOYER_NAME',
              'CITY','STATE','POSTAL_CODE','FULL_TIME','SOC_CODE','SOC_NAME',
            'PREVAILING_WAGE','WAGE_UNIT']
H1B15_18.columns = newnames15  
In [20]:
h1b =  pd.concat([H1B08_10,H1B11_14,H1B15_18])
C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:1: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.


In [21]:
h1b
Out[21]:
CASE_NO CASE_STATUS CASE_SUBMITTED CITY DECISION_DATE EMPLOYER_NAME FULL_TIME JOB_TITLE POSTAL_CODE PREVAILING_WAGE SOC_CODE SOC_NAME STATE WAGE_UNIT
0 I-08090-4182485 CERTIFIED 2008/3/30 BURLINGTON 2008/3/30 SPI RESEARCH AND DEVELOPMENT SERVICES, LLC Y PROGRAMMER ANALYST 1803 44298 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING MA yr
1 I-08090-4182487 CERTIFIED 2008/3/30 Horsham 2008/3/30 Nexus Software Solutions,Inc. Y JAVA/J2EE DEVELOPER 19044 43597 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING PA yr
2 I-08090-4182493 CERTIFIED 2008/3/30 FREMONT 2008/3/30 GREYCELL INC Y COMPUTER PROGRAMMER 94536 52437 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING CA yr
3 I-08090-4182495 CERTIFIED 2008/3/30 NEW YORK 2008/3/30 DATA PIONEER INC Y PROGRAMMER ANALYST 10005 60000 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING NY yr
4 I-08090-4182497 CERTIFIED 2008/3/30 Horsham 2008/3/30 Nexus Software Solutions,Inc. Y ASP.NET DEVELOPER 19044 43597 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING PA yr
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
654355 I-200-18080-658767 CERTIFIED 3/21/2018 IRVING 3/27/2018 CROCKETT TECHNICAL LLC Y ASSOCIATE SOFTWARE ENGINEER 75062 90,646.00 15-1132 SOFTWARE DEVELOPERS, APPLICATIONS TX Year
654356 I-200-18061-209836 CERTIFIED 3/5/2018 METTAWA 3/9/2018 BRUNSWICK CORPORATION Y MANAGER GLOBAL SUPPLY CHAIN BUSINESS ANALYTICS 60045 68,598.00 17-2112 INDUSTRIAL ENGINEERS IL Year
654357 I-201-17315-187517 CERTIFIED 11/13/2017 DORAL 11/17/2017 JOFEMAR USA INC. Y BUSINESS RESEARCH AND DEVELOPMENT 33172 65,936.00 13-1161 MARKET RESEARCH ANALYSTS AND MARKETING SPECIAL... FL Year
654358 I-200-18080-929219 CERTIFIED 3/21/2018 DURHAM 3/27/2018 IBM INDIA PRIVATE LIMITED Y PACKAGE SOLUTION CONSULTANT 27709 78,416.00 15-1121 COMPUTER SYSTEMS ANALYSTS NC Year
654359 I-200-18261-995519 CERTIFIED 9/18/2018 REDMOND 9/24/2018 MICROSOFT CORPORATION Y SOFTWARE ENGINEER 98052 133,806.00 15-1132 SOFTWARE DEVELOPERS, APPLICATIONS WA Year

5118351 rows × 14 columns

In [22]:
h1b.dtypes
Out[22]:
CASE_NO            object
CASE_STATUS        object
CASE_SUBMITTED     object
CITY               object
DECISION_DATE      object
EMPLOYER_NAME      object
FULL_TIME          object
JOB_TITLE          object
POSTAL_CODE        object
PREVAILING_WAGE    object
SOC_CODE           object
SOC_NAME           object
STATE              object
WAGE_UNIT          object
dtype: object
In [23]:
h1b.shape
Out[23]:
(5118351, 14)

At this point we combined all the H1B data for 11 years, then we will start data cleaning.

Unifying the Case Status

In [24]:
h1b['CASE_STATUS'].value_counts()
Out[24]:
CERTIFIED                                             4468166
CERTIFIED-WITHDRAWN                                    330444
DENIED                                                 171409
WITHDRAWN                                              148314
PENDING QUALITY AND COMPLIANCE REVIEW - UNASSIGNED         15
REJECTED                                                    2
INVALIDATED                                                 1
Name: CASE_STATUS, dtype: int64
In [25]:
h1b.drop(h1b.loc[h1b['CASE_STATUS'] == 'PENDING QUALITY AND COMPLIANCE REVIEW - UNASSIGNED'].index, inplace=True)
h1b.loc[h1b['CASE_STATUS']=='REJECTED','CASE_STATUS'] = 'DENIED'
h1b.loc[h1b['CASE_STATUS']=='INVALIDATED','CASE_STATUS'] = 'DENIED'

Unifying the Wage

In [26]:
h1b['WAGE_UNIT'].value_counts()
Out[26]:
Year                4333031
yr                   369577
Hour                 367444
hr                    33441
Month                  6882
Week                   3172
mth                    1710
Bi-Weekly              1422
wk                      737
Select Pay Range        359
bi                      161
Name: WAGE_UNIT, dtype: int64
In [27]:
h1b.loc[h1b['WAGE_UNIT']=='yr','WAGE_UNIT'] = 'Year'
h1b.loc[h1b['WAGE_UNIT']=='hr','WAGE_UNIT'] = 'Hour'
h1b.loc[h1b['WAGE_UNIT']=='mth','WAGE_UNIT'] = 'Month'
h1b.loc[h1b['WAGE_UNIT']=='wk','WAGE_UNIT'] = 'Week'
h1b.loc[h1b['WAGE_UNIT']=='bi','WAGE_UNIT'] = 'Bi-Weekly'
In [28]:
h1b.loc[h1b['WAGE_UNIT']== 'Select Pay Range']
Out[28]:
CASE_NO CASE_STATUS CASE_SUBMITTED CITY DECISION_DATE EMPLOYER_NAME FULL_TIME JOB_TITLE POSTAL_CODE PREVAILING_WAGE SOC_CODE SOC_NAME STATE WAGE_UNIT
192 I-200-09205-336218 WITHDRAWN 2010/11/23 NaN 2010/11/23 NaN NaN NaN NaN NaN NaN NaN NaN Select Pay Range
301 I-200-09279-405989 WITHDRAWN 2011/2/10 NaN 2011/2/10 NaN NaN NaN NaN NaN NaN NaN NaN Select Pay Range
386 I-200-09301-178420 WITHDRAWN 2011/3/25 OKLAHOMA CITY 2011/3/25 SKY FOUNDATION NaN BUSINESS MANAGER 73106- NaN NaN NaN OK Select Pay Range
411 I-200-09315-488549 WITHDRAWN 2011/3/25 SAN ANTONIO 2011/3/25 RIVERWALK FOUNDATION Y ASSISTANT PRINCIPAL 78209- NaN 11-9032.00 Education Administrators, Elementary and Secon... TX Select Pay Range
415 I-200-09320-788518 WITHDRAWN 2011/3/25 NaN 2011/3/25 NaN NaN NaN NaN NaN NaN NaN NaN Select Pay Range
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
379159 I-200-12131-574691 WITHDRAWN 2013/7/24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Select Pay Range
379162 I-200-12236-316830 WITHDRAWN 2013/7/24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Select Pay Range
379164 I-200-12257-907806 WITHDRAWN 2013/7/24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Select Pay Range
382387 I-200-13158-209616 WITHDRAWN 2013/7/26 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Select Pay Range
422949 I-200-13252-421149 WITHDRAWN 2013/9/9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Select Pay Range

359 rows × 14 columns

In [29]:
h1b.drop(h1b.loc[h1b['WAGE_UNIT']== 'Select Pay Range'].index, inplace=True)

Only save the yearly income.

In [30]:
h1b.drop(h1b.loc[h1b['PREVAILING_WAGE'] == '20-70'].index, inplace=True)
In [31]:
h1b.drop(h1b.loc[h1b['PREVAILING_WAGE'] == '.'].index, inplace=True)
In [32]:
h1b['PREVAILING_WAGE'] = h1b['PREVAILING_WAGE'].astype('str').str.replace(',','').str.replace('$','').str.replace(' ','')
In [33]:
# Change hourly wage into yearly wage
h1b.at[h1b['WAGE_UNIT'] == 'Hour', 'PREVAILING_WAGE'] = h1b.loc[h1b[
    'WAGE_UNIT'] == 'Hour']['PREVAILING_WAGE'].astype(float).apply(lambda x: x * 37.5 * 52)
# Change montly wage into yearly wage
h1b.at[h1b['WAGE_UNIT'] == 'Month', 'PREVAILING_WAGE'] = h1b.loc[h1b[
    'WAGE_UNIT'] == 'Month']['PREVAILING_WAGE'].astype(float).apply(lambda x: x * 12) 
# Change weekly into yearly wage
h1b.at[h1b['WAGE_UNIT'] == 'Week', 'PREVAILING_WAGE'] = h1b.loc[h1b[
    'WAGE_UNIT'] == 'Week']['PREVAILING_WAGE'].astype(float).apply(lambda x: x * 52)
# Change bi-weekly into yearly wage 
h1b.at[h1b['WAGE_UNIT'] == 'Bi-Weekly', 'PREVAILING_WAGE'] = h1b.loc[h1b[
    'WAGE_UNIT'] == 'Bi-Weekly']['PREVAILING_WAGE'].astype(float).apply(lambda x: x * 26) 
# Change all columns into float
h1b['PREVAILING_WAGE'] = h1b['PREVAILING_WAGE'].apply(lambda x: float(x.replace(',', '')) if isinstance(x, str) else x)
# Drop the wage unit column
h1b.drop(columns='WAGE_UNIT', inplace=True)
In [34]:
#drop wages not inbetween 10,000 and 2,000,000 as these were put with wrong wage units
h1b.drop(h1b.loc[(h1b['PREVAILING_WAGE'] > 2000000) | (h1b['PREVAILING_WAGE'] < 10000)].index, inplace=True)

Adding a Column with Full State Names.

In [35]:
h1b['STATE']
Out[35]:
0         MA
1         PA
2         CA
3         NY
4         PA
          ..
654355    TX
654356    IL
654357    FL
654358    NC
654359    WA
Name: STATE, Length: 5018088, dtype: object
In [36]:
h1b['FSTATE'] = h1b['STATE']
In [37]:
usStateAbbr = {
    'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona', 'AR': 'Arkansas', 'CA': 'California',
    'CO': 'Colorado', 'CT': 'Connecticut', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia',
    'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'IA': 'Iowa', 'KS': 'Kansas',
    'KY': 'Kentucky', 'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland', 'MA': 'Massachusetts',
    'MI': 'Michigan', 'MN': 'Minnesota', 'MS': 'Mississippi', 'MO': 'Missouri', 'MT': 'Montana',
    'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire', 'NJ': 'New Jersey', 'NM': 'New Mexico',
    'NY': 'New York', 'NC': 'North Carolina', 'ND': 'North Dakota', 'OH': 'Ohio', 'OK': 'Oklahoma',
    'OR': 'Oregon', 'PA': 'Pennsylvania', 'RI': 'Rhode Island', 'SC': 'South Carolina',
    'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'UT': 'Utah', 'VT': 'Vermont',
    'VA': 'Virginia', 'WA': 'Washington', 'WV': 'West Virginia', 'WI': 'Wisconsin', 'WY': 'Wyoming'}
In [38]:
h1b['FSTATE'] = h1b['FSTATE'].map(usStateAbbr).str.upper()

Cleaning the Date

In [39]:
h1b['DECISION_DATE'] = pd.to_datetime(h1b['DECISION_DATE'])
h1b['CASE_SUBMITTED'] = pd.to_datetime(h1b['CASE_SUBMITTED'])
In [40]:
h1b = h1b.dropna(subset=['DECISION_DATE','CASE_SUBMITTED'])

Cleaning the SOC_NAME

In [41]:
# Find one erro in SOC Name and cleaned here
h1b.loc[h1b['SOC_NAME'] == '15-1121', 'SOC_NAME'] = 'COMPUTER SYSTEMS ANALYSTS'
C:\Data_Bootcamp\lib\site-packages\pandas\core\indexing.py:494: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [42]:
h1b['SOC_NAME'].value_counts().head(20)
Out[42]:
SOFTWARE DEVELOPERS, APPLICATIONS                  531434
COMPUTER SYSTEMS ANALYSTS                          363295
COMPUTER PROGRAMMERS                               260430
Computer Systems Analysts                          248901
COMPUTER OCCUPATIONS, ALL OTHER                    220031
Computer Programmers                               215482
OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING    179713
Software Developers, Applications                  133185
SOFTWARE DEVELOPERS, SYSTEMS SOFTWARE               83599
Computer Software Engineers, Applications           70784
MANAGEMENT ANALYSTS                                 52958
COMPUTER SYSTEMS ANALYST                            47798
ACCOUNTANTS AND AUDITORS                            44274
NETWORK AND COMPUTER SYSTEMS ADMINISTRATORS         41618
Computer Occupations, All Other                     38874
FINANCIAL ANALYSTS                                  38561
MECHANICAL ENGINEERS                                38277
OPERATIONS RESEARCH ANALYSTS                        37414
Management Analysts                                 36218
Financial Analysts                                  35654
Name: SOC_NAME, dtype: int64
In [84]:
h1b.loc[h1b['SOC_NAME']=='Computer Programmers','SOC_NAME'] = 'COMPUTER PROGRAMMERS'
h1b.loc[h1b['SOC_NAME']=='Computer Systems Analysts','SOC_NAME'] = 'COMPUTER SYSTEMS ANALYSTS'
h1b.loc[h1b['SOC_NAME']=='COMPUTER SYSTEMS ANALYST','SOC_NAME'] = 'COMPUTER SYSTEMS ANALYSTS'
h1b.loc[h1b['SOC_NAME']=='Computer Software Engineers, Applications','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='SOFTWARE DEVELOPERS, SYSTEMS SOFTWARE','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='Software Developers, Applications','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='SOFTWARE DEVELOPERS, APPLICATIONS ','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='Software Developers, Systems Software','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='Computer Software Engineers, Systems Software','SOC_NAME'] = 'SOFTWARE DEVELOPERS, APPLICATIONS'
h1b.loc[h1b['SOC_NAME']=='Computer Occupations, All Other','SOC_NAME'] = 'COMPUTER OCCUPATIONS, ALL OTHER'
h1b.loc[h1b['SOC_NAME']=='Financial Analysts','SOC_NAME'] = 'FINANCIAL ANALYSTS'

Final Dataset

In [110]:
h1b
Out[110]:
CASE_NO CASE_STATUS CASE_SUBMITTED CITY DECISION_DATE EMPLOYER_NAME FULL_TIME JOB_TITLE POSTAL_CODE PREVAILING_WAGE SOC_CODE SOC_NAME STATE FSTATE YEAR
0 I-08090-4182485 CERTIFIED 2008-03-30 BURLINGTON 2008-03-30 SPI RESEARCH AND DEVELOPMENT SERVICES, LLC Y PROGRAMMER ANALYST 1803 44298.0 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING MA MASSACHUSETTS 2008
1 I-08090-4182487 CERTIFIED 2008-03-30 Horsham 2008-03-30 Nexus Software Solutions,Inc. Y JAVA/J2EE DEVELOPER 19044 43597.0 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING PA PENNSYLVANIA 2008
2 I-08090-4182493 CERTIFIED 2008-03-30 FREMONT 2008-03-30 GREYCELL INC Y COMPUTER PROGRAMMER 94536 52437.0 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING CA CALIFORNIA 2008
3 I-08090-4182495 CERTIFIED 2008-03-30 NEW YORK 2008-03-30 DATA PIONEER INC Y PROGRAMMER ANALYST 10005 60000.0 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING NY NEW YORK 2008
4 I-08090-4182497 CERTIFIED 2008-03-30 Horsham 2008-03-30 Nexus Software Solutions,Inc. Y ASP.NET DEVELOPER 19044 43597.0 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING PA PENNSYLVANIA 2008
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
654355 I-200-18080-658767 CERTIFIED 2018-03-21 IRVING 2018-03-27 CROCKETT TECHNICAL LLC Y ASSOCIATE SOFTWARE ENGINEER 75062 90646.0 15-1132 SOFTWARE DEVELOPERS, APPLICATIONS TX TEXAS 2018
654356 I-200-18061-209836 CERTIFIED 2018-03-05 METTAWA 2018-03-09 BRUNSWICK CORPORATION Y MANAGER GLOBAL SUPPLY CHAIN BUSINESS ANALYTICS 60045 68598.0 17-2112 INDUSTRIAL ENGINEERS IL ILLINOIS 2018
654357 I-201-17315-187517 CERTIFIED 2017-11-13 DORAL 2017-11-17 JOFEMAR USA INC. Y BUSINESS RESEARCH AND DEVELOPMENT 33172 65936.0 13-1161 MARKET RESEARCH ANALYSTS AND MARKETING SPECIAL... FL FLORIDA 2017
654358 I-200-18080-929219 CERTIFIED 2018-03-21 DURHAM 2018-03-27 IBM INDIA PRIVATE LIMITED Y PACKAGE SOLUTION CONSULTANT 27709 78416.0 15-1121 COMPUTER SYSTEMS ANALYSTS NC NORTH CAROLINA 2018
654359 I-200-18261-995519 CERTIFIED 2018-09-18 REDMOND 2018-09-24 MICROSOFT CORPORATION Y SOFTWARE ENGINEER 98052 133806.0 15-1132 SOFTWARE DEVELOPERS, APPLICATIONS WA WASHINGTON 2018

4585400 rows × 15 columns

In [44]:
h1b.dtypes
Out[44]:
CASE_NO                    object
CASE_STATUS                object
CASE_SUBMITTED     datetime64[ns]
CITY                       object
DECISION_DATE      datetime64[ns]
EMPLOYER_NAME              object
FULL_TIME                  object
JOB_TITLE                  object
POSTAL_CODE                object
PREVAILING_WAGE           float64
SOC_CODE                   object
SOC_NAME                   object
STATE                      object
FSTATE                     object
dtype: object

3. Data Analysis and Visualization

Since 2017, the Trump administration has announced series reformation on H1B visa policy, it changes the landscape a lot. In the following analysis, we will compare the 2017 to 2018 data to visualize these changes.

H1B Processing Time

In [45]:
decisionDaysTook = h1b['DECISION_DATE'] - h1b['CASE_SUBMITTED']
decisionDaysTook = decisionDaysTook.dropna().dt.days

days_over_30 = len(decisionDaysTook[(decisionDaysTook > 30) & (decisionDaysTook < 90)])
days_over_90 = len(decisionDaysTook[(decisionDaysTook > 90)])
top3_dec = decisionDaysTook.value_counts().iloc[:3]
top5 = pd.Series(days_over_30, index =['30 to 90'])
top4 = pd.Series(days_over_90, index =['90 or more'])
top4n5 = top4.append(top5)
top_days = top3_dec.append(top4n5)

fig, ax = plt.subplots(figsize=(12, 4))

sns.barplot(y=top_days.index, x=top_days.values);
plt.title("Number of Days Taken for the Result", fontsize=15, fontweight='bold')
plt.ylabel("Days")
plt.xlabel("Number of Cases")
plt.legend(['Cases']);

Here we used case submitted date and decision date to calculate the number of days taken for the decision. The majority of decisions were made less than a week.

Final Application Decisions

In [85]:
h1b['YEAR'] = h1b['DECISION_DATE'].dt.year.astype('str')
C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [86]:
H1B08_16 = h1b[~h1b["YEAR"].isin(["2017","2018"]) ]
H1B17_18 = h1b[h1b["YEAR"].isin(["2017","2018"]) ]
In [51]:
fig,ax = plt.subplots(nrows=1,ncols=3,sharey=False)
caseStatusSum = h1b.groupby(['CASE_STATUS'])[['CASE_NO']].count()
caseStatusb17 = H1B08_16.groupby(['CASE_STATUS'])[['CASE_NO']].count()
caseStatusa17 = H1B17_18.groupby(['CASE_STATUS'])[['CASE_NO']].count()
caseStatusSum.plot.pie(ax=ax[0],figsize=(15,10),y='CASE_NO', explode=(0.1, 0.1, 0.1, 0.1), autopct='%1.1f%%', shadow=True)
caseStatusb17.plot.pie(ax=ax[1],figsize=(15,10),y='CASE_NO', explode=(0.1, 0.1, 0.1, 0.1), autopct='%1.1f%%', shadow=True)
caseStatusa17.plot.pie(ax=ax[2],figsize=(15,10),y='CASE_NO', explode=(0.1, 0.1, 0.1, 0.1), autopct='%1.1f%%', shadow=True)
fig.tight_layout()
ax[0].set_title('Overall Decision on the Cases', fontsize=15, fontweight='bold')
ax[1].set_title('Before 2017 Decision on the Cases', fontsize=15, fontweight='bold')
ax[2].set_title('After 2017 Decision on the Cases', fontsize=15, fontweight='bold')
plt.legend(bbox_to_anchor=(1.4, 1), loc='upper right', ncol=1);

As the abrove pictures show, from the percentage perspective, there are no big changes before 2017, after 2017 and the overall data. The certified cases are around 88%, which is slightly increased after 2017.There is a 2% decrease shows in the denied cases and 1% increase on Certified-Withdrawn, which is interesting.

The results are different from our expection because we think the certified cases might decrease after the policy change in 2017. The next step, we will figure out if the case number is droped, which means less people apply for H1B visa per year after 2017.

H1B Applications Over Years

In [52]:
ycase = h1b.groupby('YEAR').agg('count')
ycase1 = ycase.iloc[:,:1]
ycase1
#Shows how many H1B cases was submitted through out the years
Out[52]:
CASE_NO
YEAR
2007 49669
2008 346735
2009 154337
2010 333679
2011 370950
2012 317824
2013 74897
2014 519375
2015 612865
2016 635996
2017 616599
2018 552474
In [53]:
fig, ax = plt.subplots(figsize=(10,5))
ycase1.plot(ax=ax)
plt.title('H1B Applications Over Years',fontsize=15,fontweight='bold')
plt.ylabel('Number of Applications')
ax.legend(['Jobs'])
Out[53]:
<matplotlib.legend.Legend at 0x21c9fb55a88>

In year 2013, the H1B application hit the lowest number which is 74897 and it's increasing after that. We can also see a tendency of dropping after year 2017, which matched with our expectation.

In [54]:
a = H1B08_10.groupby('STATE').agg('count')
b = a.iloc[:,:1].reset_index().rename(columns={'CASE_NO': 'Cases'})
fig = px.choropleth(b, 
                    locations="STATE", 
                    color="Cases", 
                    hover_name="STATE" , 
                    locationmode = 'USA-states') 
fig.update_layout(
    title_text = 'Year 2008 to 2010 H1B Job Rankings in States of US', 
    geo_scope='usa', 
    
)
fig.show()  
In [55]:
a = h1b.groupby('STATE').agg('count')
b = a.iloc[:,:1].reset_index().rename(columns={'CASE_NO': 'Cases'})
fig = px.choropleth(b, 
                    locations="STATE", 
                    color="Cases", 
                    hover_name="STATE" , 
                    locationmode = 'USA-states') 
fig.update_layout(
    title_text = 'H1B Job Rankings in States of US', 
    geo_scope='usa', 
    
)
fig.show()  
In [91]:
certified = h1b.loc[h1b['CASE_STATUS'] == 'CERTIFIED']
denied = h1b.loc[h1b['CASE_STATUS'] == 'DENIED']
In [92]:
certified18 = H1B17_18.loc[H1B17_18['CASE_STATUS'] == 'CERTIFIED']
denied18 = H1B17_18.loc[H1B17_18['CASE_STATUS'] == 'DENIED']
In [58]:
fig, ax = plt.subplots(2, 1, figsize=(13, 10))

sns.countplot(y=certified['FSTATE'], order=certified['FSTATE'].value_counts().iloc[:10].index, ax=ax[0])
ax[0].set_title('States with the Most Certified H1B Applicants', fontsize=15, fontweight='bold')
ax[0].set_xlabel('Number of Certified Cases')
ax[0].set_ylabel('City')
ax[0].legend(['Certified Cases'])

sns.countplot(y=certified18['FSTATE'], order=certified18['FSTATE'].value_counts().iloc[:10].index, ax=ax[1])
ax[1].set_title('2017 to 2018 States with the Most Certified H1B Applicants', fontsize=15, fontweight='bold')
ax[1].set_xlabel('Number of Certified Cases')
ax[1].set_ylabel('City')
ax[1].legend(['Certified Cases'])

plt.subplots_adjust(hspace=0.5);
In [59]:
fig, ax = plt.subplots(2, 1, figsize=(13, 10))

sns.countplot(y=denied['FSTATE'], order=denied['FSTATE'].value_counts().iloc[:10].index, ax=ax[0])
ax[0].set_title('States with the Most denied H1B Applicants', fontsize=15, fontweight='bold')
ax[0].set_xlabel('Number of denied Cases')
ax[0].set_ylabel('City')
ax[0].legend(['Denied Cases'])

sns.countplot(y=denied18['FSTATE'], order=denied18['FSTATE'].value_counts().iloc[:10].index, ax=ax[1])
ax[1].set_title('2017 to 2018 States with the Most denied H1B Applicants', fontsize=15, fontweight='bold')
ax[1].set_xlabel('Number of denied Cases')
ax[1].set_ylabel('City')
ax[1].legend(['Denied Cases'])

plt.subplots_adjust(hspace=0.5);

As the above pictures show, the number of H1B applications of states don't change that much. But we do see that Texas is poping up to offer more H1B visa and New York is slowing down it's visa application. According to a report from Business in Texas,Texas'IT services workforce is the second largest in the nation.

Number wise, California has the most number of applicants in both certified and denied. Interestingly, New Jersey has the second highest number of certified applicatns with the forth denied application. On the contrary, New York has higher denied rate with lower certified rate.


International students can focus on CA, NY, NJ and TX when looking for full-time jobs. Compare to the overall data with the data from 2017 to 2018,we suggest international students to pay extra attention on Texas and New Jersey are great opportunities.

In [61]:
fig, ax = plt.subplots(2,1, figsize=(13, 10))

sns.countplot(y=certified['CITY'], order=certified['CITY'].value_counts().iloc[:10].index, ax=ax[0])
ax[0].set_title('Most Certified H1B Cities', fontsize=15, fontweight='bold')
ax[0].set_xlabel('Number of Certified Cases')
ax[0].set_ylabel('City')
ax[0].legend(['Certified Cases'])

sns.countplot(y=denied['CITY'], order=denied['CITY'].value_counts().iloc[:10].index, ax=ax[1])
ax[1].set_title('Most denied H1B Cities', fontsize=15, fontweight='bold')
ax[1].set_xlabel('Number of denied Cases')
ax[1].set_ylabel('City')
ax[1].legend(['Denied Cases'])

plt.subplots_adjust(hspace=0.5);

From this chart, we can find that New York City is the highest numbers in certified and the most highest numbers in denied applications. And, Plano in Texas is a great opportunity with the second highest number of certified cases.

In [62]:
fig, ax = plt.subplots(figsize=(12, 5))

sns.countplot(y=h1b['FULL_TIME'])
plt.title("Full/Part Time Position of Applicants", fontsize=15, fontweight='bold')
plt.xlabel("Numbers of Cases")
plt.ylabel("Position")
ax.set_yticklabels(['Full Time', 'Part Time'])
plt.legend(['Cases']);

Most of the applicants are working as a full-time rather than a part-time. Over 1.8 millions of applicatns were full-time.

In [93]:
fig, ax = plt.subplots(2,1, figsize=(13, 10))

sns.countplot(y=certified['SOC_NAME'], order=certified['SOC_NAME'].value_counts().iloc[:10].index, ax=ax[0])
ax[0].set_title('Overall Most Certified H1B Occupations', fontsize=15, fontweight='bold')
ax[0].set_xlabel('Number of Certified Cases')
ax[0].set_ylabel('Occupations')
ax[0].legend(['Certified Cases'])

sns.countplot(y=certified18['SOC_NAME'], order=certified18['SOC_NAME'].value_counts().iloc[:10].index, ax=ax[1])
ax[1].set_title('2017 to 2018 Most Certified H1B Occupations', fontsize=15, fontweight='bold')
ax[1].set_xlabel('Number of Certified Cases')
ax[1].set_ylabel('Occupations')
ax[1].legend(['Certified Cases'])

plt.subplots_adjust(hspace=0.5);
In [94]:
fig, ax = plt.subplots(2,1, figsize=(13, 10))

sns.countplot(y=denied['SOC_NAME'], order=denied['SOC_NAME'].value_counts().iloc[:10].index, ax=ax[0])
ax[0].set_title('Overall Most Denied H1B Occupations', fontsize=15, fontweight='bold')
ax[0].set_xlabel('Number of Denied Cases')
ax[0].set_ylabel('Occupations')
ax[0].legend(['Denied Cases'])

sns.countplot(y=denied18['SOC_NAME'], order=denied18['SOC_NAME'].value_counts().iloc[:10].index, ax=ax[1])
ax[1].set_title('2017 to 2018 Most Denied H1B Occupations', fontsize=15, fontweight='bold')
ax[1].set_xlabel('Number of Denied Cases')
ax[1].set_ylabel('Occupations')
ax[1].legend(['Denied Cases'])

plt.subplots_adjust(hspace=0.5);

The occupations with the highest certified rate are all in IT field and we do see diversified results of denied applications. Besides IT, there are also opportunities in analytics, accounting and engineering roles

Software developers have the highest certified rate and the highest denied rate.

In [95]:
fig, ax = plt.subplots(2,1,figsize=(13, 10))

sns.countplot(y=h1b['EMPLOYER_NAME'], order=h1b['EMPLOYER_NAME'].value_counts().iloc[:10].index, ax=ax[0]);
ax[0].set_title("Overall H1B Popular Comapnies", fontsize=15, fontweight='bold')
ax[0].set_ylabel("Companies")
ax[0].set_xlabel("Number")
ax[0].legend(['Cases'])

sns.countplot(y=H1B17_18['EMPLOYER_NAME'], order=H1B17_18['EMPLOYER_NAME'].value_counts().iloc[:10].index, ax=ax[1]);
ax[1].set_title("2017 to 2018 H1B Popular Comapnies", fontsize=15, fontweight='bold')
ax[1].set_ylabel("Companies")
ax[1].set_xlabel("Number")
ax[1].legend(['Cases'])

plt.subplots_adjust(hspace=0.5);

As we can see the trend, the companies with most H1B applicants are technology consulting firms. It's interesting to see all of the big tech firms, such as Apply, Google, Microsoft and Amazon were not in the list of top 10 sponsored companies.

In [145]:
top10_company = h1b['EMPLOYER_NAME'].value_counts().head(10).index.to_list()
Company_job = h1b[h1b['EMPLOYER_NAME'].isin(top10_company)].groupby(['EMPLOYER_NAME'])['JOB_TITLE'].value_counts(ascending=False).to_frame()
Company_job.rename(columns={'JOB_TITLE': 'Count'})
Company_job.drop(Company_job.loc[(Company_job['JOB_TITLE'] <2000)].index,inplace=True)
In [146]:
Company_job
Out[146]:
JOB_TITLE
EMPLOYER_NAME JOB_TITLE
ACCENTURE LLP COMPUTER PROGRAMMER/CONFIGURER 2 6899
COMPUTER SYSTEMS ANALYST 2 5229
COMPUTER SPECIALIST/TESTING AND QUALITY ANALYST 2 4304
COMPUTER SYSTEMS ANALYST 3 2900
CAPGEMINI AMERICA INC SOFTWARE ENGINEER 2 2664
PROGRAMMER/DEVELOPER 2 2031
DELOITTE CONSULTING LLP SENIOR CONSULTANT 16665
CONSULTANT 14000
MANAGER 6596
SPECIALIST SENIOR 2342
ERNST & YOUNG U.S. LLP ADVISORY SENIOR 6301
ADVISORY MANAGER 3121
ADVISORY STAFF 2601
ASSURANCE STAFF 2203
HCL AMERICA, INC. PROGRAMMER ANALYST - II 4476
PROJECT MANAGER - III 2315
IBM INDIA PRIVATE LIMITED APPLICATION DEVELOPER 4392
IT SPECIALIST 2734
INFOSYS LIMITED TECHNOLOGY LEAD - US 25095
TECHNOLOGY ANALYST - US 19974
SYSTEMS ENGINEER - US 7491
CONSULTANT - US 6970
PROJECT MANAGER - US 6036
TECHNICAL TEST LEAD - US 4905
TECHNOLOGY ARCHITECT - US 4628
TECHNOLOGY LEAD - US - PRACTITIONER 4489
TEST ANALYST - US 4172
ASSOCIATE CONSULTANT - US 3675
LEAD CONSULTANT - US 3547
SENIOR PROJECT MANAGER - US 2602
MICROSOFT CORPORATION SOFTWARE ENGINEER 7239
SOFTWARE DEVELOPMENT ENGINEER 4384
SOFTWARE DEVELOPMENT ENGINEER IN TEST 4348
SENIOR SOFTWARE ENGINEER 2704
TATA CONSULTANCY SERVICES LIMITED DEVELOPER 18874
COMPUTER PROGRAMMER 12348
DEVELOPER USER INTERFACE 11394
ARCHITECT 5643
WEB DEVELOPER 4520
ANALYST 3283
COMPUTER SYSTEMS ENGINEER 2775
WIPRO LIMITED LEAD ENGINEER 12452
PROGRAMMER ANALYST 9932
PROJECT MANAGER 5493
CONSULTANT 4490
BUSINESS SYSTEM ANALYST 3463
ARCHITECT LEVEL 2 3191
SYSTEM ADMINISTRATOR 2799
TEST ENGINEER LEVEL 2 2668
In [97]:
fig, ax = plt.subplots(2, 1, figsize=(12, 8))

sns.countplot(y=certified['EMPLOYER_NAME'], order=certified['EMPLOYER_NAME'].value_counts().iloc[:10].index, ax=ax[0]);
ax[0].set_title("Comapnies with Most Certified H1B Applicants", fontsize=15, fontweight='bold')
ax[0].set_ylabel("Company")
ax[0].set_xlabel("Number")
ax[0].legend(['Cases']);

sns.countplot(y=denied['EMPLOYER_NAME'], order=denied['EMPLOYER_NAME'].value_counts().iloc[:10].index, ax=ax[1]);
ax[1].set_title("20017 Comapnies with Most Denied H1B Applicants", fontsize=15, fontweight='bold')
ax[1].set_ylabel("Company")
ax[1].set_xlabel("Number")
ax[1].legend(['Cases'])

plt.subplots_adjust(hspace=0.5);

Infosys limited, the one with the most H1B application has the highest certified cases. However, Deloitte consulting LLP has the highest denied cases. Also, those companies like IBM and Ernst&Young LLP, that relatively had lower numbers in H1B applications, has the higher denied cases that the companies like Infosys and Tata consulting services limited.

H1B Wages Aanalysis

In [98]:
h1b_agg1=h1b.groupby('EMPLOYER_NAME').agg({'PREVAILING_WAGE':[np.mean,np.median,np.min,np.max]})
h1b_agg1.columns = ['wage_mean','wage_median','wage_min','wage_max']
h1b_agg1.sort_values('wage_mean',ascending = False).head(10)
Out[98]:
wage_mean wage_median wage_min wage_max
EMPLOYER_NAME
NEVADA ORTHOPEDIC AND SPINE CENTER, LLP 1996800.0 1996800.0 1996800.0 1996800.0
RENEGADE 83, LLC 1978782.0 1978782.0 1978782.0 1978782.0
STANTON OPTICAL FLORIDA, LLC 1960946.0 1960946.0 1960946.0 1960946.0
Shadowmachine, LLC 1954680.0 1954680.0 1954680.0 1954680.0
UNIVERSAL METAL PRODUCTS 1951196.0 1951196.0 1951196.0 1951196.0
george 1950000.0 1950000.0 1950000.0 1950000.0
MARTELL CAPITAL GROUP LLC 1946880.0 1946880.0 1946880.0 1946880.0
JLB PARTNERS LLC 1944176.0 1989598.0 1853332.0 1989598.0
ROSALYN YALOW CHARTER SCHOOL 1941472.0 1941472.0 1941472.0 1941472.0
SUN RIPE CO. 1940380.0 1940380.0 1940380.0 1940380.0
In [140]:
h1b_agg=h1b.groupby('FSTATE').agg({'PREVAILING_WAGE':[np.mean,np.median,np.min,np.max]})
h1b_agg.columns = ['wage_mean','wage_median','wage_min','wage_max']
h1b_agg.sort_values('wage_mean',ascending = False).head(10)
Out[140]:
wage_mean wage_median wage_min wage_max
FSTATE
WASHINGTON 86848.904402 84739.0 15966.0 1908480.0
NORTH DAKOTA 83574.505889 55020.0 16952.0 523713.0
CALIFORNIA 83414.816024 78811.0 11072.0 1995006.0
RHODE ISLAND 81873.326776 83098.0 13000.0 1300000.0
ARKANSAS 77551.986265 65686.0 10000.0 1781936.0
MAINE 77451.106632 60986.0 14137.5 558060.0
SOUTH DAKOTA 76587.626971 51235.0 15080.0 1551680.0
WEST VIRGINIA 74987.434427 53935.0 12780.0 414007.0
MINNESOTA 74616.088051 68518.0 13591.5 1973400.0
NEW YORK 74009.406464 65874.0 10144.0 1996644.0
In [100]:
top10_state = h1b['FSTATE'].value_counts().head(10).index.to_list()
state_wage = h1b.groupby(['FSTATE','YEAR'],as_index=False)['PREVAILING_WAGE'].mean()
In [108]:
top10_state1 = H1B17_18['FSTATE'].value_counts().head(10).index.to_list()
state_wage1 = H1B17_18.groupby(['FSTATE','YEAR'],as_index=False)['PREVAILING_WAGE'].mean()
In [109]:
fig, ax = plt.subplots(2,1,figsize=(12,15))

for i in top10_state:
    tmp = state_wage.loc[state_wage['FSTATE']==i]
    
    ax[0].plot(tmp['YEAR'], tmp['PREVAILING_WAGE'], 
            label = i,
            linewidth = 2)
for i in top10_state1:
    tmp1 = state_wage1.loc[state_wage1['FSTATE']==i]
    
    ax[1].plot(tmp1['YEAR'], tmp1['PREVAILING_WAGE'], 
            label = i,
            linewidth = 2)


ax[0].set_title('Overall Trend of H1B Job Wages in Top 10 States',fontsize=15,fontweight='bold')
ax[1].set_title('2017 to 2018 Trend of H1B Job Wages in Top 10 States',fontsize=15,fontweight='bold')
ax[0].set_xlabel('Year')
ax[0].set_ylabel('Wages')
ax[0].legend(bbox_to_anchor=(1.1, 1.05),loc = 'best')
ax[1].set_xlabel('Year')
ax[1].set_ylabel('Wages')
ax[1].legend(bbox_to_anchor=(1.1, 1.05),loc = 'best')

plt.subplots_adjust(hspace=0.5);

The overall trend for the wages of H1B applicants are increasing, and we see a big increase in 2013.California and New York have relatively high wages than other states. Also, it's interesting to see that washington has high wages almost close to California in year 2017 and 2018

4.Descriptive Data Mining

In [241]:
pd.set_option('display.max_columns', None)
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
import seaborn as sns
h1b =  pd.concat([H1B08_10,H1B11_14,H1B15_18])
h1b.dropna(how = 'all')
C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:5: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.


Out[241]:
CASE_NO CASE_STATUS CASE_SUBMITTED CITY DECISION_DATE EMPLOYER_NAME FULL_TIME JOB_TITLE POSTAL_CODE PREVAILING_WAGE SOC_CODE SOC_NAME START_DATE STATE WAGE_UNIT
0 I-08090-4182485 CERTIFIED 2008/3/30 BURLINGTON 2008/3/30 SPI RESEARCH AND DEVELOPMENT SERVICES, LLC Y PROGRAMMER ANALYST 1803 44298 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING 2008/9/30 MA yr
1 I-08090-4182487 CERTIFIED 2008/3/30 Horsham 2008/3/30 Nexus Software Solutions,Inc. Y JAVA/J2EE DEVELOPER 19044 43597 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING 2008/9/29 PA yr
2 I-08090-4182493 CERTIFIED 2008/3/30 FREMONT 2008/3/30 GREYCELL INC Y COMPUTER PROGRAMMER 94536 52437 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING 2008/9/29 CA yr
3 I-08090-4182495 CERTIFIED 2008/3/30 NEW YORK 2008/3/30 DATA PIONEER INC Y PROGRAMMER ANALYST 10005 60000 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING 2008/9/30 NY yr
4 I-08090-4182497 CERTIFIED 2008/3/30 Horsham 2008/3/30 Nexus Software Solutions,Inc. Y ASP.NET DEVELOPER 19044 43597 30 OCCUPATIONS IN SYSTEMS ANALYSIS AND PROGRAMMING 2008/9/29 PA yr
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
654355 I-200-18080-658767 CERTIFIED 3/21/2018 IRVING 3/27/2018 CROCKETT TECHNICAL LLC Y ASSOCIATE SOFTWARE ENGINEER 75062 90,646.00 15-1132 SOFTWARE DEVELOPERS, APPLICATIONS 9/9/2018 TX Year
654356 I-200-18061-209836 CERTIFIED 3/5/2018 METTAWA 3/9/2018 BRUNSWICK CORPORATION Y MANAGER GLOBAL SUPPLY CHAIN BUSINESS ANALYTICS 60045 68,598.00 17-2112 INDUSTRIAL ENGINEERS 8/27/2018 IL Year
654357 I-201-17315-187517 CERTIFIED 11/13/2017 DORAL 11/17/2017 JOFEMAR USA INC. Y BUSINESS RESEARCH AND DEVELOPMENT 33172 65,936.00 13-1161 MARKET RESEARCH ANALYSTS AND MARKETING SPECIAL... 12/1/2017 FL Year
654358 I-200-18080-929219 CERTIFIED 3/21/2018 DURHAM 3/27/2018 IBM INDIA PRIVATE LIMITED Y PACKAGE SOLUTION CONSULTANT 27709 78,416.00 15-1121 COMPUTER SYSTEMS ANALYSTS 4/1/2018 NC Year
654359 I-200-18261-995519 CERTIFIED 9/18/2018 REDMOND 9/24/2018 MICROSOFT CORPORATION Y SOFTWARE ENGINEER 98052 133,806.00 15-1132 SOFTWARE DEVELOPERS, APPLICATIONS 3/20/2019 WA Year

5118351 rows × 15 columns

Logistic Regression

In [242]:
h1b02 = h1b[['CASE_STATUS','PREVAILING_WAGE']]
In [243]:
h1b02['PREVAILING_WAGE'] = h1b02['PREVAILING_WAGE'].astype('str').str.replace(',','').str.replace('$','').str.replace('-','')
h1b02['PREVAILING_WAGE']= pd.to_numeric(h1b02['PREVAILING_WAGE'],errors = 'coerce')
h1b02['CASE_STATUS']=h1b02['CASE_STATUS'].astype('str').str.replace('CERTIFIED-WITHDRAWN','DENIED')
h1b02['CASE_STATUS']=h1b02['CASE_STATUS'].astype('str').str.replace('DENIED','0').str.replace('CERTIFIED','1').str.replace('WITHDRAWN','0')
h1b02['CASE_STATUS']=h1b02['CASE_STATUS'].astype('str').str.replace('PENDING QUALITY AND COMPLIANCE REVIEW - UNASSIGNED','0')
h1b02['CASE_STATUS']=h1b02['CASE_STATUS'].astype('str').str.replace('REJECTED','0')
h1b02['CASE_STATUS']=h1b02['CASE_STATUS'].astype('str').str.replace('INVALIDATED','0')
h1b02.dropna(inplace=True)
h1b02
C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Data_Bootcamp\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[243]:
CASE_STATUS PREVAILING_WAGE
0 1 44298.0
1 1 43597.0
2 1 52437.0
3 1 60000.0
4 1 43597.0
... ... ...
654355 1 90646.0
654356 1 68598.0
654357 1 65936.0
654358 1 78416.0
654359 1 133806.0

5117497 rows × 2 columns

In [244]:
from sklearn.linear_model import LogisticRegression
skl_reg1 = LogisticRegression().fit(h1b02[['PREVAILING_WAGE']].values,h1b02['CASE_STATUS'].values)
skl_reg1.score(h1b02[['PREVAILING_WAGE']].values,h1b02['CASE_STATUS'].values)
C:\Data_Bootcamp\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning:

Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

Out[244]:
0.8731155093984423

Then we got the result: 87.3%, which proves that 'Status' does have strong connection with income.


According to the H1b official report published in April 2017,there will be more H1b visas captured by STEM workers, and generally STEM workers who got H1b visas earned more more on average (US Dollar 76356) than American workers with a bachelor’s degree (US Dollar 67301), within the same age group and occupation. This is why H1b visas still go to workers with higher income, because most of them are able to get a job at high-tech companies such as Google and Facebook, and some of them are sponsored by consulting firms.

5.Conclusion

In this rerport, we analyzed the H1B petition data from year 2008 to year 2018. There are some interesting findings:


1.In year 2013, the case submission has a big drop.And the H1B wages has a slightly increase in the same year.

2.The Overall certified HIB rate is around 88% and the processing time is around a week.

3.California is the state that has the highest H1B case submission and average wage, which around 8k

4.New York, New Jerseay, and Texas are three states that have high H1B submission. Among these states,the application number and denied rate are increaing in New York, but Texsas is a growing places that deserve extra attention.

5.Besides, Washington has the highest average wages

6.IT is the most popular area that get H1B sponsored. The occupations are related to programming and software development. Beside IT, there are opportunities in accounting and analytics (financial, marketing, and management)

7.IT consulting firms are most welcome H1B sponsorship, but we do see the big tech companies such as Apply, Amazon, Google, and Microsoft are missing in the top 10 list

Thanks for reading. Hope you find it interesting and helpful!

In [141]:
#h1b.to_csv('Databootcamp_finalreport.csv')