The early youth of a child is a developmental time where students are learning how to perform many tasks and learn skills, both book smart and street smart, that can help them in life. One of those skills that begins to develop in a young age is literacy in basic math and reading, as the majority of math that one deals with in adulthood, such as basic algebra with money, is taught in middle school, and reading comprehension is key to understanding the majority of events that happen in an adults life - understanding forms, learning new information, searching for housing, etc. Therefore, it is important that all children in this developmental stage have equitable opportunities deserving of them that in such a key growth period, they all have the tools and education necessary to learn such important and long lasting skills such as math and reading comprehension.
However, not all students are given such equally fitted opportunities. The US education system has long been known to have varying standards of education, where differences in education quality begin as early as pre-kindergarten, but not a lot of documentation has been procured to confirm on any large variation in education quality. It is imperative that if these differences in education quality exist, and are continuing to exist over time, then they be resolved on an institutional level.
So, our focus of this tutorial is to confirm if education inequality is reflected by national math and reading examination differences, see if a continuation of inequality in education is occurring, and to recognize factors such as gender or state that may play significant roles in such (if they exist), and use such analysis to predict how future years education inequality will be if the current education system/institution is maintained.
Our null hypothesis will be that gender and state do not have any relationship or impact on math or reading literacy in children in developmental stages. Our alternative hypothesis will be that gender and state have some relationship or impact on math or reading literacy in children in developmental stages. We defined math literacy and reading literacy by the mathematics and reading scores given by student on the NAEP Test - the National Assessment of Educational Progress Test which is a standardized national educational assessment that tracks math and reading progress across the states of the US.
The data that we will be examining can be found here: - specifically the states_all_extended.csv found on the site. The dataset contains compiled US Education data, such as the number of students enrolled in 1st through 12th grade for all states, average math scores for gendered groups from state and year combinations, and more. We decided to specifically focus on the data concerning the average math and average english scores that 4th graders gave on the NAEP test, as it is found by research that 4th grade to 5th grade is the key developmental period for children.
First we have to import the necessary libraries that we need to load the dataset. We are using pandas, numpy, and matplotlib.pyplot, geopandas, and statsmodels. Pandas is used for the DataFrame object since that is an easy way to store tabular data. Numpy is used for its math functionality and mathplotlib.pyplot is used to plot graphs demonstrating relationships between variables in our data. Geopandas will be used to create heat maps of our data, and statsmodels will be used for hypothesis testing and finding our p-values.
As an FYI, we were using Jupyter Notebook through the Anaconda distribution for our work. In order for geopandas to be installed and imported in, we had to create a new Anaconda environment with a lower Python version to install geopandas onto our machine, and then we were able to successfully import geopandas in this environment. Aid in this process is heavily credited to mick-d here: https://github.com/conda/conda/issues/9367#issuecomment-628231987
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import statsmodels.api as sm
Because we had to use an Anaconda environment with a lower Python version, we got some warnings on certain Python commands we were using with our previous higher Python version, but we chose to bypass them right now - we will come back later to update the commands to the correct Python version equivalence!
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
Now we have to load the data. The data is stored in the "states_all_extended.csv" file and so we have to load it into a DataFrame. This can be done using pandas "read_csv" method. We will store this data in a variable called "school_data".
school_data = pd.read_csv("states_all_extended.csv")
# display first few rows
school_data.head()
PRIMARY_KEY | STATE | YEAR | ENROLL | TOTAL_REVENUE | FEDERAL_REVENUE | STATE_REVENUE | LOCAL_REVENUE | TOTAL_EXPENDITURE | INSTRUCTION_EXPENDITURE | ... | G08_HI_A_READING | G08_HI_A_MATHEMATICS | G08_AS_A_READING | G08_AS_A_MATHEMATICS | G08_AM_A_READING | G08_AM_A_MATHEMATICS | G08_HP_A_READING | G08_HP_A_MATHEMATICS | G08_TR_A_READING | G08_TR_A_MATHEMATICS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1992_ALABAMA | ALABAMA | 1992 | NaN | 2678885.0 | 304177.0 | 1659028.0 | 715680.0 | 2653798.0 | 1481703.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1992_ALASKA | ALASKA | 1992 | NaN | 1049591.0 | 106780.0 | 720711.0 | 222100.0 | 972488.0 | 498362.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1992_ARIZONA | ARIZONA | 1992 | NaN | 3258079.0 | 297888.0 | 1369815.0 | 1590376.0 | 3401580.0 | 1435908.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1992_ARKANSAS | ARKANSAS | 1992 | NaN | 1711959.0 | 178571.0 | 958785.0 | 574603.0 | 1743022.0 | 964323.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1992_CALIFORNIA | CALIFORNIA | 1992 | NaN | 26260025.0 | 2072470.0 | 16546514.0 | 7641041.0 | 27138832.0 | 14358922.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 266 columns
Looking at the data, we can see that there are a few columns we will not need. For example PRIMARY_KEY isn't a data point we need to consider when testing our hypothesis so we can get rid of it. We can use the DataFrame method drop and specify the columns we want to drop.
school_data = school_data.drop(columns=['PRIMARY_KEY'])
We only want to use data that is from 2009 and beyond because data before 2009, our dataset did not record gender demographics for the NAEP test result averages.
# get previous number of rows
prev_rows = len(school_data.index)
school_data = school_data[school_data['YEAR'] >= 2009]
# get current number of rows
curr_rows = len(school_data.index)
print(str(prev_rows - curr_rows) + " rows were dropped.")
school_data.head()
1193 rows were dropped.
STATE | YEAR | ENROLL | TOTAL_REVENUE | FEDERAL_REVENUE | STATE_REVENUE | LOCAL_REVENUE | TOTAL_EXPENDITURE | INSTRUCTION_EXPENDITURE | SUPPORT_SERVICES_EXPENDITURE | ... | G08_HI_A_READING | G08_HI_A_MATHEMATICS | G08_AS_A_READING | G08_AS_A_MATHEMATICS | G08_AM_A_READING | G08_AM_A_MATHEMATICS | G08_HP_A_READING | G08_HP_A_MATHEMATICS | G08_TR_A_READING | G08_TR_A_MATHEMATICS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
867 | ALABAMA | 2009 | 745668.0 | 7186390.0 | 728795.0 | 4161103.0 | 2296492.0 | 7815467.0 | 3836398.0 | 2331552.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
868 | ALASKA | 2009 | 130236.0 | 2158970.0 | 312667.0 | 1357747.0 | 488556.0 | 2396412.0 | 1129756.0 | 832783.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
869 | ARIZONA | 2009 | 981303.0 | 8802515.0 | 1044140.0 | 3806064.0 | 3952311.0 | 9580393.0 | 4296503.0 | 2983729.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
870 | ARKANSAS | 2009 | 474423.0 | 4753142.0 | 534510.0 | 3530487.0 | 688145.0 | 5017352.0 | 2417974.0 | 1492691.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
871 | CALIFORNIA | 2009 | 6234155.0 | 73958896.0 | 9745250.0 | 40084244.0 | 24129402.0 | 74766086.0 | 35617964.0 | 21693675.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 265 columns
Since the columns names are a little tricky to figure out, we are going to outline how to read them here:
G## - This signifies which grade this value is talking about; for example G04 is referring to grade 4.
G##_A_A - This refers to all the students in that grade from all races.
G##_x_g - This is read as the number of students of race x and gender g in grade ##; for example G06_AS_M is all asian male students in grade 6.
G##_x_g_test - This is average test score of race x and gender g in grade ##; for example G06_AS_A_MATH is the average math score of all asian students in grade 6.
A in place of a gender or race signifies all genders or all races.
The different race codes provided by the dataset are defined as AM - American Indian or Alaska Native, AS - Asian, HI - Hispanic/Latino, BL - Black, WH - White, HP - Hawaiian Native/Pacific Islander and TR - two or more races. (We will not be analyzing individual races in our data analysis, but this information is useful to know for future work.)
We will be doing two parts to our exploratory data analysis.
In order to create our heat map, we will need to get the grid coordinates of the United States map with states outlines because our geopandas library, which is the geospatial data mapping Python library we will be using, requires the geometrical coordinates for mapping. We pulled our US states map shapefile and other necessary mapping files from the US Census, which can be found here in this zip file: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_500k.zip , and the description on the mapping files can be found here: https://www.census.gov/programs-surveys/geography/technical-documentation/naming-convention/cartographic-boundary-file.html . We manually unzipped the files, and added them to the folder our current notebook was in for our use because then only can geopandas have access to the files, but this can be automated.
Specifically, we used the read_file function from the geopandas library that takes in the name of the shapefile file, which is essentially a set of geometrical coordinates, and uses other CAD and plain text files that come along
usa_states = gpd.read_file("cb_2018_us_state_500k.shp")
usa_states.head()
STATEFP | STATENS | AFFGEOID | GEOID | STUSPS | NAME | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 28 | 01779790 | 0400000US28 | 28 | MS | Mississippi | 00 | 121533519481 | 3926919758 | MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ... |
1 | 37 | 01027616 | 0400000US37 | 37 | NC | North Carolina | 00 | 125923656064 | 13466071395 | MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ... |
2 | 40 | 01102857 | 0400000US40 | 40 | OK | Oklahoma | 00 | 177662925723 | 3374587997 | POLYGON ((-103.00257 36.52659, -103.00219 36.6... |
3 | 51 | 01779803 | 0400000US51 | 51 | VA | Virginia | 00 | 102257717110 | 8528531774 | MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ... |
4 | 54 | 01779805 | 0400000US54 | 54 | WV | West Virginia | 00 | 62266474513 | 489028543 | POLYGON ((-82.64320 38.16909, -82.64300 38.169... |
We will be matching our grid coordinates for each state provided in the usa_states dataframe with state educational information in our school_data dataframe. Let's see what states or territories are provided in both dataframe.
print("We have grid coordinates for: {}".format(list(usa_states["NAME"])))
We have grid coordinates for: ['Mississippi', 'North Carolina', 'Oklahoma', 'Virginia', 'West Virginia', 'Louisiana', 'Michigan', 'Massachusetts', 'Idaho', 'Florida', 'Nebraska', 'Washington', 'New Mexico', 'Puerto Rico', 'South Dakota', 'Texas', 'California', 'Alabama', 'Georgia', 'Pennsylvania', 'Missouri', 'Colorado', 'Utah', 'Tennessee', 'Wyoming', 'New York', 'Kansas', 'Alaska', 'Nevada', 'Illinois', 'Vermont', 'Montana', 'Iowa', 'South Carolina', 'New Hampshire', 'Arizona', 'District of Columbia', 'American Samoa', 'United States Virgin Islands', 'New Jersey', 'Maryland', 'Maine', 'Hawaii', 'Delaware', 'Guam', 'Commonwealth of the Northern Mariana Islands', 'Rhode Island', 'Kentucky', 'Ohio', 'Wisconsin', 'Oregon', 'North Dakota', 'Arkansas', 'Indiana', 'Minnesota', 'Connecticut']
print("We have state math and reading score information for: {}".format(list(school_data["STATE"].unique())))
We have state math and reading score information for: ['ALABAMA', 'ALASKA', 'ARIZONA', 'ARKANSAS', 'CALIFORNIA', 'COLORADO', 'CONNECTICUT', 'DELAWARE', 'DISTRICT_OF_COLUMBIA', 'FLORIDA', 'GEORGIA', 'HAWAII', 'IDAHO', 'ILLINOIS', 'INDIANA', 'IOWA', 'KANSAS', 'KENTUCKY', 'LOUISIANA', 'MAINE', 'MARYLAND', 'MASSACHUSETTS', 'MICHIGAN', 'MINNESOTA', 'MISSISSIPPI', 'MISSOURI', 'MONTANA', 'NEBRASKA', 'NEVADA', 'NEW_HAMPSHIRE', 'NEW_JERSEY', 'NEW_MEXICO', 'NEW_YORK', 'NORTH_CAROLINA', 'NORTH_DAKOTA', 'OHIO', 'OKLAHOMA', 'OREGON', 'PENNSYLVANIA', 'RHODE_ISLAND', 'SOUTH_CAROLINA', 'SOUTH_DAKOTA', 'TENNESSEE', 'TEXAS', 'UTAH', 'VERMONT', 'VIRGINIA', 'WASHINGTON', 'WEST_VIRGINIA', 'WISCONSIN', 'WYOMING', 'DODEA', 'NATIONAL']
Looking at our State values for both the grid coordinate data (usa_states) and the math/reading score data (school_data), we can see that the school_data dataframe does not contain information on the United States Virgin Islands, American Samoa, Puerto Rico, Guam, or the Commonwealth of the Northern Mariana Islands. Therefore, we will need to drop those territories' information from our grid data, and create a final grid data dataframe called 'states_coords.'
# Dropping territories that do not have data in the education data dataset
states_coords = usa_states[usa_states.NAME != "United States Virgin Islands"]
states_coords = states_coords[states_coords.NAME != "American Samoa"]
states_coords = states_coords[states_coords.NAME != "Puerto Rico"]
states_coords = states_coords[states_coords.NAME != "Guam"]
states_coords = states_coords[states_coords.NAME != "Commonwealth of the Northern Mariana Islands"]
states_coords = states_coords.sort_values(by = ['NAME']) # For readability
We can also notice that the formatting of the state names in the two dataframes differs. Since our geopandas libarary will need our mapping and statistical information to be merged together, we need to merge them on a common key value, which can be the state names. Therefore, we updated the formatting of the NAME variable in states_coords with the STATE variable formatting in school_data.
# Matching formatting of NAME variable in states_coords with STATE variable formatting in education data
for row, curr_state in states_coords.iterrows():
states_coords.at[row, 'NAME'] = curr_state['NAME'].replace(' ', '_')
states_coords.at[row, 'NAME'] = states_coords.at[row, 'NAME'].upper()
states_coords = states_coords.rename(columns = {"NAME": "STATE"})
states_coords.head()
STATEFP | STATENS | AFFGEOID | GEOID | STUSPS | STATE | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|---|
17 | 01 | 01779775 | 0400000US01 | 01 | AL | ALABAMA | 00 | 131174048583 | 4593327154 | MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ... |
27 | 02 | 01785533 | 0400000US02 | 02 | AK | ALASKA | 00 | 1478839695958 | 245481577452 | MULTIPOLYGON (((179.48246 51.98283, 179.48656 ... |
35 | 04 | 01779777 | 0400000US04 | 04 | AZ | ARIZONA | 00 | 294198551143 | 1027337603 | POLYGON ((-114.81629 32.50804, -114.81432 32.5... |
52 | 05 | 00068085 | 0400000US05 | 05 | AR | ARKANSAS | 00 | 134768872727 | 2962859592 | POLYGON ((-94.61783 36.49941, -94.61765 36.499... |
16 | 06 | 01779778 | 0400000US06 | 06 | CA | CALIFORNIA | 00 | 403503931312 | 20463871877 | MULTIPOLYGON (((-118.60442 33.47855, -118.5987... |
Vice versa, we can also see that the grid coordinate dataset does not have any DODEA or NATIONAL values (which are present in the school_data, so we can drop those values from the school_data dataframe.
We can also isolate our 2009 school_data values and 2015 school_data values from the main school_data frame right now, as we will use this information soon. As we only care about the grade 4 average score values, we can drop other extra columns not relevant to us such as revenues on a federal or state level, enrollment amounts, etc.
# Dropping state values not relevant in education dataset for 2009
data_2009 = school_data[school_data['YEAR'] == 2009]
data_2009 = data_2009[data_2009.STATE != "DODEA"]
data_2009 = data_2009[data_2009.STATE != "NATIONAL"]
data_2009.drop('TOTAL_REVENUE', inplace=True, axis=1)
data_2009.drop('FEDERAL_REVENUE', inplace=True, axis=1)
data_2009.drop('STATE_REVENUE', inplace=True, axis=1)
data_2009.drop('LOCAL_REVENUE', inplace=True, axis=1)
data_2009.drop('ENROLL', inplace=True, axis=1)
data_2009.drop('TOTAL_EXPENDITURE', inplace=True, axis=1)
data_2009.drop('CAPITAL_OUTLAY_EXPENDITURE', inplace=True, axis=1)
data_2009.drop('INSTRUCTION_EXPENDITURE', inplace=True, axis=1)
data_2009.drop('SUPPORT_SERVICES_EXPENDITURE', inplace=True, axis=1)
data_2009.drop('OTHER_EXPENDITURE', inplace=True, axis=1)
data_2009 = data_2009.sort_values(by = ['STATE']) # For readability
data_2009 = data_2009.reset_index()
data_2009.drop('index', inplace=True, axis=1)
data_2009.head()
STATE | YEAR | A_A_A | G01_A_A | G02_A_A | G03_A_A | G04_A_A | G05_A_A | G06_A_A | G07_A_A | ... | G08_HI_A_READING | G08_HI_A_MATHEMATICS | G08_AS_A_READING | G08_AS_A_MATHEMATICS | G08_AM_A_READING | G08_AM_A_MATHEMATICS | G08_HP_A_READING | G08_HP_A_MATHEMATICS | G08_TR_A_READING | G08_TR_A_MATHEMATICS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ALABAMA | 2009 | 748889.0 | 57821.0 | 56628.0 | 58608.0 | 59512.0 | 58656.0 | 58231.0 | 58118.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | ALASKA | 2009 | 131661.0 | 9926.0 | 9827.0 | 10032.0 | 10046.0 | 9864.0 | 9567.0 | 9657.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | ARIZONA | 2009 | 1077831.0 | 85725.0 | 84033.0 | 84060.0 | 83686.0 | 83193.0 | 81987.0 | 82050.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | ARKANSAS | 2009 | 480559.0 | 37665.0 | 36934.0 | 36903.0 | 36479.0 | 36489.0 | 35958.0 | 36113.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | CALIFORNIA | 2009 | 6263438.0 | 470783.0 | 459334.0 | 459813.0 | 465866.0 | 460248.0 | 461373.0 | 466893.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 255 columns
And we do the same for the 2015 education dataset.
# Dropping state values not relevant in education dataset for 2015
data_2015 = school_data[school_data['YEAR'] == 2015]
data_2015 = data_2015[data_2015.STATE != "DODEA"]
data_2015 = data_2015[data_2015.STATE != "NATIONAL"]
data_2015.drop('TOTAL_REVENUE', inplace=True, axis=1)
data_2015.drop('FEDERAL_REVENUE', inplace=True, axis=1)
data_2015.drop('STATE_REVENUE', inplace=True, axis=1)
data_2015.drop('LOCAL_REVENUE', inplace=True, axis=1)
data_2015.drop('ENROLL', inplace=True, axis=1)
data_2015.drop('TOTAL_EXPENDITURE', inplace=True, axis=1)
data_2015.drop('CAPITAL_OUTLAY_EXPENDITURE', inplace=True, axis=1)
data_2015.drop('INSTRUCTION_EXPENDITURE', inplace=True, axis=1)
data_2015.drop('SUPPORT_SERVICES_EXPENDITURE', inplace=True, axis=1)
data_2015.drop('OTHER_EXPENDITURE', inplace=True, axis=1)
data_2015 = data_2015.sort_values(by = ['STATE']) # For readability
data_2015 = data_2015.reset_index()
data_2015.head()
index | STATE | YEAR | A_A_A | G01_A_A | G02_A_A | G03_A_A | G04_A_A | G05_A_A | G06_A_A | ... | G08_HI_A_READING | G08_HI_A_MATHEMATICS | G08_AS_A_READING | G08_AS_A_MATHEMATICS | G08_AM_A_READING | G08_AM_A_MATHEMATICS | G08_HP_A_READING | G08_HP_A_MATHEMATICS | G08_TR_A_READING | G08_TR_A_MATHEMATICS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1173 | ALABAMA | 2015 | 743789.0 | 59023.0 | 58766.0 | 57963.0 | 55808.0 | 55340.0 | 54900.0 | ... | 252.0 | 260.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1174 | ALASKA | 2015 | 132477.0 | 10587.0 | 10512.0 | 10441.0 | 10118.0 | 9793.0 | 9648.0 | ... | 263.0 | 279.0 | 260.0 | 282.0 | 231.0 | 257.0 | 242.0 | NaN | 270.0 | 285.0 |
2 | 1175 | ARIZONA | 2015 | 1109040.0 | 84804.0 | 87325.0 | 88194.0 | 86594.0 | 85719.0 | 85202.0 | ... | 254.0 | 273.0 | NaN | NaN | 244.0 | 260.0 | NaN | NaN | NaN | NaN |
3 | 1176 | ARKANSAS | 2015 | 492132.0 | 38160.0 | 38590.0 | 38410.0 | 35893.0 | 35850.0 | 36020.0 | ... | 255.0 | 269.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1177 | CALIFORNIA | 2015 | 6226737.0 | 444573.0 | 463881.0 | 470157.0 | 485885.0 | 476427.0 | 471467.0 | ... | 249.0 | 263.0 | 279.0 | 304.0 | NaN | NaN | NaN | NaN | 263.0 | 289.0 |
5 rows × 256 columns
Now, we will create a dataframe to hold the calculated differences between grade 4 reading and math scores from 2009 to 2015 (2015 minus 2009) per state that we will soon use for our visualization. To do so, we will traverse through the 2009 and 2015 dataframes concurrently, and pull out each states math and reading scores for all students, and subtract them from one another. This difference in average scores in math and reading will be recorded for each state for map visualization.
# Adding the states to our empty dataframe
g04_15to09_states = pd.DataFrame()
g04_15to09_states[['STATE']] = data_2015[['STATE']]
# Traversing through 2009 and 2015 score dataset and tracking differences
for ind, currstate in data_2009.iterrows():
read_val2015 = float(data_2015.at[ind, "G04_A_A_READING"])
read_val2009 = float(data_2009.at[ind, "G04_A_A_READING"])
read_diff = read_val2015 - read_val2009
g04_15to09_states.at[ind, "READ_GROWTH"] = read_diff
math_val2015 = float(data_2015.at[ind, "G04_A_A_MATHEMATICS"])
math_val2009 = float(data_2009.at[ind, "G04_A_A_MATHEMATICS"])
math_diff = math_val2015 - math_val2009
g04_15to09_states.at[ind, "MATH_GROWTH"] = math_diff
g04_15to09_states.head()
STATE | READ_GROWTH | MATH_GROWTH | |
---|---|---|---|
0 | ALABAMA | 1.0 | 3.0 |
1 | ALASKA | 2.0 | -1.0 |
2 | ARIZONA | 5.0 | 8.0 |
3 | ARKANSAS | 2.0 | -3.0 |
4 | CALIFORNIA | 3.0 | 0.0 |
Now, we will need to merge the dataframe containing the grid coordinates and the dataframe containing the math/reading difference/growth variables together for our future heat map plotting.
map_and_stats = states_coords.merge(g04_15to09_states, on="STATE")
map_and_stats.head()
STATEFP | STATENS | AFFGEOID | GEOID | STUSPS | STATE | LSAD | ALAND | AWATER | geometry | READ_GROWTH | MATH_GROWTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 01779775 | 0400000US01 | 01 | AL | ALABAMA | 00 | 131174048583 | 4593327154 | MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ... | 1.0 | 3.0 |
1 | 02 | 01785533 | 0400000US02 | 02 | AK | ALASKA | 00 | 1478839695958 | 245481577452 | MULTIPOLYGON (((179.48246 51.98283, 179.48656 ... | 2.0 | -1.0 |
2 | 04 | 01779777 | 0400000US04 | 04 | AZ | ARIZONA | 00 | 294198551143 | 1027337603 | POLYGON ((-114.81629 32.50804, -114.81432 32.5... | 5.0 | 8.0 |
3 | 05 | 00068085 | 0400000US05 | 05 | AR | ARKANSAS | 00 | 134768872727 | 2962859592 | POLYGON ((-94.61783 36.49941, -94.61765 36.499... | 2.0 | -3.0 |
4 | 06 | 01779778 | 0400000US06 | 06 | CA | CALIFORNIA | 00 | 403503931312 | 20463871877 | MULTIPOLYGON (((-118.60442 33.47855, -118.5987... | 3.0 | 0.0 |
We can now plot a heat map of the 50 states of the US on their changes in average math and reading scores on the NAEP test from 2009 to 2015. We can first plot a heat map for just the reading score change distribution from 2009 to 2015 for analysis.
fig, ax = plt.subplots(1, figsize=(14, 14))
plt.xticks(rotation=90)
map_and_stats.plot(column = "READ_GROWTH", cmap = "RdYlGn",
linewidth = 0.4, ax = ax, edgecolor = ".4")
ax.set_title('Heat Map of Growth in Reading Literacy for 4th Graders from 2009 to 2016')
bar_info = plt.cm.ScalarMappable(cmap="RdYlGn", norm=plt.Normalize(vmin= -6, vmax=6))
bar_info._A = []
#https://stackoverflow.com/questions/18195758/set-matplotlib-colorbar-size-to-match-graph
cax = fig.add_axes([ax.get_position().x1 + 0.01,ax.get_position().y0,0.05,ax.get_position().height])
cbar = fig.colorbar(bar_info, cax = cax)
ax.set_xlim(-130, -60)
ax.set_ylim(25, 50)
ax.axis("off")
(-130.0, -60.0, 25.0, 50.0)
(Our map excludes Hawaii and Alaska due to some mapping configuration issues - but just as an FYI, Alaska had a growth of 1.0 and Hawaii had a growth of 2.0).
Well, this is interesting! If education was being distributed on an equal level across states, then we would expect to see the same amount of growth, whether positive, negative, or none, in the change in reading literacy of 4th graders from 2009 to 2015. As a reminder, we are measuring reading literacy growth by taking the difference of a state's average score for eight graders taking the NAEP reading exam in 2015 from that of the state in 2009. In terms of our heatmap, if we maintain our expectation of equal education occuring, we would expect to see the same color, aka the same growth amount, across all states. However, as we can see by our heatmap above, not all states are experiencing the same growth/change in reading literacy - we have states such as Lousiana and North Carolina which have gone up 3 to 6 points in average reading scores on the NAEP reading exam over 6 year. This is a positive that we would hope to see in our heatmap - that states' reading scores have been improving over the years! However, we can notice that there are other states such as Kansas, and, our state, Maryland, that have gone down by 6 points in a 6 year time span in reading literacy. Looking at the map, we can assess that even other states of the USA are fluctuating in their reading literacy patterns - which is an indicator that reading literacy growth can be dependent on state over time, indicative of there being education inequality by state.
Let's curate a similar heat map for growth in math literacy in 4th graders across the states from 2009 to 2015, and see what the results are there.
fig, ax = plt.subplots(1, figsize=(14, 14))
plt.xticks(rotation=90)
map_and_stats.plot(column = "MATH_GROWTH", cmap = "RdYlBu",
linewidth = 0.4, ax = ax, edgecolor = ".4")
ax.set_title('Heat Map of Growth in Mathematics Literacy for 4th Graders from 2009 to 2016')
bar_info = plt.cm.ScalarMappable(cmap="RdYlBu", norm=plt.Normalize(vmin= -6, vmax=6))
bar_info._A = []
#https://stackoverflow.com/questions/18195758/set-matplotlib-colorbar-size-to-match-graph
cax = fig.add_axes([ax.get_position().x1 + 0.01,ax.get_position().y0,0.05,ax.get_position().height])
cbar = fig.colorbar(bar_info, cax = cax)
ax.set_xlim(-130, -60)
ax.set_ylim(25, 50)
ax.axis("off")
(-130.0, -60.0, 25.0, 50.0)
Interesting! Regarding math literacy growth, 4th grade students in New York, Vermont, Maryland, New Jersey, and Kansas seem to have gone down by almost 5 to 6 points in their average mathematics score in NAEP mathematics exams from 2009 to 2015, which is a bit disheartening to see. It is noticeable that Maryland and Kansas notably also had a maximum negative growth in reading literacy as well from 2009 to 2015, but New York, Vermont, and New Jersey had, though negative growth, a slightly smaller point decrease - only about 3 to 4 points downward in reading literacy.
In comparing the heat maps for change in math literacy versus change in reading literacy across states from 2009 to 2015, we can also notice that states such as Tenessee, Texas, and Alabama that they had no growth to little positive growth of 0 to 2 points in mathematics literacy over the 6 years, but had negative growth between 2 to 6 points in reading literacy - which might indicate to us that even within educational standards, reading standards of education and math standards of educations may be differing state by state.
Now, students within a state are not all just one type of person. We can try to look to see how much gendered experiences, with the bounds of "MALE" and "FEMALE", within a state's education are affecting 4th grade students' math and reading literacy growths over the years. In order to compare to see if gendered experiences affect the growths, we can create stacked bar plots to visualize the changes in reading and mathematics growth in comparison for each state.
The code below traverses through the STATE indices of the 2009 education dataset and the 2015 education dataset concurrently, and pulls out the average math or reading score for 4th graders on the NAEP Test for each state and gender (Female or Male) combination possible. After pulling out those values for 2009 and 2015, we subtract the 2009 average from the 2015 average, and we record these differences for each subject (math or reading) and gender (male or female) combination for graphing purposes later on.
states = []
female_math_change = []
male_math_change = []
female_read_change = []
male_read_change = []
for ind, currstate in data_2009.iterrows():
states.append(map_and_stats["STUSPS"][ind])
# READING CHANGE for FEMALE
f_read_val2015 = float(data_2015.at[ind, "G04_A_F_READING"])
f_read_val2009 = float(data_2009.at[ind, "G04_A_F_READING"])
f_read_diff = f_read_val2015 - f_read_val2009
female_read_change.append(f_read_diff)
# MATH CHANGE for FEMALE
f_math_val2015 = float(data_2015.at[ind, "G04_A_F_MATHEMATICS"])
f_math_val2009 = float(data_2009.at[ind, "G04_A_F_MATHEMATICS"])
f_math_diff = f_math_val2015 - f_math_val2009
female_math_change.append(f_math_diff)
# READING CHANGE for MALE
m_read_val2015 = float(data_2015.at[ind, "G04_A_M_READING"])
m_read_val2009 = float(data_2009.at[ind, "G04_A_M_READING"])
m_read_diff = m_read_val2015 - m_read_val2009
male_read_change.append(m_read_diff)
# MATH CHANGE for MALE
m_math_val2015 = float(data_2015.at[ind, "G04_A_M_MATHEMATICS"])
m_math_val2009 = float(data_2009.at[ind, "G04_A_M_MATHEMATICS"])
m_math_diff = m_math_val2015 - m_math_val2009
male_math_change.append(m_math_diff)
Great! Now we can plot our growth values. The following code produces a stacked bar plot for the 50 states and the changes between the average mathematics scores from 2009 to 2015 for male and female students per state.
# MATH GROWTH
state_locs = np.arange(len(states)) # the state label locations
bar_width = 0.55 # the width of the bars
fig, ax = plt.subplots(1, figsize=(15, 10))
male_bars = ax.bar(states, male_math_change, bar_width, label = 'MALE', color = 'magenta')
female_bars = ax.bar(states, female_math_change, bar_width,
bottom = male_math_change, label = 'FEMALE',
color = (0.2, 0.7, 0.9, 0.5))
ax.bar_label(male_bars, padding = 4)
ax.bar_label(female_bars, padding = 4)
ax.set_xlabel('State', fontsize = 15)
ax.set_ylabel('Change in Average Math Score', fontsize = 15)
ax.set_title('Mathematics Growth from \'09 to \'15 in NAEP Test by State and Gender', fontsize = 15)
ax.set_xticks(state_locs, states)
ax.legend()
fig.tight_layout()
plt.show()
The following code produces a stacked bar plot for the 50 states and the changes between the average reading scores from 2009 to 2015 for male and female students per state.
# READING
state_locs = np.arange(len(states)) # the state label locations
bar_width = 0.55 # the width of the bars
fig, ax = plt.subplots(1, figsize=(15, 10))
male_bars = ax.bar(states, male_read_change, bar_width, label = 'MALE', color = 'orange')
female_bars = ax.bar(states, female_read_change, bar_width,
bottom = male_read_change, label = 'FEMALE',
color = (0.2, 0.7, 0.9, 0.5))
ax.bar_label(male_bars, padding = 4)
ax.bar_label(female_bars, padding = 4)
ax.set_xlabel('State', fontsize = 15)
ax.set_ylabel('Change in Average Reading Score', fontsize = 15)
ax.set_title('Reading Growth from \'09 to \'15 in NAEP Test by State and Gender', fontsize = 15)
ax.set_xticks(state_locs, states)
ax.legend()
fig.tight_layout()
plt.show()
Let's analyze our math growth plot first. Looking at our change in average math scores per state by gender bar plot, we can notice that female students tended to be on either extremes of math score changes, whereas male students tended to fluctuate less between exteme jumps in score from 2009 to 2015. 38 out of the 50 states/territories had female students having a larger difference - whether negative or positive - in average math scores from 2009 to 2015. This could be explained by the circumstance of two situations often pushed onto 'female' gendered students: 1) those who are gendered as female often have the 'dumb' or 'not smart enough' stereotype pushed onto them, which may become enough reasoning by communities or individuals to push expectations onto such students to do better on these examinations, and 2) those who are gendered as female are also considered to not supposed to worry over education in comparison to other ideal such as homemaking, which can be pushed onto such children at an early age.
Next, let's look at that math plot in comparison our reading growth plot. We can notice that there seem to be only 15 states that had at least one gender (FEMALE or MALE) having a negative point decrease in their state's average reading score on the NAEP test for 4th graders from 2009 to 2015, which is a smaller amount then the 29 states that had at least one gender having a negative point decrease for that state's average math scores from the same population. Specifically, we can see that the differences on the various states between FEMALE and MALE reading scores seems to be a bit smaller in comparison to the breadth of differences for that of math scores, but FEMALE average point jumps in reading still seem to be larger then those for MALE average point jumpts that are positive. We can also notice that DC seems to have the largest positive point increase from 2009 in both MALE and FEMALE groups for both average math and average reading scores, indicating that DC's math and reading standards of teaching may be a good model for other states. Though, it would be interesting to see why FEMALE students had such a higher jump in comparison to MALE students.
The exploratory data analysis has given us some idea of what we should expect when developing our predictive models below, which could be that FEMALE gender has more weight in influencing more drastic average math scores for each state, and that there might be less negative growths in the reading growths.
One of the predictive models we are going to create will be predicting the change in average test scores in Grade 4 based on previous years data for each state. First, we are going to remove all the columns except for state, and the average test scores for math and reading, and drop rows that include regions outside of the 50 states.
# get columns needed
state_avg = school_data[['STATE', 'YEAR', 'G04_A_A_READING', 'G04_A_A_MATHEMATICS']]
state_avg = state_avg[state_avg.STATE != "DODEA"] # dropping rows that are not relevant states
state_avg = state_avg[state_avg.STATE != "NATIONAL"] # dropping rows that are not relevant states
state_avg.head()
STATE | YEAR | G04_A_A_READING | G04_A_A_MATHEMATICS | |
---|---|---|---|---|
867 | ALABAMA | 2009 | 216.0 | 228.0 |
868 | ALASKA | 2009 | 211.0 | 237.0 |
869 | ARIZONA | 2009 | 210.0 | 230.0 |
870 | ARKANSAS | 2009 | 216.0 | 238.0 |
871 | CALIFORNIA | 2009 | 210.0 | 232.0 |
Similar to our exploratory data analysis, to create a metric for how the test scores have improved or worsened, we are subtracting the 2009 average reading and math scores from another years average reading and math scores for each state to define how much the average math and reading test scores have changed since 2009. We are storing this metric in a new column, "READING_GROWTH" and "MATH_GROWTH".
# set reading growth to NaN first
state_avg['READING_GROWTH'] = np.NaN
# method to process each row and return the reading average in 2009
def process_reading(row):
state = row['STATE']
new = state_avg.loc[state_avg['STATE'] == state]
new = new.loc[new['YEAR'] == 2009]
return new['G04_A_A_READING']
# in each row update the reading growth value with the difference between this value and the value in 2009
for i, row in state_avg.iterrows():
state_avg.at[i, 'READING_GROWTH'] = row['G04_A_A_READING'] - process_reading(row)
state_avg['MATHEMATICS_GROWTH'] = np.NaN
# similar function as reading, but for mathematics
def process_reading(row):
state = row['STATE']
new = state_avg.loc[state_avg['STATE'] == state]
new = new.loc[new['YEAR'] == 2009]
return new['G04_A_A_MATHEMATICS']
for i, row in state_avg.iterrows():
state_avg.at[i, 'MATHEMATICS_GROWTH'] = row['G04_A_A_MATHEMATICS'] - process_reading(row)
state_avg.head()
STATE | YEAR | G04_A_A_READING | G04_A_A_MATHEMATICS | READING_GROWTH | MATHEMATICS_GROWTH | |
---|---|---|---|---|---|---|
867 | ALABAMA | 2009 | 216.0 | 228.0 | 0.0 | 0.0 |
868 | ALASKA | 2009 | 211.0 | 237.0 | 0.0 | 0.0 |
869 | ARIZONA | 2009 | 210.0 | 230.0 | 0.0 | 0.0 |
870 | ARKANSAS | 2009 | 216.0 | 238.0 | 0.0 | 0.0 |
871 | CALIFORNIA | 2009 | 210.0 | 232.0 | 0.0 | 0.0 |
Since each state counts as a unique independent variable, we can use the pandas method get_dummies to create a dataframe where each state is represented by either 1 or 0, 1 if the data value is in that state and 0 if the data value is not in that state. Then we will drop the Alabama column because if all the other states are 0 we can assume that the data value must be in Alabama.
# get dummies
state_avg = pd.get_dummies(state_avg, columns=['STATE'])
# drop alabama and reading and mathematics averages since we no longer need them
state_avg = state_avg.drop(columns=['STATE_ALABAMA', 'G04_A_A_READING', 'G04_A_A_MATHEMATICS'])
state_avg.head()
YEAR | READING_GROWTH | MATHEMATICS_GROWTH | STATE_ALASKA | STATE_ARIZONA | STATE_ARKANSAS | STATE_CALIFORNIA | STATE_COLORADO | STATE_CONNECTICUT | STATE_DELAWARE | ... | STATE_SOUTH_DAKOTA | STATE_TENNESSEE | STATE_TEXAS | STATE_UTAH | STATE_VERMONT | STATE_VIRGINIA | STATE_WASHINGTON | STATE_WEST_VIRGINIA | STATE_WISCONSIN | STATE_WYOMING | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
867 | 2009 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
868 | 2009 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
869 | 2009 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
870 | 2009 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
871 | 2009 | 0.0 | 0.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 53 columns
Now we have to split apart the dataset into a train and test dataset so that we can use the test dataset to determine how accurate our predictor is. We are going to predict the years 2017, 2018, and 2019 so we will use those rows as our test data and the rest as our train data. Another note is that not all the years between 2009 to 2019 have recorded average reading and math scores for states in the US. Therefore, when we are splitting our data, we will only keep the values that are actual numeric values, and not NaN values, for both our training and testing sets.
# drop the NaN rows
train_data = state_avg[state_avg['YEAR'] < 2017].dropna()
test_data = state_avg[state_avg['YEAR'] >= 2017].dropna()
We can use Linear SVM from sklearn to make a regression model that will predict the expected increase or decrease in average scores for math and reading. We will make one each for reading and one for mathematics. But first, we have to separate our independent variable (state with year) and dependent (growth) variable.
from sklearn.linear_model import LinearRegression
X_reading = [] # independent values for reading
y_reading = [] # dependent values for reading
X_math = [] # independent values for math
y_math = [] # dependent values for reading
# iterate through each row and add the year and state to the X variables and the growths to the y variables
for i, row in train_data.iterrows():
add = row[3:].tolist()
add.insert(0, row['YEAR'])
X_reading.append(add)
y_reading.append(row['READING_GROWTH'])
X_math.append(add)
y_math.append(row['MATHEMATICS_GROWTH'])
We are going to use the Linear Regression model to fit the X (state with year) and y (growth increase/decrease) variables and create a prediction model. Then, we will add the predicted growths done on our testing dataset to a separate column in the test_data DataFrame so we can easily compare the values.
# create reading regression and math regression
reading_regr = LinearRegression().fit(X_reading, y_reading)
mathematics_regr = LinearRegression().fit(X_math, y_math)
X_test_reading = []
X_test_math = []
# accumulate X values for reading and math
for i, row in test_data.iterrows():
add = row[3:].tolist()
add.insert(0, row['YEAR'])
X_test_reading.append(add)
X_test_math.append(add)
# predict based of X values
test_data['PREDICT_READING'] = reading_regr.predict(X_test_reading)
test_data['PREDICT_MATH'] = mathematics_regr.predict(X_test_math)
test_data.head()
YEAR | READING_GROWTH | MATHEMATICS_GROWTH | STATE_ALASKA | STATE_ARIZONA | STATE_ARKANSAS | STATE_CALIFORNIA | STATE_COLORADO | STATE_CONNECTICUT | STATE_DELAWARE | ... | STATE_TEXAS | STATE_UTAH | STATE_VERMONT | STATE_VIRGINIA | STATE_WASHINGTON | STATE_WEST_VIRGINIA | STATE_WISCONSIN | STATE_WYOMING | PREDICT_READING | PREDICT_MATH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1281 | 2017 | 0.0 | 4.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.642157 | 3.666667 |
1288 | 2017 | -4.0 | -7.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.892157 | 0.166667 |
1295 | 2017 | 5.0 | 4.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.142157 | 6.666667 |
1302 | 2017 | 0.0 | -4.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.142157 | 0.666667 |
1309 | 2017 | 5.0 | 0.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.392157 | 1.916667 |
5 rows × 55 columns
To analyze if being in a particular state does affect growth of reading and math test scores, we can use statsmodel to see the p value of each coefficient, which will be the states with year, we are passing into the model.
# create statsmodel for reading data
p_reading = sm.OLS(train_data['READING_GROWTH'].tolist(), sm.add_constant(X_reading)).fit()
p_reading.summary()
Dep. Variable: | y | R-squared: | 0.501 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.334 |
Method: | Least Squares | F-statistic: | 2.997 |
Date: | Mon, 16 May 2022 | Prob (F-statistic): | 1.12e-07 |
Time: | 15:54:25 | Log-Likelihood: | -381.05 |
No. Observations: | 204 | AIC: | 866.1 |
Df Residuals: | 152 | BIC: | 1039. |
Df Model: | 51 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -658.8039 | 114.346 | -5.762 | 0.000 | -884.716 | -432.892 |
x1 | 0.3284 | 0.057 | 5.779 | 0.000 | 0.216 | 0.441 |
x2 | -2.7500 | 1.283 | -2.143 | 0.034 | -5.286 | -0.214 |
x3 | 0.5000 | 1.283 | 0.390 | 0.697 | -2.036 | 3.036 |
x4 | -0.5000 | 1.283 | -0.390 | 0.697 | -3.036 | 2.036 |
x5 | -0.2500 | 1.283 | -0.195 | 0.846 | -2.786 | 2.286 |
x6 | -3.0000 | 1.283 | -2.338 | 0.021 | -5.536 | -0.464 |
x7 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |
x8 | -2.7500 | 1.283 | -2.143 | 0.034 | -5.286 | -0.214 |
x9 | 1.2500 | 1.283 | 0.974 | 0.332 | -1.286 | 3.786 |
x10 | -1.7500 | 1.283 | -1.364 | 0.175 | -4.286 | 0.786 |
x11 | 0.7500 | 1.283 | 0.584 | 0.560 | -1.786 | 3.286 |
x12 | 0.7500 | 1.283 | 0.584 | 0.560 | -1.786 | 3.286 |
x13 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |
x14 | -1.2500 | 1.283 | -0.974 | 0.332 | -3.786 | 1.286 |
x15 | -1.0000 | 1.283 | -0.779 | 0.437 | -3.536 | 1.536 |
x16 | -0.5000 | 1.283 | -0.390 | 0.697 | -3.036 | 2.036 |
x17 | -3.0000 | 1.283 | -2.338 | 0.021 | -5.536 | -0.464 |
x18 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |
x19 | 1.7500 | 1.283 | 1.364 | 0.175 | -0.786 | 4.286 |
x20 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |
x21 | 4.125e-12 | 1.283 | 3.21e-12 | 1.000 | -2.536 | 2.536 |
x22 | -1.5000 | 1.283 | -1.169 | 0.244 | -4.036 | 1.036 |
x23 | -2.5000 | 1.283 | -1.948 | 0.053 | -5.036 | 0.036 |
x24 | -1.2500 | 1.283 | -0.974 | 0.332 | -3.786 | 1.286 |
x25 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |
x26 | -3.7500 | 1.283 | -2.922 | 0.004 | -6.286 | -1.214 |
x27 | -2.5000 | 1.283 | -1.948 | 0.053 | -5.036 | 0.036 |
x28 | -1.0000 | 1.283 | -0.779 | 0.437 | -3.536 | 1.536 |
x29 | 4.121e-12 | 1.283 | 3.21e-12 | 1.000 | -2.536 | 2.536 |
x30 | -0.2500 | 1.283 | -0.195 | 0.846 | -2.786 | 2.286 |
x31 | -1.5000 | 1.283 | -1.169 | 0.244 | -4.036 | 1.036 |
x32 | -2.7500 | 1.283 | -2.143 | 0.034 | -5.286 | -0.214 |
x33 | -2.7500 | 1.283 | -2.143 | 0.034 | -5.286 | -0.214 |
x34 | 1.0000 | 1.283 | 0.779 | 0.437 | -1.536 | 3.536 |
x35 | -2.7500 | 1.283 | -2.143 | 0.034 | -5.286 | -0.214 |
x36 | -2.5000 | 1.283 | -1.948 | 0.053 | -5.036 | 0.036 |
x37 | -1.2500 | 1.283 | -0.974 | 0.332 | -3.786 | 1.286 |
x38 | -1.7500 | 1.283 | -1.364 | 0.175 | -4.286 | 0.786 |
x39 | 4.125e-12 | 1.283 | 3.21e-12 | 1.000 | -2.536 | 2.536 |
x40 | -1.7500 | 1.283 | -1.364 | 0.175 | -4.286 | 0.786 |
x41 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |
x42 | -4.0000 | 1.283 | -3.117 | 0.002 | -6.536 | -1.464 |
x43 | -1.2500 | 1.283 | -0.974 | 0.332 | -3.786 | 1.286 |
x44 | -3.0000 | 1.283 | -2.338 | 0.021 | -5.536 | -0.464 |
x45 | 1.0000 | 1.283 | 0.779 | 0.437 | -1.536 | 3.536 |
x46 | -2.5000 | 1.283 | -1.948 | 0.053 | -5.036 | 0.036 |
x47 | -1.2500 | 1.283 | -0.974 | 0.332 | -3.786 | 1.286 |
x48 | 0.2500 | 1.283 | 0.195 | 0.846 | -2.286 | 2.786 |
x49 | -2.0000 | 1.283 | -1.558 | 0.121 | -4.536 | 0.536 |
x50 | -0.7500 | 1.283 | -0.584 | 0.560 | -3.286 | 1.786 |
x51 | 0.2500 | 1.283 | 0.195 | 0.846 | -2.286 | 2.786 |
Omnibus: | 6.101 | Durbin-Watson: | 1.995 |
---|---|---|---|
Prob(Omnibus): | 0.047 | Jarque-Bera (JB): | 8.918 |
Skew: | 0.113 | Prob(JB): | 0.0116 |
Kurtosis: | 3.999 | Cond. No. | 1.81e+06 |
We can do the same for the math data.
# create statsmodel for math data
p_math = sm.OLS(train_data['MATHEMATICS_GROWTH'].tolist(), sm.add_constant(X_math)).fit()
p_math.summary()
Dep. Variable: | y | R-squared: | 0.511 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.347 |
Method: | Least Squares | F-statistic: | 3.114 |
Date: | Mon, 16 May 2022 | Prob (F-statistic): | 4.02e-08 |
Time: | 15:54:25 | Log-Likelihood: | -403.42 |
No. Observations: | 204 | AIC: | 910.8 |
Df Residuals: | 152 | BIC: | 1083. |
Df Model: | 51 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -366.1167 | 127.593 | -2.869 | 0.005 | -618.201 | -114.032 |
x1 | 0.1833 | 0.063 | 2.891 | 0.004 | 0.058 | 0.309 |
x2 | -3.5000 | 1.432 | -2.444 | 0.016 | -6.329 | -0.671 |
x3 | 3.0000 | 1.432 | 2.095 | 0.038 | 0.171 | 5.829 |
x4 | -3.0000 | 1.432 | -2.095 | 0.038 | -5.829 | -0.171 |
x5 | -1.7500 | 1.432 | -1.222 | 0.224 | -4.579 | 1.079 |
x6 | -1.7500 | 1.432 | -1.222 | 0.224 | -4.579 | 1.079 |
x7 | -5.2500 | 1.432 | -3.666 | 0.000 | -8.079 | -2.421 |
x8 | -1.5000 | 1.432 | -1.047 | 0.297 | -4.329 | 1.329 |
x9 | 3.5000 | 1.432 | 2.444 | 0.016 | 0.671 | 6.329 |
x10 | -3.0000 | 1.432 | -2.095 | 0.038 | -5.829 | -0.171 |
x11 | -1.2500 | 1.432 | -0.873 | 0.384 | -4.079 | 1.579 |
x12 | 0.2500 | 1.432 | 0.175 | 0.862 | -2.579 | 3.079 |
x13 | -3.5000 | 1.432 | -2.444 | 0.016 | -6.329 | -0.671 |
x14 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |
x15 | 0.2500 | 1.432 | 0.175 | 0.862 | -2.579 | 3.079 |
x16 | -2.0000 | 1.432 | -1.397 | 0.165 | -4.829 | 0.829 |
x17 | -3.2500 | 1.432 | -2.269 | 0.025 | -6.079 | -0.421 |
x18 | -1.0000 | 1.432 | -0.698 | 0.486 | -3.829 | 1.829 |
x19 | -0.5000 | 1.432 | -0.349 | 0.727 | -3.329 | 2.329 |
x20 | -2.7500 | 1.432 | -1.920 | 0.057 | -5.579 | 0.079 |
x21 | -3.0000 | 1.432 | -2.095 | 0.038 | -5.829 | -0.171 |
x22 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |
x23 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |
x24 | -1.5000 | 1.432 | -1.047 | 0.297 | -4.329 | 1.329 |
x25 | 0.7500 | 1.432 | 0.524 | 0.601 | -2.079 | 3.579 |
x26 | -3.7500 | 1.432 | -2.619 | 0.010 | -6.579 | -0.921 |
x27 | -3.5000 | 1.432 | -2.444 | 0.016 | -6.329 | -0.671 |
x28 | -0.2500 | 1.432 | -0.175 | 0.862 | -3.079 | 2.579 |
x29 | -2.2500 | 1.432 | -1.571 | 0.118 | -5.079 | 0.579 |
x30 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |
x31 | -3.0000 | 1.432 | -2.095 | 0.038 | -5.829 | -0.171 |
x32 | -1.0000 | 1.432 | -0.698 | 0.486 | -3.829 | 1.829 |
x33 | -4.7500 | 1.432 | -3.317 | 0.001 | -7.579 | -1.921 |
x34 | -2.2500 | 1.432 | -1.571 | 0.118 | -5.079 | 0.579 |
x35 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |
x36 | -2.2500 | 1.432 | -1.571 | 0.118 | -5.079 | 0.579 |
x37 | -1.5000 | 1.432 | -1.047 | 0.297 | -4.329 | 1.329 |
x38 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |
x39 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |
x40 | -1.7500 | 1.432 | -1.222 | 0.224 | -4.579 | 1.079 |
x41 | -2.0000 | 1.432 | -1.397 | 0.165 | -4.829 | 0.829 |
x42 | -3.7500 | 1.432 | -2.619 | 0.010 | -6.579 | -0.921 |
x43 | 1.7500 | 1.432 | 1.222 | 0.224 | -1.079 | 4.579 |
x44 | -1.0000 | 1.432 | -0.698 | 0.486 | -3.829 | 1.829 |
x45 | -0.5000 | 1.432 | -0.349 | 0.727 | -3.329 | 2.329 |
x46 | -4.2500 | 1.432 | -2.968 | 0.003 | -7.079 | -1.421 |
x47 | -0.5000 | 1.432 | -0.349 | 0.727 | -3.329 | 2.329 |
x48 | -0.7500 | 1.432 | -0.524 | 0.601 | -3.579 | 2.079 |
x49 | -0.7500 | 1.432 | -0.524 | 0.601 | -3.579 | 2.079 |
x50 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |
x51 | 0.2500 | 1.432 | 0.175 | 0.862 | -2.579 | 3.079 |
Omnibus: | 7.342 | Durbin-Watson: | 1.715 |
---|---|---|---|
Prob(Omnibus): | 0.025 | Jarque-Bera (JB): | 8.179 |
Skew: | -0.316 | Prob(JB): | 0.0167 |
Kurtosis: | 3.750 | Cond. No. | 1.81e+06 |
Looking at the statsmodel summary, we can see analyze which coefficients had a notable effect on the predicted value. We can list which ones by looking at the P>|t| columns and if they have a value less than our assumed bound of 0.05, then we can consider it to be significant.
# accumulate constant names
const_names = train_data.columns[3:].tolist()
const_names.insert(0, 'YEAR')
reading_significant = []
# iterate through p values and if less than 0.05 add const name to reading_significant
for i in range(len(p_reading.pvalues) - 1):
if p_reading.pvalues[i + 1] < 0.05:
reading_significant.append(const_names[i])
print("Significant reading test constants: " + str(reading_significant))
print()
math_significant = []
# iterate through p values and if less than 0.05 add const name to math_significant
for i in range(len(p_math.pvalues) - 1):
if p_math.pvalues[i + 1] < 0.05:
math_significant.append(const_names[i])
print("Significant math test constants: " + str(math_significant))
Significant reading test constants: ['YEAR', 'STATE_ALASKA', 'STATE_COLORADO', 'STATE_DELAWARE', 'STATE_KANSAS', 'STATE_MISSOURI', 'STATE_NEW_MEXICO', 'STATE_NEW_YORK', 'STATE_NORTH_DAKOTA', 'STATE_SOUTH_DAKOTA', 'STATE_TEXAS'] Significant math test constants: ['YEAR', 'STATE_ALASKA', 'STATE_ARIZONA', 'STATE_ARKANSAS', 'STATE_CONNECTICUT', 'STATE_DISTRICT_OF_COLUMBIA', 'STATE_FLORIDA', 'STATE_IDAHO', 'STATE_KANSAS', 'STATE_MARYLAND', 'STATE_MISSOURI', 'STATE_MONTANA', 'STATE_NEW_JERSEY', 'STATE_NEW_YORK', 'STATE_SOUTH_DAKOTA', 'STATE_VERMONT']
Based off just the p values we can see that these states have a determinant factor in the growth of a child's reading and math test score.
An application of this model can be to predict if a state is predicted to perform poorly and change legislation and provide more funding to help the students grow more. We can see if a state is predicted to perform poorly if the state has a negative coefficient in the LinearRegression model. If this state also has a high significance then this state should try to improve their education system.
print("States that have a predicted negative growth in reading:")
for i in range(len(reading_regr.coef_)):
if(i > 1):
if reading_regr.coef_[i] < 0 and p_reading.pvalues[i + 1] < 0.05:
print(const_names[i])
print()
print("States that have a predicted negative growth in math:")
for i in range(len(mathematics_regr.coef_)):
if(i > 1):
if mathematics_regr.coef_[i] < 0 and p_math.pvalues[i + 1] < 0.05:
print(const_names[i])
States that have a predicted negative growth in reading: STATE_COLORADO STATE_DELAWARE STATE_KANSAS STATE_MISSOURI STATE_NEW_MEXICO STATE_NEW_YORK STATE_NORTH_DAKOTA STATE_SOUTH_DAKOTA STATE_TEXAS States that have a predicted negative growth in math: STATE_ARKANSAS STATE_CONNECTICUT STATE_FLORIDA STATE_IDAHO STATE_KANSAS STATE_MARYLAND STATE_MISSOURI STATE_MONTANA STATE_NEW_JERSEY STATE_NEW_YORK STATE_SOUTH_DAKOTA STATE_VERMONT
From these lists we can tell that a notable amount of states have both a negative coefficient and a low p value above 0.05 - and therefore are predicted to have significant negative decreases in point averages for 4th grades on the NAEP test in both reading and writing. Noticeably, we can see that 19 states seem to have significant predicted negative growths in both average math and reading scores for 4th graders, some of which include states such as Maine and California, which have been known to be more
Another set of predictive models that we are creating is predicting the change in average test scores in Grade 4 based on previous years data for each state per gender. We will create 4 models, each for predicting MALE Math Changes per State from some year, FEMALE Math Changes per State from some year, MALE Reading Changes per State from some year, and FEMALE Reading Changes per State from some year. Firstly, we are going to remove all the columns except for state, and the average test scores for math and reading per gender per state.
# get gender-based data
gender_avg = school_data[['STATE', 'YEAR', 'G04_A_M_READING', 'G04_A_M_MATHEMATICS', 'G04_A_F_READING', 'G04_A_F_MATHEMATICS']]
gender_avg = gender_avg[gender_avg.STATE != "DODEA"] # dropping rows that are not relevant states
gender_avg = gender_avg[gender_avg.STATE != "NATIONAL"] # dropping rows that are not relevant states
# display the first few results
gender_avg.head()
STATE | YEAR | G04_A_M_READING | G04_A_M_MATHEMATICS | G04_A_F_READING | G04_A_F_MATHEMATICS | |
---|---|---|---|---|---|---|
867 | ALABAMA | 2009 | 212.0 | 228.0 | 221.0 | 228.0 |
868 | ALASKA | 2009 | 207.0 | 238.0 | 216.0 | 236.0 |
869 | ARIZONA | 2009 | 207.0 | 230.0 | 213.0 | 230.0 |
870 | ARKANSAS | 2009 | 211.0 | 239.0 | 222.0 | 236.0 |
871 | CALIFORNIA | 2009 | 207.0 | 233.0 | 213.0 | 231.0 |
To create a metric for how the test scores have improved, we will, similar to earlier, subtract the 2009 average scores from each state's FEMALE and MALE reading and math averages from those of another year's to define how much the average math and reading test scores have changed since 2009. We are storing this metric in these new columns: "READING_GROWTH_M," "READING_GROWTH_F" and "MATHEMATICS_GROWTH_M," and "MATHEMATICS_GROWTH_F".
First, we will create these columns and use numpy to set the values in those column as empty (np.NaN). Typically, you would use this to denote missing values, but we can use this when we don't know the values up front.
# define columns in dataframe and set to empty
gender_avg['MATHEMATICS_GROWTH_M'] = np.NaN
gender_avg['MATHEMATICS_GROWTH_F'] = np.NaN
gender_avg['READING_GROWTH_M'] = np.NaN
gender_avg['READING_GROWTH_F'] = np.NaN
# display the first few results
gender_avg.head()
STATE | YEAR | G04_A_M_READING | G04_A_M_MATHEMATICS | G04_A_F_READING | G04_A_F_MATHEMATICS | MATHEMATICS_GROWTH_M | MATHEMATICS_GROWTH_F | READING_GROWTH_M | READING_GROWTH_F | |
---|---|---|---|---|---|---|---|---|---|---|
867 | ALABAMA | 2009 | 212.0 | 228.0 | 221.0 | 228.0 | NaN | NaN | NaN | NaN |
868 | ALASKA | 2009 | 207.0 | 238.0 | 216.0 | 236.0 | NaN | NaN | NaN | NaN |
869 | ARIZONA | 2009 | 207.0 | 230.0 | 213.0 | 230.0 | NaN | NaN | NaN | NaN |
870 | ARKANSAS | 2009 | 211.0 | 239.0 | 222.0 | 236.0 | NaN | NaN | NaN | NaN |
871 | CALIFORNIA | 2009 | 207.0 | 233.0 | 213.0 | 231.0 | NaN | NaN | NaN | NaN |
Next, we are defining a get_growth() function that will calculate the growth from 2009 to whichever specified year based on the subject, gender (FEMALE or MALE), and year. We will use this method to assign a growth value for each state and year combination with gender.
# method to get growth
def get_growth(row, subject, gender):
# create a list, new, that contains the average score data from 2009
state = row['STATE']
new = gender_avg.loc[gender_avg['STATE'] == state]
new = new.loc[new['YEAR'] == 2009]
# get the data based on subject and gender and subtract it with 2009 data average score
if subject == 'MATH':
if gender == 'M':
return row['G04_A_M_MATHEMATICS'] - new['G04_A_M_MATHEMATICS']
else:
return row['G04_A_F_MATHEMATICS'] - new['G04_A_F_MATHEMATICS']
else:
if gender == 'M':
return row['G04_A_M_READING'] - new['G04_A_M_READING']
else:
return row['G04_A_F_READING'] - new['G04_A_F_READING']
# iterate through each row in our dataframe and get the growth based off of subject and gender
for i, row in gender_avg.iterrows():
gender_avg.at[i, 'READING_GROWTH_M'] = get_growth(row, 'READ', 'M')
gender_avg.at[i, 'READING_GROWTH_F'] = get_growth(row, 'READ', 'F')
gender_avg.at[i, 'MATHEMATICS_GROWTH_M'] = get_growth(row, 'MATH', 'M')
gender_avg.at[i, 'MATHEMATICS_GROWTH_F'] = get_growth(row, 'MATH', 'F')
gender_avg.head()
STATE | YEAR | G04_A_M_READING | G04_A_M_MATHEMATICS | G04_A_F_READING | G04_A_F_MATHEMATICS | MATHEMATICS_GROWTH_M | MATHEMATICS_GROWTH_F | READING_GROWTH_M | READING_GROWTH_F | |
---|---|---|---|---|---|---|---|---|---|---|
867 | ALABAMA | 2009 | 212.0 | 228.0 | 221.0 | 228.0 | 0.0 | 0.0 | 0.0 | 0.0 |
868 | ALASKA | 2009 | 207.0 | 238.0 | 216.0 | 236.0 | 0.0 | 0.0 | 0.0 | 0.0 |
869 | ARIZONA | 2009 | 207.0 | 230.0 | 213.0 | 230.0 | 0.0 | 0.0 | 0.0 | 0.0 |
870 | ARKANSAS | 2009 | 211.0 | 239.0 | 222.0 | 236.0 | 0.0 | 0.0 | 0.0 | 0.0 |
871 | CALIFORNIA | 2009 | 207.0 | 233.0 | 213.0 | 231.0 | 0.0 | 0.0 | 0.0 | 0.0 |
We will be creating seperate models for STATE+FEMALE Math Growth, STATE+MALE Math Growth, STATE+FEMALE Reading Growth, and STATE+MALE Reading Growth. Because of this, gender will not be an independent variable, but rather only state will be since each model assumes which gender it is training on. Since each state counts as a unique independent variable, we can use the pandas method get_dummies on a row for each state value to create a dataframe where each state is represented by either 1 or 0, 1 if the data value represents that state, and 0 if the data value does not represent that state. Then we will then drop the Alabama column from our dataset because if all the other state values are 0 then we can assume that the data value must be an Alabama value.
# get dummies
gender_avg = pd.get_dummies(gender_avg, columns=['STATE'])
# drop alabama and reading and mathematics averages since we no longer need them
gender_avg = gender_avg.drop(columns=['STATE_ALABAMA', 'G04_A_M_READING', 'G04_A_F_READING', 'G04_A_M_MATHEMATICS', 'G04_A_F_MATHEMATICS'])
# display the first few results
gender_avg.head()
YEAR | MATHEMATICS_GROWTH_M | MATHEMATICS_GROWTH_F | READING_GROWTH_M | READING_GROWTH_F | STATE_ALASKA | STATE_ARIZONA | STATE_ARKANSAS | STATE_CALIFORNIA | STATE_COLORADO | ... | STATE_SOUTH_DAKOTA | STATE_TENNESSEE | STATE_TEXAS | STATE_UTAH | STATE_VERMONT | STATE_VIRGINIA | STATE_WASHINGTON | STATE_WEST_VIRGINIA | STATE_WISCONSIN | STATE_WYOMING | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
867 | 2009 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
868 | 2009 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
869 | 2009 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
870 | 2009 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
871 | 2009 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 55 columns
When creating predictive models, it is standard norm to split apart our dataset into a train and test set. Typically, the test set contains data the model has never seen before so we can get a more accurate picture on how it performs on real world data. Typically, datasets are split into 60% train and 40% test (more info here: how to split a dataset ).
Because our model is being trained to predict growth from 2009 to later years, we will split apart our dataset with a different convention. We will, similar to before, predict on the years 2017, 2018, and 2019, and so we will use those rows as our test data and the rest as our train data. We will also drop any year's data that have NaN values for average reading and math scores for FEMALE and MALE averages.
train_data = gender_avg[gender_avg['YEAR'] < 2017].dropna()
test_data = gender_avg[gender_avg['YEAR'] >= 2017].dropna()
For our regression model, we will use Linear Regression from sklearn (link for more information: sklearn library). We will make one model per subject, per gender for a total of four models.
First, we will define our X (state with year) and y values (growth for assumed gender) by iterating through each row and adding the year and state to the X variable and the growths to the y variables.
from sklearn.linear_model import LinearRegression
# create X and y lists by respective subject and gender
X_reading_M = []
X_reading_F = []
y_reading_M = []
y_reading_F = []
X_math_M = []
X_math_F = []
y_math_M = []
y_math_F = []
# iterate through each row and update X and y
for i, row in train_data.iterrows():
# X is state and year
X = row[3:].tolist()
X.insert(0, row['YEAR'])
# for each y (growth for assumed gender), append its respective value
X_reading_M.append(X)
y_reading_M.append(row['READING_GROWTH_M'])
X_reading_F.append(X)
y_reading_F.append(row['READING_GROWTH_F'])
X_math_M.append(X)
y_math_M.append(row['MATHEMATICS_GROWTH_M'])
X_math_F.append(X)
y_math_F.append(row['MATHEMATICS_GROWTH_F'])
Using the Linear Regression model, we will fit the X and y variables we defined above in order to create a prediction model. Then, we will test on our models using the test dataset we defined above.
# create models based on subject and assumed gender and then fit the data
reading_M = LinearRegression().fit(X_reading_M, y_reading_M)
reading_F = LinearRegression().fit(X_reading_F, y_reading_F)
mathematics_M = LinearRegression().fit(X_math_M, y_math_M)
mathematics_F = LinearRegression().fit(X_math_F, y_math_F)
# create X values for state and subject
X_test_reading_M = []
X_test_reading_F = []
X_test_math_M = []
X_test_math_F = []
# accumulate X values for reading and math
for i, row in test_data.iterrows():
X = row[3:].tolist()
X.insert(0, row['YEAR'])
X_test_reading_M.append(X)
X_test_reading_F.append(X)
X_test_math_M.append(X)
X_test_math_F.append(X)
# predict based of X values
test_data['PREDICT_READING_M'] = reading_M.predict(X_test_reading_M)
test_data['PREDICT_READING_F'] = reading_F.predict(X_test_reading_F)
test_data['PREDICT_MATH_M'] = mathematics_M.predict(X_test_math_M)
test_data['PREDICT_MATH_F'] = mathematics_F.predict(X_test_math_F)
test_data.head()
YEAR | MATHEMATICS_GROWTH_M | MATHEMATICS_GROWTH_F | READING_GROWTH_M | READING_GROWTH_F | STATE_ALASKA | STATE_ARIZONA | STATE_ARKANSAS | STATE_CALIFORNIA | STATE_COLORADO | ... | STATE_VERMONT | STATE_VIRGINIA | STATE_WASHINGTON | STATE_WEST_VIRGINIA | STATE_WISCONSIN | STATE_WYOMING | PREDICT_READING_M | PREDICT_READING_F | PREDICT_MATH_M | PREDICT_MATH_F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1281 | 2017 | 4.0 | 4.0 | 1.0 | -1.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | -1.0 | 1.349624 | 1.955009 |
1288 | 2017 | -8.0 | -6.0 | -4.0 | -5.0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | -4.0 | -5.0 | -2.933476 | -2.043989 |
1295 | 2017 | 6.0 | 3.0 | 6.0 | 5.0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 6.0 | 5.0 | 7.960281 | 6.425050 |
1302 | 2017 | -4.0 | -3.0 | 1.0 | -1.0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | -1.0 | -2.173749 | 0.917987 |
1309 | 2017 | 0.0 | 0.0 | 4.0 | 7.0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 4.0 | 7.0 | 3.015859 | 3.018895 |
5 rows × 59 columns
To analyze if being in a particular state does affect growth of reading and math test scores for our genders of MALE and FEMALE, we can use statsmodel to see the p-value of each coefficient that we are using in our fitted model.
import statsmodels.api as sm
# create statsmodel for reading data male
p_reading_M = sm.OLS(train_data['READING_GROWTH_M'].tolist(), sm.add_constant(X_reading_M)).fit()
p_reading_M.summary()
Dep. Variable: | y | R-squared: | 1.000 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 1.000 |
Method: | Least Squares | F-statistic: | 2.040e+27 |
Date: | Mon, 16 May 2022 | Prob (F-statistic): | 0.00 |
Time: | 15:54:27 | Log-Likelihood: | 5840.7 |
No. Observations: | 204 | AIC: | -1.157e+04 |
Df Residuals: | 150 | BIC: | -1.139e+04 |
Df Model: | 53 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 1.08e-12 | 7.26e-12 | 0.149 | 0.882 | -1.33e-11 | 1.54e-11 |
x1 | -4.927e-16 | 3.61e-15 | -0.136 | 0.892 | -7.63e-15 | 6.64e-15 |
x2 | 1.0000 | 5.31e-15 | 1.88e+14 | 0.000 | 1.000 | 1.000 |
x3 | -6.87e-16 | 5.48e-15 | -0.125 | 0.900 | -1.15e-14 | 1.01e-14 |
x4 | -2.526e-15 | 7.47e-14 | -0.034 | 0.973 | -1.5e-13 | 1.45e-13 |
x5 | -5.052e-15 | 7.46e-14 | -0.068 | 0.946 | -1.52e-13 | 1.42e-13 |
x6 | 1.887e-15 | 7.37e-14 | 0.026 | 0.980 | -1.44e-13 | 1.47e-13 |
x7 | 1.665e-15 | 7.37e-14 | 0.023 | 0.982 | -1.44e-13 | 1.47e-13 |
x8 | -2.22e-16 | 7.48e-14 | -0.003 | 0.998 | -1.48e-13 | 1.48e-13 |
x9 | 1.665e-16 | 7.42e-14 | 0.002 | 0.998 | -1.46e-13 | 1.47e-13 |
x10 | -7.216e-16 | 7.43e-14 | -0.010 | 0.992 | -1.48e-13 | 1.46e-13 |
x11 | 1.332e-15 | 7.77e-14 | 0.017 | 0.986 | -1.52e-13 | 1.55e-13 |
x12 | -9.714e-16 | 7.45e-14 | -0.013 | 0.990 | -1.48e-13 | 1.46e-13 |
x13 | 4.441e-16 | 7.42e-14 | 0.006 | 0.995 | -1.46e-13 | 1.47e-13 |
x14 | 1.11e-15 | 7.36e-14 | 0.015 | 0.988 | -1.44e-13 | 1.47e-13 |
x15 | -3.608e-16 | 7.44e-14 | -0.005 | 0.996 | -1.47e-13 | 1.47e-13 |
x16 | 0 | 7.38e-14 | 0 | 1.000 | -1.46e-13 | 1.46e-13 |
x17 | 6.106e-16 | 7.34e-14 | 0.008 | 0.993 | -1.44e-13 | 1.46e-13 |
x18 | 5.551e-16 | 7.37e-14 | 0.008 | 0.994 | -1.45e-13 | 1.46e-13 |
x19 | -3.331e-16 | 7.75e-14 | -0.004 | 0.997 | -1.54e-13 | 1.53e-13 |
x20 | -5.551e-16 | 7.37e-14 | -0.008 | 0.994 | -1.46e-13 | 1.45e-13 |
x21 | 4.58e-16 | 7.46e-14 | 0.006 | 0.995 | -1.47e-13 | 1.48e-13 |
x22 | -1.416e-15 | 7.42e-14 | -0.019 | 0.985 | -1.48e-13 | 1.45e-13 |
x23 | -1.11e-15 | 7.51e-14 | -0.015 | 0.988 | -1.5e-13 | 1.47e-13 |
x24 | -2.22e-16 | 7.4e-14 | -0.003 | 0.998 | -1.46e-13 | 1.46e-13 |
x25 | 0 | 7.4e-14 | 0 | 1.000 | -1.46e-13 | 1.46e-13 |
x26 | 9.992e-16 | 7.46e-14 | 0.013 | 0.989 | -1.46e-13 | 1.48e-13 |
x27 | -9.298e-16 | 7.47e-14 | -0.012 | 0.990 | -1.49e-13 | 1.47e-13 |
x28 | -1.998e-15 | 7.52e-14 | -0.027 | 0.979 | -1.51e-13 | 1.47e-13 |
x29 | 5.412e-16 | 7.48e-14 | 0.007 | 0.994 | -1.47e-13 | 1.48e-13 |
x30 | -9.437e-16 | 7.44e-14 | -0.013 | 0.990 | -1.48e-13 | 1.46e-13 |
x31 | 7.91e-16 | 7.37e-14 | 0.011 | 0.991 | -1.45e-13 | 1.46e-13 |
x32 | 9.021e-16 | 7.46e-14 | 0.012 | 0.990 | -1.47e-13 | 1.48e-13 |
x33 | 8.327e-17 | 7.43e-14 | 0.001 | 0.999 | -1.47e-13 | 1.47e-13 |
x34 | -3.886e-16 | 7.43e-14 | -0.005 | 0.996 | -1.47e-13 | 1.46e-13 |
x35 | -1.776e-15 | 7.48e-14 | -0.024 | 0.981 | -1.5e-13 | 1.46e-13 |
x36 | 1.554e-15 | 7.36e-14 | 0.021 | 0.983 | -1.44e-13 | 1.47e-13 |
x37 | 1.221e-15 | 7.45e-14 | 0.016 | 0.987 | -1.46e-13 | 1.48e-13 |
x38 | -4.996e-16 | 7.44e-14 | -0.007 | 0.995 | -1.47e-13 | 1.46e-13 |
x39 | 1.166e-15 | 7.41e-14 | 0.016 | 0.987 | -1.45e-13 | 1.48e-13 |
x40 | 1.665e-16 | 7.39e-14 | 0.002 | 0.998 | -1.46e-13 | 1.46e-13 |
x41 | -3.886e-16 | 7.67e-14 | -0.005 | 0.996 | -1.52e-13 | 1.51e-13 |
x42 | 0 | 7.36e-14 | 0 | 1.000 | -1.46e-13 | 1.46e-13 |
x43 | 2.776e-17 | 7.68e-14 | 0.000 | 1.000 | -1.52e-13 | 1.52e-13 |
x44 | -1.166e-15 | 7.66e-14 | -0.015 | 0.988 | -1.53e-13 | 1.5e-13 |
x45 | -1.11e-16 | 7.52e-14 | -0.001 | 0.999 | -1.49e-13 | 1.49e-13 |
x46 | 1.178e-15 | 7.46e-14 | 0.016 | 0.987 | -1.46e-13 | 1.48e-13 |
x47 | 8.882e-16 | 7.5e-14 | 0.012 | 0.991 | -1.47e-13 | 1.49e-13 |
x48 | -2.581e-15 | 7.61e-14 | -0.034 | 0.973 | -1.53e-13 | 1.48e-13 |
x49 | -2.914e-16 | 7.36e-14 | -0.004 | 0.997 | -1.46e-13 | 1.45e-13 |
x50 | 8.327e-16 | 7.42e-14 | 0.011 | 0.991 | -1.46e-13 | 1.48e-13 |
x51 | -6.661e-16 | 7.46e-14 | -0.009 | 0.993 | -1.48e-13 | 1.47e-13 |
x52 | -2.22e-16 | 7.38e-14 | -0.003 | 0.998 | -1.46e-13 | 1.46e-13 |
x53 | 7.008e-16 | 7.37e-14 | 0.010 | 0.992 | -1.45e-13 | 1.46e-13 |
Omnibus: | 7.634 | Durbin-Watson: | 0.002 |
---|---|---|---|
Prob(Omnibus): | 0.022 | Jarque-Bera (JB): | 7.586 |
Skew: | 0.469 | Prob(JB): | 0.0225 |
Kurtosis: | 3.117 | Cond. No. | 2.01e+06 |
# create statsmodel for reading data female
p_reading_F = sm.OLS(train_data['READING_GROWTH_F'].tolist(), sm.add_constant(X_reading_F)).fit()
p_reading_F.summary()
Dep. Variable: | y | R-squared: | 1.000 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 1.000 |
Method: | Least Squares | F-statistic: | 6.839e+27 |
Date: | Mon, 16 May 2022 | Prob (F-statistic): | 0.00 |
Time: | 15:54:27 | Log-Likelihood: | 5952.7 |
No. Observations: | 204 | AIC: | -1.180e+04 |
Df Residuals: | 150 | BIC: | -1.162e+04 |
Df Model: | 53 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 1.734e-12 | 4.2e-12 | 0.413 | 0.680 | -6.56e-12 | 1e-11 |
x1 | -8.882e-16 | 2.09e-15 | -0.426 | 0.671 | -5.01e-15 | 3.23e-15 |
x2 | 5.135e-16 | 3.07e-15 | 0.167 | 0.867 | -5.55e-15 | 6.58e-15 |
x3 | 1.0000 | 3.16e-15 | 3.16e+14 | 0.000 | 1.000 | 1.000 |
x4 | 2.054e-15 | 4.31e-14 | 0.048 | 0.962 | -8.32e-14 | 8.73e-14 |
x5 | -7.772e-16 | 4.31e-14 | -0.018 | 0.986 | -8.59e-14 | 8.44e-14 |
x6 | -2.776e-16 | 4.26e-14 | -0.007 | 0.995 | -8.44e-14 | 8.38e-14 |
x7 | 1.499e-15 | 4.26e-14 | 0.035 | 0.972 | -8.26e-14 | 8.56e-14 |
x8 | 9.992e-16 | 4.32e-14 | 0.023 | 0.982 | -8.44e-14 | 8.64e-14 |
x9 | 2.359e-15 | 4.28e-14 | 0.055 | 0.956 | -8.23e-14 | 8.7e-14 |
x10 | 1.776e-15 | 4.29e-14 | 0.041 | 0.967 | -8.3e-14 | 8.66e-14 |
x11 | 4.441e-16 | 4.49e-14 | 0.010 | 0.992 | -8.83e-14 | 8.92e-14 |
x12 | -2.22e-16 | 4.3e-14 | -0.005 | 0.996 | -8.52e-14 | 8.48e-14 |
x13 | -1.221e-15 | 4.29e-14 | -0.028 | 0.977 | -8.59e-14 | 8.35e-14 |
x14 | -7.772e-16 | 4.25e-14 | -0.018 | 0.985 | -8.48e-14 | 8.32e-14 |
x15 | 1.554e-15 | 4.3e-14 | 0.036 | 0.971 | -8.33e-14 | 8.64e-14 |
x16 | 9.714e-16 | 4.26e-14 | 0.023 | 0.982 | -8.33e-14 | 8.52e-14 |
x17 | 7.355e-16 | 4.24e-14 | 0.017 | 0.986 | -8.31e-14 | 8.45e-14 |
x18 | 9.992e-16 | 4.26e-14 | 0.023 | 0.981 | -8.31e-14 | 8.51e-14 |
x19 | 8.882e-16 | 4.48e-14 | 0.020 | 0.984 | -8.76e-14 | 8.94e-14 |
x20 | 1.138e-15 | 4.26e-14 | 0.027 | 0.979 | -8.3e-14 | 8.53e-14 |
x21 | -1.665e-15 | 4.31e-14 | -0.039 | 0.969 | -8.68e-14 | 8.35e-14 |
x22 | 0 | 4.28e-14 | 0 | 1.000 | -8.46e-14 | 8.46e-14 |
x23 | -1.11e-15 | 4.34e-14 | -0.026 | 0.980 | -8.68e-14 | 8.46e-14 |
x24 | 0 | 4.27e-14 | 0 | 1.000 | -8.45e-14 | 8.45e-14 |
x25 | 4.025e-16 | 4.28e-14 | 0.009 | 0.993 | -8.41e-14 | 8.49e-14 |
x26 | 1.665e-15 | 4.31e-14 | 0.039 | 0.969 | -8.35e-14 | 8.68e-14 |
x27 | 1.11e-15 | 4.32e-14 | 0.026 | 0.980 | -8.42e-14 | 8.64e-14 |
x28 | 1.554e-15 | 4.35e-14 | 0.036 | 0.972 | -8.43e-14 | 8.74e-14 |
x29 | 1.332e-15 | 4.32e-14 | 0.031 | 0.975 | -8.41e-14 | 8.67e-14 |
x30 | -7.772e-16 | 4.3e-14 | -0.018 | 0.986 | -8.57e-14 | 8.42e-14 |
x31 | 5.412e-16 | 4.26e-14 | 0.013 | 0.990 | -8.36e-14 | 8.47e-14 |
x32 | 1.443e-15 | 4.31e-14 | 0.033 | 0.973 | -8.37e-14 | 8.66e-14 |
x33 | 9.992e-16 | 4.29e-14 | 0.023 | 0.981 | -8.38e-14 | 8.58e-14 |
x34 | 1.61e-15 | 4.29e-14 | 0.038 | 0.970 | -8.31e-14 | 8.64e-14 |
x35 | 1.055e-15 | 4.32e-14 | 0.024 | 0.981 | -8.44e-14 | 8.65e-14 |
x36 | -5.551e-16 | 4.25e-14 | -0.013 | 0.990 | -8.46e-14 | 8.35e-14 |
x37 | 1.069e-15 | 4.3e-14 | 0.025 | 0.980 | -8.4e-14 | 8.61e-14 |
x38 | 4.996e-16 | 4.3e-14 | 0.012 | 0.991 | -8.44e-14 | 8.54e-14 |
x39 | 1.11e-15 | 4.28e-14 | 0.026 | 0.979 | -8.35e-14 | 8.57e-14 |
x40 | 8.882e-16 | 4.27e-14 | 0.021 | 0.983 | -8.34e-14 | 8.52e-14 |
x41 | -1.776e-15 | 4.43e-14 | -0.040 | 0.968 | -8.93e-14 | 8.58e-14 |
x42 | 3.331e-16 | 4.25e-14 | 0.008 | 0.994 | -8.37e-14 | 8.44e-14 |
x43 | 8.882e-16 | 4.44e-14 | 0.020 | 0.984 | -8.68e-14 | 8.86e-14 |
x44 | 1.776e-15 | 4.43e-14 | 0.040 | 0.968 | -8.57e-14 | 8.92e-14 |
x45 | 8.327e-16 | 4.35e-14 | 0.019 | 0.985 | -8.5e-14 | 8.67e-14 |
x46 | 1.887e-15 | 4.31e-14 | 0.044 | 0.965 | -8.32e-14 | 8.7e-14 |
x47 | -1.665e-16 | 4.33e-14 | -0.004 | 0.997 | -8.58e-14 | 8.55e-14 |
x48 | 1.11e-16 | 4.39e-14 | 0.003 | 0.998 | -8.67e-14 | 8.69e-14 |
x49 | 2.359e-16 | 4.25e-14 | 0.006 | 0.996 | -8.38e-14 | 8.42e-14 |
x50 | -1.11e-15 | 4.29e-14 | -0.026 | 0.979 | -8.58e-14 | 8.36e-14 |
x51 | 4.441e-16 | 4.31e-14 | 0.010 | 0.992 | -8.47e-14 | 8.56e-14 |
x52 | 4.857e-17 | 4.26e-14 | 0.001 | 0.999 | -8.41e-14 | 8.42e-14 |
x53 | 4.441e-16 | 4.26e-14 | 0.010 | 0.992 | -8.37e-14 | 8.46e-14 |
Omnibus: | 21.730 | Durbin-Watson: | 0.014 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 48.054 |
Skew: | -0.479 | Prob(JB): | 3.67e-11 |
Kurtosis: | 5.176 | Cond. No. | 2.01e+06 |
Creating statsmodel for math data MALE:
p_math_M = sm.OLS(train_data['MATHEMATICS_GROWTH_M'].tolist(), sm.add_constant(X_math_M)).fit()
p_math_M.summary()
Dep. Variable: | y | R-squared: | 0.670 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.554 |
Method: | Least Squares | F-statistic: | 5.752 |
Date: | Mon, 16 May 2022 | Prob (F-statistic): | 1.56e-17 |
Time: | 15:54:27 | Log-Likelihood: | -376.61 |
No. Observations: | 204 | AIC: | 861.2 |
Df Residuals: | 150 | BIC: | 1040. |
Df Model: | 53 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 117.5397 | 125.080 | 0.940 | 0.349 | -129.607 | 364.686 |
x1 | -0.0576 | 0.062 | -0.926 | 0.356 | -0.180 | 0.065 |
x2 | 0.3180 | 0.091 | 3.476 | 0.001 | 0.137 | 0.499 |
x3 | 0.3177 | 0.094 | 3.369 | 0.001 | 0.131 | 0.504 |
x4 | -1.4225 | 1.285 | -1.107 | 0.270 | -3.962 | 1.117 |
x5 | 3.1146 | 1.284 | 2.425 | 0.016 | 0.577 | 5.652 |
x6 | -3.5234 | 1.268 | -2.778 | 0.006 | -6.030 | -1.017 |
x7 | -1.8293 | 1.268 | -1.442 | 0.151 | -4.336 | 0.677 |
x8 | -0.1490 | 1.288 | -0.116 | 0.908 | -2.695 | 2.397 |
x9 | -3.8198 | 1.277 | -2.991 | 0.003 | -6.343 | -1.296 |
x10 | -1.1491 | 1.279 | -0.898 | 0.371 | -3.677 | 1.379 |
x11 | 2.2971 | 1.339 | 1.716 | 0.088 | -0.348 | 4.942 |
x12 | -2.8757 | 1.282 | -2.243 | 0.026 | -5.409 | -0.342 |
x13 | -2.3738 | 1.278 | -1.858 | 0.065 | -4.899 | 0.151 |
x14 | -0.1356 | 1.267 | -0.107 | 0.915 | -2.640 | 2.369 |
x15 | -1.9904 | 1.280 | -1.555 | 0.122 | -4.520 | 0.540 |
x16 | -2.6261 | 1.271 | -2.067 | 0.040 | -5.137 | -0.115 |
x17 | 0.6589 | 1.264 | 0.521 | 0.603 | -1.839 | 3.157 |
x18 | -0.7734 | 1.268 | -0.610 | 0.543 | -3.280 | 1.733 |
x19 | -1.5014 | 1.335 | -1.125 | 0.263 | -4.139 | 1.137 |
x20 | -0.0465 | 1.270 | -0.037 | 0.971 | -2.556 | 2.463 |
x21 | -1.8506 | 1.285 | -1.440 | 0.152 | -4.389 | 0.688 |
x22 | -2.6491 | 1.277 | -2.075 | 0.040 | -5.172 | -0.126 |
x23 | -2.3291 | 1.293 | -1.801 | 0.074 | -4.884 | 0.226 |
x24 | -1.6936 | 1.274 | -1.329 | 0.186 | -4.211 | 0.824 |
x25 | -1.9787 | 1.274 | -1.553 | 0.123 | -4.497 | 0.539 |
x26 | -1.7051 | 1.285 | -1.327 | 0.186 | -4.243 | 0.833 |
x27 | 1.4538 | 1.286 | 1.130 | 0.260 | -1.088 | 3.995 |
x28 | -1.0252 | 1.295 | -0.792 | 0.430 | -3.584 | 1.534 |
x29 | -3.3990 | 1.288 | -2.638 | 0.009 | -5.945 | -0.853 |
x30 | 0.8181 | 1.282 | 0.638 | 0.524 | -1.714 | 3.351 |
x31 | -2.7382 | 1.269 | -2.158 | 0.033 | -5.245 | -0.231 |
x32 | -2.4202 | 1.285 | -1.884 | 0.062 | -4.959 | 0.118 |
x33 | -1.7168 | 1.279 | -1.343 | 0.181 | -4.243 | 0.810 |
x34 | 0.5097 | 1.279 | 0.399 | 0.691 | -2.017 | 3.036 |
x35 | -3.1723 | 1.289 | -2.462 | 0.015 | -5.719 | -0.626 |
x36 | -2.6356 | 1.267 | -2.079 | 0.039 | -5.140 | -0.131 |
x37 | -1.5813 | 1.283 | -1.233 | 0.220 | -4.116 | 0.953 |
x38 | -1.5579 | 1.281 | -1.216 | 0.226 | -4.089 | 0.973 |
x39 | -1.0230 | 1.276 | -0.802 | 0.424 | -3.545 | 1.499 |
x40 | -2.5464 | 1.272 | -2.001 | 0.047 | -5.060 | -0.032 |
x41 | -3.2262 | 1.321 | -2.442 | 0.016 | -5.836 | -0.616 |
x42 | -1.4554 | 1.268 | -1.148 | 0.253 | -3.961 | 1.050 |
x43 | -0.6487 | 1.323 | -0.491 | 0.624 | -3.262 | 1.964 |
x44 | -1.7071 | 1.319 | -1.294 | 0.198 | -4.314 | 0.900 |
x45 | 2.4655 | 1.295 | 1.903 | 0.059 | -0.094 | 5.025 |
x46 | 0.2481 | 1.284 | 0.193 | 0.847 | -2.288 | 2.785 |
x47 | -1.1353 | 1.292 | -0.879 | 0.381 | -3.688 | 1.417 |
x48 | -3.3077 | 1.310 | -2.525 | 0.013 | -5.896 | -0.720 |
x49 | -0.9320 | 1.267 | -0.735 | 0.463 | -3.436 | 1.572 |
x50 | -0.3292 | 1.278 | -0.258 | 0.797 | -2.855 | 2.196 |
x51 | -0.4551 | 1.285 | -0.354 | 0.724 | -2.993 | 2.083 |
x52 | -2.4436 | 1.270 | -1.924 | 0.056 | -4.953 | 0.066 |
x53 | -0.7150 | 1.269 | -0.563 | 0.574 | -3.223 | 1.793 |
Omnibus: | 8.941 | Durbin-Watson: | 1.507 |
---|---|---|---|
Prob(Omnibus): | 0.011 | Jarque-Bera (JB): | 9.440 |
Skew: | -0.408 | Prob(JB): | 0.00891 |
Kurtosis: | 3.667 | Cond. No. | 2.01e+06 |
Create statsmodel for math data FEMALE:
p_math_F = sm.OLS(train_data['MATHEMATICS_GROWTH_F'].tolist(), sm.add_constant(X_math_F)).fit()
p_math_F.summary()
Dep. Variable: | y | R-squared: | 0.559 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.403 |
Method: | Least Squares | F-statistic: | 3.585 |
Date: | Mon, 16 May 2022 | Prob (F-statistic): | 5.14e-10 |
Time: | 15:54:27 | Log-Likelihood: | -388.09 |
No. Observations: | 204 | AIC: | 884.2 |
Df Residuals: | 150 | BIC: | 1063. |
Df Model: | 53 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -57.9757 | 132.324 | -0.438 | 0.662 | -319.436 | 203.484 |
x1 | 0.0298 | 0.066 | 0.453 | 0.651 | -0.100 | 0.160 |
x2 | 0.1847 | 0.097 | 1.909 | 0.058 | -0.007 | 0.376 |
x3 | 0.3334 | 0.100 | 3.343 | 0.001 | 0.136 | 0.531 |
x4 | -1.7418 | 1.360 | -1.281 | 0.202 | -4.429 | 0.945 |
x5 | 1.5459 | 1.359 | 1.138 | 0.257 | -1.139 | 4.231 |
x6 | -1.0370 | 1.342 | -0.773 | 0.441 | -3.688 | 1.614 |
x7 | -2.1577 | 1.342 | -1.608 | 0.110 | -4.809 | 0.494 |
x8 | -1.0663 | 1.363 | -0.782 | 0.435 | -3.760 | 1.627 |
x9 | -3.0842 | 1.351 | -2.283 | 0.024 | -5.754 | -0.415 |
x10 | -0.2419 | 1.353 | -0.179 | 0.858 | -2.916 | 2.432 |
x11 | 1.6009 | 1.416 | 1.131 | 0.260 | -1.197 | 4.399 |
x12 | -1.9549 | 1.356 | -1.441 | 0.152 | -4.635 | 0.725 |
x13 | -2.0926 | 1.352 | -1.548 | 0.124 | -4.764 | 0.579 |
x14 | 0.1947 | 1.341 | 0.145 | 0.885 | -2.455 | 2.844 |
x15 | -2.7136 | 1.355 | -2.003 | 0.047 | -5.390 | -0.037 |
x16 | -1.2318 | 1.344 | -0.916 | 0.361 | -3.888 | 1.424 |
x17 | 0.1295 | 1.337 | 0.097 | 0.923 | -2.513 | 2.772 |
x18 | -1.5370 | 1.342 | -1.145 | 0.254 | -4.188 | 1.114 |
x19 | -1.9098 | 1.412 | -1.352 | 0.178 | -4.701 | 0.881 |
x20 | -0.2228 | 1.343 | -0.166 | 0.868 | -2.877 | 2.432 |
x21 | -1.9812 | 1.359 | -1.458 | 0.147 | -4.667 | 0.704 |
x22 | -0.7047 | 1.351 | -0.522 | 0.603 | -3.374 | 1.964 |
x23 | -3.0565 | 1.368 | -2.234 | 0.027 | -5.760 | -0.353 |
x24 | -2.4268 | 1.348 | -1.800 | 0.074 | -5.090 | 0.237 |
x25 | -1.4637 | 1.348 | -1.086 | 0.279 | -4.127 | 1.200 |
x26 | -0.7883 | 1.359 | -0.580 | 0.563 | -3.474 | 1.897 |
x27 | 1.0913 | 1.361 | 0.802 | 0.424 | -1.597 | 3.780 |
x28 | -1.6865 | 1.370 | -1.231 | 0.220 | -4.394 | 1.021 |
x29 | -2.5663 | 1.363 | -1.883 | 0.062 | -5.260 | 0.127 |
x30 | -0.9269 | 1.356 | -0.684 | 0.495 | -3.606 | 1.752 |
x31 | -2.5373 | 1.342 | -1.890 | 0.061 | -5.190 | 0.115 |
x32 | -2.8898 | 1.359 | -2.126 | 0.035 | -5.576 | -0.204 |
x33 | -1.5382 | 1.353 | -1.137 | 0.257 | -4.211 | 1.135 |
x34 | 0.2120 | 1.353 | 0.157 | 0.876 | -2.461 | 2.885 |
x35 | -2.3533 | 1.363 | -1.726 | 0.086 | -5.047 | 0.340 |
x36 | -3.5553 | 1.341 | -2.652 | 0.009 | -6.205 | -0.906 |
x37 | -1.1957 | 1.357 | -0.881 | 0.380 | -3.877 | 1.486 |
x38 | -0.6587 | 1.355 | -0.486 | 0.628 | -3.336 | 2.019 |
x39 | -1.0102 | 1.350 | -0.748 | 0.456 | -3.678 | 1.658 |
x40 | -0.7972 | 1.346 | -0.592 | 0.555 | -3.457 | 1.862 |
x41 | -2.2233 | 1.398 | -1.591 | 0.114 | -4.985 | 0.538 |
x42 | 0.3977 | 1.342 | 0.296 | 0.767 | -2.253 | 3.048 |
x43 | 0.0349 | 1.399 | 0.025 | 0.980 | -2.730 | 2.799 |
x44 | -1.6134 | 1.396 | -1.156 | 0.250 | -4.371 | 1.144 |
x45 | 2.3412 | 1.370 | 1.708 | 0.090 | -0.367 | 5.049 |
x46 | 0.1377 | 1.358 | 0.101 | 0.919 | -2.546 | 2.821 |
x47 | -1.9913 | 1.367 | -1.457 | 0.147 | -4.692 | 0.709 |
x48 | -3.3074 | 1.386 | -2.387 | 0.018 | -6.045 | -0.569 |
x49 | 0.6847 | 1.341 | 0.511 | 0.610 | -1.965 | 3.334 |
x50 | -1.2321 | 1.352 | -0.911 | 0.364 | -3.904 | 1.440 |
x51 | 0.4617 | 1.359 | 0.340 | 0.735 | -2.224 | 3.147 |
x52 | -1.1396 | 1.344 | -0.848 | 0.398 | -3.795 | 1.515 |
x53 | -0.3887 | 1.343 | -0.290 | 0.773 | -3.041 | 2.264 |
Omnibus: | 0.595 | Durbin-Watson: | 1.221 |
---|---|---|---|
Prob(Omnibus): | 0.743 | Jarque-Bera (JB): | 0.729 |
Skew: | -0.083 | Prob(JB): | 0.695 |
Kurtosis: | 2.758 | Cond. No. | 2.01e+06 |
Okay - we have our four models now, each with their statistical analysis done. We can try interpreting these statistics by checking for each gender assumed model, which states had p-values below an assumed bound of 0.05, as those would indicate that for that FEMALE or MALE classification, that some states had more weight in significance on a change in growth for the average reading or math scores.
# accumulate constant names
const_names = train_data.columns[3:].tolist()
const_names.insert(0, 'YEAR')
M_reading_significant = []
F_reading_significant = []
# iterate through p values and if below than 0.05 add const name to reading_significant
for i in range(len(p_reading_M.pvalues) - 1):
if p_reading_M.pvalues[i + 1] < 0.05:
M_reading_significant.append(const_names[i])
for i in range(len(p_reading_F.pvalues) - 1):
if p_reading_F.pvalues[i + 1] < 0.05:
F_reading_significant.append(const_names[i])
print("Significant Male reading test constants: " + str(M_reading_significant))
print()
print("Significant Female reading test constants: " + str(F_reading_significant))
print()
M_math_significant = []
F_math_significant = []
# iterate through p values and if below than 0.05 add const name to math_significant
for i in range(len(p_math_M.pvalues) - 1):
if p_math_M.pvalues[i + 1] < 0.05:
M_math_significant.append(const_names[i])
for i in range(len(p_math_F.pvalues) - 1):
if p_math_F.pvalues[i + 1] < 0.05:
F_math_significant.append(const_names[i])
print("Significant Male math test constants: " + str(M_math_significant))
print()
print("Significant Female math test constants: " + str(F_math_significant))
Significant Male reading test constants: ['READING_GROWTH_M'] Significant Female reading test constants: ['READING_GROWTH_F'] Significant Male math test constants: ['READING_GROWTH_M', 'READING_GROWTH_F', 'STATE_ARIZONA', 'STATE_ARKANSAS', 'STATE_CONNECTICUT', 'STATE_FLORIDA', 'STATE_ILLINOIS', 'STATE_MAINE', 'STATE_MONTANA', 'STATE_NEVADA', 'STATE_NEW_YORK', 'STATE_NORTH_CAROLINA', 'STATE_OREGON', 'STATE_PENNSYLVANIA', 'STATE_VERMONT'] Significant Female math test constants: ['READING_GROWTH_F', 'STATE_CONNECTICUT', 'STATE_IDAHO', 'STATE_MARYLAND', 'STATE_NEW_HAMPSHIRE', 'STATE_NORTH_CAROLINA', 'STATE_VERMONT']
It seems that for reading, there the state does not have significance on either gender of FEMALE or MALE in all states. For the math models, certain states within both the FEMALE and MALE models have significant impact on the math listeracy growth in the United States.
As a reminder, we are defining growth in math literacy as an increase in average math scores for 4th graders on the NAEP test from 2009 to some future year, and the same definition for reading literacy respectively. We wanted to explore how state and gender play a role in math and reading literacy education standards, which our growth metric for different subsets of our population with state and gender interactions measures.
Based on our exploratory analysis of our data, we found that both math and reading literacy are affected by state, and by gender and state in which a student is in, which indicates that education inequality has been persisting since 2009, as the growth metric differs significantly depending on the state given, or the assumed gender (FEMALE or MALE) with a state given. Important to note that this dataset also obfiscates people outside of the male/female classfication, which may not accurately represent their gendered experiences in the American education system.
Analyzing our predictive models based on state for math and reading literacy, our results indicate that the state where a student resides play a role in determining growth in reading and math literacy. We found that in states centered in the midwest such as Arkansas, Indiana, and Colorado, students are predicted to have significant, specifically, negative growth in reading and math based on our p-values for various states for both math and reading literacy growth being below 0.05; hence we can be 95% confident to reject our null hypothesis of there being no relationship between state and math/reading literacy growth, and hence, we can determine that state has played some significant role in differing the growth of literacy for these subjects across the US - indicating that education inequality is persisting across the United States, state by state. This may be because these are more rural areas, which have less access to opportunities that states centered around more urban areas typically have. Performing this type of analysis can be useful when determining curriculums (especially common core which applies nationwide) because there are vast differences in the quality of education administered (see more here). If students in a particular state are less likely to perform well on math and reading test, state-specific policies could and should be introduced to change the current standards of education to lean towards significant positive growth.
Analyzing our predictive models based on state and gender for math/reading literacy, our results indicate to us that there is no significant growth in reading literacy across genders of FEMALE and MALE in the United States. Hence, since our p-values ranged above 0.05 significance level, for reading literacy, we cannot reject the null hypothesis that gender and state interactions play a role with reading literacy growth - indicating that state and gender interaction may not play a large role in reading difference in education inequality. This does differ slightly from our exploratory data analysis which expected differently. For our models of states under the MALE classification, we foud that there is a significant impact on math literacy that is either positive or negative - could be either extreme. Hence, since our p-values ranged below 0.05 significance level for multiple states in both math models of genders, for math iteracy, we can reject the null hypothesis that gender and state interactions play a role with math literacy growth - once again indicating to us that there still exists education inequality in math literacy within gender and state interactions.
We hope that these models curated, and the exploratory data analysis done, can be used by policymakers when determining which states, and which gendered demographics within states, need more support in developing math and reading curriculums and hiring educators. State, and gender within state, play roles in math and reading literacy that must be changed to provide equitable education to everyone.