The early youth of a child is a developmental time where students are learning how to perform many tasks and learn skills, both book smart and street smart, that can help them in life. One of those skills that begins to develop in a young age is literacy in basic math and reading, as the majority of math that one deals with in adulthood, such as basic algebra with money, is taught in middle school, and reading comprehension is key to understanding the majority of events that happen in an adults life - understanding forms, learning new information, searching for housing, etc. Therefore, it is important that all children in this developmental stage have equitable opportunities deserving of them that in such a key growth period, they all have the tools and education necessary to learn such important and long lasting skills such as math and reading comprehension.

However, not all students are given such equally fitted opportunities. The US education system has long been known to have varying standards of education, where differences in education quality begin as early as pre-kindergarten, but not a lot of documentation has been procured to confirm on any large variation in education quality. It is imperative that if these differences in education quality exist, and are continuing to exist over time, then they be resolved on an institutional level.

So, our focus of this tutorial is to confirm if education inequality is reflected by national math and reading examination differences, see if a continuation of inequality in education is occurring, and to recognize factors such as gender or state that may play significant roles in such (if they exist), and use such analysis to predict how future years education inequality will be if the current education system/institution is maintained.

Our null hypothesis will be that gender and state do not have any relationship or impact on math or reading literacy in children in developmental stages. Our alternative hypothesis will be that gender and state have some relationship or impact on math or reading literacy in children in developmental stages. We defined math literacy and reading literacy by the mathematics and reading scores given by student on the NAEP Test - the National Assessment of Educational Progress Test which is a standardized national educational assessment that tracks math and reading progress across the states of the US.

The data that we will be examining can be found here: - specifically the states_all_extended.csv found on the site. The dataset contains compiled US Education data, such as the number of students enrolled in 1st through 12th grade for all states, average math scores for gendered groups from state and year combinations, and more. We decided to specifically focus on the data concerning the average math and average english scores that 4th graders gave on the NAEP test, as it is found by research that 4th grade to 5th grade is the key developmental period for children.

In [240]:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import statsmodels.api as sm
```

In [241]:

```
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
```

In [242]:

```
school_data = pd.read_csv("states_all_extended.csv")
# display first few rows
school_data.head()
```

Out[242]:

PRIMARY_KEY | STATE | YEAR | ENROLL | TOTAL_REVENUE | FEDERAL_REVENUE | STATE_REVENUE | LOCAL_REVENUE | TOTAL_EXPENDITURE | INSTRUCTION_EXPENDITURE | ... | G08_HI_A_READING | G08_HI_A_MATHEMATICS | G08_AS_A_READING | G08_AS_A_MATHEMATICS | G08_AM_A_READING | G08_AM_A_MATHEMATICS | G08_HP_A_READING | G08_HP_A_MATHEMATICS | G08_TR_A_READING | G08_TR_A_MATHEMATICS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1992_ALABAMA | ALABAMA | 1992 | NaN | 2678885.0 | 304177.0 | 1659028.0 | 715680.0 | 2653798.0 | 1481703.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

1 | 1992_ALASKA | ALASKA | 1992 | NaN | 1049591.0 | 106780.0 | 720711.0 | 222100.0 | 972488.0 | 498362.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

2 | 1992_ARIZONA | ARIZONA | 1992 | NaN | 3258079.0 | 297888.0 | 1369815.0 | 1590376.0 | 3401580.0 | 1435908.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

3 | 1992_ARKANSAS | ARKANSAS | 1992 | NaN | 1711959.0 | 178571.0 | 958785.0 | 574603.0 | 1743022.0 | 964323.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

4 | 1992_CALIFORNIA | CALIFORNIA | 1992 | NaN | 26260025.0 | 2072470.0 | 16546514.0 | 7641041.0 | 27138832.0 | 14358922.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

5 rows × 266 columns

In [243]:

```
school_data = school_data.drop(columns=['PRIMARY_KEY'])
```

In [244]:

```
# get previous number of rows
prev_rows = len(school_data.index)
school_data = school_data[school_data['YEAR'] >= 2009]
# get current number of rows
curr_rows = len(school_data.index)
print(str(prev_rows - curr_rows) + " rows were dropped.")
school_data.head()
```

1193 rows were dropped.

Out[244]:

STATE | YEAR | ENROLL | TOTAL_REVENUE | FEDERAL_REVENUE | STATE_REVENUE | LOCAL_REVENUE | TOTAL_EXPENDITURE | INSTRUCTION_EXPENDITURE | SUPPORT_SERVICES_EXPENDITURE | ... | G08_HI_A_READING | G08_HI_A_MATHEMATICS | G08_AS_A_READING | G08_AS_A_MATHEMATICS | G08_AM_A_READING | G08_AM_A_MATHEMATICS | G08_HP_A_READING | G08_HP_A_MATHEMATICS | G08_TR_A_READING | G08_TR_A_MATHEMATICS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

867 | ALABAMA | 2009 | 745668.0 | 7186390.0 | 728795.0 | 4161103.0 | 2296492.0 | 7815467.0 | 3836398.0 | 2331552.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

868 | ALASKA | 2009 | 130236.0 | 2158970.0 | 312667.0 | 1357747.0 | 488556.0 | 2396412.0 | 1129756.0 | 832783.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

869 | ARIZONA | 2009 | 981303.0 | 8802515.0 | 1044140.0 | 3806064.0 | 3952311.0 | 9580393.0 | 4296503.0 | 2983729.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

870 | ARKANSAS | 2009 | 474423.0 | 4753142.0 | 534510.0 | 3530487.0 | 688145.0 | 5017352.0 | 2417974.0 | 1492691.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

871 | CALIFORNIA | 2009 | 6234155.0 | 73958896.0 | 9745250.0 | 40084244.0 | 24129402.0 | 74766086.0 | 35617964.0 | 21693675.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

5 rows × 265 columns

Since the columns names are a little tricky to figure out, we are going to outline how to read them here:

G## - This signifies which grade this value is talking about; for example G04 is referring to grade 4.

G##_A_A - This refers to all the students in that grade from all races.

G##_x_g - This is read as the number of students of race x and gender g in grade ##; for example G06_AS_M is all asian male students in grade 6.

G##_x_g_test - This is average test score of race x and gender g in grade ##; for example G06_AS_A_MATH is the average math score of all asian students in grade 6.

*A* in place of a gender or race signifies all genders or all races.

The different race codes provided by the dataset are defined as *AM* - American Indian or Alaska Native, *AS* - Asian, *HI* - Hispanic/Latino, *BL* - Black, *WH* - White, *HP* - Hawaiian Native/Pacific Islander and *TR* - two or more races. (We will not be analyzing individual races in our data analysis, but this information is useful to know for future work.)

We will be doing two parts to our exploratory data analysis.

- 1) First, we will look to see if any noticeable trends exist between states and average mathematics \& reading scores for 4th grade students over the years - which are indicators of early mathematical and reading literacy skills. To do so, we can generate a 'GROWTH' variable that will calculate the difference between the average 4th grade mathematics/reading scores from 2009 and average 4th grade mathematics/reading score from 2015. We would expect that if there is no education inequality amongst states, then all states should be having the same growth change occur between 2009 to 2015 - whether that be a positive, negative, or stagnant/zero change in average scores - the distribution should be the same. To visualize these comparisons, we generated heat plots to compare all 50 states of the US at once with one another, and analyzed state differences within reading and math.

- 2) Secondly, we will explore our dataset to see if gender plays a role in mathematic and reading literacy. For such an analysis, we also created a 'GROWTH' variable for female/male and reading/mathematics combinations for each state, out of which we plotted in stacked bar plots for comparison of state and gender on the scores.

In order to create our heat map, we will need to get the grid coordinates of the United States map with states outlines because our *geopandas* library, which is the geospatial data mapping Python library we will be using, requires the geometrical coordinates for mapping. We pulled our US states map shapefile and other necessary mapping files from the US Census, which can be found here in this zip file: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_500k.zip , and the description on the mapping files can be found here: https://www.census.gov/programs-surveys/geography/technical-documentation/naming-convention/cartographic-boundary-file.html . We manually unzipped the files, and added them to the folder our current notebook was in for our use because then only can geopandas have access to the files, but this can be automated.

Specifically, we used the read_file function from the geopandas library that takes in the name of the shapefile file, which is essentially a set of geometrical coordinates, and uses other CAD and plain text files that come along

In [245]:

```
usa_states = gpd.read_file("cb_2018_us_state_500k.shp")
usa_states.head()
```

Out[245]:

STATEFP | STATENS | AFFGEOID | GEOID | STUSPS | NAME | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 28 | 01779790 | 0400000US28 | 28 | MS | Mississippi | 00 | 121533519481 | 3926919758 | MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ... |

1 | 37 | 01027616 | 0400000US37 | 37 | NC | North Carolina | 00 | 125923656064 | 13466071395 | MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ... |

2 | 40 | 01102857 | 0400000US40 | 40 | OK | Oklahoma | 00 | 177662925723 | 3374587997 | POLYGON ((-103.00257 36.52659, -103.00219 36.6... |

3 | 51 | 01779803 | 0400000US51 | 51 | VA | Virginia | 00 | 102257717110 | 8528531774 | MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ... |

4 | 54 | 01779805 | 0400000US54 | 54 | WV | West Virginia | 00 | 62266474513 | 489028543 | POLYGON ((-82.64320 38.16909, -82.64300 38.169... |

In [246]:

```
print("We have grid coordinates for: {}".format(list(usa_states["NAME"])))
```

In [247]:

```
print("We have state math and reading score information for: {}".format(list(school_data["STATE"].unique())))
```

In [248]:

```
# Dropping territories that do not have data in the education data dataset
states_coords = usa_states[usa_states.NAME != "United States Virgin Islands"]
states_coords = states_coords[states_coords.NAME != "American Samoa"]
states_coords = states_coords[states_coords.NAME != "Puerto Rico"]
states_coords = states_coords[states_coords.NAME != "Guam"]
states_coords = states_coords[states_coords.NAME != "Commonwealth of the Northern Mariana Islands"]
states_coords = states_coords.sort_values(by = ['NAME']) # For readability
```

In [249]:

```
# Matching formatting of NAME variable in states_coords with STATE variable formatting in education data
for row, curr_state in states_coords.iterrows():
states_coords.at[row, 'NAME'] = curr_state['NAME'].replace(' ', '_')
states_coords.at[row, 'NAME'] = states_coords.at[row, 'NAME'].upper()
states_coords = states_coords.rename(columns = {"NAME": "STATE"})
states_coords.head()
```

Out[249]:

STATEFP | STATENS | AFFGEOID | GEOID | STUSPS | STATE | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|---|

17 | 01 | 01779775 | 0400000US01 | 01 | AL | ALABAMA | 00 | 131174048583 | 4593327154 | MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ... |

27 | 02 | 01785533 | 0400000US02 | 02 | AK | ALASKA | 00 | 1478839695958 | 245481577452 | MULTIPOLYGON (((179.48246 51.98283, 179.48656 ... |

35 | 04 | 01779777 | 0400000US04 | 04 | AZ | ARIZONA | 00 | 294198551143 | 1027337603 | POLYGON ((-114.81629 32.50804, -114.81432 32.5... |

52 | 05 | 00068085 | 0400000US05 | 05 | AR | ARKANSAS | 00 | 134768872727 | 2962859592 | POLYGON ((-94.61783 36.49941, -94.61765 36.499... |

16 | 06 | 01779778 | 0400000US06 | 06 | CA | CALIFORNIA | 00 | 403503931312 | 20463871877 | MULTIPOLYGON (((-118.60442 33.47855, -118.5987... |

We can also isolate our 2009 school_data values and 2015 school_data values from the main school_data frame right now, as we will use this information soon. As we only care about the grade 4 average score values, we can drop other extra columns not relevant to us such as revenues on a federal or state level, enrollment amounts, etc.

In [250]:

```
# Dropping state values not relevant in education dataset for 2009
data_2009 = school_data[school_data['YEAR'] == 2009]
data_2009 = data_2009[data_2009.STATE != "DODEA"]
data_2009 = data_2009[data_2009.STATE != "NATIONAL"]
data_2009.drop('TOTAL_REVENUE', inplace=True, axis=1)
data_2009.drop('FEDERAL_REVENUE', inplace=True, axis=1)
data_2009.drop('STATE_REVENUE', inplace=True, axis=1)
data_2009.drop('LOCAL_REVENUE', inplace=True, axis=1)
data_2009.drop('ENROLL', inplace=True, axis=1)
data_2009.drop('TOTAL_EXPENDITURE', inplace=True, axis=1)
data_2009.drop('CAPITAL_OUTLAY_EXPENDITURE', inplace=True, axis=1)
data_2009.drop('INSTRUCTION_EXPENDITURE', inplace=True, axis=1)
data_2009.drop('SUPPORT_SERVICES_EXPENDITURE', inplace=True, axis=1)
data_2009.drop('OTHER_EXPENDITURE', inplace=True, axis=1)
data_2009 = data_2009.sort_values(by = ['STATE']) # For readability
data_2009 = data_2009.reset_index()
data_2009.drop('index', inplace=True, axis=1)
data_2009.head()
```

Out[250]:

STATE | YEAR | A_A_A | G01_A_A | G02_A_A | G03_A_A | G04_A_A | G05_A_A | G06_A_A | G07_A_A | ... | G08_HI_A_READING | G08_HI_A_MATHEMATICS | G08_AS_A_READING | G08_AS_A_MATHEMATICS | G08_AM_A_READING | G08_AM_A_MATHEMATICS | G08_HP_A_READING | G08_HP_A_MATHEMATICS | G08_TR_A_READING | G08_TR_A_MATHEMATICS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | ALABAMA | 2009 | 748889.0 | 57821.0 | 56628.0 | 58608.0 | 59512.0 | 58656.0 | 58231.0 | 58118.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

1 | ALASKA | 2009 | 131661.0 | 9926.0 | 9827.0 | 10032.0 | 10046.0 | 9864.0 | 9567.0 | 9657.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

2 | ARIZONA | 2009 | 1077831.0 | 85725.0 | 84033.0 | 84060.0 | 83686.0 | 83193.0 | 81987.0 | 82050.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

3 | ARKANSAS | 2009 | 480559.0 | 37665.0 | 36934.0 | 36903.0 | 36479.0 | 36489.0 | 35958.0 | 36113.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

4 | CALIFORNIA | 2009 | 6263438.0 | 470783.0 | 459334.0 | 459813.0 | 465866.0 | 460248.0 | 461373.0 | 466893.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

5 rows × 255 columns

And we do the same for the 2015 education dataset.

In [251]:

```
# Dropping state values not relevant in education dataset for 2015
data_2015 = school_data[school_data['YEAR'] == 2015]
data_2015 = data_2015[data_2015.STATE != "DODEA"]
data_2015 = data_2015[data_2015.STATE != "NATIONAL"]
data_2015.drop('TOTAL_REVENUE', inplace=True, axis=1)
data_2015.drop('FEDERAL_REVENUE', inplace=True, axis=1)
data_2015.drop('STATE_REVENUE', inplace=True, axis=1)
data_2015.drop('LOCAL_REVENUE', inplace=True, axis=1)
data_2015.drop('ENROLL', inplace=True, axis=1)
data_2015.drop('TOTAL_EXPENDITURE', inplace=True, axis=1)
data_2015.drop('CAPITAL_OUTLAY_EXPENDITURE', inplace=True, axis=1)
data_2015.drop('INSTRUCTION_EXPENDITURE', inplace=True, axis=1)
data_2015.drop('SUPPORT_SERVICES_EXPENDITURE', inplace=True, axis=1)
data_2015.drop('OTHER_EXPENDITURE', inplace=True, axis=1)
data_2015 = data_2015.sort_values(by = ['STATE']) # For readability
data_2015 = data_2015.reset_index()
data_2015.head()
```

Out[251]:

index | STATE | YEAR | A_A_A | G01_A_A | G02_A_A | G03_A_A | G04_A_A | G05_A_A | G06_A_A | ... | G08_HI_A_READING | G08_HI_A_MATHEMATICS | G08_AS_A_READING | G08_AS_A_MATHEMATICS | G08_AM_A_READING | G08_AM_A_MATHEMATICS | G08_HP_A_READING | G08_HP_A_MATHEMATICS | G08_TR_A_READING | G08_TR_A_MATHEMATICS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1173 | ALABAMA | 2015 | 743789.0 | 59023.0 | 58766.0 | 57963.0 | 55808.0 | 55340.0 | 54900.0 | ... | 252.0 | 260.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

1 | 1174 | ALASKA | 2015 | 132477.0 | 10587.0 | 10512.0 | 10441.0 | 10118.0 | 9793.0 | 9648.0 | ... | 263.0 | 279.0 | 260.0 | 282.0 | 231.0 | 257.0 | 242.0 | NaN | 270.0 | 285.0 |

2 | 1175 | ARIZONA | 2015 | 1109040.0 | 84804.0 | 87325.0 | 88194.0 | 86594.0 | 85719.0 | 85202.0 | ... | 254.0 | 273.0 | NaN | NaN | 244.0 | 260.0 | NaN | NaN | NaN | NaN |

3 | 1176 | ARKANSAS | 2015 | 492132.0 | 38160.0 | 38590.0 | 38410.0 | 35893.0 | 35850.0 | 36020.0 | ... | 255.0 | 269.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

4 | 1177 | CALIFORNIA | 2015 | 6226737.0 | 444573.0 | 463881.0 | 470157.0 | 485885.0 | 476427.0 | 471467.0 | ... | 249.0 | 263.0 | 279.0 | 304.0 | NaN | NaN | NaN | NaN | 263.0 | 289.0 |

5 rows × 256 columns

In [252]:

```
# Adding the states to our empty dataframe
g04_15to09_states = pd.DataFrame()
g04_15to09_states[['STATE']] = data_2015[['STATE']]
# Traversing through 2009 and 2015 score dataset and tracking differences
for ind, currstate in data_2009.iterrows():
read_val2015 = float(data_2015.at[ind, "G04_A_A_READING"])
read_val2009 = float(data_2009.at[ind, "G04_A_A_READING"])
read_diff = read_val2015 - read_val2009
g04_15to09_states.at[ind, "READ_GROWTH"] = read_diff
math_val2015 = float(data_2015.at[ind, "G04_A_A_MATHEMATICS"])
math_val2009 = float(data_2009.at[ind, "G04_A_A_MATHEMATICS"])
math_diff = math_val2015 - math_val2009
g04_15to09_states.at[ind, "MATH_GROWTH"] = math_diff
```

In [253]:

```
g04_15to09_states.head()
```

Out[253]:

STATE | READ_GROWTH | MATH_GROWTH | |
---|---|---|---|

0 | ALABAMA | 1.0 | 3.0 |

1 | ALASKA | 2.0 | -1.0 |

2 | ARIZONA | 5.0 | 8.0 |

3 | ARKANSAS | 2.0 | -3.0 |

4 | CALIFORNIA | 3.0 | 0.0 |

In [254]:

```
map_and_stats = states_coords.merge(g04_15to09_states, on="STATE")
map_and_stats.head()
```

Out[254]:

STATEFP | STATENS | AFFGEOID | GEOID | STUSPS | STATE | LSAD | ALAND | AWATER | geometry | READ_GROWTH | MATH_GROWTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 01 | 01779775 | 0400000US01 | 01 | AL | ALABAMA | 00 | 131174048583 | 4593327154 | MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ... | 1.0 | 3.0 |

1 | 02 | 01785533 | 0400000US02 | 02 | AK | ALASKA | 00 | 1478839695958 | 245481577452 | MULTIPOLYGON (((179.48246 51.98283, 179.48656 ... | 2.0 | -1.0 |

2 | 04 | 01779777 | 0400000US04 | 04 | AZ | ARIZONA | 00 | 294198551143 | 1027337603 | POLYGON ((-114.81629 32.50804, -114.81432 32.5... | 5.0 | 8.0 |

3 | 05 | 00068085 | 0400000US05 | 05 | AR | ARKANSAS | 00 | 134768872727 | 2962859592 | POLYGON ((-94.61783 36.49941, -94.61765 36.499... | 2.0 | -3.0 |

4 | 06 | 01779778 | 0400000US06 | 06 | CA | CALIFORNIA | 00 | 403503931312 | 20463871877 | MULTIPOLYGON (((-118.60442 33.47855, -118.5987... | 3.0 | 0.0 |

In [255]:

```
fig, ax = plt.subplots(1, figsize=(14, 14))
plt.xticks(rotation=90)
map_and_stats.plot(column = "READ_GROWTH", cmap = "RdYlGn",
linewidth = 0.4, ax = ax, edgecolor = ".4")
ax.set_title('Heat Map of Growth in Reading Literacy for 4th Graders from 2009 to 2016')
bar_info = plt.cm.ScalarMappable(cmap="RdYlGn", norm=plt.Normalize(vmin= -6, vmax=6))
bar_info._A = []
#https://stackoverflow.com/questions/18195758/set-matplotlib-colorbar-size-to-match-graph
cax = fig.add_axes([ax.get_position().x1 + 0.01,ax.get_position().y0,0.05,ax.get_position().height])
cbar = fig.colorbar(bar_info, cax = cax)
ax.set_xlim(-130, -60)
ax.set_ylim(25, 50)
ax.axis("off")
```

Out[255]:

(-130.0, -60.0, 25.0, 50.0)

(Our map excludes Hawaii and Alaska due to some mapping configuration issues - but just as an FYI, Alaska had a growth of 1.0 and Hawaii had a growth of 2.0).

Well, this is interesting! If education was being distributed on an equal level across states, then we would expect to see the same amount of growth, whether positive, negative, or none, in the change in reading literacy of 4th graders from 2009 to 2015. As a reminder, we are measuring reading literacy growth by taking the difference of a state's average score for eight graders taking the NAEP reading exam in 2015 from that of the state in 2009. In terms of our heatmap, if we maintain our expectation of equal education occuring, we would expect to see the same color, aka the same growth amount, across all states. However, as we can see by our heatmap above, not all states are experiencing the same growth/change in reading literacy - we have states such as Lousiana and North Carolina which have gone up 3 to 6 points in average reading scores on the NAEP reading exam over 6 year. This is a positive that we would hope to see in our heatmap - that states' reading scores have been improving over the years! However, we can notice that there are other states such as Kansas, and, our state, Maryland, that have gone down by 6 points in a 6 year time span in reading literacy. Looking at the map, we can assess that even other states of the USA are fluctuating in their reading literacy patterns - which is an indicator that reading literacy growth can be dependent on state over time, indicative of there being education inequality by state.

Let's curate a similar heat map for growth in math literacy in 4th graders across the states from 2009 to 2015, and see what the results are there.

In [256]:

```
fig, ax = plt.subplots(1, figsize=(14, 14))
plt.xticks(rotation=90)
map_and_stats.plot(column = "MATH_GROWTH", cmap = "RdYlBu",
linewidth = 0.4, ax = ax, edgecolor = ".4")
ax.set_title('Heat Map of Growth in Mathematics Literacy for 4th Graders from 2009 to 2016')
bar_info = plt.cm.ScalarMappable(cmap="RdYlBu", norm=plt.Normalize(vmin= -6, vmax=6))
bar_info._A = []
#https://stackoverflow.com/questions/18195758/set-matplotlib-colorbar-size-to-match-graph
cax = fig.add_axes([ax.get_position().x1 + 0.01,ax.get_position().y0,0.05,ax.get_position().height])
cbar = fig.colorbar(bar_info, cax = cax)
ax.set_xlim(-130, -60)
ax.set_ylim(25, 50)
ax.axis("off")
```

Out[256]:

(-130.0, -60.0, 25.0, 50.0)

Interesting! Regarding math literacy growth, 4th grade students in New York, Vermont, Maryland, New Jersey, and Kansas seem to have gone down by almost 5 to 6 points in their average mathematics score in NAEP mathematics exams from 2009 to 2015, which is a bit disheartening to see. It is noticeable that Maryland and Kansas notably also had a maximum negative growth in reading literacy as well from 2009 to 2015, but New York, Vermont, and New Jersey had, though negative growth, a slightly smaller point decrease - only about 3 to 4 points downward in reading literacy.

In comparing the heat maps for change in math literacy versus change in reading literacy across states from 2009 to 2015, we can also notice that states such as Tenessee, Texas, and Alabama that they had no growth to little positive growth of 0 to 2 points in mathematics literacy over the 6 years, but had negative growth between 2 to 6 points in reading literacy - which might indicate to us that even within educational standards, reading standards of education and math standards of educations may be differing state by state.

*STATE* indices of the 2009 education dataset and the 2015 education dataset concurrently, and pulls out the average math or reading score for 4th graders on the NAEP Test for each state and gender (Female or Male) combination possible. After pulling out those values for 2009 and 2015, we subtract the 2009 average from the 2015 average, and we record these differences for each subject (math or reading) and gender (male or female) combination for graphing purposes later on.

In [257]:

```
states = []
female_math_change = []
male_math_change = []
female_read_change = []
male_read_change = []
for ind, currstate in data_2009.iterrows():
states.append(map_and_stats["STUSPS"][ind])
# READING CHANGE for FEMALE
f_read_val2015 = float(data_2015.at[ind, "G04_A_F_READING"])
f_read_val2009 = float(data_2009.at[ind, "G04_A_F_READING"])
f_read_diff = f_read_val2015 - f_read_val2009
female_read_change.append(f_read_diff)
# MATH CHANGE for FEMALE
f_math_val2015 = float(data_2015.at[ind, "G04_A_F_MATHEMATICS"])
f_math_val2009 = float(data_2009.at[ind, "G04_A_F_MATHEMATICS"])
f_math_diff = f_math_val2015 - f_math_val2009
female_math_change.append(f_math_diff)
# READING CHANGE for MALE
m_read_val2015 = float(data_2015.at[ind, "G04_A_M_READING"])
m_read_val2009 = float(data_2009.at[ind, "G04_A_M_READING"])
m_read_diff = m_read_val2015 - m_read_val2009
male_read_change.append(m_read_diff)
# MATH CHANGE for MALE
m_math_val2015 = float(data_2015.at[ind, "G04_A_M_MATHEMATICS"])
m_math_val2009 = float(data_2009.at[ind, "G04_A_M_MATHEMATICS"])
m_math_diff = m_math_val2015 - m_math_val2009
male_math_change.append(m_math_diff)
```

*mathematics* scores from 2009 to 2015 for male and female students per state.

In [258]:

```
# MATH GROWTH
state_locs = np.arange(len(states)) # the state label locations
bar_width = 0.55 # the width of the bars
fig, ax = plt.subplots(1, figsize=(15, 10))
male_bars = ax.bar(states, male_math_change, bar_width, label = 'MALE', color = 'magenta')
female_bars = ax.bar(states, female_math_change, bar_width,
bottom = male_math_change, label = 'FEMALE',
color = (0.2, 0.7, 0.9, 0.5))
ax.bar_label(male_bars, padding = 4)
ax.bar_label(female_bars, padding = 4)
ax.set_xlabel('State', fontsize = 15)
ax.set_ylabel('Change in Average Math Score', fontsize = 15)
ax.set_title('Mathematics Growth from \'09 to \'15 in NAEP Test by State and Gender', fontsize = 15)
ax.set_xticks(state_locs, states)
ax.legend()
fig.tight_layout()
plt.show()
```

*reading* scores from 2009 to 2015 for male and female students per state.

In [259]:

```
# READING
state_locs = np.arange(len(states)) # the state label locations
bar_width = 0.55 # the width of the bars
fig, ax = plt.subplots(1, figsize=(15, 10))
male_bars = ax.bar(states, male_read_change, bar_width, label = 'MALE', color = 'orange')
female_bars = ax.bar(states, female_read_change, bar_width,
bottom = male_read_change, label = 'FEMALE',
color = (0.2, 0.7, 0.9, 0.5))
ax.bar_label(male_bars, padding = 4)
ax.bar_label(female_bars, padding = 4)
ax.set_xlabel('State', fontsize = 15)
ax.set_ylabel('Change in Average Reading Score', fontsize = 15)
ax.set_title('Reading Growth from \'09 to \'15 in NAEP Test by State and Gender', fontsize = 15)
ax.set_xticks(state_locs, states)
ax.legend()
fig.tight_layout()
plt.show()
```

Let's analyze our math growth plot first. Looking at our change in average math scores per state by gender bar plot, we can notice that female students tended to be on either extremes of math score changes, whereas male students tended to fluctuate less between exteme jumps in score from 2009 to 2015. 38 out of the 50 states/territories had female students having a larger difference - whether negative or positive - in average math scores from 2009 to 2015. This could be explained by the circumstance of two situations often pushed onto 'female' gendered students: 1) those who are gendered as female often have the 'dumb' or 'not smart enough' stereotype pushed onto them, which may become enough reasoning by communities or individuals to push expectations onto such students to do better on these examinations, and 2) those who are gendered as female are also considered to not supposed to worry over education in comparison to other ideal such as homemaking, which can be pushed onto such children at an early age.

Next, let's look at that math plot in comparison our reading growth plot. We can notice that there seem to be only 15 states that had at least one gender (FEMALE or MALE) having a negative point decrease in their state's average reading score on the NAEP test for 4th graders from 2009 to 2015, which is a smaller amount then the 29 states that had at least one gender having a negative point decrease for that state's average math scores from the same population. Specifically, we can see that the differences on the various states between FEMALE and MALE reading scores seems to be a bit smaller in comparison to the breadth of differences for that of math scores, but FEMALE average point jumps in reading still seem to be larger then those for MALE average point jumpts that are positive. We can also notice that DC seems to have the largest positive point increase from 2009 in both MALE and FEMALE groups for both average math and average reading scores, indicating that DC's math and reading standards of teaching may be a good model for other states. Though, it would be interesting to see why FEMALE students had such a higher jump in comparison to MALE students.

The exploratory data analysis has given us some idea of what we should expect when developing our predictive models below, which could be that FEMALE gender has more weight in influencing more drastic average math scores for each state, and that there might be less negative growths in the reading growths.

In [260]:

```
# get columns needed
state_avg = school_data[['STATE', 'YEAR', 'G04_A_A_READING', 'G04_A_A_MATHEMATICS']]
state_avg = state_avg[state_avg.STATE != "DODEA"] # dropping rows that are not relevant states
state_avg = state_avg[state_avg.STATE != "NATIONAL"] # dropping rows that are not relevant states
state_avg.head()
```

Out[260]:

STATE | YEAR | G04_A_A_READING | G04_A_A_MATHEMATICS | |
---|---|---|---|---|

867 | ALABAMA | 2009 | 216.0 | 228.0 |

868 | ALASKA | 2009 | 211.0 | 237.0 |

869 | ARIZONA | 2009 | 210.0 | 230.0 |

870 | ARKANSAS | 2009 | 216.0 | 238.0 |

871 | CALIFORNIA | 2009 | 210.0 | 232.0 |

In [261]:

```
# set reading growth to NaN first
state_avg['READING_GROWTH'] = np.NaN
# method to process each row and return the reading average in 2009
def process_reading(row):
state = row['STATE']
new = state_avg.loc[state_avg['STATE'] == state]
new = new.loc[new['YEAR'] == 2009]
return new['G04_A_A_READING']
# in each row update the reading growth value with the difference between this value and the value in 2009
for i, row in state_avg.iterrows():
state_avg.at[i, 'READING_GROWTH'] = row['G04_A_A_READING'] - process_reading(row)
state_avg['MATHEMATICS_GROWTH'] = np.NaN
# similar function as reading, but for mathematics
def process_reading(row):
state = row['STATE']
new = state_avg.loc[state_avg['STATE'] == state]
new = new.loc[new['YEAR'] == 2009]
return new['G04_A_A_MATHEMATICS']
for i, row in state_avg.iterrows():
state_avg.at[i, 'MATHEMATICS_GROWTH'] = row['G04_A_A_MATHEMATICS'] - process_reading(row)
state_avg.head()
```

Out[261]:

STATE | YEAR | G04_A_A_READING | G04_A_A_MATHEMATICS | READING_GROWTH | MATHEMATICS_GROWTH | |
---|---|---|---|---|---|---|

867 | ALABAMA | 2009 | 216.0 | 228.0 | 0.0 | 0.0 |

868 | ALASKA | 2009 | 211.0 | 237.0 | 0.0 | 0.0 |

869 | ARIZONA | 2009 | 210.0 | 230.0 | 0.0 | 0.0 |

870 | ARKANSAS | 2009 | 216.0 | 238.0 | 0.0 | 0.0 |

871 | CALIFORNIA | 2009 | 210.0 | 232.0 | 0.0 | 0.0 |

In [262]:

```
# get dummies
state_avg = pd.get_dummies(state_avg, columns=['STATE'])
# drop alabama and reading and mathematics averages since we no longer need them
state_avg = state_avg.drop(columns=['STATE_ALABAMA', 'G04_A_A_READING', 'G04_A_A_MATHEMATICS'])
state_avg.head()
```

Out[262]:

YEAR | READING_GROWTH | MATHEMATICS_GROWTH | STATE_ALASKA | STATE_ARIZONA | STATE_ARKANSAS | STATE_CALIFORNIA | STATE_COLORADO | STATE_CONNECTICUT | STATE_DELAWARE | ... | STATE_SOUTH_DAKOTA | STATE_TENNESSEE | STATE_TEXAS | STATE_UTAH | STATE_VERMONT | STATE_VIRGINIA | STATE_WASHINGTON | STATE_WEST_VIRGINIA | STATE_WISCONSIN | STATE_WYOMING | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

867 | 2009 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

868 | 2009 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

869 | 2009 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

870 | 2009 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

871 | 2009 | 0.0 | 0.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

5 rows × 53 columns

In [263]:

```
# drop the NaN rows
train_data = state_avg[state_avg['YEAR'] < 2017].dropna()
test_data = state_avg[state_avg['YEAR'] >= 2017].dropna()
```

In [264]:

```
from sklearn.linear_model import LinearRegression
X_reading = [] # independent values for reading
y_reading = [] # dependent values for reading
X_math = [] # independent values for math
y_math = [] # dependent values for reading
# iterate through each row and add the year and state to the X variables and the growths to the y variables
for i, row in train_data.iterrows():
add = row[3:].tolist()
add.insert(0, row['YEAR'])
X_reading.append(add)
y_reading.append(row['READING_GROWTH'])
X_math.append(add)
y_math.append(row['MATHEMATICS_GROWTH'])
```

In [265]:

```
# create reading regression and math regression
reading_regr = LinearRegression().fit(X_reading, y_reading)
mathematics_regr = LinearRegression().fit(X_math, y_math)
X_test_reading = []
X_test_math = []
# accumulate X values for reading and math
for i, row in test_data.iterrows():
add = row[3:].tolist()
add.insert(0, row['YEAR'])
X_test_reading.append(add)
X_test_math.append(add)
# predict based of X values
test_data['PREDICT_READING'] = reading_regr.predict(X_test_reading)
test_data['PREDICT_MATH'] = mathematics_regr.predict(X_test_math)
test_data.head()
```

Out[265]:

YEAR | READING_GROWTH | MATHEMATICS_GROWTH | STATE_ALASKA | STATE_ARIZONA | STATE_ARKANSAS | STATE_CALIFORNIA | STATE_COLORADO | STATE_CONNECTICUT | STATE_DELAWARE | ... | STATE_TEXAS | STATE_UTAH | STATE_VERMONT | STATE_VIRGINIA | STATE_WASHINGTON | STATE_WEST_VIRGINIA | STATE_WISCONSIN | STATE_WYOMING | PREDICT_READING | PREDICT_MATH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1281 | 2017 | 0.0 | 4.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.642157 | 3.666667 |

1288 | 2017 | -4.0 | -7.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.892157 | 0.166667 |

1295 | 2017 | 5.0 | 4.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.142157 | 6.666667 |

1302 | 2017 | 0.0 | -4.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.142157 | 0.666667 |

1309 | 2017 | 5.0 | 0.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.392157 | 1.916667 |

5 rows × 55 columns

In [266]:

```
# create statsmodel for reading data
p_reading = sm.OLS(train_data['READING_GROWTH'].tolist(), sm.add_constant(X_reading)).fit()
p_reading.summary()
```

Out[266]:

Dep. Variable: | y | R-squared: | 0.501 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.334 |

Method: | Least Squares | F-statistic: | 2.997 |

Date: | Mon, 16 May 2022 | Prob (F-statistic): | 1.12e-07 |

Time: | 15:54:25 | Log-Likelihood: | -381.05 |

No. Observations: | 204 | AIC: | 866.1 |

Df Residuals: | 152 | BIC: | 1039. |

Df Model: | 51 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | -658.8039 | 114.346 | -5.762 | 0.000 | -884.716 | -432.892 |

x1 | 0.3284 | 0.057 | 5.779 | 0.000 | 0.216 | 0.441 |

x2 | -2.7500 | 1.283 | -2.143 | 0.034 | -5.286 | -0.214 |

x3 | 0.5000 | 1.283 | 0.390 | 0.697 | -2.036 | 3.036 |

x4 | -0.5000 | 1.283 | -0.390 | 0.697 | -3.036 | 2.036 |

x5 | -0.2500 | 1.283 | -0.195 | 0.846 | -2.786 | 2.286 |

x6 | -3.0000 | 1.283 | -2.338 | 0.021 | -5.536 | -0.464 |

x7 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |

x8 | -2.7500 | 1.283 | -2.143 | 0.034 | -5.286 | -0.214 |

x9 | 1.2500 | 1.283 | 0.974 | 0.332 | -1.286 | 3.786 |

x10 | -1.7500 | 1.283 | -1.364 | 0.175 | -4.286 | 0.786 |

x11 | 0.7500 | 1.283 | 0.584 | 0.560 | -1.786 | 3.286 |

x12 | 0.7500 | 1.283 | 0.584 | 0.560 | -1.786 | 3.286 |

x13 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |

x14 | -1.2500 | 1.283 | -0.974 | 0.332 | -3.786 | 1.286 |

x15 | -1.0000 | 1.283 | -0.779 | 0.437 | -3.536 | 1.536 |

x16 | -0.5000 | 1.283 | -0.390 | 0.697 | -3.036 | 2.036 |

x17 | -3.0000 | 1.283 | -2.338 | 0.021 | -5.536 | -0.464 |

x18 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |

x19 | 1.7500 | 1.283 | 1.364 | 0.175 | -0.786 | 4.286 |

x20 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |

x21 | 4.125e-12 | 1.283 | 3.21e-12 | 1.000 | -2.536 | 2.536 |

x22 | -1.5000 | 1.283 | -1.169 | 0.244 | -4.036 | 1.036 |

x23 | -2.5000 | 1.283 | -1.948 | 0.053 | -5.036 | 0.036 |

x24 | -1.2500 | 1.283 | -0.974 | 0.332 | -3.786 | 1.286 |

x25 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |

x26 | -3.7500 | 1.283 | -2.922 | 0.004 | -6.286 | -1.214 |

x27 | -2.5000 | 1.283 | -1.948 | 0.053 | -5.036 | 0.036 |

x28 | -1.0000 | 1.283 | -0.779 | 0.437 | -3.536 | 1.536 |

x29 | 4.121e-12 | 1.283 | 3.21e-12 | 1.000 | -2.536 | 2.536 |

x30 | -0.2500 | 1.283 | -0.195 | 0.846 | -2.786 | 2.286 |

x31 | -1.5000 | 1.283 | -1.169 | 0.244 | -4.036 | 1.036 |

x32 | -2.7500 | 1.283 | -2.143 | 0.034 | -5.286 | -0.214 |

x33 | -2.7500 | 1.283 | -2.143 | 0.034 | -5.286 | -0.214 |

x34 | 1.0000 | 1.283 | 0.779 | 0.437 | -1.536 | 3.536 |

x35 | -2.7500 | 1.283 | -2.143 | 0.034 | -5.286 | -0.214 |

x36 | -2.5000 | 1.283 | -1.948 | 0.053 | -5.036 | 0.036 |

x37 | -1.2500 | 1.283 | -0.974 | 0.332 | -3.786 | 1.286 |

x38 | -1.7500 | 1.283 | -1.364 | 0.175 | -4.286 | 0.786 |

x39 | 4.125e-12 | 1.283 | 3.21e-12 | 1.000 | -2.536 | 2.536 |

x40 | -1.7500 | 1.283 | -1.364 | 0.175 | -4.286 | 0.786 |

x41 | -2.2500 | 1.283 | -1.753 | 0.082 | -4.786 | 0.286 |

x42 | -4.0000 | 1.283 | -3.117 | 0.002 | -6.536 | -1.464 |

x43 | -1.2500 | 1.283 | -0.974 | 0.332 | -3.786 | 1.286 |

x44 | -3.0000 | 1.283 | -2.338 | 0.021 | -5.536 | -0.464 |

x45 | 1.0000 | 1.283 | 0.779 | 0.437 | -1.536 | 3.536 |

x46 | -2.5000 | 1.283 | -1.948 | 0.053 | -5.036 | 0.036 |

x47 | -1.2500 | 1.283 | -0.974 | 0.332 | -3.786 | 1.286 |

x48 | 0.2500 | 1.283 | 0.195 | 0.846 | -2.286 | 2.786 |

x49 | -2.0000 | 1.283 | -1.558 | 0.121 | -4.536 | 0.536 |

x50 | -0.7500 | 1.283 | -0.584 | 0.560 | -3.286 | 1.786 |

x51 | 0.2500 | 1.283 | 0.195 | 0.846 | -2.286 | 2.786 |

Omnibus: | 6.101 | Durbin-Watson: | 1.995 |
---|---|---|---|

Prob(Omnibus): | 0.047 | Jarque-Bera (JB): | 8.918 |

Skew: | 0.113 | Prob(JB): | 0.0116 |

Kurtosis: | 3.999 | Cond. No. | 1.81e+06 |

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 1.81e+06. This might indicate that there are

strong multicollinearity or other numerical problems.

We can do the same for the math data.

In [267]:

```
# create statsmodel for math data
p_math = sm.OLS(train_data['MATHEMATICS_GROWTH'].tolist(), sm.add_constant(X_math)).fit()
p_math.summary()
```

Out[267]:

Dep. Variable: | y | R-squared: | 0.511 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.347 |

Method: | Least Squares | F-statistic: | 3.114 |

Date: | Mon, 16 May 2022 | Prob (F-statistic): | 4.02e-08 |

Time: | 15:54:25 | Log-Likelihood: | -403.42 |

No. Observations: | 204 | AIC: | 910.8 |

Df Residuals: | 152 | BIC: | 1083. |

Df Model: | 51 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | -366.1167 | 127.593 | -2.869 | 0.005 | -618.201 | -114.032 |

x1 | 0.1833 | 0.063 | 2.891 | 0.004 | 0.058 | 0.309 |

x2 | -3.5000 | 1.432 | -2.444 | 0.016 | -6.329 | -0.671 |

x3 | 3.0000 | 1.432 | 2.095 | 0.038 | 0.171 | 5.829 |

x4 | -3.0000 | 1.432 | -2.095 | 0.038 | -5.829 | -0.171 |

x5 | -1.7500 | 1.432 | -1.222 | 0.224 | -4.579 | 1.079 |

x6 | -1.7500 | 1.432 | -1.222 | 0.224 | -4.579 | 1.079 |

x7 | -5.2500 | 1.432 | -3.666 | 0.000 | -8.079 | -2.421 |

x8 | -1.5000 | 1.432 | -1.047 | 0.297 | -4.329 | 1.329 |

x9 | 3.5000 | 1.432 | 2.444 | 0.016 | 0.671 | 6.329 |

x10 | -3.0000 | 1.432 | -2.095 | 0.038 | -5.829 | -0.171 |

x11 | -1.2500 | 1.432 | -0.873 | 0.384 | -4.079 | 1.579 |

x12 | 0.2500 | 1.432 | 0.175 | 0.862 | -2.579 | 3.079 |

x13 | -3.5000 | 1.432 | -2.444 | 0.016 | -6.329 | -0.671 |

x14 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |

x15 | 0.2500 | 1.432 | 0.175 | 0.862 | -2.579 | 3.079 |

x16 | -2.0000 | 1.432 | -1.397 | 0.165 | -4.829 | 0.829 |

x17 | -3.2500 | 1.432 | -2.269 | 0.025 | -6.079 | -0.421 |

x18 | -1.0000 | 1.432 | -0.698 | 0.486 | -3.829 | 1.829 |

x19 | -0.5000 | 1.432 | -0.349 | 0.727 | -3.329 | 2.329 |

x20 | -2.7500 | 1.432 | -1.920 | 0.057 | -5.579 | 0.079 |

x21 | -3.0000 | 1.432 | -2.095 | 0.038 | -5.829 | -0.171 |

x22 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |

x23 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |

x24 | -1.5000 | 1.432 | -1.047 | 0.297 | -4.329 | 1.329 |

x25 | 0.7500 | 1.432 | 0.524 | 0.601 | -2.079 | 3.579 |

x26 | -3.7500 | 1.432 | -2.619 | 0.010 | -6.579 | -0.921 |

x27 | -3.5000 | 1.432 | -2.444 | 0.016 | -6.329 | -0.671 |

x28 | -0.2500 | 1.432 | -0.175 | 0.862 | -3.079 | 2.579 |

x29 | -2.2500 | 1.432 | -1.571 | 0.118 | -5.079 | 0.579 |

x30 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |

x31 | -3.0000 | 1.432 | -2.095 | 0.038 | -5.829 | -0.171 |

x32 | -1.0000 | 1.432 | -0.698 | 0.486 | -3.829 | 1.829 |

x33 | -4.7500 | 1.432 | -3.317 | 0.001 | -7.579 | -1.921 |

x34 | -2.2500 | 1.432 | -1.571 | 0.118 | -5.079 | 0.579 |

x35 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |

x36 | -2.2500 | 1.432 | -1.571 | 0.118 | -5.079 | 0.579 |

x37 | -1.5000 | 1.432 | -1.047 | 0.297 | -4.329 | 1.329 |

x38 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |

x39 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |

x40 | -1.7500 | 1.432 | -1.222 | 0.224 | -4.579 | 1.079 |

x41 | -2.0000 | 1.432 | -1.397 | 0.165 | -4.829 | 0.829 |

x42 | -3.7500 | 1.432 | -2.619 | 0.010 | -6.579 | -0.921 |

x43 | 1.7500 | 1.432 | 1.222 | 0.224 | -1.079 | 4.579 |

x44 | -1.0000 | 1.432 | -0.698 | 0.486 | -3.829 | 1.829 |

x45 | -0.5000 | 1.432 | -0.349 | 0.727 | -3.329 | 2.329 |

x46 | -4.2500 | 1.432 | -2.968 | 0.003 | -7.079 | -1.421 |

x47 | -0.5000 | 1.432 | -0.349 | 0.727 | -3.329 | 2.329 |

x48 | -0.7500 | 1.432 | -0.524 | 0.601 | -3.579 | 2.079 |

x49 | -0.7500 | 1.432 | -0.524 | 0.601 | -3.579 | 2.079 |

x50 | -2.5000 | 1.432 | -1.746 | 0.083 | -5.329 | 0.329 |

x51 | 0.2500 | 1.432 | 0.175 | 0.862 | -2.579 | 3.079 |

Omnibus: | 7.342 | Durbin-Watson: | 1.715 |
---|---|---|---|

Prob(Omnibus): | 0.025 | Jarque-Bera (JB): | 8.179 |

Skew: | -0.316 | Prob(JB): | 0.0167 |

Kurtosis: | 3.750 | Cond. No. | 1.81e+06 |

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 1.81e+06. This might indicate that there are

strong multicollinearity or other numerical problems.

In [268]:

```
# accumulate constant names
const_names = train_data.columns[3:].tolist()
const_names.insert(0, 'YEAR')
reading_significant = []
# iterate through p values and if less than 0.05 add const name to reading_significant
for i in range(len(p_reading.pvalues) - 1):
if p_reading.pvalues[i + 1] < 0.05:
reading_significant.append(const_names[i])
print("Significant reading test constants: " + str(reading_significant))
print()
math_significant = []
# iterate through p values and if less than 0.05 add const name to math_significant
for i in range(len(p_math.pvalues) - 1):
if p_math.pvalues[i + 1] < 0.05:
math_significant.append(const_names[i])
print("Significant math test constants: " + str(math_significant))
```

Based off just the p values we can see that these states have a determinant factor in the growth of a child's reading and math test score.

An application of this model can be to predict if a state is predicted to perform poorly and change legislation and provide more funding to help the students grow more. We can see if a state is predicted to perform poorly if the state has a negative coefficient in the LinearRegression model. If this state also has a high significance then this state should try to improve their education system.

In [269]:

```
print("States that have a predicted negative growth in reading:")
for i in range(len(reading_regr.coef_)):
if(i > 1):
if reading_regr.coef_[i] < 0 and p_reading.pvalues[i + 1] < 0.05:
print(const_names[i])
print()
print("States that have a predicted negative growth in math:")
for i in range(len(mathematics_regr.coef_)):
if(i > 1):
if mathematics_regr.coef_[i] < 0 and p_math.pvalues[i + 1] < 0.05:
print(const_names[i])
```

In [270]:

```
# get gender-based data
gender_avg = school_data[['STATE', 'YEAR', 'G04_A_M_READING', 'G04_A_M_MATHEMATICS', 'G04_A_F_READING', 'G04_A_F_MATHEMATICS']]
gender_avg = gender_avg[gender_avg.STATE != "DODEA"] # dropping rows that are not relevant states
gender_avg = gender_avg[gender_avg.STATE != "NATIONAL"] # dropping rows that are not relevant states
# display the first few results
gender_avg.head()
```

Out[270]:

STATE | YEAR | G04_A_M_READING | G04_A_M_MATHEMATICS | G04_A_F_READING | G04_A_F_MATHEMATICS | |
---|---|---|---|---|---|---|

867 | ALABAMA | 2009 | 212.0 | 228.0 | 221.0 | 228.0 |

868 | ALASKA | 2009 | 207.0 | 238.0 | 216.0 | 236.0 |

869 | ARIZONA | 2009 | 207.0 | 230.0 | 213.0 | 230.0 |

870 | ARKANSAS | 2009 | 211.0 | 239.0 | 222.0 | 236.0 |

871 | CALIFORNIA | 2009 | 207.0 | 233.0 | 213.0 | 231.0 |

To create a metric for how the test scores have improved, we will, similar to earlier, subtract the 2009 average scores from each state's FEMALE and MALE reading and math averages from those of another year's to define how much the average math and reading test scores have changed since 2009. We are storing this metric in these new columns: "READING_GROWTH_M," "READING_GROWTH_F" and "MATHEMATICS_GROWTH_M," and "MATHEMATICS_GROWTH_F".

First, we will create these columns and use numpy to set the values in those column as empty (np.NaN). Typically, you would use this to denote missing values, but we can use this when we don't know the values up front.

In [271]:

```
# define columns in dataframe and set to empty
gender_avg['MATHEMATICS_GROWTH_M'] = np.NaN
gender_avg['MATHEMATICS_GROWTH_F'] = np.NaN
gender_avg['READING_GROWTH_M'] = np.NaN
gender_avg['READING_GROWTH_F'] = np.NaN
# display the first few results
gender_avg.head()
```