Summary


I was given two datasets: One about 15 different schools, and one about all of the students in them, including their math and reading scores. The two csv datasets were sized at 1kb and 2,074kb respectively. There was no missing data. I was asked to identify the most effective schools, and provide other insights.


Here, I used Pandas in a Jupyter Notebook. The requirements of this task required me to learn about formatting intermediate datatypes, especially the int64.

Solution


School Dataset Conclusions

  • As a whole, schools with higher budgets, did not yield better test results. Schools with higher spending per student (645-675 dollars) actually underperformed compared to schools with smaller budgets (585 dollars per student).

  • Smaller and medium sized schools dramatically out-performed large sized schools on passing math performances (89-91% passing vs 67%).

  • Charter schools out-performed the public district schools across all metrics.

In [1]:
# Dependencies and Setup
import pandas as pd
import numpy as np

# File to Load
school_data_to_load = "Resources/schools_complete.csv"
student_data_to_load = "Resources/students_complete.csv"

# Read School and Student Data File and store into Pandas Data Frames
school_data = pd.read_csv(school_data_to_load)
student_data = pd.read_csv(student_data_to_load)

# Combine the data into a single dataset
school_data_complete = pd.merge(student_data, school_data, how="left", on=["school_name", "school_name"])
In [6]:
school_data.head()
Out[6]:
School ID school_name type size budget
0 0 Huang High School District 2917 1910635
1 1 Figueroa High School District 2949 1884411
2 2 Shelton High School Charter 1761 1056600
3 3 Hernandez High School District 4635 3022020
4 4 Griffin High School Charter 1468 917500
In [8]:
school_data_complete.head()
Out[8]:
Student ID student_name gender grade school_name reading_score math_score School ID type size budget
0 0 Paul Bradley M 9th Huang High School 66 79 0 District 2917 1910635
1 1 Victor Smith M 12th Huang High School 94 61 0 District 2917 1910635
2 2 Kevin Rodriguez M 12th Huang High School 90 60 0 District 2917 1910635
3 3 Dr. Richard Scott M 12th Huang High School 67 58 0 District 2917 1910635
4 4 Bonnie Ray F 9th Huang High School 97 84 0 District 2917 1910635

District Summary

  • Calculate the total number of schools

  • Calculate the total number of students

  • Calculate the total budget

  • Calculate the average math score

  • Calculate the average reading score

  • Calculate the overall passing rate (overall average score), i.e. (avg. math score + avg. reading score)/2

  • Calculate the percentage of students with a passing math score (70 or greater)

  • Calculate the percentage of students with a passing reading score (70 or greater)

  • Create a dataframe to hold the above results

  • Optional: give the displayed data cleaner formatting

In [3]:
Total_Schools_int = len(school_data_complete['school_name'].unique())
Total_Students_int = len(school_data_complete['student_name'])
Total_Budget_int = sum(school_data_complete['budget'].unique())
Average_Math_int = school_data_complete['math_score'].mean()
Average_Reading_int= school_data_complete['reading_score'].mean()
Passing_Math_count_float = school_data_complete[(school_data_complete["math_score"] >= 70)].count()["student_name"]
Passing_Math_percentage_float = (Passing_Math_count_float / float(Total_Students_int)) * 100
Passing_Reading_count_float = school_data_complete[(school_data_complete["reading_score"] >= 70)].count()["student_name"]
Passing_Reading_percentage_float = (Passing_Reading_count_float / float(Total_Students_int)) * 100
Overall_Passing_percentage_float = (Passing_Math_percentage_float + Passing_Reading_percentage_float)/2
df1 = pd.DataFrame({
    "Total Schools": [Total_Schools_int],
    "Total Students": [Total_Students_int],
    "Total Budget": [Total_Budget_int],
    "Average Math Score": [Average_Math_int],
    "Average Reading Score": [Average_Reading_int],
    "% Passing Math": [Passing_Math_percentage_float],
    "% Passing Reading": [Passing_Reading_percentage_float],
    "% Overall Passing Rate": [Overall_Passing_percentage_float]
})
df1['Total Students'] = df1['Total Students'].map("{:,}".format)
df1['Total Budget'] = df1['Total Budget'].map("${:,.2f}".format)
df1
Out[3]:
Total Schools Total Students Total Budget Average Math Score Average Reading Score % Passing Math % Passing Reading % Overall Passing Rate
0 15 39,170 $24,649,428.00 78.985371 81.87784 74.980853 85.805463 80.393158

School Summary

  • Create an overview table that summarizes key metrics about each school, including:

    • School Name
    • School Type
    • Total Students
    • Total School Budget
    • Per Student Budget
    • Average Math Score
    • Average Reading Score
    • % Passing Math
    • % Passing Reading
    • Overall Passing Rate (Average of the above two)
  • Create a dataframe to hold the above results

In [4]:
school_types = school_data.set_index(["school_name"])["type"]
per_school_counts = school_data_complete["school_name"].value_counts()
per_school_budget = school_data_complete.groupby(["school_name"]).mean()["budget"]
per_school_capita = per_school_budget / per_school_counts
per_school_math = school_data_complete.groupby(["school_name"]).mean()["math_score"]
per_school_reading = school_data_complete.groupby(["school_name"]).mean()["reading_score"]

school_passing_math = school_data_complete[(school_data_complete["math_score"] >= 70)]
school_passing_reading = school_data_complete[(school_data_complete["reading_score"] >= 70)]
per_school_passing_math = school_passing_math.groupby(["school_name"]).count()["student_name"] / per_school_counts * 100
per_school_passing_reading = school_passing_reading.groupby(["school_name"]).count()["student_name"] / per_school_counts * 100

overall_passing_rate = (per_school_passing_math + per_school_passing_reading) / 2

per_school_summary = pd.DataFrame({"School Type": school_types,
                                   "Total Students": per_school_counts,
                                   "Total School Budget": per_school_budget,
                                   "Per Student Budget": per_school_capita,
                                   "Average Math Score": per_school_math,
                                   "Average Reading Score": per_school_reading,
                                   "% Passing Math": per_school_passing_math,
                                   "% Passing Reading": per_school_passing_reading,
                                   "% Overall Passing Rate": overall_passing_rate})

per_school_summary = per_school_summary[["School Type", "Total Students", "Total School Budget", "Per Student Budget",
                                         "Average Math Score", "Average Reading Score", 
                                         "% Passing Math", "% Passing Reading", 
                                         "% Overall Passing Rate"]]
per_school_summary["Total School Budget"] = per_school_summary["Total School Budget"].map("${:,.2f}".format)
per_school_summary["Per Student Budget"] = per_school_summary["Per Student Budget"].map("${:,.2f}".format)

per_school_summary
Out[4]:
School Type Total Students Total School Budget Per Student Budget Average Math Score Average Reading Score % Passing Math % Passing Reading % Overall Passing Rate
Bailey High School District 4976 $3,124,928.00 $628.00 77.048432 81.033963 66.680064 81.933280 74.306672
Cabrera High School Charter 1858 $1,081,356.00 $582.00 83.061895 83.975780 94.133477 97.039828 95.586652
Figueroa High School District 2949 $1,884,411.00 $639.00 76.711767 81.158020 65.988471 80.739234 73.363852
Ford High School District 2739 $1,763,916.00 $644.00 77.102592 80.746258 68.309602 79.299014 73.804308
Griffin High School Charter 1468 $917,500.00 $625.00 83.351499 83.816757 93.392371 97.138965 95.265668
Hernandez High School District 4635 $3,022,020.00 $652.00 77.289752 80.934412 66.752967 80.862999 73.807983
Holden High School Charter 427 $248,087.00 $581.00 83.803279 83.814988 92.505855 96.252927 94.379391
Huang High School District 2917 $1,910,635.00 $655.00 76.629414 81.182722 65.683922 81.316421 73.500171
Johnson High School District 4761 $3,094,650.00 $650.00 77.072464 80.966394 66.057551 81.222432 73.639992
Pena High School Charter 962 $585,858.00 $609.00 83.839917 84.044699 94.594595 95.945946 95.270270
Rodriguez High School District 3999 $2,547,363.00 $637.00 76.842711 80.744686 66.366592 80.220055 73.293323
Shelton High School Charter 1761 $1,056,600.00 $600.00 83.359455 83.725724 93.867121 95.854628 94.860875
Thomas High School Charter 1635 $1,043,130.00 $638.00 83.418349 83.848930 93.272171 97.308869 95.290520
Wilson High School Charter 2283 $1,319,574.00 $578.00 83.274201 83.989488 93.867718 96.539641 95.203679
Wright High School Charter 1800 $1,049,400.00 $583.00 83.682222 83.955000 93.333333 96.611111 94.972222

Top Performing Schools (By Passing Rate)

  • Sort and display the five best-performing schools
In [5]:
top_schools = per_school_summary.sort_values(["% Overall Passing Rate"], ascending=False)
top_schools.head(5)
Out[5]:
School Type Total Students Total School Budget Per Student Budget Average Math Score Average Reading Score % Passing Math % Passing Reading % Overall Passing Rate
Cabrera High School Charter 1858 $1,081,356.00 $582.00 83.061895 83.975780 94.133477 97.039828 95.586652
Thomas High School Charter 1635 $1,043,130.00 $638.00 83.418349 83.848930 93.272171 97.308869 95.290520
Pena High School Charter 962 $585,858.00 $609.00 83.839917 84.044699 94.594595 95.945946 95.270270
Griffin High School Charter 1468 $917,500.00 $625.00 83.351499 83.816757 93.392371 97.138965 95.265668
Wilson High School Charter 2283 $1,319,574.00 $578.00 83.274201 83.989488 93.867718 96.539641 95.203679

Bottom Performing Schools (By Passing Rate)

  • Sort and display the five worst-performing schools
In [6]:
bottom_schools = per_school_summary.sort_values(["% Overall Passing Rate"], ascending=True)
bottom_schools.head(5)
Out[6]:
School Type Total Students Total School Budget Per Student Budget Average Math Score Average Reading Score % Passing Math % Passing Reading % Overall Passing Rate
Rodriguez High School District 3999 $2,547,363.00 $637.00 76.842711 80.744686 66.366592 80.220055 73.293323
Figueroa High School District 2949 $1,884,411.00 $639.00 76.711767 81.158020 65.988471 80.739234 73.363852
Huang High School District 2917 $1,910,635.00 $655.00 76.629414 81.182722 65.683922 81.316421 73.500171
Johnson High School District 4761 $3,094,650.00 $650.00 77.072464 80.966394 66.057551 81.222432 73.639992
Ford High School District 2739 $1,763,916.00 $644.00 77.102592 80.746258 68.309602 79.299014 73.804308

Math Scores by Grade

  • Create a table that lists the average Reading Score for students of each grade level (9th, 10th, 11th, 12th) at each school.

    • Create a pandas series for each grade. Hint: use a conditional statement.

    • Group each series by school

    • Combine the series into a dataframe

    • Optional: give the displayed data cleaner formatting

In [7]:
# Create data series of scores by grade levels using conditionals
ninth_graders = school_data_complete[(school_data_complete["grade"] == "9th")]
tenth_graders = school_data_complete[(school_data_complete["grade"] == "10th")]
eleventh_graders = school_data_complete[(school_data_complete["grade"] == "11th")]
twelfth_graders = school_data_complete[(school_data_complete["grade"] == "12th")]

# Group each by school name
ninth_graders_scores = ninth_graders.groupby(["school_name"]).mean()["math_score"]
tenth_graders_scores = tenth_graders.groupby(["school_name"]).mean()["math_score"]
eleventh_graders_scores = eleventh_graders.groupby(["school_name"]).mean()["math_score"]
twelfth_graders_scores = twelfth_graders.groupby(["school_name"]).mean()["math_score"]

# Combine series into single data frame
scores_by_grade = pd.DataFrame({"9th": ninth_graders_scores, "10th": tenth_graders_scores,
                                "11th": eleventh_graders_scores, "12th": twelfth_graders_scores})

# Minor data munging
scores_by_grade = scores_by_grade[["9th", "10th", "11th", "12th"]]
scores_by_grade.index.name = None

# Display the data frame
scores_by_grade
Out[7]:
9th 10th 11th 12th
Bailey High School 77.083676 76.996772 77.515588 76.492218
Cabrera High School 83.094697 83.154506 82.765560 83.277487
Figueroa High School 76.403037 76.539974 76.884344 77.151369
Ford High School 77.361345 77.672316 76.918058 76.179963
Griffin High School 82.044010 84.229064 83.842105 83.356164
Hernandez High School 77.438495 77.337408 77.136029 77.186567
Holden High School 83.787402 83.429825 85.000000 82.855422
Huang High School 77.027251 75.908735 76.446602 77.225641
Johnson High School 77.187857 76.691117 77.491653 76.863248
Pena High School 83.625455 83.372000 84.328125 84.121547
Rodriguez High School 76.859966 76.612500 76.395626 77.690748
Shelton High School 83.420755 82.917411 83.383495 83.778976
Thomas High School 83.590022 83.087886 83.498795 83.497041
Wilson High School 83.085578 83.724422 83.195326 83.035794
Wright High School 83.264706 84.010288 83.836782 83.644986

Reading Score by Grade

  • Perform the same operations as above for reading scores
In [8]:
# Create data series of scores by grade levels using conditionals
ninth_graders = school_data_complete[(school_data_complete["grade"] == "9th")]
tenth_graders = school_data_complete[(school_data_complete["grade"] == "10th")]
eleventh_graders = school_data_complete[(school_data_complete["grade"] == "11th")]
twelfth_graders = school_data_complete[(school_data_complete["grade"] == "12th")]

# Group each by school name
ninth_graders_scores = ninth_graders.groupby(["school_name"]).mean()["reading_score"]
tenth_graders_scores = tenth_graders.groupby(["school_name"]).mean()["reading_score"]
eleventh_graders_scores = eleventh_graders.groupby(["school_name"]).mean()["reading_score"]
twelfth_graders_scores = twelfth_graders.groupby(["school_name"]).mean()["reading_score"]

# Combine series into single data frame
scores_by_grade = pd.DataFrame({"9th": ninth_graders_scores, "10th": tenth_graders_scores,
                                "11th": eleventh_graders_scores, "12th": twelfth_graders_scores})

# Minor data munging
scores_by_grade = scores_by_grade[["9th", "10th", "11th", "12th"]]
scores_by_grade.index.name = None

# Display the data frame
scores_by_grade
Out[8]:
9th 10th 11th 12th
Bailey High School 81.303155 80.907183 80.945643 80.912451
Cabrera High School 83.676136 84.253219 83.788382 84.287958
Figueroa High School 81.198598 81.408912 80.640339 81.384863
Ford High School 80.632653 81.262712 80.403642 80.662338
Griffin High School 83.369193 83.706897 84.288089 84.013699
Hernandez High School 80.866860 80.660147 81.396140 80.857143
Holden High School 83.677165 83.324561 83.815534 84.698795
Huang High School 81.290284 81.512386 81.417476 80.305983
Johnson High School 81.260714 80.773431 80.616027 81.227564
Pena High School 83.807273 83.612000 84.335938 84.591160
Rodriguez High School 80.993127 80.629808 80.864811 80.376426
Shelton High School 84.122642 83.441964 84.373786 82.781671
Thomas High School 83.728850 84.254157 83.585542 83.831361
Wilson High School 83.939778 84.021452 83.764608 84.317673
Wright High School 83.833333 83.812757 84.156322 84.073171

Scores by School Spending

  • Create a table that breaks down school performances based on average Spending Ranges (Per Student). Use 4 reasonable bins to group school spending. Include in the table each of the following:
    • Average Math Score
    • Average Reading Score
    • % Passing Math
    • % Passing Reading
    • Overall Passing Rate (Average of the above two)
In [9]:
# Establish the bins 
spending_bins = [0, 585, 615, 645, 675]
group_names = ["<$585", "$585-615", "$615-645", "$645-675"]

# Categorize the spending based on the bins
per_school_summary["Spending Ranges (Per Student)"] = pd.cut(per_school_capita, spending_bins, labels=group_names)

spending_math_scores = per_school_summary.groupby(["Spending Ranges (Per Student)"]).mean()["Average Math Score"]
spending_reading_scores = per_school_summary.groupby(["Spending Ranges (Per Student)"]).mean()["Average Reading Score"]
spending_passing_math = per_school_summary.groupby(["Spending Ranges (Per Student)"]).mean()["% Passing Math"]
spending_passing_reading = per_school_summary.groupby(["Spending Ranges (Per Student)"]).mean()["% Passing Reading"]
overall_passing_rate = (spending_passing_math + spending_passing_reading) / 2

# Assemble into data frame
spending_summary = pd.DataFrame({"Average Math Score" : spending_math_scores,
                                 "Average Reading Score": spending_reading_scores,
                                 "% Passing Math": spending_passing_math,
                                 "% Passing Reading": spending_passing_reading,
                                 "% Overall Passing Rate": overall_passing_rate})

# Minor data munging
spending_summary = spending_summary[["Average Math Score", 
                                     "Average Reading Score", 
                                     "% Passing Math", "% Passing Reading",
                                     "% Overall Passing Rate"]]

# Display results
spending_summary
Out[9]:
Average Math Score Average Reading Score % Passing Math % Passing Reading % Overall Passing Rate
Spending Ranges (Per Student)
<$585 83.455399 83.933814 93.460096 96.610877 95.035486
$585-615 83.599686 83.885211 94.230858 95.900287 95.065572
$615-645 79.079225 81.891436 75.668212 86.106569 80.887391
$645-675 76.997210 81.027843 66.164813 81.133951 73.649382

Scores by School Size

  • Perform the same operations as above, based on school size.
In [10]:
# Establish the bins 
size_bins = [0, 1000, 2000, 5000]
group_names = ["Small (<1000)", "Medium (1000-2000)", "Large (2000-5000)"]

# Categorize the spending based on the bins
per_school_summary["School Size"] = pd.cut(per_school_summary["Total Students"], size_bins, labels=group_names)

# Calculate the scores based on bins
size_math_scores = per_school_summary.groupby(["School Size"]).mean()["Average Math Score"]
size_reading_scores = per_school_summary.groupby(["School Size"]).mean()["Average Reading Score"]
size_passing_math = per_school_summary.groupby(["School Size"]).mean()["% Passing Math"]
size_passing_reading = per_school_summary.groupby(["School Size"]).mean()["% Passing Reading"]
overall_passing_rate = (size_passing_math + size_passing_reading) / 2

# Assemble into data frame
size_summary = pd.DataFrame({"Average Math Score" : size_math_scores,
                             "Average Reading Score": size_reading_scores,
                             "% Passing Math": size_passing_math,
                             "% Passing Reading": size_passing_reading,
                             "% Overall Passing Rate": overall_passing_rate})

# Minor data munging
size_summary = size_summary[["Average Math Score", 
                             "Average Reading Score", 
                             "% Passing Math", "% Passing Reading",
                             "% Overall Passing Rate"]]

# Display results
size_summary
Out[10]:
Average Math Score Average Reading Score % Passing Math % Passing Reading % Overall Passing Rate
School Size
Small (<1000) 83.821598 83.929843 93.550225 96.099437 94.824831
Medium (1000-2000) 83.374684 83.864438 93.599695 96.790680 95.195187
Large (2000-5000) 77.746417 81.344493 69.963361 82.766634 76.364998

Scores by School Type

  • Perform the same operations as above, based on school type.
In [11]:
# Type | Average Math Score | Average Reading Score | % Passing Math | % Passing Reading | % Overall Passing Rate

type_math_scores = per_school_summary.groupby(["School Type"]).mean()["Average Math Score"]
type_reading_scores = per_school_summary.groupby(["School Type"]).mean()["Average Reading Score"]
type_passing_math = per_school_summary.groupby(["School Type"]).mean()["% Passing Math"]
type_passing_reading = per_school_summary.groupby(["School Type"]).mean()["% Passing Reading"]
overall_passing_rate = (type_passing_math + type_passing_reading) / 2

# Assemble into data frame
type_summary = pd.DataFrame({"Average Math Score" : type_math_scores,
                             "Average Reading Score": type_reading_scores,
                             "% Passing Math": type_passing_math,
                             "% Passing Reading": type_passing_reading,
                             "% Overall Passing Rate": overall_passing_rate})

# Minor data munging
type_summary = type_summary[["Average Math Score", 
                             "Average Reading Score",
                             "% Passing Math",
                             "% Passing Reading",
                             "% Overall Passing Rate"]]

# Display results
type_summary
Out[11]:
Average Math Score Average Reading Score % Passing Math % Passing Reading % Overall Passing Rate
School Type
Charter 83.473852 83.896421 93.620830 96.586489 95.103660
District 76.956733 80.966636 66.548453 80.799062 73.673757
In [ ]: