I was given two datasets: One about 15 different schools, and one about all of the students in them, including their math and reading scores. The two csv datasets were sized at 1kb and 2,074kb respectively. There was no missing data. I was asked to identify the most effective schools, and provide other insights.
Here, I used Pandas in a Jupyter Notebook. The requirements of this task required me to learn about formatting intermediate datatypes, especially the int64.
As a whole, schools with higher budgets, did not yield better test results. Schools with higher spending per student (645-675 dollars) actually underperformed compared to schools with smaller budgets (585 dollars per student).
Smaller and medium sized schools dramatically out-performed large sized schools on passing math performances (89-91% passing vs 67%).
Charter schools out-performed the public district schools across all metrics.
# Dependencies and Setup
import pandas as pd
import numpy as np
# File to Load
school_data_to_load = "Resources/schools_complete.csv"
student_data_to_load = "Resources/students_complete.csv"
# Read School and Student Data File and store into Pandas Data Frames
school_data = pd.read_csv(school_data_to_load)
student_data = pd.read_csv(student_data_to_load)
# Combine the data into a single dataset
school_data_complete = pd.merge(student_data, school_data, how="left", on=["school_name", "school_name"])
school_data.head()
school_data_complete.head()
Calculate the total number of schools
Calculate the total number of students
Calculate the total budget
Calculate the average math score
Calculate the average reading score
Calculate the overall passing rate (overall average score), i.e. (avg. math score + avg. reading score)/2
Calculate the percentage of students with a passing math score (70 or greater)
Calculate the percentage of students with a passing reading score (70 or greater)
Create a dataframe to hold the above results
Optional: give the displayed data cleaner formatting
Total_Schools_int = len(school_data_complete['school_name'].unique())
Total_Students_int = len(school_data_complete['student_name'])
Total_Budget_int = sum(school_data_complete['budget'].unique())
Average_Math_int = school_data_complete['math_score'].mean()
Average_Reading_int= school_data_complete['reading_score'].mean()
Passing_Math_count_float = school_data_complete[(school_data_complete["math_score"] >= 70)].count()["student_name"]
Passing_Math_percentage_float = (Passing_Math_count_float / float(Total_Students_int)) * 100
Passing_Reading_count_float = school_data_complete[(school_data_complete["reading_score"] >= 70)].count()["student_name"]
Passing_Reading_percentage_float = (Passing_Reading_count_float / float(Total_Students_int)) * 100
Overall_Passing_percentage_float = (Passing_Math_percentage_float + Passing_Reading_percentage_float)/2
df1 = pd.DataFrame({
"Total Schools": [Total_Schools_int],
"Total Students": [Total_Students_int],
"Total Budget": [Total_Budget_int],
"Average Math Score": [Average_Math_int],
"Average Reading Score": [Average_Reading_int],
"% Passing Math": [Passing_Math_percentage_float],
"% Passing Reading": [Passing_Reading_percentage_float],
"% Overall Passing Rate": [Overall_Passing_percentage_float]
})
df1['Total Students'] = df1['Total Students'].map("{:,}".format)
df1['Total Budget'] = df1['Total Budget'].map("${:,.2f}".format)
df1
Create an overview table that summarizes key metrics about each school, including:
Create a dataframe to hold the above results
school_types = school_data.set_index(["school_name"])["type"]
per_school_counts = school_data_complete["school_name"].value_counts()
per_school_budget = school_data_complete.groupby(["school_name"]).mean()["budget"]
per_school_capita = per_school_budget / per_school_counts
per_school_math = school_data_complete.groupby(["school_name"]).mean()["math_score"]
per_school_reading = school_data_complete.groupby(["school_name"]).mean()["reading_score"]
school_passing_math = school_data_complete[(school_data_complete["math_score"] >= 70)]
school_passing_reading = school_data_complete[(school_data_complete["reading_score"] >= 70)]
per_school_passing_math = school_passing_math.groupby(["school_name"]).count()["student_name"] / per_school_counts * 100
per_school_passing_reading = school_passing_reading.groupby(["school_name"]).count()["student_name"] / per_school_counts * 100
overall_passing_rate = (per_school_passing_math + per_school_passing_reading) / 2
per_school_summary = pd.DataFrame({"School Type": school_types,
"Total Students": per_school_counts,
"Total School Budget": per_school_budget,
"Per Student Budget": per_school_capita,
"Average Math Score": per_school_math,
"Average Reading Score": per_school_reading,
"% Passing Math": per_school_passing_math,
"% Passing Reading": per_school_passing_reading,
"% Overall Passing Rate": overall_passing_rate})
per_school_summary = per_school_summary[["School Type", "Total Students", "Total School Budget", "Per Student Budget",
"Average Math Score", "Average Reading Score",
"% Passing Math", "% Passing Reading",
"% Overall Passing Rate"]]
per_school_summary["Total School Budget"] = per_school_summary["Total School Budget"].map("${:,.2f}".format)
per_school_summary["Per Student Budget"] = per_school_summary["Per Student Budget"].map("${:,.2f}".format)
per_school_summary
top_schools = per_school_summary.sort_values(["% Overall Passing Rate"], ascending=False)
top_schools.head(5)
bottom_schools = per_school_summary.sort_values(["% Overall Passing Rate"], ascending=True)
bottom_schools.head(5)
Create a table that lists the average Reading Score for students of each grade level (9th, 10th, 11th, 12th) at each school.
Create a pandas series for each grade. Hint: use a conditional statement.
Group each series by school
Combine the series into a dataframe
Optional: give the displayed data cleaner formatting
# Create data series of scores by grade levels using conditionals
ninth_graders = school_data_complete[(school_data_complete["grade"] == "9th")]
tenth_graders = school_data_complete[(school_data_complete["grade"] == "10th")]
eleventh_graders = school_data_complete[(school_data_complete["grade"] == "11th")]
twelfth_graders = school_data_complete[(school_data_complete["grade"] == "12th")]
# Group each by school name
ninth_graders_scores = ninth_graders.groupby(["school_name"]).mean()["math_score"]
tenth_graders_scores = tenth_graders.groupby(["school_name"]).mean()["math_score"]
eleventh_graders_scores = eleventh_graders.groupby(["school_name"]).mean()["math_score"]
twelfth_graders_scores = twelfth_graders.groupby(["school_name"]).mean()["math_score"]
# Combine series into single data frame
scores_by_grade = pd.DataFrame({"9th": ninth_graders_scores, "10th": tenth_graders_scores,
"11th": eleventh_graders_scores, "12th": twelfth_graders_scores})
# Minor data munging
scores_by_grade = scores_by_grade[["9th", "10th", "11th", "12th"]]
scores_by_grade.index.name = None
# Display the data frame
scores_by_grade
# Create data series of scores by grade levels using conditionals
ninth_graders = school_data_complete[(school_data_complete["grade"] == "9th")]
tenth_graders = school_data_complete[(school_data_complete["grade"] == "10th")]
eleventh_graders = school_data_complete[(school_data_complete["grade"] == "11th")]
twelfth_graders = school_data_complete[(school_data_complete["grade"] == "12th")]
# Group each by school name
ninth_graders_scores = ninth_graders.groupby(["school_name"]).mean()["reading_score"]
tenth_graders_scores = tenth_graders.groupby(["school_name"]).mean()["reading_score"]
eleventh_graders_scores = eleventh_graders.groupby(["school_name"]).mean()["reading_score"]
twelfth_graders_scores = twelfth_graders.groupby(["school_name"]).mean()["reading_score"]
# Combine series into single data frame
scores_by_grade = pd.DataFrame({"9th": ninth_graders_scores, "10th": tenth_graders_scores,
"11th": eleventh_graders_scores, "12th": twelfth_graders_scores})
# Minor data munging
scores_by_grade = scores_by_grade[["9th", "10th", "11th", "12th"]]
scores_by_grade.index.name = None
# Display the data frame
scores_by_grade
# Establish the bins
spending_bins = [0, 585, 615, 645, 675]
group_names = ["<$585", "$585-615", "$615-645", "$645-675"]
# Categorize the spending based on the bins
per_school_summary["Spending Ranges (Per Student)"] = pd.cut(per_school_capita, spending_bins, labels=group_names)
spending_math_scores = per_school_summary.groupby(["Spending Ranges (Per Student)"]).mean()["Average Math Score"]
spending_reading_scores = per_school_summary.groupby(["Spending Ranges (Per Student)"]).mean()["Average Reading Score"]
spending_passing_math = per_school_summary.groupby(["Spending Ranges (Per Student)"]).mean()["% Passing Math"]
spending_passing_reading = per_school_summary.groupby(["Spending Ranges (Per Student)"]).mean()["% Passing Reading"]
overall_passing_rate = (spending_passing_math + spending_passing_reading) / 2
# Assemble into data frame
spending_summary = pd.DataFrame({"Average Math Score" : spending_math_scores,
"Average Reading Score": spending_reading_scores,
"% Passing Math": spending_passing_math,
"% Passing Reading": spending_passing_reading,
"% Overall Passing Rate": overall_passing_rate})
# Minor data munging
spending_summary = spending_summary[["Average Math Score",
"Average Reading Score",
"% Passing Math", "% Passing Reading",
"% Overall Passing Rate"]]
# Display results
spending_summary
# Establish the bins
size_bins = [0, 1000, 2000, 5000]
group_names = ["Small (<1000)", "Medium (1000-2000)", "Large (2000-5000)"]
# Categorize the spending based on the bins
per_school_summary["School Size"] = pd.cut(per_school_summary["Total Students"], size_bins, labels=group_names)
# Calculate the scores based on bins
size_math_scores = per_school_summary.groupby(["School Size"]).mean()["Average Math Score"]
size_reading_scores = per_school_summary.groupby(["School Size"]).mean()["Average Reading Score"]
size_passing_math = per_school_summary.groupby(["School Size"]).mean()["% Passing Math"]
size_passing_reading = per_school_summary.groupby(["School Size"]).mean()["% Passing Reading"]
overall_passing_rate = (size_passing_math + size_passing_reading) / 2
# Assemble into data frame
size_summary = pd.DataFrame({"Average Math Score" : size_math_scores,
"Average Reading Score": size_reading_scores,
"% Passing Math": size_passing_math,
"% Passing Reading": size_passing_reading,
"% Overall Passing Rate": overall_passing_rate})
# Minor data munging
size_summary = size_summary[["Average Math Score",
"Average Reading Score",
"% Passing Math", "% Passing Reading",
"% Overall Passing Rate"]]
# Display results
size_summary
# Type | Average Math Score | Average Reading Score | % Passing Math | % Passing Reading | % Overall Passing Rate
type_math_scores = per_school_summary.groupby(["School Type"]).mean()["Average Math Score"]
type_reading_scores = per_school_summary.groupby(["School Type"]).mean()["Average Reading Score"]
type_passing_math = per_school_summary.groupby(["School Type"]).mean()["% Passing Math"]
type_passing_reading = per_school_summary.groupby(["School Type"]).mean()["% Passing Reading"]
overall_passing_rate = (type_passing_math + type_passing_reading) / 2
# Assemble into data frame
type_summary = pd.DataFrame({"Average Math Score" : type_math_scores,
"Average Reading Score": type_reading_scores,
"% Passing Math": type_passing_math,
"% Passing Reading": type_passing_reading,
"% Overall Passing Rate": overall_passing_rate})
# Minor data munging
type_summary = type_summary[["Average Math Score",
"Average Reading Score",
"% Passing Math",
"% Passing Reading",
"% Overall Passing Rate"]]
# Display results
type_summary