We begin with a sqlite database that contains a few tables of specific interest. For this first part, we want to get information from the Measurement table from the 'date' and 'prcp' columns. First, we a datetime module to create an object that has the specific date we want to use as a reference. (I used an sqlite DB browser to find out what the relevant date would be.) Then, we use SQL alchemy to query this database, and filter by referencing that date.
The output for our SQL alchemy query is in list format, with each item containing two pieces of separate information that we want to parse. Thankfully (because of automap), each piece retains its 'date' and 'prcp' identifying tags. So, I use a loop that appends each side of the combined list structure we've created into separate lists (to make sure it's coming out right, I print every 500 results). Then, I mash the two lists together into a dictionary. Finally, I use pd.DataFrame() to turn our new dict into a dataframe, from which I can create the first plot.
Note: Since there are multiple stations per day, it is necessary to use some sort of method to converge the information by day. I used sum() because it produced a result identical to the objective, but it would actually make more sense to take a median value.
Note: Once a plot object is created, it can be modified with MatPlotLib even though the plot was not created using MatPlotLib.
Next I was interested in different information. I used SQL Alchemy queries to find out how many stations are in the Station datatable, and how active each station is (from how many rows are provided for each station). I filtered the dataset, only considering the most active station, and then I went through the same process as before to create two separate lists (via a for-loop), a dictionary, and then a dataframe I could use to plot the results.
%matplotlib inline
from matplotlib import style
style.use('fivethirtyeight')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import datetime as dt
# Python SQL toolkit and Object Relational Mapper
import sqlalchemy
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine, func
from sqlalchemy.inspection import inspect
from sqlalchemy import desc
engine = create_engine("sqlite:///Resources/hawaii.sqlite")
# reflect an existing database into a new model
Base = automap_base()
# reflect the tables
Base.prepare(engine, reflect=True)
# We can view all of the classes that automap found
Base.classes.keys()
# Save references to each table
Measurement = Base.classes.measurement
Station = Base.classes.station
# Create our session (link) from Python to the DB
session = Session(engine)
We begin with a sqlite database that contains a few tables of specific interest. This time, we import the entire dataset, and go after making a larger dataframe that contains ALL of the information in the SQL database, and then after that, carry out most of the pruning, munging, etc. on that dataframe.
Gonna take each database individually, and then join on the common data. Join them on Measurement.data and Station.data.
prev_year_date = dt.date(2017, 8, 23) - dt.timedelta(days=365)
measurement_full_data = session.query(Measurement.station, Measurement.date, Measurement.prcp, Measurement.tobs).all()
session_full_data = session.query(Station.station, Station.name, Station.latitude, Station.longitude, Station.elevation).all()
meas_station_list=list()
meas_date_list=list()
meas_precip_list=list()
meas_tobs_list=list()
counter1=0
for row in measurement_full_data:
counter1+=1
meas_station_list.append(row.station)
meas_date_list.append(row.date)
meas_precip_list.append(row.prcp)
meas_tobs_list.append(row.tobs)
measurement_full_dict={"station":meas_station_list,"date":meas_date_list,"precipitation":meas_precip_list,"temp":meas_tobs_list}
measurement_full_df = pd.DataFrame(measurement_full_dict)
measurement_full_df.head()
measurement_df_mean_by_station_funky = measurement_full_df.groupby("station").mean()
measurement_df_mean_by_station = measurement_df_mean_by_station_funky.reset_index()
measurement_df_mean_by_station.head()
measurement_df_median_by_station_funky = measurement_full_df.groupby("station").median()
measurement_df_median_by_station = measurement_df_median_by_station_funky.reset_index()
measurement_df_median_by_station.head()
measurement_df_sum_by_station_funky = measurement_full_df.groupby("station").sum().drop('temp', axis = 1)
measurement_df_sum_by_station = measurement_df_sum_by_station_funky.reset_index()
measurement_df_sum_by_station.head()
measurement_df_mean_by_date_funky = measurement_full_df.groupby(["date"]).mean()
measurement_df_mean_by_date = measurement_df_mean_by_date_funky.reset_index()
measurement_df_mean_by_date.head()
measurement_df_median_by_date_funky = measurement_full_df.groupby(["date"]).median()
measurement_df_median_by_date = measurement_df_median_by_date_funky.reset_index()
measurement_df_median_by_date.head()
measurement_df_sum_by_date_funky = measurement_full_df.groupby("date").sum().drop('temp', axis = 1)
measurement_df_sum_by_date = measurement_df_sum_by_date_funky.reset_index()
measurement_df_sum_by_date.head()
measurement_df_mean_by_date['date'] = pd.to_datetime(measurement_df_mean_by_date['date'], format='%Y-%m-%d')
measurement_df_median_by_date['date'] = pd.to_datetime(measurement_df_median_by_date['date'], format='%Y-%m-%d')
measurement_df_sum_by_date['date'] = pd.to_datetime(measurement_df_sum_by_date['date'], format='%Y-%m-%d')
measurement_full_df
measurement_df_mean_by_station
measurement_df_median_by_station
measurement_df_sum_by_station
measurement_df_mean_by_date (with good datetime format)
measurement_df_median_by_date (with good datetime format)
measurement_df_sum_by_date (with good datetime format)
# Sort the dataframe by date
measurement_df_sum_by_date = measurement_df_sum_by_date.set_index(['date']).loc['2016-08-23':'2017-08-23']
measurement_df_sum_by_date = measurement_df_sum_by_date.reset_index()
measurement_df_sum_by_date.plot(x="date",y="precipitation",kind="line",title=" Date vs Precipitation (Sum)")
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False)
plt.ylabel("precipitation")
measurement_df_median_by_date = measurement_df_median_by_date.set_index(['date']).loc['2016-08-23':'2017-08-23']
measurement_df_median_by_date = measurement_df_median_by_date.reset_index()
measurement_df_median_by_date.plot(x="date",y="precipitation",kind="line",title=" Date vs Precipitation (Median)")
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False)
plt.ylabel("precipitation")
measurement_df_mean_by_date = measurement_df_mean_by_date.set_index(['date']).loc['2016-08-23':'2017-08-23']
measurement_df_mean_by_date = measurement_df_mean_by_date.reset_index()
measurement_df_mean_by_date.plot(x="date",y="precipitation",kind="line",title=" Date vs Precipitation (Mean)")
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False)
plt.ylabel("precipitation")
# List the stations and the counts in descending order.
measurement_full_df['station'].value_counts()
# Design a query to show how many stations are available in this dataset?
measurement_df_mean_by_station.count()
# Using the station id from the previous query, calculate the lowest temperature recorded,
# highest temperature recorded, and average temperature most active station?
USC00519281_median_df = measurement_full_df.loc[measurement_full_df['station'] == 'USC00519281']
USC00519281_snapshot = [USC00519281_median_df['temp'].min(), USC00519281_median_df['temp'].max(), USC00519281_median_df['temp'].median()]
USC00519281_snapshot
measurement_df_mean_temp_by_date = measurement_df_mean_by_date.drop('precipitation', axis=1).set_index(['date']).loc['2016-08-23':'2017-08-23']
# Choose the station with the highest number of temperature observations. (USC00519281)
# Query the last 12 months of temperature observation data for this station and plot the results as a histogram
measurement_df_mean_temp_by_date.plot(kind="hist", bins = 12)
plt.title("Temperature by Station Histogram (mean)")
plt.xlabel("Temperature (in ºF)")
df_4.plot(kind="hist", bins = 12)
plt.title("Temperature by Station Histogram")
plt.xlabel("Temperature (in ºF)")
station_full_data = session.query(Station.station, Station.name, Station.latitude, Station.longitude, Station.elevation).all()
sta_station_list=list()
sta_name_list=list()
sta_latitude_list=list()
sta_longitude_list=list()
sta_elevation_list=list()
counter2=0
for row in station_full_data:
counter2+=1
sta_station_list.append(row.station)
sta_name_list.append(row.name)
sta_latitude_list.append(row.latitude)
sta_longitude_list.append(row.longitude)
sta_elevation_list.append(row.elevation)
station_full_dict={"station":sta_station_list,"name":sta_name_list,"latitude":sta_latitude_list,"longitude":sta_longitude_list,"elevation":sta_elevation_list}
station_full_df = pd.DataFrame(station_full_dict)
station_full_df.head()