Summary


I was given two datasets with data relating to mouse trials for anti-cancer drugs. This data is smiliar to what would be obtained from a medical research lab. I was asked to interpret the data, and exactly reproduce three line plots, and a special custom plot.


I used Pandas and MatPlotLib in a Jupyter Notebook.

Solution


Pymaceuticals Inc.


Analysis

  • Overall, it is clear that Capomulin outperforms all other treatment options in the screen.
  • Capomulin was the only treatment to reduce tumor volume. It held to a 19% reduction in tumor volume over the course of trial, whereas all other drugs were correlated with an increase in tumor volume by roughly 40-50%.
  • Capomulin greatly limited the spread of the tumor compared to other treatment options. By study end, the average mouse on Capomulin had only 1 new metastatic site, as opposed to the average 2-3 found in mice of other treatment options.
  • Lastly, mice on the Capomulin treatment had the highest survival rate of any treatment in the screen. Over 90% of mice treated by Capomulin survived the full duration of the trial, compared to only 35-45% of mice on other treatment options.
In [1]:
# Dependencies and Setup
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Hide warning messages in notebook
import warnings
warnings.filterwarnings('ignore')

# File to Load (Remember to Change These)
mouse_drug_data = pd.read_csv("data/mouse_drug_data.csv")
clinical_trial_data = pd.read_csv("data/clinicaltrial_data.csv")
df = pd.merge(clinical_trial_data, mouse_drug_data,  how = "left", on=["Mouse ID","Mouse ID"])
df.head()
Out[1]:
Mouse ID Timepoint Tumor Volume (mm3) Metastatic Sites Drug
0 b128 0 45.0 0 Capomulin
1 f932 0 45.0 0 Ketapril
2 g107 0 45.0 0 Ketapril
3 a457 0 45.0 0 Ketapril
4 c819 0 45.0 0 Ketapril

Tumor Response to Treatment

  • We are tasked with creating a time series line plot that tracks tumor volume mean with error bars. To do this, we must obtain means and standard errors for each drug at each timepoint.
  • First we must use groupby() on drug type and timepoint, so as to produce workable values.
  • Next, we must munge the data so that each column represents a a drug, and each row represents a timepoint ("long format").
  • Finally, we must generate the plot for the drugs that have been pre-specified as important (namely, Capomulin, Infubinol, Ketapril, and placebo).
In [2]:
# Store the Mean Tumor Volume Data Grouped by Drug and Timepoint 
tumor_vols_mean = df.groupby(["Drug", "Timepoint"]).mean()["Tumor Volume (mm3)"]
# Convert to DataFrame
tumor_vols_mean_df = pd.DataFrame(tumor_vols_mean)
tumor_vols_mean_df = tumor_vols_mean_df.reset_index()
# Preview DataFrame
tumor_vols_mean_df.head()
Out[2]:
Drug Timepoint Tumor Volume (mm3)
0 Capomulin 0 45.000000
1 Capomulin 5 44.266086
2 Capomulin 10 43.084291
3 Capomulin 15 42.064317
4 Capomulin 20 40.716325
In [3]:
# Store the Standard Error of Tumor Volumes Grouped by Drug and Timepoint
tumor_vols_se = df.groupby(["Drug", "Timepoint"]).sem()["Tumor Volume (mm3)"]
# Convert to DataFrame
tumor_vols_se_df = pd.DataFrame(tumor_vols_se)
tumor_vols_se_df = tumor_vols_se_df.reset_index()
# Preview DataFrame
tumor_vols_se_df.head()
Out[3]:
Drug Timepoint Tumor Volume (mm3)
0 Capomulin 0 0.000000
1 Capomulin 5 0.448593
2 Capomulin 10 0.702684
3 Capomulin 15 0.838617
4 Capomulin 20 0.909731
In [4]:
# Convert data from long to wide format
tumor_vols_mean_df_wide = tumor_vols_mean_df.pivot(index="Timepoint", columns="Drug")["Tumor Volume (mm3)"]
tumor_vols_se_df_wide = tumor_vols_se_df.pivot(index="Timepoint", columns="Drug")["Tumor Volume (mm3)"]
# Preview that Reformatting worked
tumor_vols_mean_df_wide.head()
Out[4]:
Drug Capomulin Ceftamin Infubinol Ketapril Naftisol Placebo Propriva Ramicane Stelasyn Zoniferol
Timepoint
0 45.000000 45.000000 45.000000 45.000000 45.000000 45.000000 45.000000 45.000000 45.000000 45.000000
5 44.266086 46.503051 47.062001 47.389175 46.796098 47.125589 47.248967 43.944859 47.527452 46.851818
10 43.084291 48.285125 49.403909 49.582269 48.694210 49.423329 49.101541 42.531957 49.463844 48.689881
15 42.064317 50.094055 51.296397 52.399974 50.933018 51.359742 51.067318 41.495061 51.529409 50.779059
20 40.716325 52.157049 53.197691 54.920935 53.644087 54.364417 53.346737 40.238325 54.067395 53.170334
In [5]:
# Generate the Plot (with Error Bars)
      # Since we set the index to timepoint, we can use that as our x value.
plt.errorbar(tumor_vols_mean_df_wide.index, tumor_vols_mean_df_wide["Capomulin"], yerr=tumor_vols_se_df_wide["Capomulin"], color="r", marker="o", markersize=5, linestyle="dashed", linewidth=0.50)
plt.errorbar(tumor_vols_mean_df_wide.index, tumor_vols_mean_df_wide["Infubinol"], yerr=tumor_vols_se_df_wide["Infubinol"], color="b", marker="^", markersize=5, linestyle="dashed", linewidth=0.50)
plt.errorbar(tumor_vols_mean_df_wide.index, tumor_vols_mean_df_wide["Ketapril"], yerr=tumor_vols_se_df_wide["Ketapril"], color="g", marker="s", markersize=5, linestyle="dashed", linewidth=0.50)
plt.errorbar(tumor_vols_mean_df_wide.index, tumor_vols_mean_df_wide["Placebo"], yerr=tumor_vols_se_df_wide["Placebo"], color="k", marker="d", markersize=5, linestyle="dashed", linewidth=0.50)

plt.title("Tumor Response to Treatment")
plt.ylabel("Tumor Volume (mm3)")
plt.xlabel("Time (Days)")
plt.grid(True)
plt.legend(loc="best", fontsize="small", fancybox=True)
# Save the Figure
# Save the Figure
plt.savefig("analysis/Fig1.png")

# Show the Figure
plt.show()

Metastatic Response to Treatment

  • This ask is identical to the previous ask, except that the variable of interest is different. However, it is being treated the same (by taking the mean and the standard error)
In [6]:
# Store the Mean Met. Site Data Grouped by Drug and Timepoint 
metastatic_response_mean = df.groupby(["Drug", "Timepoint"]).mean()["Metastatic Sites"]
# Convert to DataFrame
metastatic_response_mean_df = pd.DataFrame(metastatic_response_mean)
# Preview DataFrame
metastatic_response_mean_df.head()
Out[6]:
Metastatic Sites
Drug Timepoint
Capomulin 0 0.000000
5 0.160000
10 0.320000
15 0.375000
20 0.652174
In [7]:
# Store the Standard Error associated with Met. Sites Grouped by Drug and Timepoint 
metastatic_response_se = df.groupby(["Drug", "Timepoint"]).sem()["Metastatic Sites"]

# Convert to DataFrame
metastatic_response_se_df = pd.DataFrame(metastatic_response_se)
# Preview DataFrame
metastatic_response_se_df.head()
Out[7]:
Metastatic Sites
Drug Timepoint
Capomulin 0 0.000000
5 0.074833
10 0.125433
15 0.132048
20 0.161621
In [8]:
# Minor Data Munging to Re-Format the Data Frames
metastatic_response_mean_df2 = metastatic_response_mean_df.reset_index()
metastatic_response_mean_df_wide = metastatic_response_mean_df2.pivot(index="Timepoint", columns="Drug")["Metastatic Sites"]

metastatic_response_se_df2 = metastatic_response_se_df.reset_index()
metastatic_response_se_df_wide = metastatic_response_se_df2.pivot(index="Timepoint", columns="Drug")["Metastatic Sites"]

# Preview that Reformatting worked
metastatic_response_mean_df_wide.head()
Out[8]:
Drug Capomulin Ceftamin Infubinol Ketapril Naftisol Placebo Propriva Ramicane Stelasyn Zoniferol
Timepoint
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.160000 0.380952 0.280000 0.304348 0.260870 0.375000 0.320000 0.120000 0.240000 0.166667
10 0.320000 0.600000 0.666667 0.590909 0.523810 0.833333 0.565217 0.250000 0.478261 0.500000
15 0.375000 0.789474 0.904762 0.842105 0.857143 1.250000 0.764706 0.333333 0.782609 0.809524
20 0.652174 1.111111 1.050000 1.210526 1.150000 1.526316 1.000000 0.347826 0.952381 1.294118
In [9]:
plt.errorbar(metastatic_response_mean_df_wide.index, metastatic_response_mean_df_wide["Capomulin"], yerr=metastatic_response_se_df_wide["Capomulin"], color="r", marker="o", markersize=5, linestyle="dashed", linewidth=0.50)
plt.errorbar(metastatic_response_mean_df_wide.index, metastatic_response_mean_df_wide["Infubinol"], yerr=metastatic_response_se_df_wide["Infubinol"], color="b", marker="^", markersize=5, linestyle="dashed", linewidth=0.50)
plt.errorbar(metastatic_response_mean_df_wide.index, metastatic_response_mean_df_wide["Ketapril"], yerr=metastatic_response_se_df_wide["Ketapril"], color="g", marker="s", markersize=5, linestyle="dashed", linewidth=0.50)
plt.errorbar(metastatic_response_mean_df_wide.index, metastatic_response_mean_df_wide["Placebo"], yerr=metastatic_response_se_df_wide["Placebo"], color="k", marker="d", markersize=5, linestyle="dashed", linewidth=0.50)

plt.title("Metastatic Spread During Treatment")
plt.ylabel("Met. Sites")
plt.xlabel("Time (Days)")
plt.grid(True)
plt.legend(loc="best", fontsize="small", fancybox=True)
# Save the Figure
# Save the Figure
plt.savefig("analysis/Fig2.png")

# Show the Figure
plt.show()

Survival Rates

  • This ask is similar to the previous two asks, but with a couple differences.
  • We need to do a count of the scores (when there are less scores, that's because there are less mice).
  • We need to draw a proportion by dividing the count at each timepoint by the total number of mice.
In [10]:
# Store the Count of Mice Grouped by Drug and Timepoint (W can pass any metric)
mice_still_alive = df.groupby(["Drug", "Timepoint"]).count()["Tumor Volume (mm3)"]
# Convert to DataFrame
mice_still_alive_df = pd.DataFrame(mice_still_alive)


# Note: Resetting the index here fills in the "Drug" column with repetitions automatically. 
# Otherwise, it would retain groupby object structure.
mice_still_alive_df.head().reset_index()
Out[10]:
Drug Timepoint Tumor Volume (mm3)
0 Capomulin 0 25
1 Capomulin 5 25
2 Capomulin 10 25
3 Capomulin 15 24
4 Capomulin 20 23
In [11]:
# Minor Data Munging to Re-Format the Data Frames
mice_still_alive_df2 = mice_still_alive_df.reset_index()
mice_still_alive_df_wide = mice_still_alive_df2.pivot(index="Timepoint", columns="Drug")["Tumor Volume (mm3)"]
# Preview the Data Frame
mice_still_alive_df_wide.head()
Out[11]:
Drug Capomulin Ceftamin Infubinol Ketapril Naftisol Placebo Propriva Ramicane Stelasyn Zoniferol
Timepoint
0 25 25 25 25 25 25 26 25 26 25
5 25 21 25 23 23 24 25 25 25 24
10 25 20 21 22 21 24 23 24 23 22
15 24 19 21 19 21 20 17 24 23 21
20 23 18 20 19 20 19 17 23 21 17
In [12]:
# Generate the Plot (Accounting for percentages)
plt.plot(100 * mice_still_alive_df_wide["Capomulin"] / 25, "ro", linestyle="dashed", markersize=5, linewidth=0.50)
plt.plot(100 * mice_still_alive_df_wide["Infubinol"] / 25, "b^", linestyle="dashed", markersize=5, linewidth=0.50)
plt.plot(100 * mice_still_alive_df_wide["Ketapril"] / 25, "gs", linestyle="dashed", markersize=5, linewidth=0.50)
plt.plot(100 * mice_still_alive_df_wide["Placebo"] / 25 , "kd", linestyle="dashed", markersize=6, linewidth=0.50)
plt.title("Mice Survival Rates During Treatment")
plt.ylabel("Survival Rate (%)")
plt.xlabel("Time (Days)")
plt.grid(True)
plt.legend(loc="best", fontsize="small", fancybox=True)

# Save the Figure
plt.savefig("analysis/Fig3.png")

# Show the Figure
plt.show()

Summary Bar Graph

  • This ask requires calculating the difference between the first and last values for each drug as a percentage.
  • Then, we must convert the answers into a tuple, which can be used in conjunction with user-defined functions to produce the desired graph.
In [13]:
# Calculate the percent changes for each drug
tumor_pct_change =  100 * (tumor_vols_mean_df_wide.iloc[-1] - tumor_vols_mean_df_wide.iloc[0]) / tumor_vols_mean_df_wide.iloc[0]
# Display the data to confirm
tumor_pct_change
Out[13]:
Drug
Capomulin   -19.475303
Ceftamin     42.516492
Infubinol    46.123472
Ketapril     57.028795
Naftisol     53.923347
Placebo      51.297960
Propriva     47.241175
Ramicane    -22.320900
Stelasyn     52.085134
Zoniferol    46.579751
dtype: float64
In [14]:
# Store all Relevant Percent Changes into a Tuple
pct_changes = (tumor_pct_change["Capomulin"], 
               tumor_pct_change["Infubinol"], 
               tumor_pct_change["Ketapril"], 
               tumor_pct_change["Placebo"])

# Splice the data between passing and failing drugs
fig, ax = plt.subplots()
ind = np.arange(len(pct_changes))  
width = 1
rectsPass = ax.bar(ind[0], pct_changes[0], width, color='green')
rectsFail = ax.bar(ind[1:], pct_changes[1:], width, color='red')

# Orient widths. Add labels, tick marks, etc. 
ax.set_ylabel('% Tumor Volume Change')
ax.set_title('Tumor Change Over 45 Day Treatment')
ax.set_xticks(ind + 0.5)
ax.set_xticklabels(('Capomulin', 'Infubinol', 'Ketapril', 'Placebo'))
ax.set_autoscaley_on(False)
ax.set_ylim([-30,70])
ax.grid(True)

# Use functions to label the percentages of changes
def autolabelFail(rects):
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 3,
                '%d%%' % int(height),
                ha='center', va='bottom', color="white")
        
def autolabelPass(rects):
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., -8,
                '-%d%% ' % int(height),
                ha='center', va='bottom', color="white")

# Call functions to implement the function calls
autolabelPass(rectsPass)
autolabelFail(rectsFail)

# Save the Figure
fig.savefig("analysis/Fig4.png")

# Show the Figure
fig.show()
In [ ]: