Final project

Due date: Sunday, May 10th at 11:59PM.

Description of the default project

The four deliverables of the default project:

  1. Conduct exploratory data analysis (EDA) of data from a clinical trial.

  2. Conduct hypothesis tests to assess the quality of randomization of patients to a drug versus a placebo.

  3. Fit a linear regression model to the number of days a patient survives from the time of registration in the clinical study (n_days).

  4. Fit a logistic regression model to a binary outcome that you construct using n_days.

The default project may be completed in groups of 1-3 students.

  • Teams of 4 or more students will not be approved, so please do not ask for an exception!

  • If you need help finding group members, please open a private post on Ed. The course staff can help with introductions.

Each default project team will prepare a 6-10 page memo summarizing their key findings.

  • Your memo should not contain any code. It should contain only text and visualizations.

  • See the custom project FAQs below for a sense of how you might structure your memo. You can also check out sample project memos on other topics.

Finally, keep in mind that the course staff is aware of past projects from C131A floating around the internet.

  • Copying from an existing project is considered cheating, and will be addressed in a manner outlined in the syllabus.

Dataset

The dataset (cholangitis.csv) comes from a randomized, clinical trial of the immunosuppressive drug D-penicillamine at the Mayo Clinic.

  • The study consisted of patients living with primary biliary cholangitis, a fatal chronic autoimmune disease of unknown cause affecting the liver. The study lasted about 12 years.

  • Patients were randomly given either D-penicillamine or a placebo, and both the patient and the health care providers were unaware as to which the patient received (“double-blinded”).

There are 418 observations of 20 variables, both numeric and categorical.

Main guiding question

Does the effect of the drug D-penicillamine on patients with primary biliary cholangitis differ from the effect of a placebo?

  • All of your analyses should be related to this research question in some way.

Visualizing the data (30% of the default project grade)

Perform exploratory data analysis of the data, using any appropriate tools we have learned. Note any interesting features of the data.

  • As you work on this part, you will produce dozens of plots and tables in RStudio.

  • However, in the final submission, you will submit 3-5 professionally-formatted visualizations that each showcase a non-trivial finding from the data.

  • Each visualization should be accompanied by a compelling narrative caption that describes the key takeaways.

  • N.B. Your visualizations alone (i.e, without any other part of the submission) should be able to tell a compelling story about whether the drug D-penicillamine appears to be effective.

Effect estimation and Hypothesis testing (40% of the default project grade)

Conduct one hypothesis test for each of the columns in the dataset, excluding the response variable n_days and the treatment variable.

  • Each hypothesis test should assess whether the distribution of the covariate differs between the treatment and control group.

  • Important: State and discuss the hypotheses you are making about the data and which testing method you are using, as a consequence.

  • Are your results consistent with fully randomized assignment of the drug? Explain.

  • Make sure to consider the fact that you are testing many different hypotheses at the same time, or, if you prefer, that your overall hypothesis is made up of several sub-hypotheses. This means that the overall probability of seeing a Type I error is much larger than for each hypothesis separately. This is called the problem of multiple testing. Look up online what to do to compensate for this issue.

  • Make sure to write one or more functions to avoid rewriting too much code. You should not need to copy-paste all of your code for each hypothesis test.

  • You are free show your hypothesis test results using a visualization, and/or a nicely-formatted table.

Model fitting (30% of the default project grade)

Perform (1) a linear regression analysis of the response (n_days) on the explanatory variables, and (2) a logistic regression analysis of a binary outcome that you will construct based on n_days.

  • In your submission, describe whether and how you transformed your data or covariates to fit your models, or excluded any observations, and why.

  • You should use cross-validation to select the best performing hyperparameters. You are free to choose any error metric to validate on, as long as you motivate your choice. For example, you might choose RMSPE, TPR, TNR, or another metric that you think is most appropriate, given the dataset and your assumptions about the scenario.

  • Based on your models, what conclusion can you draw about the effect of the drug D-penicillamine on patient survival? Does the effect appear to differ across groups? Be sure to draw on the tools of inference in your answer.

  • Hint The model choices you made contain parameters. What is the meaning of these parameters? Which hypotheses or estimates would express/measure the effect of the drug on the outcome?