Final project

Due date: Sunday, December 15 at midnight.

Options

You have two choices for the STAT 131A final project:

  • The default project. The default project is a lot like an open-ended homework assignment. You will use tools from the course to conduct an open-ended exploration of a cleaned dataset.

  • A custom project. The custom project is an opportunity to explore a topic of your choosing. You will need to identify and clean your own dataset(s), and you will have more flexibility in the tools you use to analyze the data.

The total expected time required for a proficient grade on the default project versus a custom project is the same.

  • However, a custom project may require more time devoted to formulating a research question and identifying/cleaning data, while the default project may require more time devoted to coding.

  • If you want to produce an “above and beyond” project that can be showcased on your resume or personal website, or a project that can be described in a future letter of recommendation, we recommend that you consider a custom project.

  • But, please keep in mind that you can still receive up to 100% on the final project grade by completing the default project.

Description of the default project

The four deliverables of the default project:

  1. Conduct extensive exploratory data analysis (EDA) of data from a clinical trial

  2. Conduct hypothesis tests to assess the quality of randomization of patients to a drug versus a placebo.

  3. Fit a linear regression model to the number of days a patient survives from the time of registration in the clinical study (n_days).

  4. Fit a logistic regression model to a binary outcome that you construct using n_days.

The default project may be completed in groups of 1-3 students.

  • Teams of 4 or more students will not be approved, so please do not ask for an exception!

  • If you need help finding group members, please open a private post on Ed. The course staff can help with introductions.

Each default project team will prepare a 8-10 page memo summarizing their key findings.

  • Your memo should not contain any code. It should contain only text and visualizations.

  • See the custom project FAQs below for a sense of how you might structure your memo. You can also check out sample project memos on other topics.

  • Keep in mind that the custom project memo is only 3 pages, but custom project teams must also record a 5-7 minute lightning talk of their findings.

Finally, keep in mind that the course staff is aware of past projects from 131A floating around the internet.

  • Copying from an existing project is considered cheating, and will be addressed in a manner outlined in the syllabus.

Dataset

The dataset (cholangitis.csv) comes from a randomized, clinical trial of the immunosuppressive drug D-penicillamine at the Mayo Clinic.

  • The study consisted of patients living with primary biliary cholangitis, a fatal chronic autoimmune disease of unknown cause affecting the liver. The study lasted about 12 years.

  • Patients were randomly given either D-penicillamine or a placebo, and both the patient and the health care providers were unaware as to which the patient received (“double-blinded”).

There are 418 observations of 20 variables, both numeric and categorical.

Main guiding question

Does the effect of the drug D-penicillamine on patients with primary biliary cholangitis differ from the effect of a placebo?

  • All of your analyses should be related to this research question in some way.

Visualizing the data (50% of the default project grade)

Perform exploratory data analysis of the data, using any appropriate tools we have learned. Note any interesting features of the data.

  • As you work on this part, you will produce dozens of plots and tables in RStudio.

  • However, in the final submission, you will submit 3-5 professionally-formatted visualizations that each showcase a non-trivial finding from the data.

  • Each visualization should be accompanied by a compelling narrative caption that describes the key takeaways.

  • Note that EDA is half of the default project grade. There is a reason for this: Many interesting findings primarily originate from careful EDA, and statistical modeling supports or adds depth to these main findings.

  • So, you should be spending at least half of your project time making compelling visualizations!

  • Your visualizations alone (i.e, without any other part of the submission) should be able to tell a compelling story about whether the drug D-penicillamine appears to be effective.

Hypothesis testing (20% of the default project grade)

Conduct one hypothesis test for each of the columns in the dataset, excluding the response variable n_days and the treatment variable.

  • Each hypothesis test should assess whether the distribution of the covariate differs between the treatment and control group.

  • Are your results consistent with fully randomized assignment of the drug? Explain.

  • Be sure to draw on the lessons from the multiple testing lecture when assessing statistical significance.

  • Make sure to write one or more functions to avoid rewriting too much code. You should not need to copy-paste all of your code for each hypothesis test.

  • You are free show your hypothesis test results using a visualization, and/or a nicely-formatted table.

Model fitting (30% of the default project grade)

Perform (1) a linear regression analysis of the response (n_days) on the explanatory variables, and (2) a logistic regression analysis of a binary outcome that you construct based on n_days.

  • In your submission, describe whether and how you transformed your data or covariates to fit your models, or excluded any observations, and why. Tell a story.

  • There should be evidence that you considered interactions and polynomial terms when identifying the best model specification, though they do not necessarily have to be contained in your final models.

  • You should use cross-validation to select the best performing model specifications. You are free to choose any error metric to validate on. For example, you might choose RMSE, TPR, TNR, or another metric that you think is most appropriate given the dataset and your assumptions about the scenario.

  • Based on your models, what conclusion can you draw about the effect of the drug D-penicillamine on patient survival? Does the effect appear to differ across groups? Be sure to draw on the tools of inference in your answer.

Description of the custom project

A custom project provides hands-on experience with key steps of the data science pipeline:

  • Asking research questions
  • Identifying or creating dataset(s) to help you answer your questions
  • Cleaning, exploring, and analyzing datasets using tools from 131A and beyond
  • Synthesizing and compiling your results in a short report
  • Presenting your results to an audience

You are free to pursue any topic related to applied statistics.

  • In previous years, project teams have considered athletic performance, gender inequality, farming practices, restaurant quality, music success, gentrification, and standardized testing, just to name a few. Any data-driven investigation is fair game.

  • You can some examples of successful final projects here. Note that these projects are from a different class and have a different format, but the spirit is the same.

Each custom project team of 3 students will prepare a 3-page memo containing 3-5 visualizations and text that summarizes their key findings, and record a lightning talk of approximately 5-7 minutes to be reviewed by a member of the teaching staff.

  • Your memo should not contain any code. It should contain only text and visualizations.

Important note: Suppose you spend 20 total hours on your project. As it turns out, 5 of those 20 hours were spent cleaning a dataset. Even though cleaning the dataset took 25% of your time, you probably should not devote 25% of your memo and presentation to describing the data cleaning process. Your memo and presentation should highlight the key takeaways of your analysis.

Custom project teams of 1 or 2 students will only be approved in exceptional circumstances (e.g., proprietary research data that can only be viewed by one or two students).

  • Teams of 4 or more students will not be approved, so please do not ask for an exception!

Along the way, each custom project team will receive feedback from the course staff as part of two mandatory sub-components:

  • 15-minute project meeting with the course staff
  • HW6 check-in submission

Custom project grading

The custom project is intentionally open-ended and graded holistically.

  • Given the unique challenges faced by each team, there isn’t a one-size-fits-all rubric. Some teams will spend more time collecting complex data and have simpler analyses, while others will pursue more complex analyses of data that have already been cleaned.

  • As long as there’s evidence that your team has spent time sufficiently collecting, cleaning, exploring, and analyzing your data, and has taken into consideration feedback from the course staff, you should receive high marks.

  • If you have concerns about the specific directions of your project, please see a member of the teaching staff during office hours. We’re happy to lead you in the right direction!

Only one team member needs to submit the project via Gradescope. Please add all team members to your submission group.

Custom project FAQs

We may update this section with new questions from Ed and office hours. If you do not see your question answered below, be sure to ask!

Can the memo be longer than 3 pages?

You’re welcome to include an appendix of additional relevant results, but we can’t guarantee that the teaching team will review anything beyond 3 pages. Please make sure not to just dump all of your extraneous findings and plots in an appendix unless there is a good reason to include them. While research papers often have just 3-5 main plots, authors will often produce hundreds of plots over the course of a project that the public never sees.

Can presentations be longer than 7 minutes?

No. The member of the teaching staff reviewing your presentation will stop watching at 7 minutes. Practice your presentation several times to ensure that you stay within the time limit.

How should I record the presentation?

You should record yourself in front of a large screen (e.g., a flat screen TV in a conference room) with your slides/plots/diagrams visible in the same frame as your face. Do not read off of a script, do not present via screen share, and do not keep your camera off. In other words, treat the presentation as though you are presenting to work colleagues live and in-person.

Your key plots should be the main attraction of your presentation. All plot elements should be clearly visible. You are welcome to include additional information aside from just your plots. Slides with a lot of text should be avoided.

Each team member should present in front of the same screen. If it impossible for you to coordinate with your team members to be in the same location at the same time to record the presentation, please let the teaching staff know ahead of time. In these cases, you should each present separately in front of different screens and splice together your videos.

If you are unable to access a space on or off campus with a large screen, or there is something else preventing you from presenting in the format described above, please get in touch with the course staff about finding an alternative solution.

Can reports be double spaced?

No. The report should be single-spaced with a reasonably sized font and standard margins. Keep in mind that your 3-5 figures will take up a lot of space in your memo, and thoughtfully-designed plots+captions are arguably more important than the main text.

Can we collect our own data?

Yes! Many past students have used surveys to answer their research questions.

If you plan to create a survey, be sure to receive approval of your survey from the course staff before publishing it. You’ll also want to publish your survey well before the project deadline.

Effective survey design can take much longer than expected, so it’s not a good option for a last-minute project!

Do we have to use R for our analysis?

In general, yes. However, if you’re using a technique that goes beyond the techniques taught in this course, like web scraping, you are free to use another language like Python.

Can I use data from another class or research project?

Absolutely! Of course, don’t just turn in the same project that you’ve turned in for another class. That’s a recipe for disaster. But, we are fine with double- or triple-dipping topics across your classes. It’s never bad practice to dig deeper into one dataset or topic.

Is there a rough outline of what you’re looking for in the memo and presentation?

As mentioned above, the project is intentionally open-ended and doesn’t have a fixed rubric. That being said, here is a sample outline that has worked for many projects in the past. Keep in mind that your outline may differ!

Introduction and motivation

  • What is/are your research question(s)?

  • Why is each question interesting?

  • What’s your hypothesis?

  • What’s the brief summary of your results?

Relevant work

  • Who else has tried to answer your question(s)?

  • Were they successful?

  • How does your project relate to or build on existing work?

Data and methods

  • How will you go about answering your research question(s)?

  • What data sources will you use?

  • What methods will you use, and how will they answer your research question(s)?

  • For most groups, the ideal methods will be extensive exploratory data analysis, followed by linear and/or a logistic regression(s). If this is the case for your project, you would use this section to explain how your plots and regression(s) will answer your research questions.

Results and discussion

  • What are your findings?

  • How do you interpret those findings?

  • This will probably be your longest section, so spend the most time here.

  • You shouldn’t have more than 3-5 total plots+figures in your memo. The presentation should also include no more than 3-5 plots. Only include plots if they are critically relevant to your story.

  • Spend sufficient time making your plots pretty! With tools like ChatGPT, it’s a lot easier to figure out how to clean up plots. Aim to make your plots as clean as a plot you might see in a professional news outlet. You’re welcome to discuss potential improvements to your plots during office hours.

Conclusion

  • To what extent did you answer your research question?

  • Was your hypothesis correct?

  • With infinite time and resources, how would you go about better answering your research question(s)?