๐ HW6
Due date: Monday, December 9 at midnight.
โณ We recommend reading through each problem ASAP so you can accurately estimate the time needed to complete the assignment.
- This is not an assignment to start the night before the due date!
Unless otherwise stated, assignments in STAT 131A are to be done individually.
As stated in the syllabus, the use of any LLM other than PingPong is considered cheating, unless otherwise indicated.
See the syllabus for the full course collaboration policy.
Some components of this assignment have not been seen by a previous cohort of STAT 131A students, so there may be some unforeseen hiccups.
- If anything seems confusing or unclear, please post in the HW3 thread on Ed. We are here to support you!
๐ฎ Submission
Submit your assignment via Gradescope.
The Gradescope should be live at least a few days before the HW deadline. Feel free to remind us on Ed if the Gradescope isnโt live in time!
Make sure to tag your answers properly on Gradescope, or else you may be docked points for making the grading process more time-consuming.
You must submit a PDF of any PingPong chats that include code you used in your submission.
This should take the form of one long PDF. One way to do this is to copy all of your relevant PingPong chats into a Google Doc, and then print the doc as a PDF.
You are responsible for understanding all the code you submit, regardless of whether or not you used PingPong for help.
For multiple choice problems, submit your answers and justifications as a PDF.
- You are free to use any text-editing software to write your answers.
For the project component, you will produce both (1) a .Rmd file with your code and (2) an PDF file containing the code and output.
Make sure to clearly indicate your group members at the top of your submission. You should each submit the same version of your project component.
On Gradescope, you will submit a single ZIP file containing both the .Rmd and PDF files.
To generate a PDF of your code and output, do not knit to PDF. Instead, knit your .Rmd file as HTML, open the HTML file in a web browser, and then print the HTML as a PDF, making sure that none of your code or output is cut off. You can generate an HTML file in RStudio by pressing
Knit
and thenKnit to HTML
.The knitting process will not work if there are errors in your code, so be sure to leave plenty of time to knit your lab notebooks before the deadline.
Proofread your PDF to make sure all of your answers and plots are visible. If your PDF file is really long, it is possible that your code is printing out a large dataset or a really long vector. Make sure to comment out any code that prints more information than each question asks you for.
๐ Forms to complete
Complete a 5-minute research survey on PingPong!
You are one of the first classes in the history of human civilization to have access to a course-customized large-language model. Your responses to this survey are very valuable to the scientific community.
If you donโt like PingPong, or youโve rarely used it, we still want to hear from you!
As a thank you for completing the survey, I will share a summary and analysis of the responses once the survey is closed.
Take 5 minutes to complete the STAT 131A course evaluation.
Van and Josh would be really grateful to hear your feedback about the course.
We want to make 131A as effective as possible for as many students as possible, and we want to make sure your voice is heard!
๐ฏ Getting started on final exam prep (25% of the HW6 grade)
Answer the following multiple choice questions.
- For each question, provide a complete justification of your answer. In other words, you should state why option A is correct/incorrect, and why option B is correct/incorrect.
1. Linear regression is considered a parametric method because:
A. Linear combinations can take on any value between \(-\infty\) and \(\infty\).
B. The residuals in the assumed data-generating process for linear regression are normally distributed.
C. Both A and B are correct.
D. Neither A nor B is correct.
2. Logistic regression is considered a classification algorithm because:
A. Outcomes in the training data are binary.
B. Log odds are modeled as a linear combination of coefficients and covariates.
C. Both A and B are correct.
D. Neither A nor B is correct.
3. Principal component analysis (PCA) is considered an unsupervised method because:
A. It changes the frame of reference of the data to maximimize variance across each orthogonal dimension.
B. It predicts outcomes conditional on the inputted covariates.
C. Both A and B are correct.
D. Neither A nor B is correct.
4. LOESS is considered a supervised method because:
A. Unlike linear regression, it can model non-linear relationships.
B. Unlike logistic regression, it requires the outcome variable to be real-valued and not binary.
C. Both A and B are correct.
D. Neither A nor B is correct.
5. Bootstrap resampling of the observed data is considered a non-parametric approach because:
A. It can be used to estimate the standard error of the sampling distribution of a sample statistic.
B. It requires us to assume how the input data was generated.
C. Both A and B are correct.
D. Neither A nor B is correct.
6. The permutation test is considered a non-parametric approach because:
A. It cannot be implemented in a reasonable amount of time without using a computer.
B. It assumes that the sampling distribution of the permutation test statistic is approximately normal.
C. Both A and B are correct.
D. Neither A nor B is correct.
7. The \(k\)-means clustering algorithm is considered an unsupervised method because:
A. We can compare the identified centroids to a set of true centroids.
B. It identifies the optimal number of clusters for a given dataset, without assuming how many points should be assigned to each cluster.
C. Both A and B are correct.
D. Neither A nor B is correct.
๐๏ธ Getting started on the course project (75% of the HW6 grade)
Please read the course project page.
If you choose to pursue the default project, you must take a first pass at the visualization component.
Specifically, you must submit 3-5 professionally-formatted visualizations that each showcase a non-trivial finding from the data.
Each visualization should be accompanied by a compelling narrative caption that describes its key takeaway(s).
The course staff will provide comments on your submission so that you can improve your visualizations and captions for the final project submission.
If youโd like, you may also complete additional parts of the default project, and the course staff will provide feedback. But, this is not required, and no extra credit will be awarded.
If you would like to pursue a custom project, you must schedule a meeting the course staff to discuss your project, and also submit a 1-2 page project proposal with the following components:
A research question
A description of the dataset(s) that you will use for your project. This must be a real dataset, not a hypothetical dataset that you have not yet identified.
An analysis plan
At least one plot that showcases a non-trivial preliminary finding from the data.
For structuring your proposal, you may find it helpful to read the last FAQ on the project page.