Stat 3880 - Statistical Learning

Pre-class notes

Pre-class notes are due before class on the date listed below.
They can be uploaded in PDF format at tinyurl.com/stat3880-work.
Please name the files as indicated below.

01/16 Filename: 3880pre-2-1.pdf Text: 2.1 - 2.2.3
01/21 Filename: 3880pre-3-1.pdf Text: 3.1
01/23 Filename: 3880pre-3-2.pdf Reading on regression diagnostics
01/28 Filename: 3880pre-3-3.pdf Text: 3.2-3.3
02/04 Filename: 3880pre-4-1.pdf Text: 4.1-4.3.1
02/06 Filename: 3880pre-4-2.pdf Text: 4.3.2-4.3.4
02/11 Filename: 3880pre-4-3.pdf Reading on residuals in logistic regression
02/13 Filename: 3880pre-4-4.pdf Complete classwork
02/18 Filename: 3880pre-4-5.pdf Text: 4.4-4.5
02/20 Filename: 3880pre-4-6.pdf Naive Bayes chapter (PDF file)
03/18 Filename: 3880pre-5-1.pdf Text: 5.1
03/20 Filename: 3880pre-5-2.pdf Text: 5.2
03/27 Filename: 3880pre-5-3.pdf Section 9.6.5 of STAT 3210 text (emailed).
04/01 Filename: 3880pre-8-1.pdf Text: 8.1
04/08 Filename: 3880pre-8-2.pdf Text: 8.2
04/15 Filename: 3880pre-6-1.pdf Text: 6.1

Project

Homework assignments

The assignments below will be due roughly one week after we finish the material that they cover as part of a homework portfolio for the class. You will submit them on Brightspace, unless otherwise indicated, in PDF format. Please note that your grade will be greatly reduced if you do not follow directions.
Additions to the text questions will be flagged (+,*,#,@, ...) and indicated as additional parts (x) - additional part x.
Modifications/clarifications to existing parts will be flagged and indicated as [x] - modified part x.
If you are using Rmarkdown, please knit the files to Word, remove extraneous information, and shrink any required figures to be relatively small (or set the plot size to be relatively small in the Rmarkdown file) before saving as a PDF.

(1) Chapter 2: 8+, 9* (Due 01/24) Filename: 3880hw-2-1.pdf
+8 Only submit an answer to 2.8.c.vi.
*9 Only submit an answer to 2.9.e (comments only, no plots) & 2.9.f.
Submit your code (or Rmarkdown code) for each entire problem as an appendix (at the end, after all answers).
Keep you answers to one page plus whatever you need for the code appendix.
(2) Chapter 3: 14+ (Due 02/07) Filename: 3880hw-3-1.pdf
+[g] Do not re-do parts (c)-(e), Just use the 3 different models from those parts and answer whether the new obs is an outlier and/or a high-leverage point.
NOTE: Remove this new obs for parts (h)-(k) below (or reload the data starting again with set.seed(1) ...)
(h) For the model predicting y with x1 & x2, compute the variance inflation factors (VIFs) and state an interpretation of the values.
(i) Compute the VIFs "by hand" in R based on 1/[1-(R2_{xj|x(-j)})] (see p. 102). Show your code for this.
(j) For the model predicting y with x1 & x2, how many obs may be outliers in the "x-direction"? Which observations do these correspond to? Identify these observations in the column space of the design matrix (the x1 vs. x2 plane) and describe their location.
(k) For the model predicting y with x1 alone, add a new data point that has high leverage but low Cook's distance.
(l) For the model predicting y with x1 alone, add a new data point that has moderate leverage but high Cook's distance (larger than any other observed value).
SUBMIT for 3.14: a (question answer only), f(no plots), g, h, i, j, (Do not submit anything for parts b-e)
SUBMIT for 3.14: k & l (give the coordinates of the point, leverage, Cook's dist, and show a plot of Y vs X1 with the point highlighted)
SUBMIT an appendix of code for all parts (at the end)
Please keep you answers to 2 pages plus whatever you need for the code appendix.
(HW#3) Logistic regression and LDA (Due 02/21) Filename: 3880hw-4-1.pdf
Exam1 rewrites
(HW#4) Naive Bayes 1 (Due 03/21) Filename: 3880hw-4-2.pdf
(HW#5) Chapter 5: 5+, 6* (Due 03/28) Filename: 3880hw-5-1.pdf
+* See the modifications here
NOTE: boot.ci() can throw an error related to the BCa version. If this causes problems when knitting an .Rmd file, you can use the argument type=c("norm","basic","perc"), or any subset, for the Normal, Basic, and Percentile versions avoiding the BCa version.
Chapter 8: X, 8+ (Due 04/14 - Monday) Filename: 3880hw-8-1.pdf
(X) Give a brief explanation (in your own words) of the cost complexity pruning referred to in step 2 of algorithm 8.1 in the text. Recall that we have visualized results for sequences of trees as a function of T (R calls this "size") rather than of alpha (R calls this "k").
NOTE: Use a 50% split for your train/test set and set.seed(1) before calling sample().
NOTE: use set.seed(1) before each gbm() call
NOTE: if gbm fails to load, try resizing the console window (larger)
(f) Use boosting on the training set with a depth of 3 splits and the default shrinkage to find the test MSE.
(g) Repeat part (f) with a shrinkage of .01 and .02, reporting the test MSE in each case.
(h) Repeat part (g) using "stumps" and compare the test MSE to parts (f) & (g) Submit HWs at the tinurl until there is a fix for Brightspace.