ENGR200 Term Project — Excel Demo (Statics Grade)

Instructor: Dr. Maggie Foley • Course: Engineering Statistics • Individual Assignment.
This page mirrors the Term Project Guide using the official dataset ENGR200_Term_Project_Data.xlsx (Statics course). It shows how to do the project in Excel using built-in functions and (optionally) the Analysis Toolpak.

You may choose either: (A) Analysis Toolpak for descriptive stats & regression, or (B) Excel functions/matrix formulas as shown here.
If you use Excel functions/matrix, you must submit the Excel file (see Section 9).

0) Open the Statics dataset

Download and open the Excel file ENGR200_Term_Project_Data.xlsx from Blackboard. You should see 150 students with the following columns:

A: Student_ID
B: HS_GPA
C: SAT_Math
D: SAT_Verbal
E: Study_Hours_per_week
F: Attendance_%
G: Class_Participation
H: Statics_Grade (your Y)

Snippet of the first few rows (for a quick sanity check only):

Student_ID,HS_GPA,SAT_Math,SAT_Verbal,Study_Hours_per_week,Attendance_%,Class_Participation,Statics_Grade
1,3.4,665,562,11.7,93.8,7.3,75
2,3.14,671,581,15.8,85.4,6.1,71
3,3.46,609,672,8.1,94.3,7.2,74
4,3.81,664,663,14.0,96.8,8.1,90
5,3.11,668,619,9.9,92.1,7.1,74
6,3.11,607,628,8.8,99.4,7.1,100
7,3.83,762,709,11.6,86.1,5.6,86
8,3.51,678,579,7.9,83.8,5.7,83
9,3.01,579,658,9.8,81.1,4.2,75
10,3.42,689,606,7.2,97.5,10.0,88

Expected quick checks for the full dataset (n = 150):
Mean HS_GPA ≈ 3.17 • Mean Study_Hours_per_week ≈ 11.61
Mean Attendance_% ≈ 89.79 • Mean Statics_Grade ≈ 83.54.
If your means are wildly different, your ranges are probably wrong.

1) Descriptive statistics (mean, median, stdev, count)

Compute these for every numeric variable B:H. Example layout below uses rows 2..151 as data and puts summaries in columns P:V.

P2 (Count HS_GPA): =COUNT($B$2:$B$151) P3 (Mean HS_GPA): =AVERAGE($B$2:$B$151) P4 (Median HS_GPA): =MEDIAN($B$2:$B$151) P5 (StdDev HS_GPA): =STDEV.S($B$2:$B$151) Q2 (Count SAT_Math): =COUNT($C$2:$C$151) Q3 (Mean SAT_Math): =AVERAGE($C$2:$C$151) Q4 (Median SAT_Math): =MEDIAN($C$2:$C$151) Q5 (StdDev SAT_Math): =STDEV.S($C$2:$C$151) R2 (Count SAT_Verbal): =COUNT($D$2:$D$151) R3 (Mean SAT_Verbal): =AVERAGE($D$2:$D$151) R4 (Median SAT_Verbal): =MEDIAN($D$2:$D$151) R5 (StdDev SAT_Verbal): =STDEV.S($D$2:$D$151) S2 (Count Study_Hours): =COUNT($E$2:$E$151) S3 (Mean Study_Hours): =AVERAGE($E$2:$E$151) S4 (Median Study_Hours): =MEDIAN($E$2:$E$151) S5 (StdDev Study_Hours): =STDEV.S($E$2:$E$151) T2 (Count Attendance_%): =COUNT($F$2:$F$151) T3 (Mean Attendance_%): =AVERAGE($F$2:$F$151) T4 (Median Attendance_%): =MEDIAN($F$2:$F$151) T5 (StdDev Attendance_%): =STDEV.S($F$2:$F$151) U2 (Count Participation): =COUNT($G$2:$G$151) U3 (Mean Participation): =AVERAGE($G$2:$G$151) U4 (Median Participation): =MEDIAN($G$2:$G$151) U5 (StdDev Participation): =STDEV.S($G$2:$G$151) V2 (Count Statics_Grade): =COUNT($H$2:$H$151) V3 (Mean Statics_Grade): =AVERAGE($H$2:$H$151) V4 (Median Statics_Grade): =MEDIAN($H$2:$H$151) V5 (StdDev Statics_Grade): =STDEV.S($H$2:$H$151)

Expected (approx): Mean Statics_Grade ≈ 83.54, Mean HS_GPA ≈ 3.17, Mean Study_Hours ≈ 11.61, Mean Attendance_% ≈ 89.79.

2) Histograms (Toolpak or FREQUENCY)

Make at least one histogram for Statics_Grade and one for a predictor (e.g., HS_GPA or SAT_Math).

Option A — Analysis Toolpak:
Data → Data Analysis → Histogram
• Input Range: for Statics_Grade use $H$2:$H$151
• Bin Range: choose a column of bin cut points (e.g., 50, 60, 70, …, 100)
• Check Labels if you included the header
• Choose an Output Range and check Chart Output.

Option B — FREQUENCY (no Toolpak):
Put Statics_Grade bins in U2:U12 (e.g., 50, 60, 65, …, 100), then use FREQUENCY.

(Assume bins for Statics_Grade in U2:U12) V2: =FREQUENCY($H$2:$H$151,$U$2:$U$12) → In Excel 365 this spills automatically. In older Excel: • Select V2:V12 • Type the formula • Press Ctrl+Shift+Enter (array formula) Then insert a Column Chart on U2:V12.

3) Scatter plots (Statics_Grade vs predictors)

Make scatterplots with Statics_Grade (H) as the vertical axis and a predictor as the horizontal axis. Examples:

Statics_Grade vs HS_GPA (B)
Statics_Grade vs SAT_Math (C)
Statics_Grade vs SAT_Verbal (D)
Statics_Grade vs Study_Hours_per_week (E)

One method: select the X and Y columns (e.g., E1:E151 and H1:H151 while holding Ctrl), then Insert → Scatter → first option. Add a trendline if you want a visual slope.

4) Correlation matrix (numeric variables)

Build a correlation matrix for numeric variables B:H using CORREL. Example layout: headers B:H across the top and down the left; fill with CORREL formulas.

(Example: headers in X1:AD1 and again in W2:W8: HS_GPA, SAT_Math, SAT_Verbal, Study_Hours, Attendance, Participation, Statics_Grade) X2 (HS_GPA, HS_GPA): =CORREL($B$2:$B$151,$B$2:$B$151) Y2 (HS_GPA, SAT_Math): =CORREL($B$2:$B$151,$C$2:$C$151) Z2 (HS_GPA, SAT_Verbal): =CORREL($B$2:$B$151,$D$2:$D$151) AA2 (HS_GPA, Study_Hours): =CORREL($B$2:$B$151,$E$2:$E$151) AB2 (HS_GPA, Attendance): =CORREL($B$2:$B$151,$F$2:$F$151) AC2 (HS_GPA, Participation):=CORREL($B$2:$B$151,$G$2:$G$151) AD2 (HS_GPA, Statics): =CORREL($B$2:$B$151,$H$2:$H$151) (Repeat the pattern, swapping the ranges to fill the rest of the matrix.)

Expected (approx) correlations with Statics_Grade:
corr(Statics, SAT_Verbal) ≈ 0.14 • corr(Statics, Study_Hours) ≈ −0.10 • corr(Statics, Attendance) ≈ 0.09
All are fairly weak in this demo dataset.

5) Simple Linear Regression: Statics_Grade ~ Study_Hours

Here we regress Y = Statics_Grade (H) on X = Study_Hours_per_week (E). We use direct Excel functions for β, R², and t.

A) Direct Excel functions (recommended)

(1) Slope, intercept, R² — put these somewhere (e.g., P2:P4): P2 (Slope β1): =SLOPE($H$2:$H$151,$E$2:$E$151) P3 (Intercept β0): =INTERCEPT($H$2:$H$151,$E$2:$E$151) P4 (R^2): =RSQ($H$2:$H$151,$E$2:$E$151) (2) Fitted values ŷ and residuals — new columns I and J: I1: "Yhat_Study" I2: =$P$3 + $P$2*E2 ← fill I2 down to I151 J1: "Residual_Study" J2: =H2 - I2 ← fill J2 down to J151 (3) Error variance, SE(β1), t, p: P5 (MSE): =SUMSQ($J$2:$J$151)/(COUNT($J$2:$J$151)-2) P6 (SE β1): =SQRT(P5 / SUMXMY2($E$2:$E$151,AVERAGE($E$2:$E$151))) P7 (t β1): =P2/P6 P8 (p two-sided): =T.DIST.2T(ABS(P7),COUNT($J$2:$J$151)-2)

Expected (approx) for this dataset:
β₀ ≈ 85.80 • β₁ (Study_Hours) ≈ −0.195
R² ≈ 0.010 (almost no explanatory power) • t(β₁) ≈ −1.24, p ≈ 0.22.

B) Toolpak alternative (no extra formulas)

Data → Data Analysis → Regression
• Input Y Range: $H$2:$H$151 (Statics_Grade)
• Input X Range: $E$2:$E$151 (Study_Hours_per_week)
• Check Labels, choose Output Range, check Residuals if desired
→ Toolpak will give β₀, β₁, SE, t, p, R² directly.

6) Multiple Linear Regression (Statics_Grade on several predictors)

Now build a multiple regression model with Y = Statics_Grade (H) and several X’s: HS_GPA (B), SAT_Math (C), SAT_Verbal (D), Study_Hours (E), Attendance_% (F), Class_Participation (G). You may choose a subset that makes sense for your project.

Option 1 — Analysis Toolpak (easiest, strongly recommended)

1. Data → Data Analysis → Regression
2. Input Y Range: $H$2:$H$151 (Statics_Grade)
3. Input X Range: $B$2:$G$151 (all six predictors) or a subset
4. Check Labels; choose an Output Range on a new sheet (e.g., "MLR_Toolpak")
5. Under Residuals, check Residuals (and optionally Residual Plots)
6. Click OK — you now have β’s, SE, t, p, R², Adjusted R², F, etc.

If you choose this route, you can interpret coefficients and model fit directly from the Toolpak regression table. See Section 9 for what to submit.

Option 2 — Excel functions + matrix method (for transparency)

This option reproduces Toolpak’s regression using matrix formulas. It’s more work but shows the mechanics. Use if you want to see the “inside” of OLS or if Toolpak is unavailable.

6.1 Build the design matrix X (7 columns)

Let Y = Statics_Grade in H2:H151. Create 7 columns J:P for X = [1, HS_GPA, SAT_Math, SAT_Verbal, Study_Hours, Attendance, Participation]. J1: "Intercept" K1: "HS_GPA" L1: "SAT_Math" M1: "SAT_Verbal" N1: "Study_Hours" O1: "Attendance" P1: "Participation" J2: =1 K2: =B2 L2: =C2 M2: =D2 N2: =E2 O2: =F2 P2: =G2 ↓ Fill J2:P2 down to row 151 (so X is J2:P151).

6.2 Estimate β̂, ŷ, residuals, R², F, etc.

β̂ (R2:R8): =LET(X,$J$2:$P$151,Y,$H$2:$H$151, MMULT(MINVERSE(MMULT(TRANSPOSE(X),X)),MMULT(TRANSPOSE(X),Y))) (Select R2:R8 first, then enter the formula. Excel 365: just Enter (spills). Older Excel: Ctrl+Shift+Enter.) Interpret R2:R8 as: R2 = β0 (Intercept) R3 = β_HS_GPA R4 = β_SAT_Math R5 = β_SAT_Verbal R6 = β_Study_Hours R7 = β_Attendance R8 = β_Participation Ŷ (S2:S151): = $R$2 + $R$3*B2 + $R$4*C2 + $R$5*D2 + $R$6*E2 + $R$7*F2 + $R$8*G2 (Fill S2 down to S151.) Residuals (T2:T151): =H2 - S2 (fill down) ȳ (R10): =AVERAGE($H$2:$H$151) SST (R11): =SUMPRODUCT(($H$2:$H$151-$R$10)^2) SSE (R12): =SUMSQ($T$2:$T$151) SSR (R13): =R11-R12 #slopes (R14): =COLUMNS($J$2:$P$2)-1 df error (R15):=ROWS($H$2:$H$151)-ROWS($R$2:$R$8) MSE (R16): =R12/R15 R² (R17): =R13/R11 Adj R² (R18): =1-(1-R17)*(ROWS($H$2:$H$151)-1)/R15 F (R19): =(R13/R14)/R16 F p (R20): =F.DIST.RT(R19,R14,R15)

6.3 Standard errors, t-tests, and confidence intervals

Compute (X'X)^{-1} in V2:AB8: V2:AB8: =MINVERSE(MMULT(TRANSPOSE($J$2:$P$151),$J$2:$P$151)) (Select V2:AB8 first, then enter; Excel 365: Enter, older: Ctrl+Shift+Enter.) SE β̂_j (S2:S8): S2: =SQRT($R$16*INDEX($V$2:$AB$8,ROW(A1),ROW(A1))) ↓ fill S2 down to S8 (these are SE for β0..β6) t (T2:T8): T2: =R2/S2 ↓ fill T2 down to T8 p-values (U2:U8): U2: =T.DIST.2T(ABS(T2),$R$15) ↓ fill U2 down to U8 Set α (U1): 0.05 95% CI for each β (V2:W8): V2 (CI lo): =R2 - T.INV.2T($U$1,$R$15)*S2 W2 (CI hi): =R2 + T.INV.2T($U$1,$R$15)*S2 ↓ fill V2:W2 down to V8:W8

Expected (approx) for the full model with all six predictors:
β̂: Intercept ≈ 57.6, HS_GPA ≈ 4.10, SAT_Math ≈ 0.018, SAT_Verbal ≈ 0.037,
Study_Hours ≈ −0.162, Attendance ≈ 0.144, Participation ≈ 0.180.
R² ≈ 0.206, adj-R² ≈ 0.174, df error ≈ 143, MSE ≈ 93.20.

7) OPTIONAL — Residual analysis, leverage & Cook’s D Optional

Optional but great for students aiming for distinction. Use these only if you want to check for outliers/influential points and normality more rigorously.

A) Leverage hᵢ, standardized residuals, Cook’s D (click to expand)

Assume: • X rows are J2:P151 (Option 2 design matrix) • (X'X)^{-1} is in V2:AB8 • Residuals e_i are in T2:T151 • MSE is in R16 Leverage h_i (AE2): AE2: =INDEX(MMULT(MMULT($J2:$P2,$V$2:$AB$8),TRANSPOSE($J2:$P2)),1,1) Standardized residual (AF2): AF2: =$T2/SQRT($R$16*(1-AE2)) Cook's D (AG2): AG2: =($T2^2/($R$16*COLUMNS($J$2:$P$2)))*(AE2/(1-AE2)^2) Flag line (AH1): AH1: =4/ROWS($H$2:$H$151) ↓ Fill AE2:AG2 down to row 151. Points with Cook’s D > 4/n (≈ 4/150 ≈ 0.027) or > 1 deserve closer inspection.

B) QQ plot of residuals (click to expand)

Make a QQ plot for the multiple regression residuals (T2:T151): 1. Copy residuals T2:T151 to a new column AG2:AG151. 2. Sort AG2:AG151 from smallest to largest. 3. Ranks (AH2:AH151): AH2: =ROW()-1 ← fill AH2 down to AH151 4. Probabilities p_i (AI2:AI151): AI2: =(AH2-0.5)/ROWS($AG$2:$AG$151) ← fill down 5. Normal quantiles z_i (AJ2:AJ151): AJ2: =NORM.S.INV(AI2) ← fill down Then insert a Scatter plot of (AJ2:AJ151) on the x-axis vs (AG2:AG151) on the y-axis. If points lie roughly on a straight line, residuals are roughly normal.

8) Multicollinearity (VIF)

VIF_j = 1/(1 − R²_j) where R²_j comes from regressing X_j on the other predictors (with an intercept). We can use LINEST for each predictor in turn.

Let the six predictors be B:G (HS_GPA, SAT_Math, SAT_Verbal, Study_Hours, Attendance, Participation). Place VIFs in AD2:AD7, one row per predictor: HS_GPA (AD2): =1/(1-INDEX(LINEST($B$2:$B$151, CHOOSE({1,2,3,4,5},$C$2:$C$151,$D$2:$D$151,$E$2:$E$151,$F$2:$F$151,$G$2:$G$151), TRUE,TRUE),3,1)) SAT_Math (AD3): =1/(1-INDEX(LINEST($C$2:$C$151, CHOOSE({1,2,3,4,5},$B$2:$B$151,$D$2:$D$151,$E$2:$E$151,$F$2:$F$151,$G$2:$G$151), TRUE,TRUE),3,1)) SAT_Verbal (AD4): =1/(1-INDEX(LINEST($D$2:$D$151, CHOOSE({1,2,3,4,5},$B$2:$B$151,$C$2:$C$151,$E$2:$E$151,$F$2:$F$151,$G$2:$G$151), TRUE,TRUE),3,1)) Study_Hours (AD5): =1/(1-INDEX(LINEST($E$2:$E$151, CHOOSE({1,2,3,4,5},$B$2:$B$151,$C$2:$C$151,$D$2:$D$151,$F$2:$F$151,$G$2:$G$151), TRUE,TRUE),3,1)) Attendance (AD6): =1/(1-INDEX(LINEST($F$2:$F$151, CHOOSE({1,2,3,4,5},$B$2:$B$151,$C$2:$C$151,$D$2:$D$151,$E$2:$E$151,$G$2:$G$151), TRUE,TRUE),3,1)) Participation (AD7): =1/(1-INDEX(LINEST($G$2:$G$151, CHOOSE({1,2,3,4,5},$B$2:$B$151,$C$2:$C$151,$D$2:$D$151,$E$2:$E$151,$F$2:$F$151), TRUE,TRUE),3,1))

Expected (approx): VIFs for this dataset are all near 1.0 (≈1.01–1.07),
so multicollinearity is not a serious issue in the full model.

Special Note — Using SPSS to Get VIF

Excel’s Analysis Toolpak does not report VIF directly. You can either use the Excel VIF formulas in this section, or you can run the same regression in SPSS and let SPSS calculate VIF for you.

SPSS menu path for VIF:

1. Analyze → Regression → Linear… 2. In the Linear Regression dialog: • Move your Y (e.g., Statics_Grade) into the Dependent box. • Move your predictors (e.g., HS_GPA, SAT_Math, SAT_Verbal, Study_Hours_per_week, Attendance_%, Class_Participation) into Independent(s). • Make sure Method is: Enter. 3. Click Statistics… • Under Regression Coefficients / Model, check: Estimates and Model fit. • Under Collinearity diagnostics, check: Collinearity diagnostics. • Click Continue. 4. Click OK. 5. In the Output Viewer, open the Coefficients table and look at the right-most columns: Tolerance and VIF (one row per predictor).

Practical shortcut: run the full model in Excel, then replicate the same Y and X’s in SPSS once, just to read off the Tolerance and VIF values and copy them into your report.

9) What to submit (Term Project)

Word report (5–8 pages) — required for everyone.
- Describe the dataset and selected variables.
- Summarize descriptive stats, histograms, correlation matrix.
- Present your regression model(s) with tables (coefficients, SE, t, p, R², F).
- Interpret one numeric coefficient and (if you used any) one dummy coefficient.
- Discuss multicollinearity using your VIFs.
- Comment briefly on residuals/diagnostics if you used Section 7.
If you used the Analysis Toolpak for regression/descriptives (Regression, Descriptive Statistics, Histogram):
- Include the original Toolpak output tables/figures (pasted or screenshot) in your Word report.
- You do not have to submit the Excel file, but you may attach it if you wish.
If you used Excel functions / matrix formulas (SLOPE, INTERCEPT, RSQ, CORREL, MMULT, MINVERSE, etc.):
- Submit your Excel workbook showing all steps (descriptives, histograms, scatterplots, simple regression, multiple regression, VIF, and any diagnostics you used).
- Submit your Word report as above.

Practical strategy: master this Statics demo first (follow steps 1–8), then repeat the same workflow on your own approved model for the project.

10) (Optional) Toolpak shortcuts vs functions

Descriptive Statistics (Toolpak): quick table of mean, median, SD, min, max.
Histogram (Toolpak): automatic bins + chart.
Regression (Toolpak): β, SE, t, p, R², adjusted R², ANOVA all at once.
Functions path: COUNT, AVERAGE, MEDIAN, STDEV.S, CORREL, SLOPE, INTERCEPT, RSQ, T.DIST.2T, F.DIST.RT, MMULT, MINVERSE, etc., give the same results but let you see each step.

11) Why Excel here? When to use Stata / R / Python instead?

Excel is for learning transparency. In this project you see every step: building X, computing β̂, residuals, MSE, standard errors, t-tests, confidence intervals, and VIF. For serious professional work, you would normally use Stata, R, or Python: they handle big datasets, robust/clustered SEs, better graphics, and fully reproducible code. This assignment is a bridge so that when you move to those tools, you already understand exactly what each command is doing.