RPubs

by RStudio

Recently Published

Gerrymandering

On November 4th 2025, Californians will cast their votes on Proposition 50, which adopts a congressional district map drawn by the state legislature (bill AB 604) until after the 2030 census. Will the adoption of the new district map lead to an increase in partisan advantage relative to 2024? If so, by how much?

4 months ago

Plot

By WWWWWW

4 months ago

Plot

By WWWWWW

4 months ago

Simple Linear Regression

By hannah_37894

4 months ago

Tugas ANSUR Anita

By AnitaHana_16231008

4 months ago

Pruebas Parmaétricas_informe

By carol_c

Análisis estadístico de comparación paramétrico dando respuesta a una hipótesis con una y dos muestras.

4 months ago

Regression Trees and Rule-Based Modeling

By Candace63

Regression Trees and Rule-Based Modeling This analysis explores tree-based regression methods and variable importance metrics across four comprehensive exercises using R. The project demonstrates advanced machine learning techniques for handling correlated features, optimizing model complexity through bias-variance tradeoff analysis, and deploying interpretable models for production manufacturing optimization. Using the Friedman simulation dataset and real-world chemical manufacturing data, the analysis compares traditional and conditional variable importance methods, evaluates hyperparameter effects on model generalization, and showcases the strategic value of combining interpretable single trees with high-performance ensemble methods for business decision-making. Key Accomplishments: • Variable Importance Methods: Compared traditional Random Forest importance against conditional importance (Strobl et al., 2007), demonstrating that conditional methods correctly penalize redundant correlated features while traditional methods artificially split importance—critical for feature selection in production ML systems • Model Comparison: Evaluated 5 tree-based methods (Single Tree, Bagged Trees, Random Forest, GBM, Cubist) on manufacturing yield prediction, achieving optimal Test R² = 0.62 with Random Forest while identifying 10x variation in importance scores across methods • Bias-Variance Optimization: Simulated gradient boosting across 6 interaction depths (1-10), confirming optimal depth of 4-6 balances complexity and generalization—shallow trees underfit, deep trees overfit • Hyperparameter Analysis: Analyzed GBM exploration-exploitation tradeoff, demonstrating that conservative parameters (learning rate = 0.1, bagging fraction = 0.1) produce distributed importance and better generalization versus aggressive settings that concentrate on 2-3 features • Production Interpretability: Deployed interpretable regression tree revealing ManufacturingProcess32 as critical control parameter (threshold: 159.5, +2.5 yield impact), identifying that 60% of production operates sub-optimally—providing actionable operational targets that ensemble models cannot offer Technical Stack: R (caret, randomForest, gbm, Cubist, party), 10-fold cross-validation, conditional inference forests, MARS, ensemble methods

4 months ago