RPubs

Final Project Link

This report analyzes data from 187 countries to understand what drives overall statistical performance. Using the World Bank Statistical Performance Indicators (SPI) dataset from 2023, we explore which data capability dimensions -- such as data use, data products, and data infrastructure -- are most strongly associated with a country's overall score.

about 2 months ago

Final Project Link

This report analyzes data from 187 countries to understand what drives overall statistical performance. Using the World Bank Statistical Performance Indicators (SPI) dataset from 2023, we explore which data capability dimensions -- such as data use, data products, and data infrastructure -- are most strongly associated with a country's overall score.

about 2 months ago

Final Project (Final Version)

This report analyzes data from 187 countries to understand what drives overall statistical performance. Using the World Bank Statistical Performance Indicators (SPI) dataset from 2023, we explore which data capability dimensions – such as data use, data products, and data infrastructure – are most strongly associated with a country’s overall score.

about 2 months ago

Final Project (Final)

This report analyzes data from 187 countries to understand what drives overall statistical performance. Using the World Bank Statistical Performance Indicators (SPI) dataset from 2023, we explore which data capability dimensions -- such as data use, data products, and data infrastructure -- are most strongly associated with a country's overall score.

about 2 months ago

Final Project

This report analyzes data from 187 countries to understand what drives overall statistical performance. Using the World Bank Statistical Performance Indicators (SPI) dataset from 2023, we explore which data capability dimensions – such as data use, data products, and data infrastructure – are most strongly associated with a country’s overall score. The analysis includes exploratory data visualizations, a hypothesis test comparing income groups, and regression models to identify the strongest predictors. The findings are intended to help international development organizations decide where to focus their investments when supporting countries with weak data systems.

about 2 months ago

Final Project (Part 2)

his report analyzes data from 187 countries to understand what drives overall statistical performance. Using the World Bank Statistical Performance Indicators (SPI) dataset from 2023, we explore which data capability dimensions -- such as data use, data products, and data infrastructure -- are most strongly associated with a country's overall score. The analysis includes exploratory data visualizations, a hypothesis test comparing income groups, and regression models to identify the strongest predictors. The findings are intended to help international development organizations decide where to focus their investments when supporting countries with weak data systems.

about 2 months ago

Week 14 - Model Critique

This notebook critiques the analysis from Data Dive 8 - Regression Modeling, which used the World Bank Statistical Performance Indicators (SPI) dataset. The dataset covers 217 countries from 2004 to 2023 and measures statistical capacity across five sub-dimensions: data use, data products, data services, data sources, and data infrastructure. Two models were used in Week 8: ANOVA: Does income group explain differences in overall statistical performance? Simple Linear Regression: Can data_products_score predict overall_score? The goal is not to redo the analysis, but to critique it. This includes identifying what was missing, what assumptions were not checked, and what risks exist if these models were used in practice. The original Week 8 analysis can be found here: Data Dive 8 - Regression Modeling

about 2 months ago

Final Project (Part 1)

2 months ago

Data Dive 13 _ Communicating Statistics

2 months ago

Data Dive 12 - Time Series Modeling

This notebook explores the time dimension of the statistical capacity dataset, which tracks country-level scores across indicators of data quality and accessibility from 2004 to 2023. The focus here is on how one of those indicators, data_products_score, which measures the quality and availability of statistical outputs produced by a country, has evolved globally over nearly two decades. Because the dataset records one observation per country per year, the analysis works at the level of global yearly averages, which produces a clean annual time series from 2005 to 2023.

2 months ago

Week 11 Data Dive - Generalized Linear Models (Part 2)

This notebook builds on the logistic regression model from Week 10 by returning to a continuous outcome and applying tools from this week's lab: model comparison (AIC, BIC, ANOVA) and multicollinearity diagnostics (VIF). The response variable is data_products_score, which measures the quality and availability of statistical outputs produced by a country. Two sub-indicators are used as predictors: data_use_score and data_services_score. Both reflect distinct but related dimensions of statistical capacity, making them natural candidates for a multiple linear regression model. To simplify the analysis, only the 2023 cross-sectional snapshot is used. This avoids the repeated-measures complexity that would arise from using multiple years and ensures the independence assumption for linear regression is more plausible.

3 months ago

Data Dive 10 — Generalized Linear Models

This notebook extends the regression analysis from previous weeks by introducing a generalized linear model (GLM), specifically logistic regression, to model a binary outcome. Prior analyses focused on continuous outcomes using linear regression — predicting overall_score from sub-indicator scores and income group. Logistic regression is needed here because the response variable is binary. Rather than modeling a score, the goal is to model the probability that a country is high-performing, defined as having an overall_score above 85. This threshold was chosen because it represents a meaningfully high level of statistical performance, well above the dataset mean, and results in a reasonably balanced split: 48 high-performing countries and 138 that fall below the threshold. The three predictors used are data_use_score, data_services_score, and data_infrastructure_score — sub-indicators that reflect distinct dimensions of a country’s statistical capacity and were not directly used as outcome variables in prior models. The dataset is the World Bank Statistical Performance Indicators dataset, covering 217 countries from 2004 to 2023. The analysis uses only the 2023 cross-sectional snapshot.

3 months ago

Week 9 Data Dive — Regression Diagnostics

This notebook extends the regression analysis from Week 8 by expanding the simple linear regression model and applying diagnostic tools to evaluate whether the model's assumptions are reasonably met. In Week 8, a simple linear regression was fit using data_products_score as the sole predictor of overall_score. That model explained approximately 67% of the variation in overall statistical performance across countries. This week, 1–3 additional variables are introduced to improve the model, and the 5 diagnostic plots discussed in class are used to assess its validity. The central goal is to ensure that any conclusions drawn from the model rest on a valid foundation. The dataset is the World Bank Statistical Performance Indicators dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. The analysis continues to use the 2023 cross-sectional snapshot.

3 months ago

Data Dive 8 — Regression Modeling

This notebook continues the analysis of the World Bank Statistical Performance Indicators (SPI) dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. Each row represents one country-year observation and includes multiple measures of statistical capacity, such as data use, data products, and infrastructure. Building on the hypothesis testing work from Week 7, this analysis shifts toward modeling. Two questions are addressed. First, does income group explain differences in overall statistical performance across countries? This is tested using a one-way ANOVA. Second, can a single continuous sub-indicator, data_products_score, predict a country's overall performance? This is explored through simple linear regression. These two approaches move beyond comparing group means toward understanding the structure of the relationship between statistical capacity indicators.

3 months ago

Week 7 Data Dive — Hypothesis Testing

This notebook continues the analysis of the World Bank Statistical Performance Indicators (SPI) dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. Each row represents one country-year observation and includes multiple measures of statistical capacity, such as data use, production, and infrastructure. This week, hypothesis testing is used to examine whether meaningful differences in statistical performance exist between income groups. Specifically, AB testing compares High income countries (Group A) and Low income countries (Group B) across two performance indicators. Two hypothesis testing frameworks are applied. Hypothesis 1 uses the Neyman–Pearson framework, which involves pre-specified error rates, power analysis, and a reject or fail-to-reject decision. Hypothesis 2 uses Fisher’s significance testing framework, which focuses on interpreting the p-value and assessing the strength of evidence against the null hypothesis. Understanding the relationship between income level and statistical capacity has policy relevance, as it may inform decisions related to development funding, technical assistance, and governance priorities.

4 months ago

Week 6 Data Dive — Confidence Intervals

In this data dive, I explore the Statistical Performance Indicators (SPI) dataset from the World Bank, accessed via TidyTuesday. Each row in this dataset represents a country–year observation, tracking how well countries manage and use statistical data across multiple dimensions over the years 2004–2023.

4 months ago

Data Dive 5 - Documentation

In this data dive, I examine the Statistical Performance Indicators dataset from the World Bank to identify unclear elements in the data and documentation. The dataset contains 4,340 rows and 12 columns, with each row representing a country in a specific year. The goal is to critically evaluate what's clear, what's unclear, and what issues might affect analysis.

4 months ago

RPubs

mtazike

Mahya

Recently Published

Final Project Link

Final Project Link

Final Project (Final Version)

Final Project (Final)

Final Project

Final Project (Part 2)

Week 14 - Model Critique

Final Project (Part 1)

Data Dive 13 _ Communicating Statistics

Data Dive 12 - Time Series Modeling

Week 11 Data Dive - Generalized Linear Models (Part 2)

Data Dive 10 — Generalized Linear Models

Week 9 Data Dive — Regression Diagnostics

Data Dive 8 — Regression Modeling

Week 7 Data Dive — Hypothesis Testing

Week 6 Data Dive — Confidence Intervals

Data Dive 5 - Documentation

Data Dive 4 - Sampling and Drawing Conclusions

Week 3 Data Dive - Group By and Probabilities

Data Dive

Sign In

mtazike

Mahya

Recently Published