Recently Published
Data Dive 10 — Generalized Linear Models
This notebook extends the regression analysis from previous weeks by introducing a generalized linear model (GLM), specifically logistic regression, to model a binary outcome.
Prior analyses focused on continuous outcomes using linear regression — predicting overall_score from sub-indicator scores and income group. Logistic regression is needed here because the response variable is binary. Rather than modeling a score, the goal is to model the probability that a country is high-performing, defined as having an overall_score above 85.
This threshold was chosen because it represents a meaningfully high level of statistical performance, well above the dataset mean, and results in a reasonably balanced split: 48 high-performing countries and 138 that fall below the threshold. The three predictors used are data_use_score, data_services_score, and data_infrastructure_score — sub-indicators that reflect distinct dimensions of a country’s statistical capacity and were not directly used as outcome variables in prior models.
The dataset is the World Bank Statistical Performance Indicators dataset, covering 217 countries from 2004 to 2023. The analysis uses only the 2023 cross-sectional snapshot.
Week 9 Data Dive — Regression Diagnostics
This notebook extends the regression analysis from Week 8 by expanding the simple linear regression model and applying diagnostic tools to evaluate whether the model's assumptions are reasonably met.
In Week 8, a simple linear regression was fit using data_products_score as the sole predictor of overall_score. That model explained approximately 67% of the variation in overall statistical performance across countries. This week, 1–3 additional variables are introduced to improve the model, and the 5 diagnostic plots discussed in class are used to assess its validity. The central goal is to ensure that any conclusions drawn from the model rest on a valid foundation.
The dataset is the World Bank Statistical Performance Indicators dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. The analysis continues to use the 2023 cross-sectional snapshot.
Data Dive 8 — Regression Modeling
This notebook continues the analysis of the World Bank Statistical Performance Indicators (SPI) dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. Each row represents one country-year observation and includes multiple measures of statistical capacity, such as data use, data products, and infrastructure.
Building on the hypothesis testing work from Week 7, this analysis shifts toward modeling. Two questions are addressed. First, does income group explain differences in overall statistical performance across countries? This is tested using a one-way ANOVA. Second, can a single continuous sub-indicator, data_products_score, predict a country's overall performance? This is explored through simple linear regression.
These two approaches move beyond comparing group means toward understanding the structure of the relationship between statistical capacity indicators.
Week 7 Data Dive — Hypothesis Testing
This notebook continues the analysis of the World Bank Statistical Performance Indicators (SPI) dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. Each row represents one country-year observation and includes multiple measures of statistical capacity, such as data use, production, and infrastructure.
This week, hypothesis testing is used to examine whether meaningful differences in statistical performance exist between income groups. Specifically, AB testing compares High income countries (Group A) and Low income countries (Group B) across two performance indicators.
Two hypothesis testing frameworks are applied. Hypothesis 1 uses the Neyman–Pearson framework, which involves pre-specified error rates, power analysis, and a reject or fail-to-reject decision. Hypothesis 2 uses Fisher’s significance testing framework, which focuses on interpreting the p-value and assessing the strength of evidence against the null hypothesis.
Understanding the relationship between income level and statistical capacity has policy relevance, as it may inform decisions related to development funding, technical assistance, and governance priorities.
Week 6 Data Dive — Confidence Intervals
In this data dive, I explore the Statistical Performance Indicators (SPI) dataset from the World Bank, accessed via TidyTuesday. Each row in this dataset represents a country–year observation, tracking how well countries manage and use statistical data across multiple dimensions over the years 2004–2023.
Data Dive 5 - Documentation
In this data dive, I examine the Statistical Performance Indicators dataset from the World Bank to identify unclear elements in the data and documentation. The dataset contains 4,340 rows and 12 columns, with each row representing a country in a specific year. The goal is to critically evaluate what's clear, what's unclear, and what issues might affect analysis.
Data Dive 4 - Sampling and Drawing Conclusions
In this data dive, I will explore how different random samples from the same dataset can produce varying results. This helps demonstrate how sampling variability can influence the conclusions we draw from data.
Week 3 Data Dive - Group By and Probabilities
This notebook explores group-by analysis and probability concepts using a country–year dataset. It examines regions, income levels, population summaries, and region–income combinations to identify rare, common, and missing patterns.
Data Dive
Summary statistics and visual exploration of the dataset