Recently Published
Combined Statistical Approaches for the Analysis of EgaEnEstellaQts Streamflow: From ROC Curves to Goodness-of-Fit Measures
This script illustrates a comprehensive approach to evaluating hydrological data (here, EgaEnEstellaQts, a series of streamflow measurements on the Ega River at the Estella station in Spain) through different methods. First, it applies a ROC (Receiver Operating Characteristic) analysis to test the discriminative capacity of the predictions. It then implements performance measures (goodness-of-fit) to compare observations and simulations, supported by graphical visualizations such as adjustment curves, scatter plots, and time series. Finally, the code provides descriptive statistics and an analysis of the data distribution. The purpose of this script is to combine statistical and graphical tools in order to assess the quality of the simulations, detect potential biases, and provide a comprehensive view of model performance when applied to environmental or hydrological datasets.
Performance Evaluation of Climate and Precipitation Models Using Taylor Diagrams in R
Project Objective
This analysis evaluates the performance of various precipitation and climate models (CHIRPS, PERSIANN, TAMSATV3, ARCV2, ERA5) against ground-based observations using Taylor diagrams in R. Taylor diagrams provide a concise visualization of model performance by comparing correlation, standard deviation, and centered root-mean-square error (RMSE). The analysis uses synthetic data representing annual and seasonal (JF, MAM, JJAS, OND) precipitation measurements. By leveraging the plotrix and openair packages, we assess the models’ ability to replicate observed precipitation patterns, offering insights into their accuracy and reliability for climate studies.
Comprehensive Data Visualization: A Multidimensional Exploration
This R script report harnesses R’s advanced plotting libraries to explore built-in datasets (iris, mtcars, Titanic, volcano) through diverse visualization techniques, including scatter plots, boxplots, histograms, 3D scatter plots, and interactive visualizations. The code uncovers critical patterns, such as species clustering in iris or performance correlations in mtcars, making it a powerful tool for data analysis. By leveraging ggplot2, plotly, and Chart.js, it produces publication-ready HTML output, ideal for data scientists, analysts, and educators aiming to communicate complex insights with clarity and interactivity.
Bayesian Analysis of Precipitation Fluxes in a Climatic Context Using MixSIAR
This module introduces the application of stable water isotopes (δ¹⁸O and δD) to quantify the origin of precipitation and hydrological inputs under varying climatic conditions. Using Bayesian mixing models implemented in MixSIAR (R package), participants will learn to:
Prepare and structure isotopic datasets from precipitation, potential water sources, and discrimination factors.
Apply a Bayesian mixing model to estimate the relative contributions of different hydrological sources (rainfall, snowmelt, groundwater).
Evaluate model outputs through statistical summaries and convergence diagnostics (e.g., MCMC trace plots, Gelman–Rubin statistics) to ensure robustness.
Visualize estimated source proportions and their posterior distributions, enabling clear interpretation in both climatic and hydrological studies.
By the end of the module, participants will be able to trace water fluxes within a watershed, assess the climatic influence on precipitation sources, and communicate their findings effectively using results derived from Bayesian analysis.
In this training context, simulated datasets are employed to provide order-of-magnitude examples and hands-on practice with MixSIAR.
δ¹⁸O (delta-O-18): The ratio of oxygen-18 (¹⁸O) to oxygen-16 (¹⁶O) in water. δD (δ²H, delta-Deuterium): The ratio of hydrogen-2 (²H, deuterium) to hydrogen-1 (¹H) in water.
Machine Learning Approaches for Predicting Temperature using the New York AirQuality Dataset (May–September 1973)
The AirQuality dataset in R contains daily air quality measurements collected in New York City from May to September 1973. The dataset includes variables such as ozone concentration, solar radiation, wind speed, and daily temperature. In this analysis, we focus on predicting temperature, which is a key climatic variable with strong implications for environmental studies, health impacts, and energy demand forecasting. We apply two machine learning models: Random Forest, a powerful ensemble method that captures complex, non-linear relationships between predictors, and Neural Network (shallow), which provides an alternative regression approach by simulating interconnected neurons. By comparing the two models, we can assess their predictive performance and understand the relative importance of different meteorological variables in explaining temperature variations during this historical air quality study.
Comparative Analysis of Predictor Importance for Rainfall in Climatic Data: Relative Weights Analysis (RWA), Machine Learning, and Statistical Methods
This code evaluates and compares the influence of various climatic variables (temperature, pressure, humidity, wind characteristics, sunshine, cloud cover, evapotranspiration, soil moisture) on rainfall. By applying Relative Weights Analysis (RWA), iopsych relative weights, relimp (relative importance in linear regression), and Random Forest variable importance, it identifies which predictors contribute most to rainfall variability. The approach provides a robust understanding of the dominant climatic drivers, allowing researchers to prioritize variables for predictive modeling and better interpret their impact on rainfall patterns. The comparison across multiple methods ensures the reliability and consistency of variable importance assessments.
Linear and Machine Learning Models for Rainfall Prediction (M5P Trees)
The objective of this code is to predict rainfall using simulated climate variables (temperature, pressure, humidity, wind) through various modeling approaches, ranging from linear and generalized regression to advanced models like M5P regression trees. The focus is on building and comparing predictive models, validating their performance using train/test splits and cross-validation, and quantitatively evaluating predictions with metrics such as RMSE and R². This code enables the exploration of model robustness, identification of the most influential variables, and visualization of model fit through plots comparing observed and predicted values, making the analysis both educational and applicable to real climate datasets.
Meteorological Variable Relationships and Regression Analysis
The objective of this project is to identify relationships between various meteorological variables. This pursuit has two main goals: first, to explore correlations and the significance of relationships between climatic factors; second, to build and compare multiple linear regression models to determine the most relevant predictors of rainfall. This approach allows testing variable selection methods and performance criteria (adjusted R², MSE, AIC, Mallows’ Cp, etc.) in a controlled setting, providing a pedagogical exercise and a methodological foundation transferable to real-world climate data analysis.
Multi-Objective Optimization Analysis with NSGA-II
This analysis employs the Non-Dominated Sorting Genetic Algorithm II (NSGA-II), a robust multi-objective evolutionary algorithm, to address complex optimization problems. The study includes two scientifically relevant case studies:
1. Car Example: Optimizes fuel consumption and maximum speed based on vehicle weight and power. This is critical in automotive engineering for designing sustainable vehicles, balancing environmental impact (fuel efficiency) with performance (speed), which influences market competitiveness and regulatory compliance.
2. DRASTIC Index Example: Optimizes weights of the DRASTIC parameters (Depth to water, net Recharge, Aquifer media, Soil media, Topography, Impact of vadose zone, and Conductivity) to maximize correlation with nitrate concentration (NO3) while minimizing Root Mean Square Error (RMSE). This enhances groundwater vulnerability assessment, providing valuable insights for environmental management and policy-making in regions prone to contamination.
The analysis generates Pareto fronts to visualize trade-offs, computes optimal solutions, and exports results as CSV files for further scientific evaluation.
Spatial Autocorrelation Patterns and Inequality Analysis
This analysis examines spatial autocorrelation patterns in synthetic raster data using the Getis-Ord Gi* statistic to identify hotspots and coldspots, and measures spatial inequality with the Gini index. It involves generating a synthetic raster (representing NDVI), visualizing it as an NDVI map, creating a shapefile, extracting raster values, computing spatial neighbors, calculating Gi* statistics, classifying hotspots, visualizing hotspots with a thematic map, and exporting outputs as shapefiles and rasters. The analysis highlights spatial clustering and inequality in a simulated dataset.
Monte Carlo Simulation and Bootstrap
This analysis estimates the Levelized Cost of Wind Energy (LCOE) using Monte Carlo simulation to account for uncertainties in key parameters (investment cost, operation and maintenance cost, interest rate, capacity factor, and lifetime) and bootstrap methods to compute confidence intervals for the mean LCOE. It includes wind speed modeling, vertical extrapolation, power density calculation, Weibull distribution fitting, and correlation analysis.
Hierarchical Cluster Analysis with Dendrogram for Optimal Class Selection
This analysis applies hierarchical clustering to a simulated dataset of 50 observations with 5 variables using Ward's method and Euclidean distance, visualizes the results with a dendrogram, and extracts cluster assignments
Hierarchical Cluster Analysis with Dendrogram
L'analyse de clustering hiérarchique avec dendrogramme, présentée dans ce document, est une méthode statistique permettant de regrouper des observations similaires en clusters basés sur leurs caractéristiques. Elle commence par calculer une matrice de distances (euclidienne dans ce cas) entre les observations, après standardisation des données pour éliminer les biais d'échelle. La méthode de Ward.D2 est utilisée pour construire un dendrogramme en minimisant la variance intra-cluster à chaque étape de fusion. Le nombre optimal de clusters est déterminé à l'aide de l'algorithme NbClust, qui évalue des indices comme la silhouette et l'écart (gap statistic) pour identifier une partition robuste (ici, 3 clusters). Une analyse en composantes principales (ACP) est ensuite effectuée pour réduire la dimensionnalité, suivie d'une classification hiérarchique sur composantes principales (HCPC) pour affiner les résultats. Les visualisations, notamment via fviz_dend, permettent d'interpréter les regroupements, avec des rectangles colorés mettant en évidence les clusters dans le dendrogramme. Les résultats sont exportés sous forme de tableaux et de fichiers pour une analyse ultérieure.
The hierarchical clustering analysis with dendrogram, as presented in this document, is a statistical method designed to group similar observations into clusters based on their characteristics. It begins by computing a Euclidean distance matrix between observations after standardizing the data to eliminate scale biases. The Ward.D2 method is employed to construct a dendrogram by minimizing intra-cluster variance at each merging step. The optimal number of clusters is determined using the NbClust algorithm, which evaluates indices such as silhouette and gap statistics to identify a robust partition (here, 3 clusters). A principal component analysis (PCA) is then performed to reduce dimensionality, followed by hierarchical clustering on principal components (HCPC) to refine the results. Visualizations, particularly via fviz_dend, facilitate interpretation of the groupings, with colored rectangles highlighting clusters in the dendrogram. The results are exported as tables and files for further analysis.