Recently Published
Shared code
Open-source software dominates in certain areas. A lot of data science relies on thousands of open-source packages that are continually being improved in part because anyone can see how they work. Yet the open-source model has not taken off in academia. A lot of the publicly available code in accounting research relates to two seemingly obscure topics: Fama-French industries and winsorization. I discuss both in this note.
Some benchmarks with comp.g_secd
I use the WRDS data set `comp.g_secd` to do some benchmarking. A representative query that takes 6 minutes using SAS on the WRDS servers, takes about 1 minute using the WRDS PostgreSQL server, and about 0.2 seconds using a local parquet file. The parquet file occupies less than 4 GB on my hard drive, which compares with about 145 GB for the SAS file on the WRDS server. While creating the parquet file takes 45 minutes, this may be a reasonable trade-off for a researcher who is analysing `comp.g_secd` frequently and does not need the very latest iteration of `comp.g_secd` for research purposes.
The best of both worlds: Using modern data frame libraries to create pandas data
A number of modern data frame libraries have emerged that address weaknesses of pandas. In this note, I use polars and Ibis to show how one can use these libraries to get the data into a form in which pandas can shine.
Using SAS to create pandas data
SAS might be another approach to manipulating data for pandas. My Python package wrds2pg offers a sas_to_pandas() function that can run code on the WRDS server and return the results as a pandas dataframe. While not quite as fast as using Ibis with the PostgreSQL server, SAS performs pretty well with this task.
Workshop: Introduction to R Statistics for Insect Ecology
Welcome to the digital home of our workshop!
Insect ecology is uniquely messy—zero-inflated counts, overdispersed populations, and more variables than a centipede has legs.
This guide is designed to take you from "R-anxiety" to "R-competence," focusing on the specific statistical hurdles we face as entomologists.