gravatar

democratizing-data-science

Democratizing Data Science

Recently Published

Interactive Word Cloud Visualization of Affirmative Action Comments
Describe the steps, data source [data torytelling]
Post Announcement Key Actors' Ascriptions
This interactive sociogram displays available information of the top 1 percent of most influential actors in the #CancelStudentDebt movement post announcement of the Biden-Harris plan. These actors were classified into four groups: Politicians, Individual Activist, non-profit (NPO), and News Media. The color of the links capture the level of influence of the actor. High accounts amongst the most influential players in the movement, middle and low are less influential as discussed in the study. *U.S. Department of Education is not a politician, it is a governmental entity.
Peer Effects
Identification strategy for paper A Complex Systems Network Approach to Quantifying Peer Effects: Evidence From Ghanaian Preprimary Classrooms Sharon Wolf, Manuel S. Gonzalez Canche, Kristen Coe First published: 16 June 2021 https://doi.org/10.1111/cdev.13608
GeoStoryTelling [GST]
This is the output of 'GeoStoryTelling,' a software tool designed to merge GIS with qualitative analyses to enhance our storytelling projects. Here, researchers may link their stories and their participants' stories with the places where such stories or narrative take place. The visuals support video, images, music/sounds located in both the web. Moreover, images or pictures may also be read from their local computers. The software is available for Mac and Windows operative systems. YouTube videos are not displayed when the embedded video contains copyrighted information, like music or images. However, if researchers record and upload their own videos or music to YouTube, vimeo, SOUNDCLOUD, or any other streaming platform that allows embedding, they can rely on GeoStoryTelling to share these video or audio recordings to tell more powerful geostories. The legend key reveals more information about each point based on attributes of the name column. Users may decide to add categories instead of unique names, also under this column. For visualization purposes, Geostories located in close distances are grouped together based on a clustering algorithm based on a radius of 80 pixels, or roughly 1.3152E-5 miles. That is, the number ``3'' in this map, indicates that 3 events were clustered together. Clicking in each cluster will automatically zoom in revealing the points located in such a cluster. Users may refresh the browser to reset the HTML or hit the reset button located in the top left side of the map. Note that one of the stories shared deal with the opioid epidemic, which may be triggering, accordingly, viewer discretion is advised (as indicated in the legend).
Net Cost of Attendance and Net Tuition Revenue
How Much Room for Net Cost of Attendance Variation Exist in the Community College? Community colleges are public institutions and as such they are regulated by state legislatures (Education Commission of the States, 2012). As of 2020, 46 states have a statute that outlines tuition-setting authority and ten states have adopted state policies to cap or freeze tuition growth in the two-year public sector (Education Commission of the States, 2012). Prior studies on tuition-setting and price competition that included the public four-year sector were able to doing so by focusing on non-resident tuition, because this attribute is not regulated by these states legislatures, state boards of education, or state systems (González Canché, 2014, 2017). These regulations complicate the possibility that tuition variation may be impacted by contextualized or place-based factors, as we aim to explore in our study. There are however, important conceptual and empirical differences in our efforts to model affordability prospects, as we discuss next. First, let us recall that net price of college attendance is the annual total enrollment cost including tuition and fees, books and supplies, and living expenses, after accounting for all forms of grant and/or scholarship aid, that students must pay to maintain enrollment. Accordingly, since only one of these costs (i.e., tuition) is regulated by states legislatures, state boards of education, or state systems, this agency with respect to the other costs opens the possibility of college affordability to be otherwise driven or at least impacted by local or place-based contextualized factors. If this logic holds true, we should empirically observe the following interrelated patterns: when considering both net price of college attendance and net tuition revenue for full-time equivalent (FTE) students, the former should show more within state variation compared to net tuition revenue, precisely given the state regulations regarding tuition price setting. To assess the feasibility of this argument, we present Figure 1. Figure 1 was built to empirically assess the feasibility of variation in net price of college attendance that may be statistically modeled. This figure contains two indicators and four sub-figures. The indicators are net price of college attendance and net tuition revenue per FTE. Sub-figures A and C show the distribution of net price of college attendance (our main outcome of interest) and sub-figures B and D show the distribution of net tuition revenue per FTE in different states of the United States.
Price Heterogeneity
This interactive plot shows variation in net prices across and within sectors, as well as across the Contiguous United States. Click in each dot shows its name, net price, and the state where this institution is located.
Spatial Concentration of CC with partnerships located within 90 miles
This Moran's I plot corroborates that community colleges participation intensity is spatially dependent or autocorrelated. That is, community colleges have more similar participation rates compared to their neighboring community colleges that what we should expect under a random spatial distribution or random spatial process. This plot also offers a test for outlier units (i.e., community colleges) statistical significant influence in these variations. This plot is separated into four quadrants: (a) top right is high-high, community colleges with high participation rates are neighboring other community colleges with high participation rates. (b) top left is low-high, community colleges with low participation rates are neighboring other community colleges with high participation rates. (c) bottom right is high-low, community colleges with high participation rates are neighboring other community colleges with low participation rates. (d) bottom left is low-low, community colleges with low participation rates are neighboring other community colleges with low participation rates. In this plot, only community colleges in pink were identified to be as statistical significant outliers. The tests identifies the observed values of each neighboring structure and conducts simulations to shuffle those values with the values of other units in the network that are not neighbors. If observed differences have consistently higher absolute magnitudes than shuffled differences, such a difference is deemed to be more stable than having found it by chance alone.
Spatial Concentration of CC with partnerships located within 30 miles
This Moran's I plot corroborates that community colleges participation intensity is spatially dependent or autocorrelated. That is, community colleges have more similar participation rates compared to their neighboring community colleges that what we should expect under a random spatial distribution or random spatial process. This plot also offers a test for outlier units (i.e., community colleges) statistical significant influence in these variations. This plot is separated into four quadrants: (a) top right is high-high, community colleges with high participation rates are neighboring other community colleges with high participation rates. (b) top left is low-high, community colleges with low participation rates are neighboring other community colleges with high participation rates. (c) bottom right is high-low, community colleges with high participation rates are neighboring other community colleges with low participation rates. (d) bottom left is low-low, community colleges with low participation rates are neighboring other community colleges with low participation rates. In this plot, only community colleges in pink were identified to be as statistical significant outliers. The tests identifies the observed values of each neighboring structure and conducts simulations to shuffle those values with the values of other units in the network that are not neighbors. If observed differences have consistently higher absolute magnitudes than shuffled differences, such a difference is deemed to be more stable than having found it by chance alone.
Spatial Concentration of CC with partnerships located within 60 miles
This Moran's I plot corroborates that community colleges participation intensity is spatially dependent or autocorrelated. That is, community colleges have more similar participation rates compared to their neighboring community colleges that what we should expect under a random spatial distribution or random spatial process. This plot also offers a test for outlier units (i.e., community colleges) statistical significant influence in these variations. This plot is separated into four quadrants: (a) top right is high-high, community colleges with high participation rates are neighboring other community colleges with high participation rates. (b) top left is low-high, community colleges with low participation rates are neighboring other community colleges with high participation rates. (c) bottom right is high-low, community colleges with high participation rates are neighboring other community colleges with low participation rates. (d) bottom left is low-low, community colleges with low participation rates are neighboring other community colleges with low participation rates. In this plot, only community colleges in pink were identified to be as statistical significant outliers. The tests identifies the observed values of each neighboring structure and conducts simulations to shuffle those values with the values of other units in the network that are not neighbors. If observed differences have consistently higher absolute magnitudes than shuffled differences, such a difference is deemed to be more stable than having found it by chance alone.
States' influence on community college partnerships
This Moran's I plot corroborates that community colleges participation rates in a state are significantly related to community colleges participation rates in their neighboring states. Moreover, this tests for outlier units (i.e., state level) statistical significant influence in these variations. This plot is separated into four quadrants: (a) top right is high-high, states with high participation rates are surrounded by other states with high participation rates. (b) top left is low-high, states with low participation rates are surrounded by other states with high participation rates. (c) bottom right is high-low, states with high participation rates are surrounded by other states with low participation rates. (d) bottom left is low-low, states with low participation rates are surrounded by other states with low participation rates. In this case, only states in pink were identified to be as statistical significant outliers. The tests identifies the observed values of each neighboring structure and conducts simulations to shuffle those values with the values of other units in the network that are not neighbors. If observed differences have consistently higher absolute magnitudes than shuffled differences, such a difference is deemed to be more stable than having found it by chance alone.
States' clustering based on Meoids
results of the states’ partition around medoids clustering yielded an optimal classification of states into five categories or classes (see Figure 11(a)). These classes are the following: 1: Rev. trans. 10%, Assoc. deg. 0%, Core low 0%, Common # 10%, Avg= 0.2 (8 states) 2: Rev. trans. 0%, Assoc. deg. 60%, Core low 100%, Common # 0%, Avg=1.6 (8 states) 3: Rev. trans. 100%, Assoc. deg. 90%, Core low 80%, Common # 0%, Avg= 2.7 (15 states) 4: Rev. trans. 0%, Assoc. deg. 80%, Core low 90%, Common # 100%, Avg= 2.7 (10 states) 5: Rev. trans. 100%, Assoc. deg. 90%, Core low 100%, Common # 100%, Avg=3.9 (9 states
Pre Announcement Key Actors' Ascriptions
This interactive sociogram displays available information of the top 1 percent of most influential actors in the #CancelStudentDebt movement prior to the announcement of the Biden-Harris plan. *These actors were classified into four groups: Politicians, Individual Activist, non-profit (NPO), and News Media. The color of the links capture the level of influence of the actor. High accounts amongst the most influential players in the movement, middle and low are less influential as discussed in the study.
Factors Impacting Diasporic Collaborations by Discipline and Location
This interactive sociogram displays the top six* most relevant factors and processes that faculty members deemed relevant for the establishment and maintenance of diasporic collaborations among their locations and disciplines. *Six codes per each of the groups (STEM, Non-STEM, Border, No Border) analyzed capture their top 25% of factors per group. Our selection of this number was based on rendering plots that are clear to read, rather than containing too many nodes.
Concept map of diasporic academic collaborations
This interactive Figure shows the concept map of the voices of our participants, illustrating the diasporic nature of their collaborations. The concept map comprises two levels of elements. Five blue dots—previous education in the USA, bilingual proficiency, fostering institutional links, cross-cultural understanding, and postgraduate connections in Mexico—represent larger concepts linked to detailed ideas represented by orange dots. These orange concepts are not unique to blue concepts, but rather shared across them. For instance, "diasporic sensibility: a built, defined, and reimagined identity as a Mexican scholar" is included in and related to previous education in the USA, bilingual proficiency, fostering institutional links, and postgraduate connections in Mexico. Also linked to these blue concepts is the notion of tender connections with home, friends, and childhood (an orange dot). This idea aligns with the yearning and ambivalence observed in diasporic intellectuals in our literature review (e.g., Chambers, 2008). Despite variations in the interviewees' narratives, when considered collectively, these elements outline prominent characteristics of academics' connections with their home country. Consequently, diasporic collaboration emerges as a crucial collective undertaking in which the fruits of their labor belong to and benefit a larger community.
Diversifying Efforts: Outstanding Achievers in Standardized Math SAT
As demonstrated in this study, standardized tests are spatially dependent, wherein test-takers tend to have similar outcomes than their neighbors. These two plots serve to identify high achievers in standardized Math as a function of performing statistically significantly higher than their closest peers' in mathematical performance. That is, the the purpose of these visuals is to offer researchers and admission officers with a tool that enables the easy identification of outstanding and qualified test-takers whom, despite experiencing life in places with high levels of socioeconomic hardship, consistently and statistically significantly mastered these tests, outperforming in doing so their peers who grew up in the same neighborhoods. From this perspective, our proposed framework offers a feasibly solution to the challenge of locating “The Hidden Supply of High-Achieving, Low-Income Students” (Hoxby & Avery, 2012). Cases shown in pink in the top plot are located in the High-Low quadrant of the Moran's I plot and indicate that student's own mathematical score is significantly higher than the performance of their neighbors living within one mile from the each test taker. When placing the cursor in each dot, we display their score, their peers' score, and the high, mid, or low indicators in poverty, unemployment, minority status (i.e., concentration of African American, Native American, and Hispanic inhabitants), and family strcture measured all measured at the Zip code tabulated area (ZCTA) level. The map showcases the same outstanding cases (excluding cases in the other three quadrants of the Moran's I plot shown in grey the aforementioned plot) but in addition places them in their specific geographical context. This map also dynamically displays the distribution of poverty, unemployment, and minority concentration at the ZCTA when placing the cursor in each ZCTA. Finally, when placing the cursor in each dot, we display students' score, their peers' scores, and the high, mid, or low indicators in poverty, unemployment, minority status, and family composition measured at the ZCTA level, as well. Truly exceptional cases are constituted by students living in high poverty, high unemployment, high minority, and high proportion of single mother households and their acceptance may truly contribute to diversifying entry college cohorts.
Interactive Wordcloud Analysis of Reasons to Enroll in a Data Science Course
This is a wordcloud representation of the most frequent words used by participants of a data science course. Since this is a HTML representation, the frequency of each word may be retrieved by place the cursor on each word.
ChatGPT co-authorship network
Since December 2022, there has been 897 academic articles published on the topic ChatGPT. These articles have been co-authored by 2,118 academics, representing 8,173 collaborations. This network highlights the most influential authors as a function of connecting different articles together (i.e., betweenness centrality). Access to the original data available at https://cutt.ly/rwaClF9g.
matrix_sentiments_data_science
Explain in detail what does this do, what are the data, what is the question, the purpose...
Sentiment Analysis Social Media Data Analytics
This is a test
Example From Class Summer 2023
Desc...
Identification Strategy Multilevel SAR Models
This map shows how schools nested within ZCTAs with differing levels of poverty may: 1. Perform in mathematical proficiency 2. Form connections with other schools in close proximity, regardless of whether those neighboring institutions are located within or outside their ZCTA. Additionally, the Moran's I test shows that school performance is subjected to spatial dependence (I = 0.268, p. < .001). That is, since schools performance is not independent from the performance of their neighboring units, models should account for that form of dependence. Additionally, the place-based where those schools are located also shows spatial dependence ( (I = 0.401, p. < .001), which also confirms the need to model for that form of variation. In sum since both school and place-based spatial autocorrelation are present, the use of multilevel SAR models is justified.
Variable Importance Via Boruta and Mathematical Proficiency in New York State (NYSED 2017-2018)
Feature Selection: a Machine Learning Strategy to Deal with Place-based Multicollinearity The tenets of geography of advantage/disadvantage suggest that the geographical indicators selected may be highly correlated, that is, zones with high crime are likely to have high poverty levels, for example. This correlation, which is typically observed in studies modeling environmental factors (Li et al., 2016), may affect the observed variable importance of the predictors. Following Li et al., (2016) before model estimation, variable inclusion criteria relied on a Feature selection algorithm (Kursa & Rudnicki, 2010) to detect all non-redundant variables to predict mathematical proficiency variation via machine learning. This process effectively addresses multicollinearity issues by identifying and easing the exclusion of redundant features. This non-redundant feature selection was implemented using the Boruta function, a Random Forest regression procedure. Boruta is a wrapper algorithm that subsets features, the school (Xs) and place-based indicators (Zs) and train a model using them to try to capture all the relevant indicators with respect to an outcome variable. References: Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. J Stat Softw, 36(11), 1-13. https://www.jstatsoft.org/article/view/v036i11 Li, J., Tran, M., & Siwabessy, J. (2016). Selecting optimal random forest predictive models: a case study on predicting the spatial distribution of seabed hardness. PloS one, 11(2). https://doi.org/10.1371/journal.pone.0149089
Variable Importance Via Boruta and Mathematical Proficiency in New York State (NYSED 2018-2019)
Feature Selection: a Machine Learning Strategy to Deal with Place-based Multicollinearity The tenets of geography of advantage/disadvantage suggest that the geographical indicators selected may be highly correlated, that is, zones with high crime are likely to have high poverty levels, for example. This correlation, which is typically observed in studies modeling environmental factors (Li et al., 2016), may affect the observed variable importance of the predictors. Following Li et al., (2016) before model estimation, variable inclusion criteria relied on a Feature selection algorithm (Kursa & Rudnicki, 2010) to detect all non-redundant variables to predict mathematical proficiency variation via machine learning. This process effectively addresses multicollinearity issues by identifying and easing the exclusion of redundant features. This non-redundant feature selection was implemented using the Boruta function, a Random Forest regression procedure. Boruta is a wrapper algorithm that subsets features, the school (Xs) and place-based indicators (Zs) and train a model using them to try to capture all the relevant indicators with respect to an outcome variable. References: Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. J Stat Softw, 36(11), 1-13. https://www.jstatsoft.org/article/view/v036i11 Li, J., Tran, M., & Siwabessy, J. (2016). Selecting optimal random forest predictive models: a case study on predicting the spatial distribution of seabed hardness. PloS one, 11(2). https://doi.org/10.1371/journal.pone.0149089
Variable Importance Via Boruta and Mathematical Proficiency in New York State (NYSED 2017-2019)
Feature Selection: a Machine Learning Strategy to Deal with Place-based Multicollinearity The tenets of geography of advantage/disadvantage suggest that the geographical indicators selected may be highly correlated, that is, zones with high crime are likely to have high poverty levels, for example. This correlation, which is typically observed in studies modeling environmental factors (Li et al., 2016), may affect the observed variable importance of the predictors. Following Li et al., (2016) before model estimation, variable inclusion criteria relied on a Feature selection algorithm (Kursa & Rudnicki, 2010) to detect all non-redundant variables to predict mathematical proficiency variation via machine learning. This process effectively addresses multicollinearity issues by identifying and easing the exclusion of redundant features. This non-redundant feature selection was implemented using the Boruta function, a Random Forest regression procedure. Boruta is a wrapper algorithm that subsets features, the school (Xs) and place-based indicators (Zs) and train a model using them to try to capture all the relevant indicators with respect to an outcome variable. References: Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. J Stat Softw, 36(11), 1-13. https://www.jstatsoft.org/article/view/v036i11 Li, J., Tran, M., & Siwabessy, J. (2016). Selecting optimal random forest predictive models: a case study on predicting the spatial distribution of seabed hardness. PloS one, 11(2). https://doi.org/10.1371/journal.pone.0149089
Example of Visuals Plotly
data taken from... to ...
Program-level Within-state Representation of Insitutionally-driven Agreements
This graph shows a network analysis of central programs, disciplines, and policies regarding transfer and articulation legislature within-state across the Continental Unites States. Size of nodes is based on strength. Colors are based on regions. Hoovering over each line or node offers more details. Data source: https://www.collegetransfer.net/Search/Search-for-Transfer-Articulation-Agreements
Program-level Out-of-state Representation of Insitutionally-driven Agreements
This graph shows a network analysis of central programs, disciplines, and policies regarding transfer and articulation legislature out-of-state across the Continental Unites States. Size of nodes is based on strength. Colors are based on regions. Hoovering over each line or node offers more details. Data source: https://www.collegetransfer.net/Search/Search-for-Transfer-Articulation-Agreements
Program-level Agregated Representation of Insitutionally-driven Agreements
This graph shows a network analysis of central programs, disciplines, and policies regarding transfer and articulation legislature across the Continental Unites States. Size of nodes is based on strength. Colors are based on regions. Hoovering over each line or node offers more details. Data source: https://www.collegetransfer.net/Search/Search-for-Transfer-Articulation-Agreements
State-level Out-of-state Representation of Insitutionally-driven Agreements
This graph shows a network analysis of central states and policies regarding transfer and articulation legislature out-of-state, across the Continental Unites States. Size of nodes is based on strength. Colors are based on regions. Hoovering over each line or node offers more details. Data source: https://www.collegetransfer.net/Search/Search-for-Transfer-Articulation-Agreements
State-level Within-state Representation of Insitutionally-driven Agreements
This graph shows a network analysis of central states and policies regarding transfer and articulation legislature within state, across the Continental Unites States. Size of nodes is based on strength. Colors are based on regions. Hoovering over each line or node offers more details. Data source: https://www.collegetransfer.net/Search/Search-for-Transfer-Articulation-Agreements
State-level Agregated Representation of Insitutionally-driven Agreements
This graph shows a network analysis of central states and policies regarding transfer and articulation legislature across the Continental Unites States. Size of nodes is based on strength. Colors are based on regions. Hoovering over each line or node offers more details. Data source: https://www.collegetransfer.net/Search/Search-for-Transfer-Articulation-Agreements
Out-of-State Spatial Network Rendering of Institutionally-Driven Transfer and Articulation Agreements
This graph show a spatial network analysis of institutions participating in out-of-state transfer and articulation agreements across the Continental Unites States. Colors are based on institutions control (public, private for- and private not-for-profit) and levels (2- or 4-year). Hoovering over state or connection offers more details. Main data source: https://www.collegetransfer.net/Search/Search-for-Transfer-Articulation-Agreements
Within-State Spatial Network Rendering of Institutionally-Driven Transfer and Articulation Agreements
This graph show a spatial network analysis of institutions participating in in-state transfer and articulation agreements across the Continental Unites States. Colors are based on institutions control (public, private for- and private not-for-profit) and levels (2- or 4-year). Hoovering over state or connection offers more details. Main data source: https://www.collegetransfer.net/Search/Search-for-Transfer-Articulation-Agreements
Key Actor Analysis of Central Institutions
This graph shows a network analysis of central institutions and policies regarding transfer and articulation legislature across the Continental Unites States. Size of nodes is based on betweenness centrality. Colors are based on eigenvector centrality. Hoovering over each line or node offers more details. Data source: https://www.collegetransfer.net/Search/Search-for-Transfer-Articulation-Agreements
Spatial Network Rendering of Statewide Transfer and Articulation Policies
This graph show a spatial network analysis of central states and policies regarding transfer and articulation legislature across the Continental Unites States. Colors are based on eigenvector centrality. Hoovering over state or connection offers more details. Data source: Education Commission of the States https://www.ecs.org/50-state-comparison-transfer-and-articulation/
Key Actor Analysis of Central States and Policies
This graph show a network analysis of central states and policies regarding transfer and articulation legislature across the Continental Unites States. Size of nodes is based on betweenness centrality. Colors are based on eigenvector centrality. Hoovering over each line or node offers more details. Data source: Education Commission of the States https://www.ecs.org/50-state-comparison-transfer-and-articulation/