gravatar

Karuitha

John Karuitha

Recently Published

House Prices and Access to Parks in London
In this analysis, we examine the connection between the accessibility to local parks and housing prices in London. The results reveal a noteworthy association between house prices and access to local parks and metropolitan parks. However, district parks and open spaces show a negative relationship with house prices, although it is not statistically significant. Notably, Westminster consistently stands out as the priciest area. The regression models employed in the analysis show that accessibility to parks has limited role in influencing housing prices. Instead, the borough is of primary importance in pricing. Nonetheless, access to open spaces, local parks, and district parks has a significant relationship with house prices. Metropolitan parks and regional parks have no significant relationship with prices.
Enhancing Anti-Churn Strategies: Leveraging Advanced Machine Learning for Targeted Intervention in Telecommunications
This study aims to improve a telecommunications company’s anti-churn efforts using advanced machine learning. Six models, including Decision Tree, Random Forest, Bagged Decision Tree, Extreme Gradient Boosting, Ridge Model, and Lasso Model, were used to identify 2000 customers for urgent anti-churn actions. The logistic regression model successfully pinpointed 2740 clients, offering targeted insights to enhance the anti-churn campaign’s effectiveness. The findings highlight the importance of using sophisticated machine learning for precise customer churn prediction in the telecommunications sector.
Predicting Customer Churn with Precision: Unleashing the Power of Machine Learning and Logistic Regression in Telecommunication Analytics
This study focuses on optimizing the anti-churn campaign for a telecommunications company through advanced machine learning techniques. The analysis involves the implementation of logistic regression to pinpoint 2000 consumers for urgent contact in the anti-churn initiative. Through the application of logistic regression model, a total of 2262 clients were successfully identified, providing targeted insights to significantly enhance the efficacy of the anti-churn campaign. The findings underscore the value of employing sophisticated machine learning methodologies for precise customer churn prediction and strategic intervention in the telecommunications sector.
Analysis of Health Data in R
Analysis of health data- the health status of citizens
Department Store Sales Time-Series Analysis
I analyze time series data from a departmental store and attempt to make forecasts. I also discuss the various terms associated with time series- trend, seasonality, and noise. I then compare the performance of the models in forecasting. The results show that a polynomial model with seasonal effects does better than a pure linear model.
Analyzing Sales Data in Python and Pandas: Unveiling Regional Leaders, Top Sales Representatives, Best-selling Items, and Peak Sales Month
This analysis, conducted using Python and Pandas, delves into a dataset comprising 43 observations and 8 variables related to sales transactions. The primary objectives include identifying the region with the highest sales, pinpointing top-performing sales representatives, determining the best-selling item, and uncovering monthly variations in sales. The results highlight the exceptional dominance of the Central region, contributing over 60% of total sales. Sales representatives Kivell and Parent emerge as frontrunners, while binders stand out as the most popular item. Furthermore, the analysis unveils seasonal variations, with December and July recording peak sales, and March experiencing a notable dip. These findings offer actionable insights for strategic decision-making, emphasizing the significance of regional, individual, and temporal considerations in optimizing sales efforts and business performance. The use of Python and Pandas demonstrates the efficacy of data-driven approaches in extracting meaningful patterns and trends from complex datasets, paving the way for informed decision-making in dynamic business environments.
Financial Development and Educational Attainment: A Cross Country Comparison
In this analysis, I examine the relationship between financial development and education attainment. My premise is that people who attain higher education are better placed to enter the formal labor market and hence demand financial services such as bank accounts. Education also raises awareness even among people in the informal and semi-formal sectors to better manage and access finance from formal financial intermediaries. The findings of the analysis confirm these hypotheses.
Assessing the Influence of 2000s Trade with China on Manufacturing Employment Across US Census Zones
We examine the variation in changes in manufacturing jobs across census regions in the United States. We find that there is a significant regional variation, with satl region experiencing the least reduction and wncen the most. Education, share of routine jobs, and manufacturing (Share of employment in manufacturing at start of decade) are also have a statistically significant relationship with changes in manufacturing employment.
Hypothesis Testing in R
In examining a dataset featuring exam scores and corresponding hours of study, this analysis endeavors to unravel the intricate relationship between academic performance and study efforts.
Analyzing Deaths of Climbers in Mt. Everest
In this analysis, I explore data from Wikipedia on the recorded number of deaths among climbers of Mt. Everest.
What Is The State of Food Security and Nutrition in the US?
The United Nations Food and Agriculture Organization’s report, “The State of Food Security and Nutrition in the World 2022,” often makes people think that food insecurity is a problem only in other parts of the world, not in the United States.
Admission Delays for Emergency Patients
This project uses machine learning to predict potential admission delays for emergency patients.
Natural Language Processing (NLP): Analyzing Social Media Data Using the ‘Bag of Words’ Technique in R
The proliferation of social media platforms has revolutionized the way people communicate, share information, and express opinions in the digital age. These platforms have become an invaluable source of data for researchers, businesses, and policymakers seeking to gain insights into public sentiment, behavior, and trends. Analyzing social media data, however, presents unique challenges due to the unstructured and often noisy nature of text-based content.
Happiness and Television Consumption
In this analysis, I use data from SOEP to explore the link between TV consumption and happiness. I explore descriptive statistics and plot a heat map. The analysis shows that people that consume more TV are, on average, happier than those that watch less TV. However, the observed link does not imply causality due to possible confounding factors.
Earnings Differentials Between US Born and Migrant Workers
In this analysis, I examine the drivers of the differential in earnings between US born and migrant workers in the United States. The analysis shows that migrants have the lowest earnings in their first year of arrival. Earnings increase steeply in the first five years, followed by a gradual decline. The drivers of income differentials include gender, age, race, education, certification, hours worked, job location (rural vs urban), and occupation. The data has a severe case of missing data making analysis challenging. After including more control variables. it appears that all else remaining the same, foreign born workers are likely to earn more than US born workers. There is potential for omitted variable bias.
Analysis of Africa’s 250 Largest Companies Using Libreoffice Calc and Python
Recently. I published a data analysis project titled Navigating Africa’s Business Landscape Using Python in my Rpubs site. The analysis provoked quite a bit of interest that motivated me to go a little deeper into analyzing the corporate landscape in Africa. In this mopre comprehensive analysis project, I use updated data from the Africa Business to explore the 250 largest formal corporations in Africa 1. The data captures the revenue, net income, and market valuation for the of the 250 largest companies in Africa for the years 2022 and 2023(Kagle, 2023).
Navigating Africa's Business Landscape Using PythonDocument
In this analysis, we use data from Kagle to illustrate the use of Python and Pandas in data analysis. The data captures the top 2000 companies in the world and is available for free upon registration on the Kagle website. We filter the data to only include companies from Africa.
nveiling Corporate Titans: Navigating Global Business Landscapes through Python Data Analysis
In this analysis, we use data from Kagle to illustrate the use of Python and Pandas in data analysis. The data captures the top 2000 companies in the world and is available for free upon registration on the Kagle website.
Filtering and Sorting with Pokemon Data in Python
In this project, I analyze Pokemon data to illustrate data filtering using pandas.
Analyzing Store Food Sales Using Python Programming Language
In this analysis, I analyse sales data in different cities and regions in the United States. The objective of this analysis is to illustrate basic data analysis using Python programming language. Python is the leading competitor for R, the other leading data science analysis programming language.
Analyzing Common English Names Using Python
In this project, I analyze data about popular English language names. The data is available on this [site](https://github.com/dolph/dictionary/blob/master/popular.txt). The project was part of a course created by [FreeCodeCamp](https://www.freecodecamp.org/). The course is available on YouTube on this [link](https://www.youtube.com/watch?v=r-uOLxNrNk8&t=224s). The course is project based. The purpose is to illustrate the basics of data analysis using the Python language.
Scraping Multiple Pages of Text Using R and rvest
In many cases, data is not always available in a ready to load and analyse format. Actually, data is often not available. In this case, we may have to collect data ourselves. One of the ways to do this is through getting data from websites through a process refered to as web-scraping. In this section, we examine the scraping of data from the web using R.
Mining Text and Exploring Sentiments in Leo Tolstoy\\'s 'How Much Land Does a Man Need?'
In conducting sentiment analysis on Leo Tolstoy's short story titled "How much land does a man need?" [@tolstoy1905much], the primary objective is to illustrate automated text mining in R. The scondary objective is to examine the underlying sentiments conveyed within the text by applying a quantitative approach. By analyzing the story through this lens, we aim to gain a deeper understanding of the characters, themes, and overall message conveyed by Tolstoy.
QnA Analysis of the Drivers of Loans defaults
Analysis of loans defaults data
Scraping & Analyzing World Population Data Using Python
In this analysis, I scrap and analyze population and country data from three sites.
Using Machine Learning to Predict Flight Delays : Decision Trees and Random Forests
Flight delays are a significant concern in the airline industry. Apart from the inconvenience caused to travelers, delays also affect the reputation of airlines, negatively impacting market share. In this analysis, I utilize data for flights between New York and Washington DC. The central questions in the analysis are; - Which factors have a significant relationship to flight delays? - Can machine learning be useful in predicting flight delays?
Which Beauty Product Combinations do Customers Often Buy Together?
In this mini-project, I explore association rules using data from a beauty product shop. Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected (Kotsiantis and Kanellopoulos 2006).
Estimating Systematic Risk Using the Capital Asset Pricing Model, CAPM
In this analysis, I use stock prices for Microsoft and General Motors (GM) to estimate the systematic risk of the stocks using the Capital Asset Pricing Model (CAPM). CAPM, developed by William Sharpe, Jack Treynor, John Lintner and Jan Mossin (Perold, 2004) quantifies the systematic risk and the expected return on an asset, particularly stocks.
Becoming a Geographer: The Art of Creating Maps in R
In this mini-project, I demonstrate how to make presentable maps using R. Often, researchers may require to visualise their data using maps. For instance, finance researchers and professionals may desire to visualise the extent of financial inclusion in different countries.
Plasma Ferritin Concentration Study
Which factors affect plasma ferritin concentration (Ferr) among Australian athletes? IIn this article, we assess the effect of a collection of explanatory variables on the plasma ferritin concentration (Ferr) in 202 Australian athletes. The file Sports Data CW 2021.csv contains the data on the plasma ferritin concentration as well as a selection of demographic variables of 202 male and female athletes.
Visualizing World Happiness in 4 Charts
In this flex dashboard, I use 4 charts to visualize world happiness in. 2021.
Weight and Sleeping Patterns in the UK
With sedentary lifestyles, obesity has become a significant health issue globally. Overweight individuals have a higher risk of developing heart disease, stroke, cancer, kidney disease, among others (Shah et al. 2021). These health issues place additional strain on health facilities and state financial resources. Consequently, much research goes into tracking obesity, mapping out possible health complications associated with obesity, establishing the factors contributing to obesity. Critically, many resources go to the design of measures to minimise the prevalence of obesity (Fruh 2017; Malik, Willett, and Hu 2013; Lopez 2007). In this project, we explore the link between age, sleep, and body mass index in a sample of individuals from the United Kingdom.
DETERMINANTS OF BODY FAT IN INDIVIDUALS
The task is to build a regression model that can be useful in explaining and predicting the body fat of individuals.
Sentiment Analysis of Kenya’s Star Newspaper on Friday July 15, 2022
In this analysis, I scrap data from the Star Newspaper for Friday July 15, 2022 and evaluate the sentiment and topics that dominate the news.
Functional Programming in R Using Purrr
In this article, I highlight the use of map functions from the purrr package in R. Additional information is available in the R help pages and the resources cited in the references section.
Acess to Finance: A Global Comparison
In this project, I use financial access data from the IMF to map the current state and the trends in access to finance globally.
Extracting Data from 200 Nested Excel Files Using R
In this project, I demonstrate how to extract data from multiple excel files nested in different folders and sub-folders. Ordinarily, this data would require a person to open each folder and sub-folder. Next, open each Ms Excel file inside the folder or sub-folder, and then copy-paste the content of each of the Excel files to a master Excel file. Even for a very efficient Ms Excel user, cutting and pasting data from 200 Ms Excel files to create one data set is a tall order. Fortunately, the R programming language makes such tasks easy and fast.
Regression analysis of YouTube dataset
Regression analysis of YouTube dataset
Analysis of Gobal Homicide/ Murder Rates
In this project, I use data from the World Bank to develop a dashboard that captures the world's homicides/murder hotspots. Overall, Homicides are concentrated in Latin America and the Caribbean, followed by Africa. No country from Asia and Europe is in the top 20 in homicide/murder rates.
Beer debate: Part 1
In this article, I used data from BeerAdvocate.com and Wikipedia to examine drivers of average ratings of beer. In addition, I examined the countries with the highest beer consumption both in absolute terms and in per capita terms. The significant takeaways are as follows.
Who’s the Fastest of All? Analyzing the 100 Metres Men’s Sprint Data
I use data from World Athletics on the best times posted by male athletes in the 100 metres sprint from 1958 to the present.
Reducing the Number of High Fatality Accidents in the UK
In this project, I use data from the Department of Transport in the United Kingdom (UK) to derive insights to reduce fatalities from major accidents. Specifically, the project aims to identify factors associated with road accidents fatalities. The key findings are: - Accident casualties peak during weekends, Starting from Friday and falling on Sunday. - Accidents casualties vary by time of day. - Major accidents and accident casualties mainly happen when the weather is fine. - Major accidents and accident casualties mainly occur in road stretches with speed limits of 30 mph and 60 mph.
John Karuitha: Scraping Data from Websites Using R
I illustrate how to scrap data from websites using R
KARUITHA: Getting Started With tidymodels
In this exercise, I introduce tidymodels using health insurance data to fit a linear regression model.
Report for the Cars and Pressure Datasets
A short report based on the inbuilt R datasets; cars and pressure