gravatar

chidemannie

Emmanuel Benyeogor

Recently Published

Swiftkey_Nextword_Prediction_Pitch
This project presents a next-word prediction application developed as part of the Johns Hopkins Data Science Capstone (Coursera). The objective was to build a lightweight and efficient predictive text model similar to those used in smart mobile keyboards. Using the HC Corpora English datasets (blogs, news, and Twitter), I performed exploratory data analysis, constructed n-gram frequency tables (2-gram, 3-gram, and 4-gram models), and implemented a backoff strategy to handle unseen word combinations. The final product is a deployed Shiny application that predicts the top three most likely next words given an input phrase. The model is optimized for performance and memory efficiency to ensure responsiveness in a web environment. Key Features: Frequency-based n-gram language model Backoff prediction logic (4-gram → 3-gram → 2-gram → fallback) Efficient storage using serialized RDS tables Deployed via shinyapps.io Live application: https://chidemannie.shinyapps.io/Swiftkey_Next_Word_Predictor/ Source code: https://github.com/chidemannie/swiftkey-capstone This slide deck summarizes the modeling approach, performance considerations, and demonstrates how the application works.
SwiftKey Capstone: Exploratory Data Analysis of HC Corpora
This project is part of the Johns Hopkins University Data Science Specialization Capstone. The objective is to explore large-scale English text datasets (blogs, news, and Twitter) and build the foundation for a next-word prediction model similar to those used in mobile smart keyboards. The analysis includes: Basic dataset summaries (file size, line counts, maximum line length) Exploratory analysis of text structure Sampling and cleaning strategies suitable for large corpora Word frequency analysis (unigrams and bigrams) Distribution of words per line Vocabulary coverage analysis (50% and 90% token coverage) Preliminary modeling strategy for n-gram backoff prediction The results highlight differences between text sources (short Twitter messages vs. long blog entries), motivate efficient sampling techniques, and inform the design of a responsive Shiny application for deployment. The next phase of the capstone will implement an optimized n-gram model with a backoff strategy and deploy it via Shiny for real-time next-word prediction.
Data Science Capstone - Milestone Report Wk2
A large training dataset comprising of text with four languages is downloaded. The English database is used for the report. Aim: To familiarize with databases and do the necessary cleaning (for example offensive and profane words). Using techniques learnt we would be: Identifying appropriate words, punctuation and numbers referred to as tokens. The is achieved through the process called Tokenization Getting rid of profane and other words we do not want to be included in our prediction through filtering
GeyserEruption
Geyser Eruption It is not possible to predict more than one eruption in advance. Old Faithful is currently bimodal. It has two eruption durations, either a long (over 4 minutes) or more rarely a short (about 2-1/2 minutes). Short eruptions lead to an interval of just over an hour and long eruptions lead to an interval of about 1-1/2 hours.
test
ytest
Geyser APP
Old Faithful is a cone-type geyser. Since 2000 its intervals have varied from 44 to 125 minutes, with an average of about 90-92 minutes, its duration is 1 1/2 to 5 minutes and its height is 90 to 184 feet.