Recently Published
Coursera R Capstone: A Word Prediction Shiny App
We design a language model based on a dataset provided by Swiftkey, which is now part of Microsoft (https://www.microsoft.com/en-us/swiftkey?activetab=pivot_1%3aprimaryr2). The dataset provided has sub-data in four languages (Finnish, Russian, German, and English). We use the English data here. After a detailed analysis, we decided to build a model using N-Grams of one, two, three, four, and five words using all of the data provided, and supplemented the language model with 6-grams from the blogs subset of the data. We built a shiny app based on this language model.
The test version is located at:
https://rubiera.shinyapps.io/capstone_test/
Exploratory N-Gram Model and Draft Next Word Prediction Model
We analyze here a dataset provided by Swiftkey, which is now part of Microsoft (https://www.microsoft.com/en-us/swiftkey?activetab=pivot_1%3aprimaryr2). The dataset provided has sub-data in four languages (Finnish, Russian, German, and English). We analyze the English data here. Here are the steps in this analysis:
Task 2: Exploratory Data Analysis of the Swift English Dataset as an N-Gram Model
Explore 1-grams, 2-grams, and 3-grams in a 70 percent training sample of the dataset to:
- Understand the distribution of words and relationship between the words in the corpora.
- Understand frequencies of words and word pairs.
- Assess how many unique words are needed in a frequency sorted dictionary to cover (50%,90%) of all word instances in the English dataset.
Use wordnet (https://wordnet.princeton.edu/) to:
- Evaluate if a word is in the English language, which, by inference, can be used to establish that it is from another language.
- Explore ways to increase coverage using synonyms.
Task 3: Draft "Next Word" Prediction Model
- Explore predictions based on the (N-1) gram to compare use of back-off to the (N-1) gram and/or the use of multiple lower order N-Grams.
- Explore techniques to handle unseen N-Grams.
- Explore N-grams in the testing dataset to predict the next word knowing the previous 1, 2, or 3 words.
- Explore how big an N is needed in our N-Gram model to maximize correct predictions while minimizing response time to user and storage requirements.
Natural Language Processing of SEC Reports
The Securities and Exchange Commission requires U.S. stock-issuing, or publicly listed companies to file a large number of reports. These reports contain financial data annotated with text of varying lengths. In this shiny app, we have collected a small sample of recent annotations contained in the financial reports of three companies with different types of operations, and different styles of text annotation.
Machine Learning Modeling of Selected Weight-lifting Activities
In this analysis, we evaluate the data used by Velloso et al. in their paper "Qualitative Activity Recognition of Weight Lifting Exercises" [ACM SIGCHI 2013]. The data was collected for six young men performing weight-lifting exercises with a light dumbbell (1.25 KG) using five pre-determined sequences (named as the variable 'classe' in the dataset). Sequence A is the correct sequence, and sequences B, C, D, and E are variations of the men performing the weight-lifting exercise incorrectly. Sequences B, C, D, and E are specific ways to perform the weight-lifting exercise incorrectly, and this means that the four wrong sequences should be just as separable from each other as they are from A.