gravatar

mulchlax

Antonio Rubiera

Recently Published

Coursera R Capstone: A Word Prediction Shiny App
We design a language model based on a dataset provided by Swiftkey, which is now part of Microsoft (https://www.microsoft.com/en-us/swiftkey?activetab=pivot_1%3aprimaryr2). The dataset provided has sub-data in four languages (Finnish, Russian, German, and English). We use the English data here. After a detailed analysis, we decided to build a model using N-Grams of one, two, three, four, and five words using all of the data provided, and supplemented the language model with 6-grams from the blogs subset of the data. We built a shiny app based on this language model. The test version is located at: https://rubiera.shinyapps.io/capstone_test/
Exploratory N-Gram Model and Draft Next Word Prediction Model
We analyze here a dataset provided by Swiftkey, which is now part of Microsoft (https://www.microsoft.com/en-us/swiftkey?activetab=pivot_1%3aprimaryr2). The dataset provided has sub-data in four languages (Finnish, Russian, German, and English). We analyze the English data here. Here are the steps in this analysis: Task 2: Exploratory Data Analysis of the Swift English Dataset as an N-Gram Model Explore 1-grams, 2-grams, and 3-grams in a 70 percent training sample of the dataset to: - Understand the distribution of words and relationship between the words in the corpora. - Understand frequencies of words and word pairs. - Assess how many unique words are needed in a frequency sorted dictionary to cover (50%,90%) of all word instances in the English dataset. Use wordnet (https://wordnet.princeton.edu/) to: - Evaluate if a word is in the English language, which, by inference, can be used to establish that it is from another language. - Explore ways to increase coverage using synonyms. Task 3: Draft "Next Word" Prediction Model - Explore predictions based on the (N-1) gram to compare use of back-off to the (N-1) gram and/or the use of multiple lower order N-Grams. - Explore techniques to handle unseen N-Grams. - Explore N-grams in the testing dataset to predict the next word knowing the previous 1, 2, or 3 words. - Explore how big an N is needed in our N-Gram model to maximize correct predictions while minimizing response time to user and storage requirements.
Natural Language Processing of SEC Reports
The Securities and Exchange Commission requires U.S. stock-issuing, or publicly listed companies to file a large number of reports. These reports contain financial data annotated with text of varying lengths. In this shiny app, we have collected a small sample of recent annotations contained in the financial reports of three companies with different types of operations, and different styles of text annotation.
Machine Learning Modeling of Selected Weight-lifting Activities
In this analysis, we evaluate the data used by Velloso et al. in their paper "Qualitative Activity Recognition of Weight Lifting Exercises" [ACM SIGCHI 2013]. The data was collected for six young men performing weight-lifting exercises with a light dumbbell (1.25 KG) using five pre-determined sequences (named as the variable 'classe' in the dataset). Sequence A is the correct sequence, and sequences B, C, D, and E are variations of the men performing the weight-lifting exercise incorrectly. Sequences B, C, D, and E are specific ways to perform the weight-lifting exercise incorrectly, and this means that the four wrong sequences should be just as separable from each other as they are from A.