Recently Published
Bioinformatics: Sequence and Phylogenetic Analysis
In this article, I have discussed introductory and important aspects of sequence analysis along with multiple sequence alignment and phylogenetic analysis in Bioinformatics.
Sequence analysis is the most basic task in bioinformatics. In general, it refers to processing sequence data of DNA or proteins to harness the underlying information about the function, structure, or evolution of the biomolecule.
Multiple sequence alignment is also one of the most fundamental tasks in bioinformatics. It often leads to fundamental biological insight into sequence-structure-function relationships of nucleotide or protein sequence families.
NLP: A Novel and Comprehensive Methodology for Text Analytics in R
In this research project, we have developed a state-of-the-art methodology for text analytics bypassing any deep learning architecture to make text analytics feasible for small machines on a go. In this experimental setting, we have collected a huge amount of Spam and Non-spam text data for classification. We have tokenized the text data and preprocessed it with the Quanteda library in the R programming setting. Later, we built a Document frequency matrix and prepared the data to build our first decision tree model with a cross-validation approach. Then we created the Term Frequency Inverse Document Matrix function from scratch and incorporated the function with our DFM to normalize the documents and at the same time account for the weight of each word in making the prediction. Later we endorsed n-gram modeling to enrich the feature space with more word frequency.
All these can be considered pretty standard procedures on text analytics but the feature space for any big data will explode with such a scale to render any machine learning models on a small computer almost infeasible in a limited time frame. For the resolution, we proposed Latent Semantic Analysis to project the DFM transformed through the Term Frequency Inverse Document Matrix function on a vector space with the application of Singular Value Decomposition. This solved the problem of the curse of dimensionality and also retained the performance of the decision tree model. The newly projected vector space dataset contained a minimal number of most important features which makes it possible to run more sophisticated algorithms like Random Forest to raise the accuracy level. But that’s not the end. We have engineered new features on our vector space model to boost our accuracy not only on the training data but also on the test data hence maintaining the balance of accuracy with high standards. Feature engineering has appeared to be a true winner for optimization when we are only left with the vector space model.
Querying Cloud Database using MySql in RStudio: Basic to Advanced
In this project, I have discussed and elaborated on how an R programmer can perform all of the operations of SQL in an R programming setting without having to leave his/her comfort zone. For this project, I have connected to a cloud database on AWS. The queries have been built from the basic levels to more advanced queries.
Working with the APIs(Application Programming Interface) in R
In this short project, I have demonstrated how to make a request to APIs, and retrieve the content of the response to create a data set ready for analysis.
Mapping Geospatial Data with R
In this short project, I have taken the initiative to reproduce the famous Cholera Death Map created by Physician John Snow in 1854.
ETL(Extract, Transform, and Load) with RStudio
RStudio as being the most popular IDE for R programming is definitely a suitable place for ETL(Extract, Transform, and Load) along with Data Wrangling, Statistical Analysis, String Processing, and Machine Learning. The vast collection of libraries of R programming facilitates executing ETL with high efficiency on this platform.
I have developed this project for the demonstrative purpose of how to commit ETL in RStudio. I have extensively used a big dataset compiled as TECA to run through all the experiments.
Data Visualization in Practice
In this project, I have tried to elaborate on visualization techniques frequently practiced in many industries, academia, and government corporations. The sheer breadth and depth of the *ggplot2* and its counterparts is beyond the scope of this project. But, soon this project will be updated with more sophisticated and interactive visualization techniques.
R programming language has rich and powerful data visualization capabilities. While tools like Excel, Power BI, and Tableau are often the go-to solutions for data visualizations, none of these tools can compete with R in terms of the sheer breadth of, and control over, crafted data visualizations.
We will mostly be focusing on the visualization tools provided by the mighty *ggplot2* library in R. The *ggplot2* library adopts the *grammar of graphics* which essentially means, it is possible to create hundreds of plots by simply leveraging the handful of verbs, nouns, and adjectives of graphical representation.
Neural Network: A Laymen's Guide
In this project, I have taken all the pains and tolls to describe all the inner workings of the Neural Network in a very easy-to-understand language, but I have not dropped the mathematics. Slowly and gradually, one will be able to decipher the complexity of Neural networks by sailing through graphical representations and mathematics built from the scratch. You are welcome!
Data Collection and Data Management Routines in R
Data collection and Data Management skills are the basic requirements for Data Analysts, Data Scientists, and research figures. From my previous experience as a data analyst for 3 years, I have accumulated a comprehensive set of operations for data collection and data management procedures in R and tried my best to assemble them into this project. I have gone through discussing every topic with succinct descriptions. Minor mistakes may still lurk inside. Readers' discretion is widely expected.
Extensive Statistical Data Analysis Pipeline
As a Data Analyst, it was a long-cherished dream for me to publish an exhaustive statistical methodology for data analysis in practice. Although this project is not intended to be all-encompassing, nevertheless it will provide a good framework for an end-to-end data analysis pipeline with robust reproducible reporting.
In this statistical data analysis project, We will walk through a comprehensive data analysis pipeline starting with the basic analytical procedures later expanding towards more advanced, and sophisticated methods, and finally hypothesis testing of different kinds. All the code chunks will be shown in shaded rectangular boxes. The output of the codes will also be shown in unshaded rectangular boxes followed by ##. The comments in the code chunks will be followed by #.
For this project, we will use a curated dataset, ICU data, on the various observations of ICU patients, prepared by Statistician Matthias Kohl based on real-life health data.
Statistical Learning Workshops: Linear Regression
Linear regression is the stepping stone for building any statistical learning model or Machine Learning Model. Although it is somewhat dull in comparison to other vastly popular and complex learning algorithms, Linear regression is still a very useful tool for predicting quantitative responses. We can easily make good inferences on which variables or predictors are highly responsible for driving the response in good faith. The term *linear* is not simply limited to applying only linear terms of variables as the predictors but also we can fit polynomial versions of the variables as predictors to build a more complex model to capture the behavior of data.
In this workshop, We will start with the *Simple Linear Regression* where we will make predictions based on a single variable. Later, we will introduce multiple linear regression to accommodate many predictors and interaction terms in a single model.