Aryan Singh's World

Data Science, Statistics, ML, Deep Learning

How to choose the Machine Learning algorithm to use for a problem?

The choice of machine learning algorithm to solve a particular problem is very hard to determine before trying a bunch of algorithms along with hyperparameter optimisation. But there are some pointers that can be kept in mind while figuring out the right algorithm:

1. Time Series Data: For data having one dependent variable in the form of time sequence algorithms like ARIMA or sequence models like LSTM can be benchmarked to find the optimum solution.

2. Speech/ Text Analytics: Probably a deep learning based approach along with sequence to sequence models like RNN and LSTM can be a good start.

3. Text Classification: Sequence models like HMM, CRF and LSTM can be tried for this solution.

4. Structured Data(Regression): Linear regression can be used as baseline, followed by SVM regression followed by using non linear kernels like rbf. Tree based ensemble models like Random Forrest and XGBoost should be tried for a more intuitive solution.

5. Structured Data(Classification): We can start with Logistic Regression for baseline. It also explains the importance of each of the dependent variables in terms of the coefficient. Furthermore, SVM, Random Forrest and XGBoost can be tried. If we have a large number of training examples and better hardware then solution based on deep learning along with appropriate activation functions like softmax(Multiclass) or sigmoid(binary) can be used.

6. Image/Video based Data: For image/video based data a pre trained DL based network is a good starting point. Particularly tried and tested architectures like VGG and Resnet 50 trained on Image net dataset can be used and the final few layers can be retrained to tail it to our particular problem. For real time object detection YOLO is a very elegant solution and can be given a try.

What do NaMo’s speeches convey?

This weekend while wandering around the labyrinths of internet, I stumbled upon the corpus of Indian prime minister Mr. Narendra Modi’s speeches. I thought it would be interesting to analyse the speeches to see what are the main issues he speaks about and what is the overall connotation of the speeches. In this blog, I present my analysis of the speeches along with the visualizations in the form of graphs and plots.

Unigram and Bigram Frequency

I used count vectoriser from sklearn feature extraction to vectorise the text into frequency vectors and then summed it over the rows to find the frequency of each word in the overall corpus. Later I plotted the top 30 words by frequency on a barplot to analyse them. Following is the result I got:

wordfreq
Since mann ki baat is a program aimed at listening to and addressing problems of people PMs main focus is on issues relating to poverty and water. Also he talks about taking actions by using phrases like time, make, great.

Most Frequent Nouns, Adjective and Verbs

Next up I thought it would be interesting to do POS tagging of each speech to see what are the major issues PM lays stress upon and how positive/willing he is to solve them. For this I pos tagged the whole corpus using nltk and then found out the most common nouns, verbs and adjectives out of it. I plotted 16 most common nouns, adjectives and verbs in the form a word cloud to visualise and draw inferences. Here is what I got:

Nouns Cloud:

nouncloud

It is clear from the word cloud that the main issues that are being highlighted are related to basic amenities like water, villages, farmers. Interestingly enough, yoga is repeatedly a frequent part of the conversation. PM has also addressed issues about black money but the frequency is on the lower side.

Verbs Cloud:

verb

The verbs mostly have a positive connotation. Words like think, make, started and done indicate the action oriented approach.

Adjectives Cloud:

adjectives

Adjectives do reveal a basic essence of the major fields/issues that PM is targetting. Youth, poor and digital India initiatives are some of the most frequent areas touched upon.

Sentiment Analysis Of The Speeches

Next up I analysed each speech for it’s sentiment score to understand whether the connotation is positive/negative and how it has changed overtime. I use TextBlob library to sentiment score each speech. Here is how the time series sentiment score looks like:

sentiment

Looking at the overall analysis, the speeches don’t seem to be that positive. This might be because they are aimed at addressing the issues people face on daily basis. Overall years 2015 and 2016 are more positive as compared to the other years.

 

Code: https://github.com/aryancodify/NaturalLanguageProcessing/blob/master/modi_speeches.ipynb

Dataset: https://www.kaggle.com/shankarpandala/mann-ki-baat-speech-corpus