Election Analysis (UP Election-2022)

5 min readFeb 1, 2022

1 — Overview: As there are on going Assembly Election in Uttar Pradesh (India), may news channels or agencies/news channels predict the exit polls to do this they carry out exit polls through the method of random sampling. Some also opt for systematic sampling to predict the actual result. The agencies/news channels ask people from different age groups, gender, caste, religion and region whom they voted for.
But sometime it becomes very expensive and may not be accurate. What’s the alternate?
As we know we have majority of population on social media like Facebook, Twitter, YouTube, etc. If we some how use these data to make a analysis or prediction in cheap and efficient way.
Well, this blog discuss and show a sample of work to do the proposed idea, which can give use idea to experiment with more data.

2 — The Data: As we discuss above to use the data from social media, here I will take the data from YouTube like ‘comments ’and ‘likes’ from top 10 interviews in the last couple of months of two major candidates Akhilesh Yadav(Samajwadi Party) and Yogi Aditya Nath(Bhartiya Janta Party).

To do this we scrape the YouTube comments of top 10 videos of both candidates.

Below is the code to scrape the comments and likes of videos of Akhilesh Yadav, and similarly goes for Yogi Aditya Nath.

Scaping Akhilesh Yadav Data

2 —Preprocessing: This is most trivial task in the analysis as there won’t be any kind of machine learning or deep learning model/algorithm, as they we will comments in multiple languages like Hindi, English or Hinglish(Roman Urdu) which will make our task a bit more complicated. So we do this task in multiple parts —

Part 1 — Get the stopwords:

First remove the stop words of all languages (Hindi, English or Hinglish) using pre-defined stop word list which I have got on the internet.

Stop words

As you can see we have also added some more stopwords which was not in the list of stopwords.

Part 2 — Abbreviation and Emoji removal:

In this part we will try to convert number abbreviation to numbers like ‘K’ to 1000, along with that we will remove the emoji from the sentences using Emoji library.

Abbreviation and Emoji remove

Part 3 — Combine:

Let’s combine all cleaning parts and remove some HTML tags, HTML links using BeautifulSoup library. Additionally, we also decontract the words like “can’t” to “can not” using contractions library.

Preprocess

Part 4 — Apply

Here we have applied all preprocessing cleaning task using .apply function of pandas.

Apply

Same goes for Yogi Aditya Nath data set.

4 — Brute Force Cleaning (All words): Now this part is quite tricky as we are go beyond dataset means we are going to use words/slangs that are generally used by the people in the India towards particular parties or people like — biased media is often defined by people as “Godi Media”. We make a dictionary of these such words and make a pattern to replace these words with our defined dictionary. Here our assumption we are taking is that the words in dictionary will appear in our dataset.

Note: Here we are taking all the words without making special changes means we will not convert negative words/sentence to tag like “Jai Yogi” to “Yogi_positive”.

Brute Force Clean (All)

Then we apply this function on “comment” columns using pandas’s .apply function.

5— Minor Tuning: Here we will remove some words to make out plots/analysis more robust.

Tune All words

6 — Frequent words: Here we made a dictionary of word and times it appeared over the corpus. Then we only took top 50 repetitions.

Frequent Words

7 — Plots(All words):

Here we plot the word cloud using the above dictionary.

Code for Word Cloud

Frequency Bar Plot Code

Pie Chart Code

Histogram of Number of likes code

8 — Brute Force Cleaning (Focus words): Without making blog a bit more long to read let’s skip to some common parts, here we will do the same cleaning that we did above but we change the dictionary to make tag of positive or negative comments.

Brute Force (Focus words)

9 — Plots:

We are now plotting into two method first we just plot the polarity tag according to the party and candidates without combining them and then we combine them into one according to the party of the respective candidates.