Battle Of The NeighbourhoodLast Updated on May 3, 2021
This was a complete Data Science project where the main idea is to compare how similar or dissimilar two cities are. Here I have compared New York City and Amsterdam. Using Foursquare API reviews and ratings of different locations in both the cities were fetched. I have performed segmenting and clustering using the data of both cities.I have used Folium for the mapping.
Share with someone who needs it
Spotify Data AnalysisLast Updated on May 3, 2021
Spotify Data Analysis
This project was made by using Tableau Software. Tableau is an interactive visualization software. A lot of functions can be performed by using this software. Many charts can be drawn by using single or multiple attributes. Colours can be added to show variation in the charts or to show the intensity of a particular attribute. Charts/graphs that can be made are:
1. Pie chart
2. Bar graph
3. Line graph
4. Waterfall chart
In my project, I had used a dataset from Kaggle. The dataset was about the details of songs from Spotify app. The dataset had 119 different attributes out of which 2 were in string format and the rest were in numerical. A few attributes were:
1. Song name
2. Artist name
From theses 19 attributes I had made a total of 13 visualizations based on different factors, and had assembled them in 6 dashboards.
It gives the analysis of the danceability. It shows 2 analysis:
1. Artists who provide most danceability
It is a bar graph with danceability in the y-axis. It shows that the artist named Katy Perry had most danceability in her songs.
2. Artists in top 10 with the most danceability
It is a bar graph, which dims its colour as the bar’s size decreases.
It gives the analysis of the genre of songs. It shows 2 analysis:
1. How the proposition of genres has changed in 10 years
Canadian pop was famous in 2009 as well as in 2020. While Detroit hip hop is not as famous now.
2. Least famous artists and the genre of their songs
It is a point chart which shows which artist makes songs in which genre
It gives the analysis of the popularity. It shows 2 analysis:
1. Most popular artists and their popularity
It shows how the popularity of the artists have changed over the years.
2. Most popular artists and their song’s popularity
It shows that the artist Sara Barailles has the most popularity with 71 average popularity
It gives the analysis of the positivity. It shows 2 analysis:
1. Loudness vs energy with respect to positivity
A colour changing bar graph which dims as the value decreases.
2. Artist with most popularity
A bar graph showing artist Katy Perry with most positive songs
It shows 2 analysis:
1. Song names that start with question related phrases
Such songs had a popularity index of only 1055
2. Change in speechiness vs beats
A bar graph that shows the change of speechiness vs beats over the years
It gives the analysis of the most popular artist Katy Perry. It shows 3 analysis:
1. Songs sung over the years
It is in tabular format with 2 columns
2. Popularity of songs
It shows how much her songs have been popular over the years
3. Popularity and number of times her songs appeared in top 10
It shows her most popular and hit songs popularity index
Password CheckerLast Updated on May 3, 2021
This can be the most secure way for you to check if your password has ever been hacked. This is a password checker which checks whether this password has been used before or not. and if it has been used then the number of times it has been found. It makes it easy for you to understand that your password is strong enough to keep or is it too light. Its working is pretty simple, in my terminal i write the python file with my code checkmypass.py followed by the password to check if its ever been hacked , its gonna check as many passwords as we list in the terminal. I have used passwords API (pawned password) and SHAH1 (algorithm) to hash the given password into some complex output which is hard to hack also only the first five characters of hash version of password has been used for super privacy so that the real one is safe. The concept of k-anonymity is used it provides privacy protection by guaranteeing that each record relates to at least k individuals even if the released records are directly linked (or matched) to external information. I have added this on my Github repository.
THIS CAN BE REALLY EFFECTIVE FOR SOME PERSONEL USE.
Age And Gender DetectionLast Updated on May 3, 2021
objective :To build a gender and age detector that can approximately guess the gender and age of the person (face) in a picture or through webcam.
Description : In this Python Project, I had used Deep Learning to accurately identify the gender and age of a person from a single image of a face. I used the models trained by Tal hassner and Gil levi. The predicted gender may be one of ‘Male’ and ‘Female’, and the predicted age may be one of the following ranges- (0 – 2), (4 – 6), (8 – 12), (15 – 20), (25 – 32), (38 – 43), (48 – 53), (60 – 100) (8 nodes in the final softmax layer). It is very difficult to accurately guess an exact age from a single image because of factors like makeup, lighting, obstructions, and facial expressions. And so, I made this a classification problem instead of making it one of regression.
For this python project, I had used the Adience dataset; the dataset is available in the public domain. This dataset serves as a benchmark for face photos and is inclusive of various real-world imaging conditions like noise, lighting, pose, and appearance. The images have been collected from Flickr albums and distributed under the Creative Commons (CC) license. It has a total of 26,580 photos of 2,284 subjects in eight age ranges (as mentioned above) and is about 1GB in size. The models I used had been trained on this dataset.
Working : Open your Command Prompt or Terminal and change directory to the folder where all the files are present.
- Detecting Gender and Age of face in Image Use Command :
python detect.py --image image_name
- Detecting Gender and Age of face through webcam Use Command :
Machine Learning (Heart Disease Prediction Model)Last Updated on May 3, 2021
This is web based API model which predicts the probability of having a heart disease
Here I had a dataset of few patients where I had information like CRF, Hypothrodism, HT,DM.
I have splitted the data so that I can train , and then test our prediction by finding out accuracy using various Python Algorithms.
The library used here are numpy , matplotlib, pandas, sklearn and pickle of Python.
I preprocessed the data and performed various splitting options.
I observed various plots using library matplotlib.
I have used numpy and pandas to to read the data and observe various statistical things.
I have used various algorithms like:
Random forest ( model file in github as modelRF.py)
Decision tree ( modelDT.py).
Naive Bayes (modelNB.py)
In each algorithm I fitted my training data, saved model to the disk , loaded the model using Pickle library and then finally compared the result .
All the accuracy was found out for each algorithm and all of them showed accuracy greater than 85%.
All this model building was done in model.py files , modelNB (naive bayes) modelSVM (support vector machine) etc . according to the algorithm
After finding accuracy from every algorithm.
I finally built a model using library flask , request,jsonify,render_template ,keras and loaded the model using pickle .
The final features of the model was predicted and finally created as app.py.
As the model runs on local host we also added various html tags and styling using CSS to make it more presentable.
The code is shared freely on Github platform.
Link added below
Covid Tracket On Twitter Using Data Science And AiLast Updated on May 3, 2021
Hi folks, I hope you are doing well in these difficult times! We all are going through the unprecedented time of the Corona Virus pandemic. Some people lost their lives, but many of us successfully defeated this new strain i.e. Covid-19. The virus was declared a pandemic by World Health Organization on 11th March 2020. This article will analyze various types of “Tweets” gathered during pandemic times. The study can be helpful for different stakeholders.
For example, Government can make use of this information in policymaking as they can able to know how people are reacting to this new strain, what all challenges they are facing such as food scarcity, panic attacks, etc. Various profit organizations can make a profit by analyzing various sentiments as one of the tweets telling us about the scarcity of masks and toilet papers. These organizations can able to start the production of essential items thereby can make profits. Various NGOs can decide their strategy of how to rehabilitate people by using pertinent facts and information.
In this project, we are going to predict the Sentiments of COVID-19 tweets. The data gathered from the Tweeter and I’m going to use Python environment to implement this project.
The given challenge is to build a classification model to predict the sentiment of Covid-19 tweets. The tweets have been pulled from Twitter and manual tagging has been done. We are given information like Location, Tweet At, Original Tweet, and Sentiment.
Approach To Analyze Various Sentiments
Before we proceed further, One should know what is mean by Sentiment Analysis. Sentiment Analysis is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic is Positive, Negative, or Neutral. (Oxford Dictionary)
Following is the Standard Operating Procedure to tackle the Sentiment Analysis kind of project. We will be going through this procedure to predict what we supposed to predict!
- Exploratory Data Analysis.
- Data Preprocessing.
- Classification Models.
Let’s Guess some tweets
I will read the tweet and can you tell me the sentiment of that tweet whether it is Positive, Negative, Or Neutral. So the first tweet is “Still shocked by the number of #Toronto supermarket employees working without some sort of mask. We all know by now, employees can be asymptomatic while spreading #coronavirus”. What’s your guess? Yeah, you are correct. This is a Negative tweet because it contains negative words like “shocked”.
If you can’t able to guess the above tweet, don’t worry I have another tweet for you. Let’s guess this tweet-“Due to the Covid-19 situation, we have increased demand for all food products. The wait time may be longer for all online orders, particularly beef share and freezer packs. We thank you for your patience during this time”. This time you are absolutely correct in predicting this tweet as “Positive”. The words like “thank you”, “increased demand” are optimistic in nature hence these words categorized the tweet into positive.
The original dataset has 6 columns and 41157 rows. In order to analyze various sentiments, We require just two columns named Original Tweet and Sentiment. There are five types of sentiments- Extremely Negative, Negative, Neutral, Positive, and Extremely Positive as you can see in the following picture.
Summary Of Dataset
Basic Exploratory Data Analysis
The columns such as “UserName” and “ScreenName” do not give any meaningful insights for our analysis. Hence we are not using these features for model building. All the tweets data collected from the months of March and April 2020. The following Bar plot shows us the number of unique values in each column.
There are some null values in the location column but we don’t need to deal with them as we are just going to use two columns i.e. “Sentiment” and “Original Tweet”. Maximum tweets came from London(11.7%) location as evident from the following figure.
There are some words like ‘coronavirus’, ‘grocery store’, having the maximum frequency in our dataset. We can see it from the following word cloud. There are various #hashtags in the tweets column. But they are almost the same in all sentiments hence they are not giving us meaningful full information.
World Cloud showing the words having a maximum frequency in our Tweet column
When we try to explore the ‘Sentiment’ column, we came to know that most of the peoples are having positive sentiments about various issues shows us their optimism during pandemic times. Very few people are having extremely negatives thoughts about Covid-19.
The preprocessing of the text data is an essential step as it makes the raw text ready for mining. The objective of this step is to clean noise those are less relevant to find the sentiment of tweets such as punctuation(.,?,” etc.), special characters(@,%,&,$, etc.), numbers(1,2,3, etc.), tweeter handle, links(HTTPS: / HTTP:)and terms which don’t carry much weightage in context to the text.
Also, we need to remove stop words from tweets. Stop words are those words in natural language that have very little meaning, such as “is”, “an”, “the”, etc. To remove stop words from a sentence, you can divide your text into words and then remove the word if it exists in the list of stop words provided by NLTK.
Then we need to normalize tweets by using Stemming or Lemmatization. “Stemming” is a rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “ed”, “s” etc) from a word. For example — “play”, “player”, “played”, “plays” and “playing” are the different variations of the word — “play”.
Stemming will not convert original words into meaningful words. As you can see “considered” gets stemmed into “condit” which does not have meaning and a spelling mistake too. The better way is to use Lemmatization instead of stemming process.
Lemmatization is a more powerful operation, and it takes into consideration the morphological analysis of the words. It returns the lemma which is the base form of all its inflectional forms.
Here in the Lemmatization process, we are converting the word “raising” to its basic form “raise”. We also need to convert all tweets into the lower case before we do the normalization process.
We can include the process of tokenization. In tokenization, we convert a group of sentences into tokens. It is also called text segmentation or lexical analysis. It is basically splitting data into a small chunk of words. Tokenization in python can be done by the python NLTK library’s word_tokenize() function.
We can use a count vectorizer or a TF-IDF vectorizer. Count Vectorizer will create a sparse matrix of all words and the number of times they are present in a document.
TFIDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The TF–IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. (wiki)
Building Classification Models
The given problem is Ordinal Multiclass classification. There are five types of sentiments so we have to train our models so that they can give us the correct label for the test dataset. I am going to built different models like Naive Bayes, Logistic Regression, Random Forest, XGBoost, Support Vector Machines, CatBoost, and Stochastic Gradient Descent.
I have used the given problem of Multiclass Classification that is dependent variable has the values -Positive, Extremely Positive, Neutral, Negative, Extremely Negative. I also convert this problem into binary classification i.e. I clubbed all tweets into just two types Positive and Negative. You can also go for three-class classification i.e. Positive, Negative and Neutral in order to achieve greater accuracy. In the evaluation phase, we will be comparing the results of these algorithms.
The feature importance (variable importance) describes which features are relevant. It can help with a better understanding of the solved problem and sometimes lead to model improvements by employing feature selection. The top three important feature words are panic, crisis, and scam as we can see from the following graph.
In this way, we can explore more from various textual data and tweets. Our models will try to predict the various sentiments correctly. I have used various models for training our dataset but some models show greater accuracy while some do not. For multiclass classification, the best model for this dataset would be CatBoost. For binary classification, the best model for this dataset would be Stochastic Gradient Descent.