Battle Of NeighborhoodsLast Updated on May 3, 2021
Key Skills: Web Scraping , Foursquare API , kmeans , Pandas , folium ,Data Cleaning , Clustering
GOAL: to find the most profitable location to open an authentic Indian restaurant in Manhattan, NYC
-retrieved information about venues in each neighborhood using Foursquare API
-clustered similar neighborhoods in Manhattan on based on types of venues present in each neighborhood
-analyzed the clusters and the neighborhoods to decide on the most profitable location for the restaurant
link my blog post explaining the project in detail:
Share with someone who needs it
Bank_Loan_Default_CaseLast Updated on May 3, 2021
The Objective of this problem is to predict whether a person is ‘Defaulted’ or ‘Not Defaulted’ on the basis of the given 8 predictor variables.
The data consists of 8 Independent Variables and 1 dependent variable. The Independent Variables are I. Age: It is a continuous variable. This feature depicts the age of the person. II. Ed: It is a categorical variable. This feature has the education category of the person converted to numerical form. III. Employ: It is a categorical variable. This feature contains information about the geographic location of the person. This column has also been converted to numeric values. IV. Income: It is a continuous variable. This feature contains the gross income of each person. V. DebtInc: It is a continuous variable. This feature tells us an individual’s debt to his or her gross income. VI. Creddebt: It is a continuous variable. This feature tells us about the debt-to-credit ratio. It is a measurement of how much a person owes their creditors as a percentage of its available credit. VII. Othdebt: It is a continuous variable. It tells about any other debt a person owes. VIII. Default: It is a categorical variable. It tells whether a person is a Default (1) or Not-Default (0).
After performing extensive exploratory data analysis the data is given to multiple models like Logistic Regression, Decision Tree classifier, Random Forest classifier, KNN, Gradient Boosting classifier with and without hyperparameter tuning, the final results are obtained and compared on metrics like precision score, recall score, AUC-ROC score.
Identify The Best Model For Class Imbalance Data In Multiclass ProblemLast Updated on May 3, 2021
In Robust model for Imbalanced class of data, a research on an Infinite possibility of imbalanced class of data and we would like to investigate what are the best models through all possible imbalanced situation of a data set. Usually we do Up-Sampling Or Down-Sampling of the imbalanced data and make it balanced before applying machine learning models. In both the cases, We lose information about that data set. In this project, we would like to investigate what are the best models through all possible imbalanced situation of a data set. There is no particular definition for imbalanced class of data. In general, data that is not balanced is called imbalanced. Generating Data Points in Square Pattern Keeping a boundary classifying the data points as belongs to multiple class and name them as class_1, class_2 and class_3. Adding some jitter points to every data points to make every data points fall under different class and make them misclassify in itself. Make the balance dataset to imbalance by making one class with the proportion of samples like 1%,2%,3%.......10% keeping other classes same. Referring to the above that at least one of the class having significantly less number of training examples or the examples in the training data belonging to one class heavily outnumber the examples in the other class. Currently, most of the Machine learning algorithms assume the training data to be balanced like SVM, Logistic-Regression, Naïve-Bayes etc., Last few decades ,some effective methods have been proposed to attack this problem like upsampling, down-sampling, Smote etc…
Dog And Cat Image ClassificationLast Updated on May 3, 2021
Dog and cat image classification
The project classifies an image into a dog or a cat. The model has been built by using Convolutional Neural Network or also known as CNN. CNN is a part of deep learning which deals with analysing images. It is widely used for image recognition and classification. This project was developed by using Python. Python is an interpreted, high-level and general-purpose programming language. Python was implemented on Jupyter Notebook.
Libraries and Functions used-
Various Python libraries were used while developing the ML model. The tools used were:
1. tensorflow- It focusses on training of neural networks
2. load_model- This library is used to load a model and construct it identically
3. tkinter- It is a python GUI toolkit
4. PIL- It is Python Image Library that supports in doing operations with images
5. Filedialog- It is used for selecting a file/directory
6. Playsound- It is used for playing audios
7. ImageDataGenerator- It is a class of Keras library used for real-time data augmentation
8. Flow_from_directory- It is an image augmentation tool
9. keras Preprocessing- It is the data preprocessing module of keras which provides utilites for working with image data.
10. load_img- It loads the image in PIL format.
11. img_to_array- It changes the image into a numpy array.
12. expand_dim- It expands the dimension to add an extra dimension for a batch of only one image with axis=0.
In this neural network 2 activation functions were used-
The methods followed were:
1. Pre-processing of data
1.1 Training data
1.2 Testing data
2. Building CNN
2.1 Adding the first convolution layer
2.3 Adding the second convolution layer
2.5 Full connection
2.6 Output layer
The accuracy of last(50) epoch was 97%
This function loads the ML model and take the image input given by the user and then pre-process it. Later the pre-processed image goes as an input to ML model which gives the prediction. For our output, this code plays a sound corresponding to the prediction.
The final page asks the user to select an image from the local computer. The tab’s name is ‘Image Classifier’.
Once the user selects the image, the model successfully predicts whether the image is of a dog or a cat. The model also plays a sound stating about the prediction.
Student Staff Management SystemLast Updated on May 3, 2021
This project was a minor project done by me in my B.Tech 3rd year which was submitted to my department in the same year only. The project was completely done using VB.Net as it's front end and MYSQl as its database for the purpose of data storing and management.
This was a small project which was solely prepared to focus on issues regarding performing basic operations swiftly on data of the staff and the students present in the university such as CRUD(create, retrieve, update, delete) operations on data which could be managed easily and was fast in terms of retrieval and provided to cause less hassle. The languages used in the project were as follows :
1) VB. Net : For front end purposes
2) MYSQL : For database purposes
The database prepared for the project was totally normalized up to 3-NF so that the data stored in the database could be optimized and stored in a effective manner. There are nearly 4-NF to create a relational schema for data storing in which 1-NF being the least optimized to 4-NF be the max. Here I tend to chose the 3-NF as because it could provide me with the max optimization and no data loss. The 4-NF instead optimizes the data base better than the 3-NF but could also provide with lossy data. Hence the optimal choice here was to go with the 3-NF and I chose the same option as I didn't wanted to lose any data in the process .
Anyways after designing the database I went forward with the designing of frontend and did it with the help of .Net in the process . Here I tend to keep the user interface as simple as possible so that a simple person could also use it regardless of the knowledge of computer systems . So I chose a very simple user interface which only focuses on the work in hand and doesn't carry any unnecessary details like designing ,coloring etc etc.
So after completing both these operations I then tried to link my data base with the program so that my front end could access the database running in the background and store and retrieve the data easily and in a efficient way. After linking those two my project was almost complete and was ready to be deployed.
So in short the in my total project I :- Successfully managed to create a centralized management system for the students and the staff of the university which helped to manage and store data more efficiently as compared to the previous model.
P.S : I don't currently have the project link to 2 of my projects. Sorry for that
Project - Mercedes-Benz Greener ManufacturingLast Updated on May 3, 2021
Reduce the time a Mercedes-Benz spends on the test bench.
Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with the crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.
To ensure the safety and reliability of every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.
You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.
I have done Data exploration, checking for Missing values and Outliers. Treat the outliers. Applied Label Encoding on categorical variables. I have scaled the data. Applied PCA to reduce the dimension of data but no effect of it on the result. In the prediction, I used Random Forest, KNN, and XGBoost modelling. In all of them, XGBoost has given good result.