Cluster AiLast Updated on May 3, 2021
Explore a galaxy of research papers in 3D space using a state-of-the-art machine learning model.
Search engines like Google Scholar make it easy to find research papers on a specific topic. However, it can be hard to branch out from a general position to find topics for your research that need to be specified. Wouldn’t it be great to have a tool that not only recommends you research papers, but does it in a way that makes it easy to explore other related topics and solutions to your topic?
What it does
Users will input either a text query or research paper into Cluster AI. Cluster AI uses BERT (Bidirectional Encoder Representations from Transformers), a Natural Language Processing model, in order to connect users to similar papers. Cluster AI uses the CORE Research API to fetch research articles that may be relevant, then visualizes the similarity of these papers in a 3d space. Each node represents a research paper, and the distances between the nodes show the similarity between those papers. Using this, users can visualize clusters of research papers with close connections in order to quickly find resources that pertain to their topic.
Test Cluster AI here
Note: Running on CPU based server, deploying your own Django server using instructions in the Source Code is highly recommended. Demo may have delays depending on the query and number of users at any given point. 10-100 papers, but up to 20 papers requested in the query will be optimal.
Check out the Source Code!
How we built it
We used a multitude of technologies, languages, and frameworks in order to build ClusterAI.
- BERT (Bidirectional Encoder Representations from Transformers) and MDS (Multidimensional Scaling) with PyTorch for the Machine Learning
- Python and Django for the backend
Challenges we ran into
The CORE Research API did not always provide all the necessary information that was requested. It sometimes returned papers not in English or without abstracts. We were able to solve this problem by validating the results ourselves. Getting the HTML/CSS to do exactly what we wanted gave us trouble.
Accomplishments that we're proud of
We worked with a state-of-the-art natural language processing model which successfully condensed each paper into a 3D point.
The visualization of the graph turned out great and let us see the results of the machine learning techniques we used and the similarities between large amounts of research papers.
What we learned
What's next for Cluster AI
We can add filtering to the nodes so that only nodes of a given specification are shown. We can expand Cluster AI to visualize other corpora of text, such as books, movie scripts, or news articles. Some papers are in different languages; we would like to use an API to convert the different languages into a person’s native language, so anyone will be able to read the papers.
False Alarm Detection SystemLast Updated on May 3, 2021
This project was made for a chemical industry which had sensors installed in various parts of the factory to detect H2S gas which is hazardous to health. Every time one or multiple sensors detected the H2S leak, an emergency alarm rings to alert the workers. For every alarm, the industry calls a team which sanitizes the place and checks for the leak and this was a big cost to the company.
A few of the alarms that ring are not even hazardous. The company gave us the data for each alarm with a final column stating the alarm was dangerous or not.
Unwanted substance deposition (0/1)
The data was first pre-processed and analysis libraries like Numpy and Pandas were used to make it ready to be utilized by a machine learning algorithm.
Problems like standard scaling, categorical data and missing values were handled with appropriate techniques.
Then, we used Logistic Regression model to make a classifier with first five column as independent columns and dangerous column as dependent/target column.
Now whenever, there is a leakage and the alarm rings, the data is sent to us and we predict if it is dangerous or not. If found dangerous then only the team is called to sanitize the place and fix the leak. This saved a lot of money for the company.
Real Time Object Detection Using TensorflowLast Updated on May 3, 2021
Object detection is a computer vision technique in which a software system can detect, locate, and trace the object from a given image or video. The special attribute about object detection is that it identifies the class of object (person, table, chair, etc.) and their location-specific coordinates in the given image. The location is pointed out by drawing a bounding box around the object. The bounding box may or may not accurately locate the position of the object. The ability to locate the object inside an image defines the performance of the algorithm used for detection. Face detection is one of the examples of object detection.
These object detection algorithms might be pre-trained or can be trained from scratch. In most use cases, we use pre-trained weights from pre-trained models and then fine-tune them as per our requirements and different use cases.
Generally, the object detection task is carried out in three steps:
- Generates the small segments in the input as shown in the image below. As you can see the large set of bounding boxes are spanning the full image
- Feature extraction is carried out for each segmented rectangular area to predict whether the rectangle contains a valid object.
- Overlapping boxes are combined into a single bounding rectangle (Non-Maximum Suppression)
Tensorflow is an open-source library for numerical computation and large-scale machine learning that ease Google Brain TensorFlow, the process of acquiring data, training models, serving predictions, and refining future results.
- Tensorflow bundles together Machine Learning and Deep Learning models and algorithms.
- It uses Python as a convenient front-end and runs it efficiently in optimized C++.
- Tensorflow allows developers to create a graph of computations to perform.
- Each node in the graph represents a mathematical operation and each connection represents data. Hence, instead of dealing with low-details like figuring out proper ways to hitch the output of one function to the input of another, the developer can focus on the overall logic of the application.
The TensorFlow Object Detection API is an open-source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models.
- There are already pre-trained models in their framework which are referred to as Model Zoo.
- It includes a collection of pre-trained models trained on various datasets such as the
- COCO (Common Objects in Context) dataset,
- the KITTI dataset,
- and the Open Images Dataset.
As you may see below there are various models available so what is different in these models. These various models have different architecture and thus provide different accuracies but there is a trade-off between speed of execution and the accuracy in placing bounding boxes.
Tensorflow allows developers to create a graph of computations to perform. Each node in the graph represents a mathematical operation and each connection represents data. Hence, instead of dealing with low-details like figuring out proper ways to hitch the output of one function to the input of another, the developer can focus on the overall logic of the application.
The deep learning artificial intelligence research team at Google, Google Brain, in the year 2015 developed TensorFlow for Google’s internal use. This Open-Source Software library is used by the research team to perform several important tasks.
TensorFlow is at present the most popular software library. There are several real-world applications of deep learning that makes TensorFlow popular. Being an Open-Source library for deep learning and machine learning, TensorFlow finds a role to play in text-based applications, image recognition, voice search, and many more. DeepFace, Facebook’s image recognition system, uses TensorFlow for image recognition. It is used by Apple’s Siri for voice recognition. Every Google app that you use has made good use of TensorFlow to make your experience better.
Here mAP (mean average precision) is the product of precision and recall on detecting bounding boxes. It’s a good combined measure for how sensitive the network is to objects of interest and how well it avoids false alarms. The higher the mAP score, the more accurate the network is but that comes at the cost of execution speed which we want to avoid here.
As my PC is a low-end machine with not much processing power, I am using the model ssd_mobilenet_v1_coco which is trained on COCO dataset. This model has decent mAP score and less execution time. Also, the COCO is a dataset of 300k images of 90 most commonly found objects so the model can recognise 90 objects.
This brings us to the end of this project where we learned how to use Tensorflow object detection API to detect objects in images
Bank_Loan_Default_CaseLast Updated on May 3, 2021
The Objective of this problem is to predict whether a person is ‘Defaulted’ or ‘Not Defaulted’ on the basis of the given 8 predictor variables.
The data consists of 8 Independent Variables and 1 dependent variable. The Independent Variables are I. Age: It is a continuous variable. This feature depicts the age of the person. II. Ed: It is a categorical variable. This feature has the education category of the person converted to numerical form. III. Employ: It is a categorical variable. This feature contains information about the geographic location of the person. This column has also been converted to numeric values. IV. Income: It is a continuous variable. This feature contains the gross income of each person. V. DebtInc: It is a continuous variable. This feature tells us an individual’s debt to his or her gross income. VI. Creddebt: It is a continuous variable. This feature tells us about the debt-to-credit ratio. It is a measurement of how much a person owes their creditors as a percentage of its available credit. VII. Othdebt: It is a continuous variable. It tells about any other debt a person owes. VIII. Default: It is a categorical variable. It tells whether a person is a Default (1) or Not-Default (0).
After performing extensive exploratory data analysis the data is given to multiple models like Logistic Regression, Decision Tree classifier, Random Forest classifier, KNN, Gradient Boosting classifier with and without hyperparameter tuning, the final results are obtained and compared on metrics like precision score, recall score, AUC-ROC score.
Credit Card DetectionLast Updated on May 3, 2021
models trained to label anonymized credit card transactions as fraudulent or genuine. Dataset from Kaggle. In this project I build machine learning models to identify fraud in credit card transactions. I also make several data visualizations to reveal patterns and structure in the data.
The dataset, hosted on Kaggle, includes credit card transactions made by cardholders. The data contains 7983 transactions that occurred over of which 17 (0.21%) are fraudulent. Each transaction has 30 features, all of which are numerical. The features V1, V2, ..., V28 are the result of a PCA transformation. To protect confidentiality, background information on these features is not available. The Time feature contains the time elapsed since the first transaction, and the Amount feature contains the transaction amount. The response variable, Class, is 1 in the case of fraud, and 0 otherwise. Project Introduction
The approaches for the project are :
Randomly split the dataset into train, validation, and test set. Do feature engineering. Predict and evaluate with validation set. Train on train set then predict and evaluate with validation set. Try other different models. Compare the difference between the predictions and choose the best model. Predict on test set to report final result.
I was able to accurately identify fraudulent transactions using a LogisticRegression model. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. Feature 'Class' is the target variable with value 1 in case of fraud and 0 otherwise.To improve a particular model, I optimized hyperparameters via a grid search with 3-fold cross-validation