Image Captioning Bot Using Rnn And Cnn

Last Updated on May 3, 2021


What does an Image Captioning Problem entail?

Suppose you see this picture –

What is the first thing that comes to you mind? (PS: Let me know in the comments below!).

Here are a few sentences that people could come up with :

A man and a girl sit on the ground and eat .
A man and a little girl are sitting on a sidewalk near a blue bag eating .
A man wearing a black shirt and a little girl wearing an orange dress share a treat .

A quick glance is sufficient for you to understand and describe what is happening in the picture. Automatically generating this textual description from an artificial system is the task of image captioning.

The task is straightforward – the generated output is expected to describe in a single sentence what is shown in the image – the objects present, their properties, the actions being performed and the interaction between the objects, etc. But to replicate this behaviour in an artificial system is a huge task, as with any other image processing problem and hence the use of complex and advanced techniques such as Deep Learning to solve the task.

Before I go on, I want to give special thanks to Andrej Kartpathy et. al, who helped me understand the topic with his insightful course – CS231n.


Methodology to Solve the Task

The task of image captioning can be divided into two modules logically – one is an image based model – which extracts the features and nuances out of our image, and the other is a language based model – which translates the features and objects given by our image based model to a natural sentence.

For our image based model (viz encoder) – we usually rely on a Convolutional Neural Network model. And for our language based model (viz decoder) – we rely on a Recurrent Neural Network. The image below summarizes the approach given above.

Usually, a pretrained CNN extracts the features from our input image. The feature vector is linearly transformed to have the same dimension as the input dimension of the RNN/LSTM network. This network is trained as a language model on our feature vector.

For training our LSTM model, we predefine our label and target text. For example, if the caption is “A man and a girl sit on the ground and eat.”, our label and target would be as follows –

Label – [ <start>, A, man, and, a, girl, sit, on, the, ground, and, eat, . ] 

Target – [ A, man, and, a, girl, sit, on, the, ground, and, eat, ., <end> ]

This is done so that our model understands the start and end of our labelled sequence.



Walkthrough of Implementation

Let’s look at a simple implementation of image captioning in Pytorch. We will take an image as input, and predict its description using a Deep Learning model.

The code for this example can be found on GitHub. The original author of this code is Yunjey Choi. Hats off to his excellent examples in Pytorch!

In this walkthrough, a pre-trained resnet-152 model is used as an encoder, and the decoder is an LSTM network.

To run the code given in this example, you have to install the pre-requisites. Make sure you have a working python environment, preferably with anaconda installed. Then run the following commands to install the rest of the required libraries.

git clone

cd coco/PythonAPI/
python build
python install

cd ../../

git clone
cd pytorch-tutorial/tutorials/03-advanced/image_captioning/

pip install -r requirements.txt

After you have setup your system, you should download the dataset required to train the model. Here we will be using the MS-COCO dataset. To download the dataset automatically, you can run the following commands:

chmod +x

Now you can go on and start your model building process. First – you have to process the input:

# Search for all the possible words in the dataset and 
# build a vocabulary list

# resize all the images to bring them to shape 224x224

Now you can start training your model by running the below command:

python --num_epochs 10 --learning_rate 0.01

Just to peek under the hood and check out how we defined our model, you can refer to the code written in the file.

import torch
import torch.nn as nn
import torchvision.models as models
from torch.nn.utils.rnn import pack_padded_sequence
from torch.autograd import Variable

class EncoderCNN(nn.Module):
    def __init__(self, embed_size):
        """Load the pretrained ResNet-152 and replace top fc layer."""
        super(EncoderCNN, self).__init__()
        resnet = models.resnet152(pretrained=True)
        modules = list(resnet.children())[:-1]      # delete the last fc layer.
        self.resnet = nn.Sequential(*modules)
        self.linear = nn.Linear(resnet.fc.in_features, embed_size) = nn.BatchNorm1d(embed_size, momentum=0.01)
    def init_weights(self):
        """Initialize the weights.""", 0.02)
    def forward(self, images):
        """Extract the image feature vectors."""
        features = self.resnet(images)
        features = Variable(
        features = features.view(features.size(0), -1)
        features =
        return features
class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers):
        """Set the hyper-parameters and build the layers."""
        super(DecoderRNN, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
    def init_weights(self):
        """Initialize weights.""", 0.1), 0.1)
    def forward(self, features, captions, lengths):
        """Decode image feature vectors and generates captions."""
        embeddings = self.embed(captions)
        embeddings =, embeddings), 1)
        packed = pack_padded_sequence(embeddings, lengths, batch_first=True) 
        hiddens, _ = self.lstm(packed)
        outputs = self.linear(hiddens[0])
        return outputs
    def sample(self, features, states=None):
        """Samples captions for given image features (Greedy search)."""
        sampled_ids = []
        inputs = features.unsqueeze(1)
        for i in range(20):                                    # maximum sampling length
            hiddens, states = self.lstm(inputs, states)        # (batch_size, 1, hidden_size), 
            outputs = self.linear(hiddens.squeeze(1))          # (batch_size, vocab_size)
            predicted = outputs.max(1)[1]
            inputs = self.embed(predicted)
            inputs = inputs.unsqueeze(1)                       # (batch_size, 1, embed_size)
        sampled_ids =, 1)                # (batch_size, 20)
        return sampled_ids.squeeze()

Now we can test our model using:

python --image='png/example.png'

For our example image, our model gives us this output:

<start> a group of giraffes standing in a grassy area . <end>

And that’s how you build a Deep Learning model for image captioning!



The model which we saw above was just the tip of the iceberg. There has been a lot of research done on this topic. Currently, the state-of-the-art model in image captioning is Microsoft’s CaptionBot. You can look at a demo of the system on their official website (link :

I will list down a few ideas which you can use to build a better image captioning model.


More Details: Image Captioning Bot using RNN and CNN

Submitted By

Share with someone who needs it

Resume Up-Loader

Last Updated on May 3, 2021



Ever you apply to an organisation with cv through mail but it might happen that specific organisation don't know that actually candidate need like job preference or type of job, so it get easier when we use this app called resume up-loader.

working model:-

It is my first self project using Django (python

framework) called Resume Up-loader .

where you put every detail about yourself ,job location photos,signature,CV,after submitting the information load on the server and next page you can look all your information and download the Resume also ,i am continuously working on it and upgrading that it list all the company on that preference job location for your current qualification and skill it help the candidate to know in which company is he/she is suitable for and it also company to know their candidate batter

Under a projects section

To make this single page website I have use the python web framework called Django

Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design. Built by experienced developers, it takes care of much of the hassle of Web development, so you can focus on writing your app without needing to reinvent the wheel. It’s free and open source.

I have also use HTML to define the structure of front-end and use style tag to make this beautiful

More Details: Resume up-loader

Submitted By

Student Staff Management System

Last Updated on May 3, 2021


This project was a minor project done by me in my B.Tech 3rd year which was submitted to my department in the same year only. The project was completely done using VB.Net as it's front end and MYSQl as its database for the purpose of data storing and management.

This was a small project which was solely prepared to focus on issues regarding performing basic operations swiftly on data of the staff and the students present in the university such as CRUD(create, retrieve, update, delete) operations on data which could be managed easily and was fast in terms of retrieval and provided to cause less hassle. The languages used in the project were as follows :

1) VB. Net : For front end purposes

2) MYSQL : For database purposes

The database prepared for the project was totally normalized up to 3-NF so that the data stored in the database could be optimized and stored in a effective manner. There are nearly 4-NF to create a relational schema for data storing in which 1-NF being the least optimized to 4-NF be the max. Here I tend to chose the 3-NF as because it could provide me with the max optimization and no data loss. The 4-NF instead optimizes the data base better than the 3-NF but could also provide with lossy data. Hence the optimal choice here was to go with the 3-NF and I chose the same option as I didn't wanted to lose any data in the process .

Anyways after designing the database I went forward with the designing of frontend and did it with the help of .Net in the process . Here I tend to keep the user interface as simple as possible so that a simple person could also use it regardless of the knowledge of computer systems . So I chose a very simple user interface which only focuses on the work in hand and doesn't carry any unnecessary details like designing ,coloring etc etc.

So after completing both these operations I then tried to link my data base with the program so that my front end could access the database running in the background and store and retrieve the data easily and in a efficient way. After linking those two my project was almost complete and was ready to be deployed.

So in short the in my total project I :- Successfully managed to create a centralized management system for the students and the staff of the university which helped to manage and store data more efficiently as compared to the previous model.

P.S : I don't currently have the project link to 2 of my projects. Sorry for that

Thank You

More Details: Student Staff Management System

Submitted By

Telecom Churn Prediction

Last Updated on May 3, 2021


This case requires trainees to develop a model for predicting customer churn at a fictitious wireless telecom company and use insights from the model to develop an incentive plan for enticing would-be churners to remain with company. Data for the case are available in csv format. The data are a scaled down version of the full database generously donated by an anonymous wireless telephone company. There are still 7043 customers in the database, and 20 potential predictors. Candidates can use whatever method they wish to develop their machine learning model. The data are available in one data file with 7043 rows that combines the calibration and validation customers. “calibration” database consisting of 4000 customers and a “validation” database consisting of 3043 customers. Each database contained (1) a “churn” variable signifying whether the customer had left the company two months after observation, and (2) a set of 20 potential predictor variables that could be used in a predictive churn model. Following usual model development procedures, the model would be estimated on the calibration data and tested on the validation data. This case requires both statistical analysis and creativity/judgment. I recommend you pend much time on both fine-tuning and interpreting results of your machine learning model.

More Details: Telecom churn Prediction

Submitted By


Last Updated on May 3, 2021


Problem Statement

Develop tools that would increase Productivity for students and teachers. In the past 10-15 years we have seen the transition of things around us from offline to online, whether it's business, entertainment activities, daily needs, and now even education. Productivity tools have been a success with businesses and firms. Develop productivity tools for students and teachers in any domain of your choice that can achieve the same success in the educational field in the future.

Problem Solution

In this post - covid era, the education sector has erupted, with a plethora of new opportunities. Scholastic provides a complete and comprehensive education portal for students as well as staff.

  • The USP of the application are lab sessions simulated using Augmented Reality.
  • Other features include usage of virtual assistants like Alexa to provide reminders, complete timetable and file integration
  • A blockchain based digital report card system where teachers can upload report cards for students & send it to parents.
  • Plagiarism checker for assignments. It is a one - stop solution to all needs such as announcements and circulars from institution or a staff member, fee payment and even a chatbot for additional support.

Tech Stack

  • Google Assistant For Chatbot
  • Via the Actions Console
  • Python3 for Plagiarism Checker
  • Gensim
  • NumPy
  • NLP Models ( Word Embedding)
  • Heroku (For Deployment & making API Calls)
  • Android Studio with Java For Main Android App
  • AR Foundation For Simulated Lab Sessions with Blender & Unity
  • Ethereum, Solidity & React.js For Blockchain Based Storage for Report Cards (Along with Ganache & Truffle Suite)

More Details: Scholastic

Submitted By

Real Time Object Detection Using Tensorflow

Last Updated on May 3, 2021


Object detection is a computer vision technique in which a software system can detect, locate, and trace the object from a given image or video. The special attribute about object detection is that it identifies the class of object (person, table, chair, etc.) and their location-specific coordinates in the given image. The location is pointed out by drawing a bounding box around the object. The bounding box may or may not accurately locate the position of the object. The ability to locate the object inside an image defines the performance of the algorithm used for detection. Face detection is one of the examples of object detection.

These object detection algorithms might be pre-trained or can be trained from scratch. In most use cases, we use pre-trained weights from pre-trained models and then fine-tune them as per our requirements and different use cases.

Generally, the object detection task is carried out in three steps:

  • Generates the small segments in the input as shown in the image below. As you can see the large set of bounding boxes are spanning the full image

  • Feature extraction is carried out for each segmented rectangular area to predict whether the rectangle contains a valid object.

  • Overlapping boxes are combined into a single bounding rectangle (Non-Maximum Suppression)

Tensorflow is an open-source library for numerical computation and large-scale machine learning that ease Google Brain TensorFlow, the process of acquiring data, training models, serving predictions, and refining future results.

  • Tensorflow bundles together Machine Learning and Deep Learning models and algorithms. 
  • It uses Python as a convenient front-end and runs it efficiently in optimized C++.
  • Tensorflow allows developers to create a graph of computations to perform. 
  • Each node in the graph represents a mathematical operation and each connection represents data. Hence, instead of dealing with low-details like figuring out proper ways to hitch the output of one function to the input of another, the developer can focus on the overall logic of the application.

The TensorFlow Object Detection API is an open-source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models.

  • There are already pre-trained models in their framework which are referred to as Model Zoo. 
  • It includes a collection of pre-trained models trained on various datasets such as the 
  • COCO (Common Objects in Context) dataset, 
  • the KITTI dataset, 
  • and the Open Images Dataset.

As you may see below there are various models available so what is different in these models. These various models have different architecture and thus provide different accuracies but there is a trade-off between speed of execution and the accuracy in placing bounding boxes.

Tensorflow bundles together Machine Learning and Deep Learning models and algorithms. It uses Python as a convenient front-end and runs it efficiently in optimized C++.

Tensorflow allows developers to create a graph of computations to perform. Each node in the graph represents a mathematical operation and each connection represents data. Hence, instead of dealing with low-details like figuring out proper ways to hitch the output of one function to the input of another, the developer can focus on the overall logic of the application.

The deep learning artificial intelligence research team at Google, Google Brain, in the year 2015 developed TensorFlow for Google’s internal use. This Open-Source Software library is used by the research team to perform several important tasks.

TensorFlow is at present the most popular software library. There are several real-world applications of deep learning that makes TensorFlow popular. Being an Open-Source library for deep learning and machine learning, TensorFlow finds a role to play in text-based applications, image recognition, voice search, and many more. DeepFace, Facebook’s image recognition system, uses TensorFlow for image recognition. It is used by Apple’s Siri for voice recognition. Every Google app that you use has made good use of TensorFlow to make your experience better.

Here mAP (mean average precision) is the product of precision and recall on detecting bounding boxes. It’s a good combined measure for how sensitive the network is to objects of interest and how well it avoids false alarms. The higher the mAP score, the more accurate the network is but that comes at the cost of execution speed which we want to avoid here.

As my PC is a low-end machine with not much processing power, I am using the model ssd_mobilenet_v1_coco which is trained on COCO dataset. This model has decent mAP score and less execution time. Also, the COCO is a dataset of 300k images of 90 most commonly found objects so the model can recognise 90 objects.

This brings us to the end of this project where we learned how to use Tensorflow object detection API to detect objects in images 

More Details: Real Time Object Detection using Tensorflow

Submitted By