Prerequisite: Building a Pipeline in Rasa for Training

In this tutorial, we will learn how to prepare the training data for our chatbot model.

The training data for a chatbot is generally in the form of intents and entities. These are nothing but examples of how we humans speak or convey something in any language, for e.g. English.

Rasa NLU has multiple ways of defining the intents and their entities on our training data.

It supports training data in markdown, in JSON as a single file, or as a directory containing multiple files.

We are will be working on the most difficult, but highly scalable method first, which is JSON data. Creating a JSON file is difficult by hand but programmatically very easy and scalable.

The JSON format of the data that Rasa NLU expects, has a top-level object called rasa_nlu_data, with the keys common_examples, regex_features, and entity_synonyms.

The most important one with which we are going to be working is common_ examples.

The following is the skeleton form of how our JSON data is going to look:

{
    "rasa_nlu_data": {
        "common_examples": [],
        "regex_features" : [],
        "entity_synonyms": []
     }
}

 

 

The common_examples key in our JSON data is the central place that will be used to train our model. We will be adding all our training examples in the common_examples array.

regex_features is a tool to help the intent classifier recognize entities or intents and improve the accuracy of intent classification.

Let’s start writing our JSON file. Let’s call it chatbot_data.json.

  • Create a folder called restaurant_bot.
  • Change the current working directory to restaurant_bot.
  • Start Jupyter Notebook.
  • Create a new folder called data.
  • Click on the data folder and go to “Text File” under New menu in Jupyter Notebook.
  • Click on the name of the file created and change the name to chatbot_data.json and write your intents for your chatbot.

For Steps 5 and 6, feel free to use your favorite editors like Sublime, Notepad++, PyCharm, etc., to work with the JSON file.

The following is what a chatbot_data.json under data folder looks like:

 "rasa_nlu_data": {
     "common_examples": [
         {
             "text": "Hello",
             "intent": "greeting",
             "entities": []
         },
         {
             "text": "What would you like to have?",
             "intent": "get_food_order",
             "entities": []
         },
         {
             "text": "I'd like to order a pizza",
             "intent": "get_food_order",
             "entities": []
         },
         {
             "text": "Do you have something in Italian?",
             "intent": "get_order_query"
         }
     ],
     "regex_features": [],
     "entity_synonyms": []
     }
}

 

 

Well as you can see, it looks very clumsy to prepare this by hand.

There are even nicer and easy methods using which we can prepare our data, such as what we have in Dialogflow.

There are many cool and interesting tools for creating training data in the format that Rasa expects. One of them was created by Polgár András, and it’s also pretty much good for inspecting and modifying existing data that we prepared earlier.

This tool saves a lot of time if we are working on small projects where we have to create the data by hand.

It’s always a good idea to visualize the data in any application you are building that is completely data-driven.

We will cover this tool in the next tutorial.