"OK Google, let me talk to my API."
"Sure, here is the test version of your API."
You've probably heard of them by now, the "smart speakers" everybody is talking about. They look like they fit right inside your home and they are only good to entertain you with some tunes, but there is a lot more power behind those simple speakers than you might think. The Google Home, Amazon's Echo, Apple's HomePod, Microsoft's up and coming Cortana Invoke,... Every major technology firm is jumping on the bandwagon to try and beat their competitors.
It didn't used to be like this. Originally, these smart "assistants" appeared on your phone. The Google Assistant, Siri, Cortana,... They all focused on using your phone without actually touching the screen. Amazon was the first major technology firm to realise the phone was no longer necessary. And now, with the possibilities to develop custom Conversational Agents for several of these devices, the possibilities are endless.
Google in your Home
For our tryouts, we decided to order a few Google Home devices from the UK. From my experience while living there, it was clear that the Google Assistant is already way ahead of Amazon's Alexa in terms of NLP (Natural Language Processing) and voice recognition. Since Kunstmaan is based in Belgium and none of the developers are native English speakers, those features increase the accuracy of the spoken voice commands quite intensively. (Both Apple's HomePod and Microsoft's Cortana Invoke are not yet available to buy at the time of writing.)
The Google Home is powered by the Google Assistant, which is Google's version of a personal assistant that -depending on which device it's on- helps you through your day in different ways. The Google Assistant on your phone is able to do different things than the Google Assistant on the Google Home, such as sending text messages to a friend. The other way around, the Google Home's Google Assistant is able to differentiate between multiple users by voice recognition, or trigger actions on other devices with queries like "Show me pictures of cars on the TV".
Similarly, the Google Home only supports English, German, and French currently, while the more generic Google Assistant supports English, French, German, Hindi, Japanese, Portuguese, and Spanish. Obviously, the feature set of the Google Assistant varies between languages, which is why the rollout of the Google Home is limited to select countries.
Developing for the Google Assistant
Now as developers, we always wanted to have our own J.A.R.V.I.S., so what we actually want is to create our own voice commands! There are 3 ways to do this: Shortcuts (for the average consumer), If This Than That (also known as IFTTT.com, for power users), and Actions on Google (for developers, so this one gets its own massive section in this post later on). All three ways will be explained, but if you only care about talking to your own API, skip right ahead to Actions on Google!
Google Assistant Shortcuts
First and foremost, Shortcuts for the Google Assistant don't allow you to create new functionality. It's limited to creating new commands for already existing functionality. As an example, you could create a Shortcut "Put on relax TV" which would execute the command "Show cute kitten videos from Youtube on the TV". Shortcuts like these are good for the average consumer, but it doesn't help us developers to do some magic, for which If This Than That and Actions on Google are better options.
If This, Than That
Tight integration of IFTTT.com with the Google Assistant provides the "power user" with the possibility to create custom voice commands that can actually do something new. IFTTT aggregates several 3rd parties, including all the major social networks, online shops, several smart home or appliance manufacturers up to even car manufacturers. These huge list of services combined with a very easy-to-learn interface allows consumers to expand the functionality of their Google Assistant quite expansively.
However, again as developers, you'll hit limits of what's possible quite early on. The voice commands are limited to maximum 2 variables (string + number), and custom webhooks to do GET/POST calls to an API can't actually return the result to the Google Assistant. As such, IFTTT integration with your own API is limited to trigger basic tasks, but doesn't really allow anything more than that.
Custom Conversational Agents with Actions on Google
As mentioned previously, both the Shortcuts and IFTTT integrations are limited to triggering one-off basic tasks, but don't allow you to do advanced commands with multiple parameters or to have an actual conversation with your API.
This is where Actions on Google comes in. And again, there are multiple ways to start developing for Actions on Google. You can either use the Actions SDK, or use a 3rd party conversational agent tool such as API.AI, which wraps the functionality of the Actions SDK, and provices additional features such as an easy-to-use IDE, natural language understanding (NLU), machine learning, and more. Note that API.AI is actually owned by Google, and is the recommended approach for creating conversational agents for Actions on Google. It also allows you to create conversational agents for several other platforms (not just limited to Actions on Google), but more on that later.
An API AI?
To be able to start working with API.AI for Actions on Google, you'll need to link your API.AI project to your Actions on Google projects. By defining that Actions on Google project with a name and several sample invocations for the Google Assistant, it will be possible to test your conversational agent in the Actions on Google simulator in the browser. That way, you don't necessarily need a physical Google Home or a phone with the Google Assistant. Once you have got your sample invocation (such as "Let me talk to my Custom API"), you can just trigger it by voice or chat and from then on you'll be communicating with your own conversational agent.
A conversational agent on API.AI consists of several concepts, which I'll explain below.
Entities are sets of variables that are limited to certain values, so the best way to look at these is as Enum values. API.AI provides both a set of predefined entities that are either a limited set (such as @sys.address, or @sys.flight-number), or unlimited sets (such as @sys.last-name), but it's also possible to create your own entity. These entities are recognised by API.AI in user queries, and are sent up to any webhooks if specified. A more technical example than the one visualised would be an Entity named "Project", in which you would define all the different projects which can be built on your deploy server. By including the name of a project in a user query, API.AI will know you're taking about a project, and can match your query accordingly.
As visible in the image above, an API.AI intent matches a user's intent to do something, in this perticular example to book a hotel room. You can define several ways the user could state this intent, with our without certain (required) parameters. After adding these sample sentences, API.AI will be able to match user queries with this intent, even if it does not match completely, by using Natural Language Understanding (NLU). Note that NLU is not yet available in the Dutch language support. Several entities such as dates, cities, and the sample "chain" entitiy we created previously are clearly marked in these sample sentences.
For each intent, you can provide a list of entities that might be mentioned in the user query, and mark some of them as required. Marking an entity as required will cause the conversational agent to automatically prompt the user for this value. As soon as all required entities in an intent are known, the API.AI will resolve the intent. It's either possible to resolve the intent directly from within the API.AI IDE (provide a text/speech response or an Actions on Google-specific response), or to trigger a webhook. In both of these responses, you will have the entity variables available to use in your output or further business logic.
What's the context?
API.AI also has the concept of Context. A context can be input and/or output of an Intent and can contain variables of previous intents. In the example above, our "book.hotel" intent had an output "booking" context with a lifetime of 5. This means that after the book.hotel intent has been successful, this booking context (including the different variables of the booking) will stay around for another 5 user queries. These contexts can be used for follow-up intents such as "Change the end date to the 25th of July", which could take a booking input context, and have an booking output context with the end date changed to the new value. By defining intents with a certain input Context, they can only be matched when that input Context currently exists. Therefore, it would be impossible to start a conversation with "Change the end date to the 25th of July", since the booking context would not exist yet.
Providing the user with a response
For simple responses that don't require any business logic it is possible to define them straight in API.AI. It's also possible to provide specific visual responses for Actions on Google such as cards or carousels on phones.
However, it's also possible to trigger a webhook request for the response, which API.AI calls Fulfullment, which will do a POST request to an endpoint of your choosing. Note that this endpoint is the same for all intents, so you'll need to provide an Action string to differentiate between the different intents.
There are several ways to build your own webhook endpoint. Google's recommendation is the Actions on Google Node.js library, but for better integration with several projects here at Kunstmaan, we built our prototype in Java. It doesn't really matter which backend language you use, as long as you can easily work with JSON objects.
On completion of an intent, API.AI will do a POST call to your endpoint with the Action, all relevant entities, and contexts so you got everything you need for your business logic. Your endpoint should be responsible from translating the API.AI JSON objects to valid API calls for your business logic, and format the response from your API calls into response JSON objects for API.AI. All of this is required to take less than 5 seconds though, since the Google Assistant will mark your conversational agent as non-responsive if any response takes longer than 5 seconds.
Conversational Agent Architecture
Hooking up your API.AI project with your business logic via a fulfillment webhook completes your conversational agent. You can see an image above of this complete architecture of which we've only scratched the surface. With API.AI as your conversational agent tool, you're not just limited to the Google Assistant. It provides integrations with all the major platforms, including Facebook Messenger, Alexa, Slack, and others. On top of that, there are several SDK's available for even more platforms so your conversational agent can be fully integrated within your web-app, mobile app or even watch app.
As you can see, fulfillment allows you to call into any existing API, as long as you provide the translations. And thus, the possiblities are quite extensive. We can't wait to see what applications you can come up with, feel free to let us know on Twitter!
"Hey Google, ask Kunstmaan Labs to publish the most recent draft."
"Alright, I've published the post 'Hey Google, what can you do?' on the Kunstmaan Labs blog."