TechTip: Watson APIs - Text to Speech

Analytics & Cognitive
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

We looked at Watson Speech to Text; you talk and Watson converts the audio to a written document. Did you think there wouldn't be tit for tat? Now, we go the other way.

By David Shirey

In one sense, you might think this would be a mirror image of the Speech to Text article, but things like that happen rarely. Although the circumstances (text to speech versus speech to text) are opposite, there are some differences in what happens in this process. Let's look, shall we?

Getting There

As with previous APIs, we start on the Watson home page. Click on Products and Services. In the window that appears, select Text to Speech.

The format here is roughly the same as in the previous two APIs that we looked at. The Try for Free button lets you dive right in and start working. The View Demo button will take you into a couple of demos.

What Does This API Do?

Obviously, this API will convert written text into an audio format using a specific language and even a specific voice. The languages supported include German, French, Spanish (Castilian plus North and South American dialects), Brazilian Portuguese, Italian, and English with different dialects for US and UK (for English)

A variety of genders and intonations are available. German comes in either male (Dieter) or female (Birgit). English (British) is female (Kate), and American English are Michael, Lisa, and Allison (documentation claims that Allison is "expressive"). Castilian Spanish has both male (Enrique) and female (Laura formats, but Latin and North American Spanish is just female (Sofia). French is female (Renee), and Italian is also female (Francesca). To round things out, Japanese is female (Emi), and Brazilian Portuguese is female (Isabella). Seems to be a preponderance of female voices. Kind of odd, considering all the publicity about the scarcity of women in the top-level tech environment. Reminds me of the time someone reprogrammed the Enterprise computer to be a woman.

Apparently, it delivers a seamless voice interaction that caters to your audience with control over every word. I know that this, at least, is true because it is taken word for word from the IBM website. Hey, baby, have they ever lied to you?

Of course, this is not hard to believe. Text to speech is not a cutting-edge technology. Truly world-class companies already have voice-activated systems that use text-to-speech technology, although in a slightly different way. The trick, however, is to make text to speech truly customized.

OK, I can see you are a sarcastic and disbelieving group. Fine. I can dig it. So let me prove you wrong.

Other Stuff

I tried the demo here, and at first it didn't work. But I think that was because I was using Safari. I switch over to Chrome, and it worked fine. You can choose from any of the languages that are supported, and you have several options for the type of text and the intonation used. There is a flat piece of text that sounds like it came from a high-school social studies book. The next one is a sample of something a customer service representative would say. The tone had been modified to be more personal, but I had the strange feeling while listening to it that it sounded vaguely like a homicide detective telling someone that their loved one had just been found dead in the old cemetery. The final section gave some hints on how the speaker's voice can be changed to sound harder or softer and whatnot. That was kind of freaky, like listening to someone go through a list of their multiple personalities. But it was nice to see the range that the app can provide.

The Software Development Kits (SDKs) for this app include the usual suspects: Java, Python, Swift, .Net, Node.js, and Unity.

Probably the best thing about Text to Speech (which seems simpler than the opposite) is that it's customizable. You can add a voice (for example, if you live in Brooklyn or maybe Alabama), and you can also add custom words.

And that's where most translation software breaks down - in the special words or phrases that each particular industry uses or in dialects that are just hard to understand. I agree. Most of us in the United States speak English. But there is a world of difference (as any Hollywood star knows) between Dixie and New England and Southern California.

But the Watson app is open to customization, and you can make what comes out sound as homelike as you want. Plus, it's able to pick out special terms or phrases that are important to your industry and make sure those are understood. It's as American as American can be. And I don't really know what that means. It just sounded good.

Obviously, adding a new voice or new terms will not be a simple process. But if it's really important to your business, it's nice to know the Watson supports it.

Can't Go Any Further

The bottom line is that between Speech to Text and Text to Speech, Watson handles a lot of the communications and conversion issues that you might have. The only question is how you'll use each.