Open-Source Tools for Watson, Part 1

Analytics & Cognitive
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

There’s way more to AI than just Watson. What are some of those other tools? And where do you get them? If you think those are good questions, keep reading.

You may have been thinking of Watson as a single product that does it all from soup to nuts. And to be fair, Watson covers a lot of ground.

But the truth is, while Watson may provide the horsepower, a set of machine-learning APIs, and an overall framework that allows you to connect it to a number of programming languages in your business system, there are many open-source products that can and should be used to create the perfect Watson experience. Fortunately, IBM has made a real commitment to open-source software, and nowhere is that more evident than in the number and types of tools that can be used with Watson.

“Free” and “Open Source” Software

I want to say something about the terms “free” and “open source.” Maybe I’m alone in this, but when I hear the word “free,” I think of no cost. And when I hear the term “open source,” I think of free, as in no cost.

However, an in-depth look at both of these terms results in the conclusion that, while they are more or less identical, what qualifies a specific piece of software for these categories really depends on the wording of the license that may or may not go with it. For example, what does the license say about your ability to sell the software or even give the software away for free?

Truth is, most open-source software does come with a license. And the type of license it comes with and the license’s exact terms determine what you can and can’t do with the open-source software. And that is independent of whether or not the original open-source group charges you for the software.

What’s important is to remember that the “free” part refers to how you can use this software, not necessarily about whether or not it costs money—that is, you will find open-source or free software that has to be paid for, but you will be free to use it in any way that you want.

I just didn’t want you to get excited when we talk about IBM’s commitment to open source and think you can use all of the tools mentioned below for no cost. Some you can (Jupyter Notebook, TensorFlow, etc.), and some you can’t (H2O, SAS, etc.).

Data Analysis Engines

The first tools that Watson tends to rely on are the Data Analysis Engines, which take a large amount of data, process it in a reasonable time frame, and organize it so that relationships and dependencies can be isolated.

The two primary engines today are Hadoop and Spark. These were both developed off of Apache, one by Yahoo in 2006 and the other by AMPLab at Cal Berkley in 2012.

What’s the difference? Basically, Hadoop runs in batch mode using disk while Spark handles streaming data and does the work in memory. Both specialize in “clustering,” which is dealing with a number of databases simultaneously (the cluster) rather than just taking each database one at a time.

Consequently, Spark is generally faster, about 100 times faster than Hadoop in memory and 10 times faster in disk. There are a couple of reasons for this, but mostly it is related to the way that Spark is able to optimize processing between MapReduce steps (both products use the MapReduce algorithm).

There are cases where Hadoop is faster, specifically if you have to do batch processing of your data, but in general Spark is faster if you have a ton of data. Certainly, Spark is the fair-haired child today, although there are a lot of Hadoop specialists out there.

And that brings us to another consideration; hardware and staffing. The software may be free (as in no cost), but you need something fairly heavy duty to run it on. This could be your own site or Watson. If your own site, then you need to have some hardware in place (and Spark will require more memory than Hadoop) as well as people who are familiar with cluster admin techniques.

For a concise article on the detail differences in the software and the costs, see this.

Watson, Spark, and Hadoop are similar in that they all deal with unstructured data, but the difference is that Spark and Hadoop do only part of what Watson does. They do not have the ability to learn. Watson does, taking in the input and refining its ability to interpret what it sees.

Coming Up Next: Models

Before we finish part 1 of this two-part series, I want to take a high-level look at the term “models.”

As you dive into AI, you will see that term referred to often, and the second part of this series will deal with the software you use to set up and improve these models.

In a nutshell, the models are a set of rules that tell us what we want the software to do. It is basically a decision system: given a particular situation, what should the machine’s response to it be? The model is what tells the machine what to do.

There are many types of models, just as there are many types of AI systems: chat bots, purchasing systems, air traffic control, language translation, whatever.

How you set up and, especially, improve that model is crucial to the success of your AI adventure. And the software we will be talking about in the next installment is what will let you do that. Until then, see ya.