21
Fri, Jun
3 New Articles

# AI Theory vs. Reality - The Final 5 of 10 Lessons from the Field

###### Typography
• Smaller Small Medium Big Bigger
• Default Helvetica Segoe Georgia Times

There’s a huge difference between the purely academic exercise of training machine learning (ML) models versus building end-to-end data-science solutions to help solve real enterprise problems.

This chapter summarizes the lessons learned after two years of our team engaging with dozens of enterprise clients from different industries, including manufacturing, financial services, retail, entertainment, and healthcare, among others.

6. Data Is Often Unbalanced

Say you have a dataset with labeled credit-card transactions and 0.1% of those transactions turn out to be fraudulent, whereas 99.9% of them are good/normal. If we create a model that says that there’s never fraud, guess what? The model will give a correct answer in 99.9% of the cases, so its accuracy will be 99.9%! This common accuracy fallacy can be avoided by considering different metrics such as precision and recall.

These are defined in terms of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN):

• TP = Total number of instances correctly predicted as positive
• TN = Total number of instances correctly predicted as negative
• FP = Total number of instances incorrectly predicted as positive
• FN = Total number of instances incorrectly predicted as negative

In a typical anomaly-detection scenario, the primary goal is to minimize false negatives—for example, ignoring a fraudulent transaction, not recognizing a defective chip, or diagnosing a sick patient as being healthy— while not incurring a great number of false positives.

Precision = TP/(TP + FP) Recall = TP/(TP + FN)

Note that precision penalizes FP while recall penalizes FN. A model that never predicts fraud will have zero recall and undefined precision. Conversely, a model that always predicts fraud will have 100% recall but a very low precision due to a high number of false positives.

The use of receiver operating characteristic (ROC) curves in anomaly detection is discouraged. This is because the false positive rate (FPR), which ROC curves rely on, is heavily biased by the number of negative instances in the dataset (i.e., FP + TN), leading to a potentially small FPR even when there’s a huge number of FPs.

FPR = FP/(FP + TN)

Instead, the false discovery rate (FDR) is useful to have a better understanding of the impact of FPs in an anomaly detection model:

FDR = 1 – Precision = FP/(TP + FP)

7. Don’t Predict. Just Tell Me Why!

We have come across several projects in which the goal is not to create a model to make predictions in real time but rather to explain a hypothesis or analyze which factors explain a certain behavior. This is to be taken with a grain of salt, given that most machine-learning algorithms are based upon correlation, not causation. Some examples are:

• Which factors make a patient fall into high risk?
• Which drug has the highest impact on blood test results?
• Which insurance-plan parameter values maximize profit?
• Which characteristics of a customer make him or her more prone to delinquency?
• What’s the profile of someone involved in customer churn (a “churner”)?

One way to approach these questions is by calculating feature importance, which is given by algorithms such as random forests, decision trees, and XGBoost. Furthermore, algorithms such as Local Interpretable Model- Agnostic Explanation (LIME) or SHapley Additive exPlanations (SHAP) are helpful to explain models and predictions, even if they come from neural networks or other “black-box” models.

Machine-learning algorithms have both parameters and hyperparameters. They differ in that the former are directly estimated by the algorithm— for example, the coefficients of a regression or the weights of the neural network—whereas the latter are not and need to be set by the user—for example, the number of trees in a random forest, the regularization method in a neural network, or the kernel function of a support vector machine (SVM) classifier.

Setting the right hyperparameter values for your ML model can make a huge difference. For instance, a linear kernel for an SVM won’t be able to classify data that is not linearly separable. A tree-based classifier may overfit if the maximum depth or the number of splits is set too high, or it may underfit if the maximum number of features is set too low. Finding the optimal values for hyperparameters is a very complex optimization problem. Here are a few tips:

• Understand the priorities for hyperparameters. In a random forest, the number of trees and the max depth may be the most relevant hyperparameters, whereas for deep learning, the learning rate and the number of layers might be prioritized.
• Use a search strategy like grid search or random search. The latter is preferred.
• Use cross-validation by setting aside a separate testing set, splitting the remaining data into k folds, iterating k times using each fold for validation (that is, to tune hyperparameters), and using the remaining data for training. Finally, compute average quality metrics over all folds.

9. Deep Learning May Be a Panacea

During the past few years, deep learning (DL) has been an immense focus of research and industry development. Frameworks such as TensorFlow, Keras, and Caffe now enable rapid implementation of complex neural networks through a high-level application programming interface (API). Application types are countless, including computer vision, chatbots, self-driving cars, machine translation, and even games (including one that can beat the top chess computer in the world).

One of the main premises behind DL is its ability to continue learning as the amount of data increases, which is especially useful in the era of big data. This, combined with recent developments in hardware (e.g., graphics processing units, or GPUs) allows the execution of large deep-learning jobs, which used to be prohibitive due to resource limitations.

So, does this mean that DL is always the way to go for any machine-learning problem? Not really. Here’s why:

• Simplicity: The results of a neural network model are very dependent on the architecture and the hyperparameters of the In most cases, you’ll need some expertise on network architectures to correctly tune the model. There’s also a significant trial-and-error component in this regard.
• Interpretability: As we saw earlier, a number of use cases require not only predicting but also explaining the reason behind a prediction. Why was a loan denied? Or why was an insurance policy price increased? While tree-based and coefficient-based algorithms directly allow for explainability, this is not the case with neural networks.
• Quality: In our experience, for most structured datasets, the quality of neural-network models is not necessarily better than that of random forests and Where DL excels is actually when there’s unstructured data involved. In other words, images, text, or audio. The bottom line: Don’t use a shotgun to kill a fly. ML algorithms such as random forest and XGBoost are sufficient for most structured supervised problems, being also simpler to tune, run, and explain. Let DL speak for itself in unstructured data problems or for reinforcement learning.

10. Don’t Let the Data Leak

While working on a project to predict arrival delay of flights, it was noticed that the model suddenly reached 99% accuracy when using all the features available in the dataset. This was due to using the departure delay as a predictor for the arrival delay. This is a typical example of data leakage, which occurs when any of the features used to create the model will be unavailable or unknown at prediction time. So be warned.

Open Source Gives Us Everything. Why Do We Need a Platform?

It has never been easier to build a machine-learning model. A few lines of R or Python code will suffice for such an endeavor, and there’s plenty of resources and tutorials online to train even a complex neural network. For data preparation, Apache Spark can be really useful, even scaling to large datasets. And tools like Docker™ (containerization software) and Plumbr (an application performance-monitoring tool) ease the deployment of machine-learning models through HTTP requests. So, it looks like one could build an end-to-end ML system purely using the open-source stack. Right?

This may be true for building proofs of concept. A graduate student working on his dissertation would certainly be covered under the umbrella of the open source. However, for the enterprise, the story is a bit different.

We are big fans of open source, and many open-source tools are available. But at the same time, there are also quite a few gaps. Here are some of the reasons why enterprises choose data science platforms:

• Open-source integration: Up and running in minutes, support for multiple environments, and transparent version updates
• Collaboration: Easy sharing of datasets, data connections, code, models, environments, and deployments
• Governance and security: Not only over data, but over all analytics assets
• Model management, deployment, and retraining
• Model bias: Detect and correct a model that’s biased by things like gender or age
• Assisted data curation: Visual tools to address the most painful task in data science
• Graphics processing units (GPUs): Immediate provisioning and configuration for optimal performance of deep-learning frameworks (e.g., TensorFlow)
• Codeless modeling: For statisticians, subject matter experts, and even executives who don’t code but want to build models visually

An integrated data science platform should be able to provide all of the above and more so that the end-user does not have to be a systems integrator.

Look for another AI: Evolution and Revolution excerpt in an upcoming issue of MC Systems Insight. Can't wait?  Pick up your copy of, Artificial Intelligence: Evolution and Revolution at the MC Press Bookstore Today!

Mark Simmonds is a Program Director in IBM Data and AI communications. He writes extensively on machine learning and data science, holding a number of author recognition awards. He previously worked as an IT architect, leading complex infrastructure design projects. He is a member of the British Computer Society and holds a Bachelor’s Degree in Computer Science.

MC Press books written by Mark Simmonds available now on the MC Press Bookstore.

 Artificial Intelligence: Evolution and Revolution Get started on your AI journey with insights for a path to success. List Price \$19.95 Now On Sale

\$0.00 Raised:
\$

### Resource Center

• Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application. You can find Part 1 here. In Part 2 of our free Node.js Webinar Series, Brian May teaches you the different tooling options available for writing code, debugging, and using Git for version control. Brian will briefly discuss the different tools available, and demonstrate his preferred setup for Node development on IBM i or any platform. Attend this webinar to learn:

• More than ever, there is a demand for IT to deliver innovation. Your IBM i has been an essential part of your business operations for years. However, your organization may struggle to maintain the current system and implement new projects. The thousands of customers we've worked with and surveyed state that expectations regarding the digital footprint and vision of the company are not aligned with the current IT environment.

• IBM announced the E1080 servers using the latest Power10 processor in September 2021. The most powerful processor from IBM to date, Power10 is designed to handle the demands of doing business in today’s high-tech atmosphere, including running cloud applications, supporting big data, and managing AI workloads. But what does Power10 mean for your data center? In this recorded webinar, IBMers Dan Sundt and Dylan Boday join IBM Power Champion Tom Huntington for a discussion on why Power10 technology is the right strategic investment if you run IBM i, AIX, or Linux. In this action-packed hour, Tom will share trends from the IBM i and AIX user communities while Dan and Dylan dive into the tech specs for key hardware, including:

• TRY the one package that solves all your document design and printing challenges on all your platforms. Produce bar code labels, electronic forms, ad hoc reports, and RFID tags – without programming! MarkMagic is the only document design and print solution that combines report writing, WYSIWYG label and forms design, and conditional printing in one integrated product. Make sure your data survives when catastrophe hits. Request your trial now!  Request Now.

• Forms of ransomware has been around for over 30 years, and with more and more organizations suffering attacks each year, it continues to endure. What has made ransomware such a durable threat and what is the best way to combat it? In order to prevent ransomware, organizations must first understand how it works.

• IT security is a top priority for businesses around the world, but most IBM i pros don’t know where to begin—and most cybersecurity experts don’t know IBM i. In this session, Robin Tatam explores the business impact of lax IBM i security, the top vulnerabilities putting IBM i at risk, and the steps you can take to protect your organization. If you’re looking to avoid unexpected downtime or corrupted data, you don’t want to miss this session.

• Can you trust all of your users all of the time? A typical end user receives 16 malicious emails each month, but only 17 percent of these phishing campaigns are reported to IT. Once an attack is underway, most organizations won’t discover the breach until six months later. A staggering amount of damage can occur in that time. Despite these risks, 93 percent of organizations are leaving their IBM i systems vulnerable to cybercrime. In this on-demand webinar, IBM i security experts Robin Tatam and Sandi Moore will reveal:

• Disaster protection is vital to every business. Yet, it often consists of patched together procedures that are prone to error. From automatic backups to data encryption to media management, Robot automates the routine (yet often complex) tasks of iSeries backup and recovery, saving you time and money and making the process safer and more reliable. Automate your backups with the Robot Backup and Recovery Solution. Key features include:

• Managing messages on your IBM i can be more than a full-time job if you have to do it manually. Messages need a response and resources must be monitored—often over multiple systems and across platforms. How can you be sure you won’t miss important system events? Automate your message center with the Robot Message Management Solution. Key features include:

• The thought of printing, distributing, and storing iSeries reports manually may reduce you to tears. Paper and labor costs associated with report generation can spiral out of control. Mountains of paper threaten to swamp your files. Robot automates report bursting, distribution, bundling, and archiving, and offers secure, selective online report viewing. Manage your reports with the Robot Report Management Solution. Key features include:

• For over 30 years, Robot has been a leader in systems management for IBM i. With batch job creation and scheduling at its core, the Robot Job Scheduling Solution reduces the opportunity for human error and helps you maintain service levels, automating even the biggest, most complex runbooks. Manage your job schedule with the Robot Job Scheduling Solution. Key features include:

• Business users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.

• When it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed. Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:

• Supply Chain is becoming increasingly complex and unpredictable. From raw materials for manufacturing to food supply chains, the journey from source to production to delivery to consumers is marred with inefficiencies, manual processes, shortages, recalls, counterfeits, and scandals. In this webinar, we discuss how:

• The MC Resource Centers bring you the widest selection of white papers, trial software, and on-demand webcasts for you to choose from. >> Review the list of White Papers, Trial Software or On-Demand Webcast at the MC Press Resource Center. >> Add the items to yru Cart and complet he checkout process and submit

• Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application.