For AI, Synthetic Data Is Anything but Fake News

Analytics & Cognitive
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

AI apps must be trained to evaluate data. However, some data uses are restricted by law or the logistics of gathering it. These difficulties created a need for synthetic data, and the market is providing it.

To most people, “synthetic data” sounds like a paradox, or at least a contradiction in terms. Synthetic is something made up, and data is presumed to be a collection of facts. In the realm of AI, though, synthetic data is not only real, it has become a critical tool. In fact, it is becoming so important that it may actually surpass the use of real data for training AI apps by the end of this decade, according to at least one prediction. Any enterprise that plans to use AI at some point will have to embrace this concept with the confusing name, if not become an outright consumer of synthetic data.

Machine learning (ML), the process by which the neural networks that run AI apps learn how to do what they do, requires large amounts of data to both train an AI and to help ensure that the training an app does receive isn't biased in some subtle way. Essentially, an AI app is presented known data and works on it until it produces an expected value or output, at which point the owner of the app can consider it trained enough to proceed to the next training step. Although the process sounds simple, it can be rather complicated in execution.

Without getting too much into the weeds here, at a basic level, an AI app isn't learning the data itself, but rather how to classify it. The app turns that classification into a math function for subsequent processing. Part of this classification process also involves labeling the data, a function that currently is mostly done by humans to shortcut the trial-and-error process an AI app would have to go through to do that itself. Labeling provides a context for the data, and as the ML process continues, more labels are added to the data to give it more context. Generally speaking, the more data in a dataset and the more accurate the labels a dataset is given, the more efficient the ML process can be.

Finding Enough of the Right Kind of Data

Ah, but where does that data come from? What's the guarantee for its accuracy? And how much data of the kind needed can one get?

If you were trying to figure out something like whether people would be more or less likely to use an antacid tablet for heartburn depending on the tablet's flavor, asking every person in the nation that question would produce a large dataset, but the time and expense of doing it would be prohibitive. If instead you tried to get data from hospitals on how often their patients needed antacid tablets based on what they had consumed from hospital room service, you'd run up against privacy protections embedded in such laws as the Health Insurance Portability and Accountability Act (HIPAA) that don't let hospitals release that kind of data because the data could be traced back to the individuals from which it came, potentially violating their privacy.

Finally, even if somehow there was data available by either method, how should that data be labeled? How would you prevent human error in the labeling process? Synthetic data solves these problems.

It used to be that the privacy dilemma could be worked around by anonymizing data. For example, a large amount of data would be processed about epidemic victims by jumbling or substituting other letters for their names so specific records couldn't be traced back to an individual. However, in 2019, researchers from Imperial College London and Belgium's Université Catholique de Louvain proved that the actual people from an anonymized dataset could be identified via as few as 15 demographic attributes.

While there had been similar proofs with smaller datasets earlier, this was a tipping point at which Europe's General Data Protection Regulation (GDPR) would be violated. Simple anonymizing is no longer enough protection, particularly for text data. Now the standard must be that datasets used for research on medical problems, or those dealing with financial records, must be particularly careful about being true to the structure of a dataset without violating any person's privacy.

The Value of Synthetic Data

Synthetic data is generated by custom-built algorithms that can generate artificial datasets that replicate the structure of any kind of actual dataset. Because these algorithms model the data in a real dataset without replicating the actual data within it, the synthetic dataset retains the form, probability distribution, and other characteristics of a dataset of any particular kind that might be used for AI app training. However, the synthetic dataset has no actual information that can be tied to any individual, so it avoids the privacy risk. Synthetic data simply represents actual data and lets an AI app set baselines just as it could with real data. This enables ML training to retain the ability to help Ai apps learn to recognize statistical patterns and other properties in a dataset that can't be traced back to anything because it's replacement data rather than actual data. And it's well-known that ML training activities can generate more-accurate AI app models if the training data is more diverse.

In keeping with the idea of making medical-related data available for AI training and other purposes without violating privacy rules, health-insurance provider Anthem announced in May it would partner with Google Cloud to generate synthetic data that will include "medical histories, health care claims and other medical data."

Synthetic data has an economic benefit that counteracts the problem of prohibitive data-gathering costs as well. The generated datasets can be scaled to any size without affecting validity. This lets enterprises looking for a very large amount of data find some without having to research or generate it themselves and lets smaller companies access this aid to ML activity at a reasonable cost. It also provides enough data diversity to represent the reality-based models that ML training tries to teach and opens possibilities for enterprises to simply experiment with ML training relatively inexpensively. Such experimentation can act as a confidence-building measure for organizations choosing to sample AI technology before launching any full-bore project.

In addition, synthetic datasets come with labels already assigned. The entity using such datasets gets to skip that step in preparing AI training data. Making tweaks to the synthetic datasets also helps AI support staff adjust data-generation parameters and observe those results. The technology has the potential, for example, to help spot and counteract lending biases in the financial services industry and to help retailers analyze consumer purchasing behavior, all without exposing personal information about individuals.

Synthetic Data's Types and Categories

Generally speaking, synthetic data is divided into three major types and three categories. The three major types are synthetic text, synthetic images, and tabular synthetic data. Text is data in a natural language format. Images, primarily used in computer vision applications, include still and video graphics that are used to train AI apps to correctly "see" and discriminate between images of different objects. Tabular synthetic data is data of many types that appears in field-like structures for analysis by algorithms.

The three categories are fully synthetic data, partially synthetic data, and hybrid synthetic data. Fully synthetic data, as one might expect, is generated independently of real data and therefore has no relationship to actual data except that its format is similar. Partially synthetic data uses real data as a structure but substitutes for actual data values if there is a potential privacy issue. Hybrid synthetic data uses randomly selected items of real data that is then combined with similarly structured synthetic records to create an independent synthetic data item.

There are two major ways to create synthetic data. The first is to take a statistical distribution of a set of real-world data and then substitute numbers for the original data to create a dataset with different actual values. The second is to generate a model of an observed statistical distribution of an actual dataset and then substitute random data using the model as a structure. The models primarily used are generative, which means they include a model of the data distribution itself and can tell users the probability that a particular data item is representative of the overall dataset. (This is opposed to discriminative models, which use labels and tell users how likely the label is to be correct.)

(As a side note, synthetic data shouldn't be confused with a somewhat similar term, surrogate data. Surrogate data is specific to a group of data points in a sequence that have been taken and displayed at equally spaced moments in time, which can subsequently be analyzed via moving-average models.)

Obtaining Synthetic Data

If these definitions aren't entirely clear, the good news is that rather than building an algorithm to generate synthetic data, it's easier to find enterprises that provide synthetic data for training AI apps. For most situations, synthetic data produced by established companies using their own tested generation algorithms both produces more accurate synthetic data without violating privacy rules and is cheaper than an enterprise inventing a synthetic data-generating algorithm on their own.

On the one hand, no one can guarantee total accuracy with today's means of generating synthetic data. However, the general consensus among many enterprises seeking ML training data for AI apps is that the accuracy is high enough to offset the expense, inconvenience, and necessity of generating and labeling some version of synthetic data in-house.

There are currently scores of vendors offering synthetic data of various kinds and tools for building in-house synthetic datasets, should anyone's needs make traveling that route more practical in the long term. Because of space considerations, presented here is simply a series of links to associated providers. Readers interested in obtaining synthetic data or related tools for building it independently are encouraged to research which vendors provide aid best suited to any particular AI training needs.

Synthetic Data and Data Generation Apps Vendors



Avo Automation



Bulian AI


Clearbox AI

Curiosity Software









FinCrime Dynamics




Intelligent Delivery Solutions


Instill AI


MD Clone

Mostly AI


Octopize MD

Particle Health

Replica Analytics

Sarus Technologies

Sky Engine



Syndata AB







John Ghrist

John Ghrist has been a journalist, programmer, and systems manager in the computer industry since 1982. He has covered the market for IBM i servers and their predecessor platforms for more than a quarter century and has attended more than 25 COMMON conferences. A former editor-in-chief with Defense Computing and a senior editor with SystemiNEWS, John has written and edited hundreds of articles and blogs for more than a dozen print and electronic publications. You can reach him at

More Articles By This Author
Related Articles


Support MC Press Online





  • White Paper: Node.js for Enterprise IBM i Modernization

    SB Profound WP 5539

    If your business is thinking about modernizing your legacy IBM i (also known as AS/400 or iSeries) applications, you will want to read this white paper first!

    Download this paper and learn how Node.js can ensure that you:
    - Modernize on-time and budget - no more lengthy, costly, disruptive app rewrites!
    - Retain your IBM i systems of record
    - Find and hire new development talent
    - Integrate new Node.js applications with your existing RPG, Java, .Net, and PHP apps
    - Extend your IBM i capabilties to include Watson API, Cloud, and Internet of Things

    Read Node.js for Enterprise IBM i Modernization Now!


  • Profound Logic Solution Guide

    SB Profound WP 5539More than ever, there is a demand for IT to deliver innovation.
    Your IBM i has been an essential part of your business operations for years. However, your organization may struggle to maintain the current system and implement new projects.
    The thousands of customers we've worked with and surveyed state that expectations regarding the digital footprint and vision of the companyare not aligned with the current IT environment.

    Get your copy of this important guide today!


  • 2022 IBM i Marketplace Survey Results

    Fortra2022 marks the eighth edition of the IBM i Marketplace Survey Results. Each year, Fortra captures data on how businesses use the IBM i platform and the IT and cybersecurity initiatives it supports.

    Over the years, this survey has become a true industry benchmark, revealing to readers the trends that are shaping and driving the market and providing insight into what the future may bring for this technology.

  • Brunswick bowls a perfect 300 with LANSA!

    FortraBrunswick is the leader in bowling products, services, and industry expertise for the development and renovation of new and existing bowling centers and mixed-use recreation facilities across the entertainment industry. However, the lifeblood of Brunswick’s capital equipment business was running on a 15-year-old software application written in Visual Basic 6 (VB6) with a SQL Server back-end. The application was at the end of its life and needed to be replaced.
    With the help of Visual LANSA, they found an easy-to-use, long-term platform that enabled their team to collaborate, innovate, and integrate with existing systems and databases within a single platform.
    Read the case study to learn how they achieved success and increased the speed of development by 30% with Visual LANSA.


  • Progressive Web Apps: Create a Universal Experience Across All Devices

    LANSAProgressive Web Apps allow you to reach anyone, anywhere, and on any device with a single unified codebase. This means that your applications—regardless of browser, device, or platform—instantly become more reliable and consistent. They are the present and future of application development, and more and more businesses are catching on.
    Download this whitepaper and learn:

    • How PWAs support fast application development and streamline DevOps
    • How to give your business a competitive edge using PWAs
    • What makes progressive web apps so versatile, both online and offline



  • The Power of Coding in a Low-Code Solution

    LANSAWhen it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed.
    Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:

    • Discover the benefits of Low-code's quick application creation
    • Understand the differences in model-based and language-based Low-Code platforms
    • Explore the strengths of LANSA's Low-Code Solution to Low-Code’s biggest drawbacks



  • Why Migrate When You Can Modernize?

    LANSABusiness users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.
    In this white paper, you’ll learn how to think of these issues as opportunities rather than problems. We’ll explore motivations to migrate or modernize, their risks and considerations you should be aware of before embarking on a (migration or modernization) project.
    Lastly, we’ll discuss how modernizing IBM i applications with optimized business workflows, integration with other technologies and new mobile and web user interfaces will enable IT – and the business – to experience time-added value and much more.


  • UPDATED: Developer Kit: Making a Business Case for Modernization and Beyond

    Profound Logic Software, Inc.Having trouble getting management approval for modernization projects? The problem may be you're not speaking enough "business" to them.

    This Developer Kit provides you study-backed data and a ready-to-use business case template to help get your very next development project approved!

  • What to Do When Your AS/400 Talent Retires

    FortraIT managers hoping to find new IBM i talent are discovering that the pool of experienced RPG programmers and operators or administrators is small.

    This guide offers strategies and software suggestions to help you plan IT staffing and resources and smooth the transition after your AS/400 talent retires. Read on to learn:

    • Why IBM i skills depletion is a top concern
    • How leading organizations are coping
    • Where automation will make the biggest impact


  • Node.js on IBM i Webinar Series Pt. 2: Setting Up Your Development Tools

    Profound Logic Software, Inc.Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application. In Part 2, Brian May teaches you the different tooling options available for writing code, debugging, and using Git for version control. Attend this webinar to learn:

    • Different tools to develop Node.js applications on IBM i
    • Debugging Node.js
    • The basics of Git and tools to help those new to it
    • Using as a pre-built development environment



  • Expert Tips for IBM i Security: Beyond the Basics

    SB PowerTech WC GenericIn this session, IBM i security expert Robin Tatam provides a quick recap of IBM i security basics and guides you through some advanced cybersecurity techniques that can help you take data protection to the next level. Robin will cover:

    • Reducing the risk posed by special authorities
    • Establishing object-level security
    • Overseeing user actions and data access

    Don't miss this chance to take your knowledge of IBM i security beyond the basics.



  • 5 IBM i Security Quick Wins

    SB PowerTech WC GenericIn today’s threat landscape, upper management is laser-focused on cybersecurity. You need to make progress in securing your systems—and make it fast.
    There’s no shortage of actions you could take, but what tactics will actually deliver the results you need? And how can you find a security strategy that fits your budget and time constraints?
    Join top IBM i security expert Robin Tatam as he outlines the five fastest and most impactful changes you can make to strengthen IBM i security this year.
    Your system didn’t become unsecure overnight and you won’t be able to turn it around overnight either. But quick wins are possible with IBM i security, and Robin Tatam will show you how to achieve them.

  • Security Bulletin: Malware Infection Discovered on IBM i Server!

    SB PowerTech WC GenericMalicious programs can bring entire businesses to their knees—and IBM i shops are not immune. It’s critical to grasp the true impact malware can have on IBM i and the network that connects to it. Attend this webinar to gain a thorough understanding of the relationships between:

    • Viruses, native objects, and the integrated file system (IFS)
    • Power Systems and Windows-based viruses and malware
    • PC-based anti-virus scanning versus native IBM i scanning

    There are a number of ways you can minimize your exposure to viruses. IBM i security expert Sandi Moore explains the facts, including how to ensure you're fully protected and compliant with regulations such as PCI.



  • Encryption on IBM i Simplified

    SB PowerTech WC GenericDB2 Field Procedures (FieldProcs) were introduced in IBM i 7.1 and have greatly simplified encryption, often without requiring any application changes. Now you can quickly encrypt sensitive data on the IBM i including PII, PCI, PHI data in your physical files and tables.
    Watch this webinar to learn how you can quickly implement encryption on the IBM i. During the webinar, security expert Robin Tatam will show you how to:

    • Use Field Procedures to automate encryption and decryption
    • Restrict and mask field level access by user or group
    • Meet compliance requirements with effective key management and audit trails


  • Lessons Learned from IBM i Cyber Attacks

    SB PowerTech WC GenericDespite the many options IBM has provided to protect your systems and data, many organizations still struggle to apply appropriate security controls.
    In this webinar, you'll get insight into how the criminals accessed these systems, the fallout from these attacks, and how the incidents could have been avoided by following security best practices.

    • Learn which security gaps cyber criminals love most
    • Find out how other IBM i organizations have fallen victim
    • Get the details on policies and processes you can implement to protect your organization, even when staff works from home

    You will learn the steps you can take to avoid the mistakes made in these examples, as well as other inadequate and misconfigured settings that put businesses at risk.



  • The Power of Coding in a Low-Code Solution

    SB PowerTech WC GenericWhen it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed.
    Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:

    • Discover the benefits of Low-code's quick application creation
    • Understand the differences in model-based and language-based Low-Code platforms
    • Explore the strengths of LANSA's Low-Code Solution to Low-Code’s biggest drawbacks



  • Node Webinar Series Pt. 1: The World of Node.js on IBM i

    SB Profound WC GenericHave you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application.
    Part 1 will teach you what Node.js is, why it's a great option for IBM i shops, and how to take advantage of the ecosystem surrounding Node.
    In addition to background information, our Director of Product Development Scott Klement will demonstrate applications that take advantage of the Node Package Manager (npm).
    Watch Now.

  • The Biggest Mistakes in IBM i Security

    SB Profound WC Generic The Biggest Mistakes in IBM i Security
    Here’s the harsh reality: cybersecurity pros have to get their jobs right every single day, while an attacker only has to succeed once to do incredible damage.
    Whether that’s thousands of exposed records, millions of dollars in fines and legal fees, or diminished share value, it’s easy to judge organizations that fall victim. IBM i enjoys an enviable reputation for security, but no system is impervious to mistakes.
    Join this webinar to learn about the biggest errors made when securing a Power Systems server.
    This knowledge is critical for ensuring integrity of your application data and preventing you from becoming the next Equifax. It’s also essential for complying with all formal regulations, including SOX, PCI, GDPR, and HIPAA
    Watch Now.

  • Comply in 5! Well, actually UNDER 5 minutes!!

    SB CYBRA PPL 5382

    TRY the one package that solves all your document design and printing challenges on all your platforms.

    Produce bar code labels, electronic forms, ad hoc reports, and RFID tags – without programming! MarkMagic is the only document design and print solution that combines report writing, WYSIWYG label and forms design, and conditional printing in one integrated product.

    Request your trial now!

  • Backup and Recovery on IBM i: Your Strategy for the Unexpected

    FortraRobot automates the routine tasks of iSeries backup and recovery, saving you time and money and making the process safer and more reliable. Automate your backups with the Robot Backup and Recovery Solution. Key features include:
    - Simplified backup procedures
    - Easy data encryption
    - Save media management
    - Guided restoration
    - Seamless product integration
    Make sure your data survives when catastrophe hits. Try the Robot Backup and Recovery Solution FREE for 30 days.

  • Manage IBM i Messages by Exception with Robot

    SB HelpSystems SC 5413Managing messages on your IBM i can be more than a full-time job if you have to do it manually. How can you be sure you won’t miss important system events?
    Automate your message center with the Robot Message Management Solution. Key features include:
    - Automated message management
    - Tailored notifications and automatic escalation
    - System-wide control of your IBM i partitions
    - Two-way system notifications from your mobile device
    - Seamless product integration
    Try the Robot Message Management Solution FREE for 30 days.

  • Easiest Way to Save Money? Stop Printing IBM i Reports

    FortraRobot automates report bursting, distribution, bundling, and archiving, and offers secure, selective online report viewing.
    Manage your reports with the Robot Report Management Solution. Key features include:

    - Automated report distribution
    - View online without delay
    - Browser interface to make notes
    - Custom retention capabilities
    - Seamless product integration
    Rerun another report? Never again. Try the Robot Report Management Solution FREE for 30 days.

  • Hassle-Free IBM i Operations around the Clock

    SB HelpSystems SC 5413For over 30 years, Robot has been a leader in systems management for IBM i.
    Manage your job schedule with the Robot Job Scheduling Solution. Key features include:
    - Automated batch, interactive, and cross-platform scheduling
    - Event-driven dependency processing
    - Centralized monitoring and reporting
    - Audit log and ready-to-use reports
    - Seamless product integration
    Scale your software, not your staff. Try the Robot Job Scheduling Solution FREE for 30 days.