AI, Digital health, Features, Insight, Leadership, News, News by region, Research

How synthetic data could be key to unlocking the potential of AI in healthcare

By Opinion editorPublished On: May 19, 2025Last Updated: May 28, 2025

By Christian Hardahl, healthcare leader (EMEA region) at software firm SAS.

As AI continues to permeate every facet of the modern world, industries are rapidly discovering its potential to revolutionise operations, improve efficiency and drive innovation. Among the sectors pushing the boundaries of AI adoption, the healthcare sector stands at the forefront.

However, healthcare organisations face a unique set of challenges when it comes to integrating AI, including data scarcity, privacy concerns and stringent regulatory constraints. As a result, these barriers have made the process of training AI models slower, more expensive and less effective.

Data scarcity and privacy concerns

In the healthcare sector, the importance of high-quality, real-world data cannot be overstated. But acquiring enough data to train AI models that can accurately predict outcomes or assist in decision-making is often an insurmountable task.

The challenge lies not only in obtaining sufficient volumes of data but also in ensuring that it is diverse, relevant and free from any biases that might skew results. Some data collection methods may fall short in these areas, leading to gaps in knowledge and limiting the ability of AI systems to make truly informed predictions.

On top of this, privacy is a major concern for many. Healthcare data is some of the most sensitive information that exists and its improper use or exposure can result in serious consequences for both individuals and organisations, so the sector must tread carefully.

Additionally, in many countries, stringent laws such as the General Data Protection Regulation (GDPR) in the UK, or the Health Insurance Portability and Accountability Act (HIPAA) in the US, impose significant constraints on how personal health data can be used.

Overcoming the roadblocks

Defined as algorithmically generated data that mimics real-world data, synthetic data is proving to be a game changer for many industries – particularly healthcare. This artificially generated data mirrors the statistical properties of real-world data, but without containing any sensitive or identifiable information.

From financial transactions and medical records to customer behaviour patterns, synthetic data enables businesses to generate highly relevant datasets while preserving privacy and maintaining the integrity of the information used to train their models.

By generating datasets that preserve the statistical relationships and distributions found in real data, without using any personal or confidential information, synthetic data allows organisations to sidestep the privacy issues entirely.

Ultimately, this makes it possible to train AI models with rich datasets while adhering to regulatory requirements and ensuring that sensitive information remains secure.

Enhancing AI quality and performance

While privacy is a critical factor, the quality and performance of AI models are also paramount. For AI systems to be effective, they need access to the type of data that accurately represents the complexities that often occur in real-world scenarios.

This is where the power of synthetic data is able to shine. Healthcare organisations can create large volumes of high-quality data, free from any gaps or biases, that reflect the true variability that is seen in real life.

However, the methods used to create synthetic data are just as important as the data itself. One commonly used approach is Generative Adversarial Networks (GANs), which consists of two neural networks – the generator and the discriminator – that work in tandem to produce realistic data.

The generator creates synthetic data, while the discriminator evaluates it for authenticity. Over time, the two networks improve, resulting in highly realistic data that mimics the nuances of real-world datasets. This is particularly effective for generating medical data, where real-world data can be sparse and difficult to obtain due to privacy regulations.

Another widely used method is the Synthetic Minority Over-sampling Technique (SMOTE). This technique is especially valuable for addressing any imbalances in datasets, such as the underrepresentation of certain diseases or particular patient demographics.

Generating synthetic examples of minority classes enables SMOTE to create balanced training sets, ultimately leading to more accurate and inclusive AI models within the healthcare context.

Addressing bias and ensuring fairness

As synthetic data becomes more prevalent across the industry, it’s crucial for organisations to address one of its potential pitfalls: the reinforcement of bias. Just as real-world data can reflect societal or systemic biases, synthetic data can inadvertently carry those same patterns if generated from biased source material.

This can pose significant risks, especially in the healthcare sector, where biased data can lead to unequal or even harmful outcomes for underrepresented groups.

To mitigate this, organisations must implement robust processes to audit and refine their synthetic datasets. This includes evaluating training data sources, applying fairness metrics and conducting regular assessments of model outputs across different demographic groups. Beyond this it’s also crucial to rigorously test and validate AI models trained on synthetic data against real-world datasets to ensure their performance, reliability, fairness, and applicability in practical scenarios.

The goal is to ensure that AI systems trained on synthetic data produce results that are both equitable and accurate. Proactively managing the integrity of synthetic datasets enables healthcare organisations to build more trustworthy models, foster public confidence in AI solutions and deliver better, more equitable care.

Scaling AI solutions with synthetic data

As organisations look to scale their AI initiatives, synthetic data is emerging as a powerful enabler of speed, flexibility and efficiency. Unlike real-world data, which often involves lengthy processes of collection, cleaning and anonymisation, synthetic data can be generated programmatically, often in a matter of hours or days rather than weeks or months. This allows teams to quickly produce datasets tailored to their exact specifications, reducing the time spent wrangling imperfect or sensitive data.

Organisations can create training datasets on demand, eliminating bottlenecks in the AI development process. It also allows for iterative model refinement. Developers can test, tweak and retrain their models using updated synthetic datasets, accelerating innovation and improving outcomes.<

Moreover, synthetic data plays a critical role in expanding model coverage. For instance, AI models often struggle with edge cases or rare events due to their limited presence in real-world data. Synthetic data can be strategically generated to simulate these underrepresented scenarios, helping models become more robust and capable of handling complex, real-life conditions.

Crucially, synthetic data isn’t intended to replace real-world data – it complements it. This hybrid approach enables healthcare organisations to scale AI solutions confidently, resulting in a faster path from development to deployment, more reliable outcomes and smarter systems that are better equipped to meet the demands of modern healthcare.

Gaining a competitive advantage

SAS VP of Data Ethics, Reggie Townsend, states: “The rapid adoption of AI in healthcare needs responsible innovation and responsible innovators.”

This couldn’t be more true as AI capabilities continue to advance, and the pressure on healthcare organisations to innovate swiftly and responsibly has never been greater. Synthetic data is proving to be a critical asset in this race, enabling faster development cycles, reducing reliance on sensitive data and dramatically lowering costs.

But the value of synthetic data goes far beyond operational efficiency. It’s not merely a workaround for data scarcity or privacy concerns – it’s a strategic driver of innovation.

This shift is reshaping the future of AI in healthcare, enabling breakthroughs that were previously constrained by access, regulation, or bias. It allows organisations to explore new use cases, personalise patient care and deliver AI-driven insights at scale – all without compromising trust or security.

In an industry where accuracy, speed and ethics are non-negotiable, synthetic data is becoming a cornerstone of competitive advantage. Healthcare providers that embrace this technology will not only accelerate their AI journey but also lead the charge in building a more intelligent, equitable and data-driven future.

AI-powered handwriting analysis could be key to spotting dyslexia early

Reorganisation, consolidation, and cuts: what are the implications for NHS IT?

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__hssrc	session	This cookie is set by Hubspot. According to their documentation, whenever HubSpot changes the session cookie, this cookie is also set to determine if the visitor has restarted their browser. If this cookie does not exist when HubSpot manages cookies, it is considered a new session.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	1 year	This cookies is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".

Cookie	Duration	Description
__hssc	30 minutes	This cookie is set by HubSpot. The purpose of the cookie is to keep track of sessions. This is used to determine if HubSpot should increment the session number and timestamps in the __hstc cookie. It contains the domain, viewCount (increments each pageView in a session), and session start timestamp.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
__hstc	1 year 24 days	This cookie is set by Hubspot and is used for tracking visitors. It contains the domain, utk, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
hubspotutk	1 year 24 days	This cookie is used by HubSpot to keep track of the visitors to the website. This cookie is passed to Hubspot on form submission and used when deduplicating contacts.

Cookie	Duration	Description
cookielawinfo-checkbox-functional	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-others	1 year	No description
lfuuid	9 years 11 months	Third party (Lead Forensics) cookie which enables us to track visitor behaviour on our site. Tracking is performed anonymously until a user identifies themselves by submitting a form.
tl_554_555_1	1 month	No description
tl_554_605_2	1 month	No description
tlf_1	5 days	No description