- Synthetic Data News
- Posts
- Microsoft synthetic datasets, synthetic clinical data hubs, data for Canadian smart cities
Microsoft synthetic datasets, synthetic clinical data hubs, data for Canadian smart cities
Concrete synthetic data examples and applications that will speed up the adoption of synthetic data next year
In this first and last edition of the year, no predictions, but some examples of trends that, in my opinion, will speed up the adoption of synthetic data next year:
Making synthetic datasets available on-demand or via synthetic data catalogs will allow people to test and evaluate quickly the performance of models and analyses made from synthetic data.
Collaboration across entities will lower the technological cost and allow to build larger synthetic data pools that will benefit all industry stakeholders.
Providing intuitive interfaces and embedding synthetic data capabilities in industry-specialized tools will make it more accessible to domain experts.
📰 Synthetic data in the wild
SYNTHEMA Launches Cross-Border Hub for Developing AI Techniques in Rare Hematological Diseases: 16 partners joined forces under an EU Horizon-funded initiative to develop AI models for anonymization and synthetic data generation of clinical data in rare hematological diseases. The project will focus on increasing the data samples of these diseases and use a federated learning infrastructure to connect clinical sites, academia, and SMEs across Europe. It also aims to promote interoperability standards and GDPR-compliant research in rare hematological diseases. (link)
£1m to advance regulation of AI and synthetic clinical trial data in the UK: The grant will fund three projects for the Medicines and Healthcare Products Regulatory Agency (MHRA). The first project will develop synthetic datasets to mimic real patient data in clinical trials, while the second will produce a methodology for regulating both transparent and complex AI models. (link)
A differentially-private public synthetic dataset to build support systems for anti-trafficking efforts: The International Organization for Migration (IOM) and Microsoft have released the first public synthetic dataset on human trafficking victim-perpetrator relations. Generated with differential privacy to enable data sharing, the project aims to improve the use and access to data through synthetic data for interactive exploration and ML. This new synthesizer is available within the OpenDP initiative in Microsoft’s SmartNoise library. (link)
Synthetic data for Canadian smart cities to improve privacy and data quality: Toronto-based nonprofit Innovate Cities and synthetic data generation provider Replica Analytics are partnering to provide Canadian municipalities with synthetic data for smart city projects, using real data as a basis. The synthetic data will be generated using Replica Analytics' SDG technology and used in CityShield, Innovate Cities' data trust, which aims to protect municipal data up to EU General Data Protection Regulation standards. (link)
New framework for generating synthetic Electronic Health Records (EHRs): Researchers from Google developed EHR-Safe, a framework to generate synthetic EHRs that are both high-fidelity and meet privacy constraints. Based on a sequential encoder-decoder architecture and generative adversarial networks (GANs). It can generate similar downstream performance to real data when used to train diagnostic models. (link)
âš™ New synthetic data companies and tools
Synthetic Datasets is an online dataset store for synthetic image data that takes advantage of the recent advent of image generation models. (link)
Synthetic Future provides on demand image data for object detection. (link)
Synthetic Data Directory lists existing synthetic data companies and tools. (link)
Red Flag Test uses synthetic data that mimic money laundering transactional patterns to check the performance of transaction monitoring systems. (link)
📣 From the community
David Pujol, Amir Gilad, and Ashwin Machanavajjhala designed PreFair, a system that allows for DP fair synthetic data generation. (link)
In a blog post co-written with ChatGPT, I explained what Differential Privacy is and how it's used. (link)
Noa Zamstein demonstrates how to generate and use synthetic data to make statistical inferences about the original data set while protecting privacy using a data set on Titanic passengers. (link)
I'll post this content every two weeks. You can subscribe below to receive it by email. Have a great end of the year. ✌