What is Federated Machine Learning?
And how AI companies use it to get around privacy regulations
Many AI applications rely on personalization to be useful for a user. With rising AI competition around the globe, we’ve seen many countries enact localization data privacy regulations.These regulations restrict access of personal information for a country’s citizens either by requiring their data to be geographically stored within the country’s borders or creating strict data exportation limits.
Notable regions and countries that have enacted policies like this are the European Union, China, Russia, and India–all countries that have a massive potential user base for AI products and can be large players in AI at the global scale.
We’re going to go over why a country enacts this regulation, what it means for companies creating AI products, and what federated machine learning is and how it circumvents regulatory restrictions legally.
Why this regulation?
There are two primary reasons countries enact this type of regulation:
To protect their citizens’ data from being misused by a foreign entity. Once a person’s data has geographically left the country, it’s impossible to ensure its safety.
To improve their position in the global AI race. Personal data is very valuable for training AI models and countries that can keep their competitors from having access to their citizens’ data are at an advantage.
This regulation has a very real two-fold impact: one fold affects companies and the other affects the citizens within those countries.
What does this regulation impact?
Companies wanting to train models on personal data across borders where localization laws are in play can’t just download that data, transfer it to their data center, and start training with the rest of their data–that would be illegal. This is especially tricky when data from multiple areas with localization laws is needed. This creates situations where training is either logistically complex or seemingly impossible.
Large companies with data centers in many regions can more easily get around these regulations (legally, of course) by training models in regions where the data can be legally accessed. Larger companies also tend to have enough data to fully train models without relying on data across multiple regions. Even in these cases, a great deal of money, time, and care is spent by these companies ensuring regulatory compliance.
Unfortunately, the same can’t be said for startups. Without the time and resources to ensure compliance, startups are forced to rely on a larger company’s infrastructure (think Google, Amazon, or Microsoft’s cloud offerings) to train their models. Even with the proper infrastructure, smaller companies may not have the data necessary to train in regions with data localization.
This means startups may not be able to offer their services or products to users in those regions. This is the second fold of the impact of localization laws: users miss out on AI products or get them on a delayed schedule while compliance is ensured. This is something we’ve seen occur many times in the European Union.
What is federated machine learning and how does it help?
There is a possible solution: federated machine learning. Many people are familiar with federation due to federated social media platforms and understand it as another term for distribution. It’s important to note that when it comes to machine learning, federated training isn’t the same as distributed training. It’s a subset of it.
Distributed training is defined as training models across multiple machines. This is the norm for modern AI, as large-scale models require numerous chips split across multiple machines to train. A datacenter is really just a house for many computers all connected together on a local network. LLMs and other modern machine learning applications use data centers for fast, efficient training.
Federated training takes place across multiple machines but instead of being defined by the location of training, it’s defined by the location of data. An overview of federated learning looks something like this:
A model is sent to multiple machines across the internet.
The model is trained on the separate machines using the data on that machine.
When training is done, the model updates are sent to a central node.
Model updates are reconciled across all machines on that central node.
The model is updated and the updated model is sent back to the machines to continue training.
Instead of training across machines on the same local network, federated learning leverages training machines across the internet to train a single model on sensitive data without that data having to leave the premises. Only the model updates are sent back to the central server for reconciliation.
This reconciliation process generally uses federated averaging. Simply put, this updates the central model based on the average of the separate model updates. More sophisticated algorithms would weight the different training jobs based on their quality and relevance to the training task.
The most important takeaway about federated learning is training data never leaves its origin. This allows models to be trained across regions while complying with data localization laws. This is especially useful for sensitive data such as medical records, intellectual property, or anything with personally identifiable information. Another added benefit of federated training is reducing the risk of a large-scale data breach due to hosting large datasets on a single machine.
Of course, federated learning doesn’t come without its compromises. It:
Depends on latency, bandwidth, and availability across multiple machines potentially slowing the training process.
Adds infrastructure complexity as model updates are transmitted via the internet, reconciled, and sent back to training machines.
Introduces potential data quality issues due to differences in quality, organization, and bias within the separate datasets.
As we see more AI and data privacy regulation introduced, the benefits of federated training will likely begin to outweigh the compromises.
Let me know what you think and feel free to ask any questions in the comments.
Always be (machine) learning,
Logan
Awesome read thanks Logan!
Like you said, federated learning is essential in medicine and health these days. As a matter of fact it's now the resort for those who want to get their hands on biological data. Take western and Eastern Europe. The former oftentimes either refuses to share sequencing data with the latter, even when fully anonymized, or they charge so much for it that grants barely can cover the cost.