Mike MacKenzie (CC BY 2.0) To perform its magic, a machine learning model or algorithm must be “trained” to discern patterns of interest in the data it ingests. The accuracy of the model depends directly on the amount of data used to train it. That’s why for most real-world use cases, producing an effective and useful AI/ML model requires huge amounts of training data. And that presents a problem with respect to privacy.
- For our purposes in this article, we’ll use the terms AI, ML, and AI/ML interchangeably.
Privacy Is a Major Issue for AI Today
Here’s an example of the problem. Developing AI/ML algorithms that can reliably assist physicians in diagnosing medical conditions requires that the models be trained using immense quantities of data from real patients. The amount and variety of data required is far beyond what a single hospital could provide. Traditionally, that has meant that the data from many institutions had to be pooled in a centralized repository to aggregate the huge amount required for training the ML model. But with today’s emphasis on privacy, sharing the personal information of patients has become extremely problematical. The European Union’s General Data Protection Regulation (GDPR), for example, strictly forbids exchanging an individual’s personal information (PI) between different organizations without that person’s express permission. It also gives individuals control over the use to which their information can be put. The impracticality of obtaining consent from each person whose data forms part of a training dataset severely limits the development of effective AI/ML diagnostic assistants. But a new approach initially developed by Google in 2017, called Federated Learning, allows AI models to be trained without the requirement of sharing and consolidating private information.
What Is Federated Learning?
Federated learning was developed as a means of eliminating the requirement for a central store of raw data for AI model training. Instead, model training is carried out at each data source. (Examples of data sources, often referred to as endpoint devices or clients, include consumers’ smartphones, IoT devices, autonomous vehicles, and electronic health information systems.) Only model updates, and never the raw data residing on the endpoint devices, are sent to a central location Here’s how it works.
The Learning Process
First, a generic machine learning model is generated at a central server. This model, which is nothing more than a starting baseline, is distributed to all endpoint or client devices. In the case of smartphones or IoT devices, for example, these could number in the millions. It is in the clients that the raw data, including any potentially sensitive or protected personal information, resides. Jeromemetronome via Wikimedia (CC BY-SA 4.0) Each client updates the ML model it receives from the central server, using its own data as training inputs. The client then returns its locally updated model to the central server, which aggregates the updates from all clients and uses them to generate a new baseline model. The new baseline is then distributed to the clients, and the cycle is repeated until the baseline is optimized.
Why This Process Is Valuable
In its announcement of this new technology, Google provided a concrete, real-world example of its value. Although most users are unaware of it, whenever they type text into their smartphone, they are using AI. That’s because smartphones use an AI-based predictive text model to attempt to predict the next word when you begin typing text into the phone. As Karen Hao, artificial intelligence reporter for the MIT Technology Review, notes in a recent article, it is federated learning that “allowed Google to train its predictive text model on all the messages sent and received by Android users—without ever actually reading them or removing them from their phones.”
Impact on Machine Learning
Federated learning is expected to fundamentally change how AI models are developed. A good example of that transformation is in the way medical AI models are trained. Before the advent of federated learning, the necessity of amassing huge quantities of data at a central location severely limited researchers’ ability to develop effective AI diagnostic models. As Karen Hao says, Today, most organizations have only a limited supply of internally generated data they can use in training their AI models; and they face huge obstacles, due to legal, regulatory or business restrictions, in acquiring valid training data from other organizations to augment the data available internally. Federated learning should give a tremendous boost to the use of AI in areas such medicine, IoT, autonomous vehicles, etc., by allowing organizations to collaborate in building accurate AI models while keeping their sensitive personal or business data safely in-house.
Potential Issues
Training AI models is a compute- and memory-intensive process. Because federated learning requires that such training take place on endpoint devices such as smartphones, autonomous vehicles, or IoT devices, the compute load on those devices could be disruptive to their normal functions. One approach to mitigating these difficulties is to schedule AI model training processes for times when the device would normally be idle. In addition, having perhaps millions of devices sending and receiving model updates across a network could cause bandwidth limitation problems. Google has addressed this issue with its Federated Averaging algorithm, which can train deep networks using 10-100x less communication compared to an implementation lacking that feature. Another, perhaps more serious issue is the vulnerability of federated learning to what’s called “model poisoning.” Because a federated learning AI model is developed by ingesting model update data from large numbers of endpoint devices, malicious actors may have the opportunity to compromise the final model by fabricating or “poisoning” the model update information sent from some endpoint devices. This might allow them to create back doors into the model. Because model update data is extremely difficult for humans to interpret, and because keeping the source of model information anonymous is a design feature of many federated learning implementations, identifying the source, or even the existence, of tainted information provided to the baseline model could be extremely difficult. Protecting against this possibility will probably involve development of some kind of “set a good AI model to catch a bad AI model” strategy.
The Future of Federated Learning
The ability to train AI/ML models without violating data privacy is a huge technological advancement. That’s why federated learning has the potential to be a game-changer in many AI application areas, including computer vision, natural language processing, health care, autonomous vehicles, IoT, and the large-scale prediction and recommendation applications used in e-commerce systems. It would be no exaggeration to say that, to a significant degree, federated learning is reshaping the future of AI. © 2020 Ronald E Franklin
Comments
Ronald E Franklin (author) from Mechanicsburg, PA on July 21, 2020: Much appreciated, Jo. Ronald E Franklin (author) from Mechanicsburg, PA on July 21, 2020: Thanks, Eric. Jo Miller from Tennessee on July 21, 2020: Very thorough and informative. Eric Dierker from Spring Valley, CA. U.S.A. on July 20, 2020: How very interesting and so relevant today. I have to admit that I am very easily modeled and that is OK with me.