Afsana Khan

Universiteit Maastricht

Share this project

Download PDF

Publication date: 11 juni 2026

University: Universiteit Maastricht

ISBN: 978-94-6534-390-7

UNLOCKING VALUE OF DATA WITH VERTICAL FEDERATED LEARNING

Summary

Organisations generate and store increasing amounts of data through their daily activities. Customer interactions, financial records, medical information, and digital services all produce data that can help organisations understand patterns, improve decisions and support their operations. Machine learning has become an important way to make use of this data. By learning from existing data, a machine learning model can identify relevant patterns and make predictions for new situations. This has helped organisations detect risks, improve planning, and gain a clearer view of the factors that influence different outcomes. Although the value of data is widely recognised, its best use is often not possible because the information needed for a strong model is rarely held by one organisation alone. Different parties possess different parts of the information, and these parts would be most useful when combined. However, they are not able or willing to share their data because it contains sensitive details, is subject to legal requirements, or has commercial importance. When data remains siloed in this way, it cannot be gathered in a single place, and conventional machine learning cannot be applied. This situation has increased the need for methods that allow organisations to collaborate without revealing their raw data. Federated learning offers a way for organisations to train a shared model while keeping their data local. Each party trains on its own data and shares only insights, not the raw data. How this collaboration works depends on how the data are partitioned. In horizontal federated learning (HFL), organisations hold the same features for different individuals, so each can train a complete local model, and the updates can be combined easily. This makes the process relatively simple. In vertical federated learning (VFL), organisations hold different features about the same samples, so no single party has all the information needed to train a full model. The model must therefore be split, and intermediate representations must be exchanged during training. This setup increases coordination needs, raises communication and privacy concerns, and makes it harder to judge which parties provide useful information. Despite these difficulties, vertical federated learning allows organisations to use complementary information that cannot be shared directly. These characteristics make VFL both challenging and valuable, and they form the focus of this dissertation. In Chapter 2, we define federated learning and explain how it applies the same loss-minimisation and gradient descent principles used in conventional machine learning, but in a distributed setting where data remain local and are never shared between parties. The chapter also defines horizontal and vertical federated learning and shows how they differ technically. Horizontal learning allows each party to train a full local model, while vertical learning requires the exchange of intermediate representations because no single party has all the features. Chapter 3 answers the first guiding question through a structured review of the vertical federated learning literature. This chapter provided a systematic overview using a structured literature review across major databases and was organised into a lifecycle view with stages covering foundations, development, evaluation & management, and deployment. This organisation clarified how work in VFL fits together and revealed important gaps. These gaps shaped the remaining chapters of the dissertation.

Chapter 4 focuses on selecting suitable participants in a vertical federation. This problem matters because each party contributes different features, and including irrelevant or redundant features increases communication and provides little benefit. This dissertation proposes VFL-RPS, a privacy-preserving method that estimates the relevance of each party’s features before training begins. It combines secure computation of correlation among features of the parties, redundancy detection, and a forward selection strategy to identify complementary participants. The results show that the method can preserve or improve model performance while involving far fewer parties, reducing both computation and communication. The approach also avoids the heavy requirements of mutual-information-based techniques and handles realistic scenarios with overlapping or irrelevant features. Chapter 5 addresses communication efficiency in VFL. Standard VFL requires repeated exchanges of intermediate results during training, which is impractical in settings with slow networks, limited bandwidth, or many participants. To reduce this burden, the dissertation proposes a one-shot method in which each party compresses its local data into latent representations and shares these once with the active party. Two approaches are explored: PCA for linear and autoencoders for nonlinear patterns. This replaces many iterative exchanges with a single round of communication. Experiments across several datasets show that the aggregated model consistently performs better than the local model and remains close to the centralised model, with the expected small trade-off in accuracy. Higher compression also improves privacy by limiting the detail in the shared representations. This demonstrates that practical VFL does not need to rely on expensive iterative communication and can be adapted to constrained environments. Chapter 6 considers how incentives can be shared fairly among participants. In many VFL settings, the active party gains the most direct benefit from the final model because it holds the labels, while passive parties provide additional features. Without a fair way to share benefits, it is difficult to maintain long-term collaboration. This chapter reframes the problem as a bankruptcy game, where the total performance gain is the estate, and each party’s marginal contribution is its claim. Because contributions can overlap, the sum of claims can exceed the total gain, making this framing suitable. The Talmud or contest-garment rule is used to obtain the nucleolus allocation efficiently, keeping fairness without requiring exponential coalition computations. Experiments show that this method provides stable and intuitive allocations, rewards genuine contributions proportionally, and remains robust to strategic behaviour. Compared with existing baselines, it offers a clearer and more balanced incentive structure that supports sustained federations. Chapter 7 explores hybrid data partitions, where vertical and horizontal splits appear together. Many real-world settings involve organisations that hold different features and observe different individuals. This chapter proposes a hybrid federated learning method in which each party creates an embedding of its own features, and the active party combines these embeddings with its own representation to make predictions.

A financial crime detection scenario illustrates how transactions create shared samples across otherwise disjoint customer groups, forming a hybrid structure. Experiments show that the method can effectively learn from such partitions, outperforming models trained only on local features and approaching the performance of a centralised model while keeping data local. This extends federated learning to more realistic and complex data environments.

Chapter 8 concludes the dissertation by bringing together the insights from all chapters and outlining directions for future work in handling drift in VFL, making VFL more explainable, and the use of VFL in large language models. The chapters collectively show that VFL can unlock value that remains inaccessible when organisations keep their data siloed. By identifying suitable partners, reducing communication, sharing incentives fairly, and supporting hybrid data partitions, the methods developed in this dissertation make it possible to use distributed information without compromising privacy. These elements form a practical foundation for collaborations in which data cannot be shared but where combining insights can lead to outcomes that no single organisation could achieve alone.