Model Merging by Output Space Projection - a case study by Bethan Evans

Modern machine-learning models are often adapted, or fine-tuned, for specific tasks from a pre-trained base model. One model might perform well on a particular language task, another on an image-classification problem, and another on a different domain altogether. Model merging asks whether these specialised models can be combined into a single model that performs well across tasks, without retraining from scratch.

Existing merging methods have shown that this is often possible in practice. They typically combine fine-tuned parameter updates from several tasks using averaging, pruning, or rescaling rules. Perhaps surprisingly, these simple methods of combining weights often perform well empirically. However, it is not always clear when or why a particular merging rule should succeed.

In our work, we aim to characterise this more precisely. We formulate model merging as a convex quadratic programme over fine-tuned updates, yielding weights that minimise a squared-output calibration objective using calibration inputs and fine-tuned model outputs. This framework also includes several existing merging methods as special cases.

Let us consider merging parameters in the $N$th layer of a neural network with base parameters $\theta_0$ denoted by $h(x; \theta_0)$. If $\delta_N^{(k)}$ denotes the parameter update from the $k$-th fine-tuned model in the $N$th layer, then we consider merged updates of the form \[ \delta_{\mathrm{merge}}=\sum_{k=1}^K D_k\delta_N^{(k)}, \] where each $D_k$ is a diagonal weighting matrix applying a scaling to the rows of each $\delta_N^{(k)}$. For different choices of $D_k$ (such as $D_k = \frac{1}{K} I$ for averaging) this includes common merging strategies in its feasible set as special cases, but also allows the weights to be chosen optimally.

By stacking the diagonals of each $D_k$ for each fine-tuned model $k$ into a vector $\mathbf d$ and concatenating the $N$th layer fine-tuned model updates into a matrix $\boldsymbol{\delta_N}$, the merge is selected by minimising the mean-squared loss on some chosen calibration dataset $F$, \[ \min_{\mathbf d}\sum_{j\in F} \left\|h(x_j;\theta_0 + \boldsymbol \delta_N\mathbf d )-y_j\right\|^2, \] where the calibration set may use labelled outputs or outputs from the fine-tuned models themselves. Under a linearised model of the network in the fine-tuned updates, this becomes a convex quadratic optimisation problem, so the optimal diagonal weights can be found globally rather than chosen heuristically.

The same viewpoint also gives a geometric interpretation of model merging. Rather than thinking only in parameter space, the merge can be viewed as correcting the outputs of a base model. Define the error correction required from the base model to the calibration set in output space as \[ b_j=h(x_j; \theta_0)-y_j, \qquad S=\sum_{j\in F }b_jb_j^\top. \] Choosing a basis for merging is then equivalent to choosing a subspace that captures as much of this residual correction energy as possible. The best choice of basis is given by the leading eigenvectors of $S$, while diagonal merging corresponds to using the standard coordinate basis.

Experiments on image-classification models and LLMs support this projection picture: bases that capture more residual energy produce lower output error. This is illustrated in Figure 1, where we plot the energy captured for various choices of basis as the dimension of the basis increases. We observe that as captured energy increases, MSE decreases, and that the leading eigenvectors of $S$ give the optimal basis.

Figure 1: MSE against captured energy for increasing basis dimensions $p$ on MNIST.

Compared with other common methods, including task arithmetic, model soups and DARE, both merging in the optimal basis and the simpler diagonal-mask framework achieve lower MSE on the same validation set. The optimal basis is significantly more computationally intensive than the diagonal-mask framework. Thus, when sufficient energy is captured by the diagonal-mask approach, it may provide the better practical choice. Overall, this gives both a principled explanation of existing merging methods in mean-squared error and a practical diagnostic for predicting when a merge is likely to succeed.

Bethan Evans is a postgraduate student in Oxford Mathematics.

« All Case Studies