Machine Learning and Matrix Completion for Bioactivity Predictions


Bioactivity predictions are a crucial part of drug discovery. Given a set of proteins linked to a particular disease, e-therapeutics would like to identify compounds that will interact with these proteins, in order to disrupt the progression of the disease. Working from compound-protein-interaction matrices, which store known experimental data for interactions between compounds and proteins, the challenge is to make predictions where no data exists. These matrices are binary and known data is very sparse; fewer than 1% of entries are known. This problem is similar to that faced by companies within e-commerce wanting to develop recommender systems that recommend new products of services to existing customers.

Machine learning techniques use extra chemical and structural information about compounds and proteins to learn a model that will predict bioactivity where there is no experimental evidence, but the chemical and structural information is available. This requires appropriate selection of what extra information to use, in order to establish whether predictions are reliable beyond the heavily sampled regions (i.e. groups of compounds and proteins with substantial experimental data available). Figure 1 illustrates how different machine learning methods perform when trained using different information about chemical and structural properties, bench-marked against two matrix completion methods, Naïve Bayes and Low Rank matrix completion. One of aims of this project is to determine how well predictions extrapolate to compounds and or proteins for which no experimental data exists. 

Figure 1: Error plot  for different machine learning methods (higher F-scores indicates better predictions).

In addition, predictions can be difficult to interpret or validate. We use techniques from tiling literature, an approach that looks for sub-groups of shared entries, to identify sub-groupings of compounds and proteins that interact. The problem can be formulated as finding a rank-k approximation to our database, as a product of two binary matrices. This can be done in terms of finding either the maximum k-tiling (where positive tiles cannot include known negative entries) or best k-tiling (the tiling that has least overall error from the underlying database). PROXIMUS is an algorithm that finds the best rank-1 tile, splits according to the rows present in the tile, re-tiles the two sides of the partition, partitions again and repeats until convergence. The project aims to extend this algorithm to recommender systems, specifically extending tiling approaches to the case of missing data. 


Shen et al (2009) relax an interger program formulation of the rank-1 best tile problem (an integer program) and show that the solution of this relaxed problem is close to the solution of the original problem. We have extended this result to the case where the underlying database has missing entries. This bound has been numerically verified for random matrices.

In the case of rank-k tiling, we implement PROXIMUS for missing data at different amounts of missing data in the interaction matrix (80%,50%,10%,1%) and find that the approximation achieves an accuracy of 90% for our dataset with a small number of tiles. 

Figure 2: Error on training data at different levels of missing data.

Future Work

We hope to extend machine learning methods to the question of extrapolation to unsampled rows and columns. We hope to consider different adaptations of the current algorithm, such as using row-wise as well as column-wise splits, using a max tile approach or measures of purity to determine the best split (rather than the Frobenius norm for best tile), and using both positive and negative information as the algorithm progresses. In addition, we hope to develop a measure of ‘tile reliability’ and perform numerical experiments with different data sets, for example recommender systems for film users.