Recommender Systems for Drug Discovery

Researcher: Mel Beckerleg
Academic Supervisor: Raphael Hauser, Andrew Thompson
Industrial Supervisor: Jonny Wray

Background

Predicting missing entries in databases has a wide range of applications, from e-commerce to film recommendations to biomedical sciences.

e-Therapeutics are a drug discovery company who use computational and scientific analysis to identify novel compounds for treating complex diseases. One challenge they face is how to identify drugs that might work on a subset of target proteins that are linked to causing disease. Testing all possible drugs in a laboratory is infeasible in terms of time and cost, and therefore they need methods for using existing evidence about the interactions to make predictions for those that are unknown.

In fact, this problem resembles the problem faced by companies like Amazon and Netflix, of how to recommend products or films to their customers, based on the preferences of other customers. It turns out that in both cases, the databases often have an approximately low rank structure, meaning that a relatively small number of patterns are sufficient to explain the whole database. This has allowed for the development of highly accurate algorithms even in the case where a high proportion of the entries are missing. However, these methods do not always allow for interpretable predictions. In particular, for binary databases, where interactions are the result of groups (clusters) of similarly behaved drugs or proteins, there is a lot of benefit to be gained from also being able to identify these clusters.

For this reason I am researching algorithms for predictive clustering. The aim is to find algorithms that can recover not only accurate predictions, but also the underlying clusterings.

Outcomes

I have explored the use of a linear program for identifying the best single cluster that approximates a database with missing entries. I have shown that for any binary matrix, this method has an error that is no worse than twice the minimal error for any clustering. For the case of a single noisy, partially observed cluster, I have identified a relationship between the noise levels and the cluster size that quantifies when exact recovery is possible.

I have used this work as the basis of an algorithm for rank-k binary matrix completion, which uses an iterative framework for generating clusters through rank-1 partitions. The method performs well on real world recommender systems, when benchmarked against other approaches for predictive clustering. [1]

I have also proved recovery in the asymptotic limit of binary databases composed of clustered rows. One is a nearest-neighbor inspired approach, directly grouping rows based on the distance between them. The second uses a log-likelihood maximization approach to recover the original database, before grouping rows within a certain distance. Both results link the number of missing entries to the rank of the database. I have also explored this relationship numerically for finite databases.