In many applications we need to find or estimate the $p \ge 1$ largest elements of a matrix, along with their locations. This is required for recommender systems used by Amazon and Netflix, link prediction in graphs, and in finding the most important links in a complex network, for example.

Our algorithm uses only matrix vector products and is based upon a power method for mixed subordinate norms. We have obtained theoretical results on the convergence of this algorithm via a comparison with rook pivoting for the LU decomposition. We have also improved the practicality of the algorithm by producing a blocked version iterating on $n \times t$ matrices, as opposed to vectors, where $t$ is a tunable parameter. For $p > 1$ we show how deflation can be used to improve the convergence of the algorithm.

Finally, numerical experiments on both randomly generated matrices and real-life datasets (the latter for $A^TA$ and $e^A$) show how our algorithms can reliably estimate the largest elements of a matrix whilst obtaining considerable speedups when compared to forming the matrix explicitly: over 1000x in some cases.

# Past Computational Mathematics and Applications Seminar

For many years, sum-factored algorithms for finite elements in rectangular reference geometry have combined low complexity with the mathematical power of high-order approximation. However, such algorithms rely heavily on the tensor product structure inherent in the geometry and basis functions, and similar algorithms for simplicial geometry have proven elusive.

Bernstein polynomials are totally nonnegative, rotationally symmetric, and geometrically decomposed bases with many other remarkable properties that lead to optimal-complexity algorithms for element wise finite element computations. The also form natural building blocks for the finite element exterior calculus bases for the de Rham complex so that H(div) and H(curl) bases have efficient representations as well. We will also their relevance for explicit discontinuous Galerkin methods, where the element mass matrix requires special attention.

We will review the basic building blocks of iterative solvers, i.e. sparse matrix-vector multiplication, in the context of GPU devices such

as the cards by NVIDIA; we will then discuss some techniques in preconditioning by approximate inverses, and we will conclude with an

application to an image processing problem from the biomedical field.

The aim of this talk is to design an efficient multigrid method for constrained convex optimization problems arising from discretization of some underlying infinite dimensional problems. Due to problem dependency of this approach, we only consider bound constraints with (possibly) a linear equality constraint. As our aim is to target large-scale problems, we want to avoid computation of second

derivatives of the objective function, thus excluding Newton like methods. We propose a smoothing operator that only uses first-order information and study the computational efficiency of the resulting method. In the second part, we consider application of multigrid techniques to more general optimization problems, in particular, the topology design problem.

Quadrature is the term for the numerical evaluation of integrals. It's a beautiful subject because it's so accessible, yet full of conceptual surprises and challenges. This talk will review ten of these, with plenty of history and numerical demonstrations. Some are old if not well known, some are new, and two are subjects of my current research.

Partial differential equations with more than three coordinates arise naturally if the model features certain kinds of stochasticity. Typical examples are the Schroedinger, Fokker-Planck and Master equations in quantum mechanics or cell biology, as well as quantification of uncertainty.

The principal difficulty of a straightforward numerical solution of such equations is the `curse of dimensionality': the storage cost of the discrete solution grows exponentially with the number of coordinates (dimensions).

One way to reduce the complexity is the low-rank separation of variables. One can see all discrete data (such as the solution) as multi-index arrays, or tensors. These large tensors are never stored directly.

We approximate them by a sum of products of smaller factors, each carrying only one of the original variables. I will present one of the simplest but powerful of such representations, the Tensor Train (TT) decomposition. The TT decomposition generalizes the approximation of a given matrix by a low-rank matrix to the tensor case. It was found that many interesting models allow such approximations with a significant reduction of storage demands.

A workhorse approach to computations with the TT and other tensor product decompositions is the alternating optimization of factors. The simple realization is however prone to convergence issues.

I will show some of the recent improvements that are indispensable for really many dimensions, or solution of linear systems with non-symmetric or indefinite matrices.

To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming

models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications.

In this talk we present the design of task-based sparse direct solvers on top of runtime systems. In the context of the

qr_mumps solver, we prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in qr_mumps.

Following this approach, we move to heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources. Finally we introduce a memory-aware algorithm to control the memory behavior of our solver and show, in the context of multicore architectures, an important reduction of the memory footprint for the multifrontal QR factorization with a small impact on performance.