Cyrus Cousins

Statistical Data Science with Rademacher Averages

Project Overview

Overview

These works apply Rademacher averages to obtain state-of-the-art approximation algorithms for classic data science tasks, including pattern mining and betweenness centrality estimation. Bavarian focuses on approximating Betweenness Centrality (BC) in graphs. By using the MCERA, Bavarian provides strong, sample-dependent approximation guarantees, significantly improving over existing methods by reducing error bounds by 2-4 times. The algorithms offer a unifying framework for comparing various BC estimators and introduce new sample-complexity results based on the graph's vertex-diameter. Similarly, MCRapper applies MCERAs to POSET-structured function families common in pattern mining tasks. It efficiently computes upper bounds on the maximum deviation of sample means, enabling the identification of statistically significant patterns and approximations of high-expectation function sets. MCRapper also leads to the development of TFP-R, an algorithm for True Frequent Pattern mining, which outperforms current methods by offering higher precision and recall.

Both algorithms demonstrate the flexibility and power of MCERAs in providing robust, efficient solutions to complex problems in graph analysis and pattern mining. Their ability to deliver strong statistical guarantees and improve computational efficiency highlights their significant potential applications in various fields requiring precise data analysis and pattern recognition. In machine learning, Rademacher averages are generally used to bound the deviation between the training and test risk of the learned model, but these applications leverage the strength of the Rademacher average more effectively, by simultaneously bounding large families of statistics, which are then used to produce domain-specific approximations. These projects are closely related to my work on empirical centralization, and I have employed similar techniques in adversarial learning and fair adversarial learning settings, as well as empirical game theory, to simultaneously bound many statistics efficiently.


Bavarian: Betweenness Centrality Approximation with Variance-Aware Rademacher Averages

Cyrus Cousins, Chloe Wohlgemuth, and Matteo Riondato

Abstract

We present Bavarian, a collection of sampling-based algorithms for approximating the Betweenness Centrality (BC) of all vertices in a graph. Our algorithms use Monte-Carlo Empirical Rademacher Averages (MCERAs), a concept from statistical learning theory, to efficiently compute tight bounds on the maximum deviation of the estimates from the exact values. The MCERAs provide a sample-dependent approximation guarantee much stronger than the state of the art, thanks to its use of variance-aware probabilistic tail bounds. The flexibility of the MCERAs allows us to introduce a unifying framework that can be instantiated with existing sampling-based estimators of BC, thus allowing a fair comparison between them, decoupled from the sample-complexity results with which they were originally introduced. Additionally, we prove novel sample-complexity results showing that, for all estimators, the sample size sufficient to achieve a desired approximation guarantee depends on the vertex-diameter of the graph, an easy-to-bound characteristic quantity. We also show progressivesampling algorithms and extensions to other centrality measures, such as percolation centrality. Our extensive experimental evaluation of Bavarian shows the improvement over the state-of-the art made possible by the MCERAs (2–4x reduction in the error bound), and it allows us to assess the different trade-offs between sample size and accuracy guarantees offered by the different estimators.


MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Leonardo Pellegrina, Cyrus Cousins, Fabio Vandin, and Matteo Riondato

Abstract

We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both 1.statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and 2. approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by MCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks

Editors note: To alleviate any potential confusion, we clarify that MCRapper should be parsed and pronounced as MC-Rapper, rather than M-CRapper.