Tahsin Reza | University of Waterloo

Tahsin Reza

Home Publications Projects Research Group Teaching Service

I am an Assistant Professor of Electrical and Computer Engineering at the University of Waterloo. I completed the PhD degree in Electrical and Computer Engineering at The University of British Columbia, advised by Professor Matei Ripeanu. Before coming to UWaterloo, I was a member of the research staff at the Center for Applied Scientific Computing, Lawrence Livermore National Laboratory in California.

I am broadly interested in software technologies for parallel and distributed computing. My research focus is investigating novel systems techniques, both at the application and middleware level, for contemporary and emerging large-scale problems that demand scalable and timely solution. My goal is building performant and scalable yet sustainable systems that harness state-of-the-art massively parallel and large computing platforms, distributed communication, and memory technologies.

Research Group

We are always looking forward to working with highly motivated Postdoctoral scholars, and PhD, Master's and undergraduate students interested in computer systems research. If you are interested in working with us, please send an email. Please refer to Electrical and Computer Engineering resources for future graduate students for admission requirements and program details.

Graduate Students

Guilherme Campos, MASc (co-advised by Professor Wojciech Golab)

Faraz Awan, MASc

Kevin Kim, MASc

Yisheng Liu, PhD

Hasti Ghaneshirazi, PhD (co-advised by Professor Ladan Tahvildari)

Alumni

Zonghao Huang, MEng project, Spring 2025 (next, AMD)

Jerry Hu, BASc (URA)

Kevin Lin, BASc (URA, next, graduate student at University of Toronto)

Akira Yoshiyama, BASc (NSERC USRA)

Vinamra Singh, MEng project, Spring 2024 (co-advised by Professor Ladan Tahvildari)

Key Research Awards and Honors

■ Award - Natural Sciences and Engineering Research Council of Canada (NSERC) Postdoctoral Fellowship (2020)

■ Best paper nomination - The 36th IEEE International Parallel Distributed Processing Symposium (IPDPS'22)

■ Best paper nomination - The 19th IEEE International Conference on Cluster Computing (CLUSTER'17)

Publications Google Scholar DBLP

Journal

Distributed Approximate Minimal Steiner Trees with Millions of Seed Vertices on Billion-Edge Graphs. Reza, T., Steil, T., Sanders, G., and Pearce, R. Journal of Parallel and Distributed Computing (JPDC), 181(1), 2023, Elsevier B.V.

A Biomedical Open Knowledge Network Harnesses the Power of AI to Understand Deep Human Biology. Baranzini, S., Börner, K., Morris, J., Nelson, C., Soman, K., Schleimer, E., Keiser, M., Musen, M., Pearce, R., Reza, T., Smith, B., Herr II, B., Oskotsky, B., Rizk-Jackson, A., Rankin, K., Sanders, S., Bove, R., Rose, P., Israni, S., and Huang, S. AI Magazine, 43(1), pp. 46 - 58, May, 2022.

Scalable Pattern Matching in Metadata Graphs via Constraint Checking. Reza, T., Halawa, H., Ripeanu, M., Sanders, G., and Pearce, R. ACM Transactions on Parallel Computing (TOPC), 8(1), April, 2021.

PtSel: Accelerating Persistent Scatterer Pixel Selection for InSAR Processing. Reza, T., Zimmer, A., Delgado Blasco, J., Ghuman, P., Aasawat, T., and Ripeanu, M. IEEE Transaction on Parallel and Distributed Systems (TPDS), 29(1), pp. 16 - 30, January, 2018.

QaASs: QoS Aware Adaptive Security Scheme for Video Streaming in MANETs. Reza, T., and Barbeau, M. Journal of Information Security and Applications (JISA), 18(1), pp. 68 - 82, July, 2013, Elsevier Ltd.

Conference

Embracing Irregular Parallelism in HPC with YGM. Steil, T., Reza, T., Priest, B., and Pearce, R. In Proceedings of The IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'23), 12 - 17 November, 2023. (Acceptance rate 23.9%)

Towards Distributed 2-Approximation Steiner Minimal Trees in Billion-edge Graphs. Reza, T., Sanders, G., and Pearce, R. In Proceedings of The 36th IEEE International Parallel and Distributed Processing Symposium (IPDPS'22), Lyon, France, 30 May - 03 June, 2022. (Acceptance rate 25.9%, 123/474, Best Paper Nomination)

Computing Surveys of Triangles in Massive-scale Temporal Graphs with Metadata. Steil, T., Reza, T., Iwabuchi, K., Priest, B., Sanders, G., and Pearce, R. In Proceedings of The IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'21), 14 - 19 November, 2021. (Acceptance rate 23.6%, 86/365)

HyGN: Hybrid Graph Engine for NUMA. Aasawat, T., Reza, T., Yoshizoe, K., and Ripeanu, M. In Proceedings of The IEEE International Conference on BigData (BigData'20), 10 - 13 December, 2020. (Acceptance rate 16.4% for eight page papers)

Approximate Pattern Matching in Distributed Graphs with Precision and Recall Guarantees. Reza, T., Ripeanu, M., Sanders, G., and Pearce, R. In Proceedings of The ACM SIGMOD International Conference on Management of Data (SIGMOD'20), Portland, Oregon, 14 - 19 June, 2020. (Acceptance rate 22.3%)

PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution. Reza, T., Ripeanu, M., Tripoul, N., Sanders, G., and Pearce, R. In Proceedings of The IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'18), Dallas, Texas, 11 - 16 November, 2018. (Acceptance rate 19%, 55/288)

Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs. Reza, T., Klymko, C., Ripeanu, M., Sanders, G., and Pearce, R. In Proceedings of The 19th IEEE International Conference on Cluster Computing (CLUSTER'17), Honolulu, Hawaii, 05 - 08 September, 2017. (Acceptance rate 21.8%, 47/216, Best Paper Nomination)

Accelerating Persistent Scatterer Pixel Selection for InSAR Processing. Reza, T., Zimmer, A., Ghuman, P., Aasawat, T., and Ripeanu, M. In Proceedings of The 26th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP'15), Toronto, Ontario, 27 - 29 July, 2015. (Acceptance rate 24.7%, 21/85)

Non-cooperating Vehicle Tracking in VANETs Using the Conditional Logit Model. Reza, T., Lamothe, G., Barbeau, M., and Alsubaihi, B. In Proceedings of The 16th IEEE Conference on Intelligent Transport Systems (ITSC'13), Hague, Netherlands, pp. 626 - 633, 06 - 09 October, 2013.

Tracking an ‘on the run vehicle' in a metropolitan VANET. Reza, T., Barbeau, M., and Alsubaihi, B. In Proceedings of The IEEE Intelligent Vehicles Symposium (IV'13), Gold Coast, Australia, pp. 220 - 227, 23 - 26 June, 2013.

QoS Aware Adaptive Security Scheme for Video Streaming in MANETs. Reza, T., and Barbeau, M. In Proceedings of The 5th International Symposium on Foundations & Practice of Security (FPS'12), Montreal, Quebec, pp. 324 - 240, 25 - 26 October, 2012.

Workshop

Labeled Triangle Indexing for Efficiency Gains in Distributed Interactive Subgraph Search. Reza, T., Ripeanu, M., Sanders, G., and Pearce, R. In Proceedings of The 10th Workshop on Irregular Applications: Architectures and Algorithms (IA^3'20), co-located with SC'20, 11 November, 2020.

Towards Interactive Pattern Search in Massive Graphs. Reza, T., Ripeanu, M., Sanders, G., and Pearce, R. In Proceedings of The 3rd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) (GRADES-NDA'20), co-located with SIGMOD'20, Portland, Oregon, 14 June, 2020.

There are Trillions of Little Forks in the Road. Choose Wisely! - Estimating the Cost and Likelihood of Success of Constrained Walks to Optimize a Graph Pruning Pipeline. Tripoul, N., Halawa, H., Reza, T., Ripeanu, M., Sanders, G., and Pearce, R. In Proceedings of The 8th Workshop on Irregular Applications: Architectures and Algorithms (IA^3'18), co-located with SC'18, Dallas, Texas, 11 - 16 November, 2018.

Scale-Free Graph Processing on a NUMA Machine. Aasawat, T., Reza, T., and Ripeanu, M. In Proceedings of The 8th Workshop on Irregular Applications: Architectures and Algorithms (IA^3'18), co-located with SC'18, Dallas, Texas, 11 - 16 November, 2018.

How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Aasawat, T., Reza, T., and Ripeanu, M. In Proceedings of The 4th IEEE International Workshop on High-Performance Big Data, Deep Learning, and Cloud Computing (HPBDC'18), co-located with IPDPS'18, Vancouver, British Columbia, 21 May, 2018.

Poster, Research Showcase

Accelerating BFS and SSSP on a NUMA machine for the Graph500 Challenge. Aasawat, T., Yoshizoe, K., Reza, T., and Ripeanu, M. The IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19), Denver, Colorado, 17 - 22 November, 2019.

Pattern Matching in Massive Metadata Graphs at Scale. Reza, T., and Ripeanu, M. Doctoral Showcase. In IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'18), Dallas, Texas, 11 - 16 November, 2018.

Efficient GPU Techniques for Processing Temporally Correlated Satellite Image Data. Reza, T., Mukherjee, D., and Ripeanu, M. In IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'15), Austin, Texas, 15 - 20 November, 2015.

Parallel Clustering Coefficient Computation using GPUs. Reza, T., Aasawat, T., and Ripeanu, M. In IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'14), New Orleans, Louisiana, 16 - 21 November, 2014.

Technical Report

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems. Gharaibeh, A., Reza, T., Santos-Neto, E., Beltrão Costa, L., Sallinen, S., and Ripeanu. M. Technical Report, NetSysLab, Electrical and Computer Engineering, The University of British Columbia, December, 2014.

Research Projects

Scalable Analytics in High-dimensional Data

UWaterloo | Code - TBA | Paper - TBA

■ Distributed and GPU accelerated vector similar search.

■ Distributed and GPU accelerated clustering in vector data.

Scalable Deep Learning

UWaterloo | Code - TBA | Paper - TBA

■ Federated learning.

■ Low-cost neural architecture search.

Distributed Maximal Independent Set Computation

UWaterloo | Code - TBA | Paper - TBA

We present a distributed Maximal Independent Set algorithm and its scalable implementation to support graph datasets that are too big for the main memory of most servers, and applications that demand high throughput. The proposed distributed solution is based on the randomized priority assignment approach which has been shown to identify large maximal independent sets. Our MPI-based, in-memory solution accelerates distributed computation by harnessing asynchronous processing, achieves scalability through load balanced graph partitioning and produces deterministic results. We demonstrate scalability using massive datasets - up to a 128 billion edge real-world web graph and 512 compute nodes (16K CPU-cores) - to the best of our knowledge, the largest scale-free graph data and platform scale for maximal independent set computation found in the literature.

Distributed Middleware for Data-intensive Workloads

UWaterloo | Code - TBA | Paper - TBA

In recent years, the scale and throughput need of data science and machine learning tasks have been driving up the demand for HPC capabilities across many industries. The core design philosophy of popular HPC/distributed computing middleware such as MPI and GASNet, conceived decades ago, is largely motivated by the need of compute-bound problems. In contrast, often distributed communication is the bottleneck in modern data-intensive problems; many exhibit irregular, data-driven computation patterns - leading to load-imbalanced processing, which makes strong scaling challenging. Lack of sophisticated flow control mechanisms in traditional HPC middleware, in communication-heavy applications, often cause message buildup and memory exhaustion, fault manifestation in the messaging layer and system collapse. Furthermore, current HPC middleware have limitations in maximizing performance of individual nodes, e.g., limited scalability with increasing CPU-core count, and inadequate support for multithreading, which is inconsistent with the node architecture of modern distributed platforms. Given these challenges, our goal is to address the design shortcomings of current HPC distributed middleware for efficiently and reliably supporting contemporary and emerging data-intensive tasks on modern distributed platforms.

Accelerating Neural Architecture Search

UWaterloo and CASC, LLNL | Code - TBA | Paper - TBA

Neural Architecture Search (NAS) is a high-impact machine learning problem area that focuses on automating the task of identifying high-quality models. NAS solutions are compute intensive - often thousands of models are trained and evaluated. To speed up the search process, low overhead model quality estimation is considered an attractive alternative to spending a prolonged duration in model training and validation. We have developed a graph-centric approach for approximating model quality: a cheap to compute quality metric that, within a NAS workflow, can be used as a direct substitute for model validation accuracy. We use a graph representation of the architecture model that captures co-occurrences of universal model building blocks. Through sampling and validating a small number of models, we co-relate graph features (related to model building blocks) to ground truth model quality. Later, we approximate the quality of a new model based on these weighted graph features. We implement our solution by extending the neuroevolution-based framework NSGANet. Evaluation using multiple computer vision datasets, including CIFAR-10, CIFAR-100, and several NAS-Bench-360 datasets shows that our approach can speed up architecture search by up to 6x over NSGANet, while identifying models with comparable quality. Also, our MPI-based distributed implementation scales out architecture search on distributed GPUs.

Middleware for Distributed Platforms with Fast Networks

UWaterloo and CASC, LLNL | Code - YGM, UPC++, UPC++ benchmarks | Paper - [SC'23]

YGM is a C++ distributed computing library developed at LLNL. YGM offers a high-level API for implementing distributed algorithms on top of its asynchronous Remote Procedure Call (RPC) engine. YGM employs advanced message buffering for improving bandwidth utilization, thus, throughput, in MPI-based communication. We are actively working on extending API features and improving performance under irregular workloads and communication patterns. We have multiple projects in the pipeline that use YGM for distributed messaging.

UPC++ is a Partitioned Global Address Space (PGAS) library for implementing distributed algorithms, developed at LBNL. UPC++ provides APIs for both one-sided Remote Memory Access (RMA) and RPC. UPC++ offers true one-sided RMA communication (to the global address space) and fire-and-forget RPCs. UPC++ uses GASNet-EX for distributed messaging and supports multiple low-level protocols for different network architectures, e.g., InfiniBand Verbs (IBV), OpenFabrics Interfaces (OFI) and UDP. We thoroughly evaluated UPC++ to understand its performance offerings for our target applications. We have developed a number of benchmark applications and evaluated them using 100+ compute nodes (thousands of cores) on distributed platforms with both Mellanox EDR Infiniband and Intel Omni-Path networks.

Distributed Steiner Tree Computation

CASC, LLNL | Papers - [JPDC'23], [IPDPS'22]

Steiner Tree computation has direct applications to VLSI design, communication network optimization, systems biology, knowledge discovery and database search. Given a set of seed vertices of an edge-weighted graph, a Steiner minimal tree is an acyclic subgraph (tree) that connects the seed vertices while minimizing its distance - the sum of edge weights. Finding a Steiner minimal tree of a graph and arbitrary seed vertices is an NP-hard problem. Therefore, polynomial time algorithms for finding a tree with a distance close to the distance of a Steiner minimal tree are popular because of to their practical relevance. We have developed a distributed 2-approximate Steiner minimal tree solution and demonstrated scalability using up to a hundred-billion edge real-world web graph, millions of seed vertices and 512 compute nodes. In place of distance computation between all pairs of seed vertices, an expensive phase in many 2-approximate algorithms, our solution exploits Voronoi cell computation. Also, this approach offers higher parallel efficiency than related works that perform minimum spanning tree computation on the entire data graph. We provide theoretical guarantees of the approximation bound of our solution and empirical evidence supporting these guarantees.

Distributed Subgraph Matching

NetSysLab, UBC and CASC, LLNL | Code - PruneJuice | Papers - [TOPC'21], [IA^3'20], [SIGMOD'20], [GRADES-NDA'20], [SC'18], [IA^3'18], [CLUSTER'17]

Subgraph matching is a key algorithmic technique enabling fine-grained analysis of a graph's local structures - fundamental to answering pattern queries in graph databases. Exact subgraph matching poses various scalability challenges: the problem does not have a know polynomial time solution, skewed graph topology and match distribution makes load balanced processing difficult. We have developed a novel algorithmic pipeline which enables distributed subgraph matching in massive labeled graphs. This work is the first to show that it is much cheaper in practice to first reduce a large graph to the solution subgraph - the union of all vertices and edges that participate in at least one match of the query subgraph. The solution subgraph directly answers a class of popular graph database queries. Match enumeration is also cheaper in practice in the reduced solution subgraph. First, we present a general solution for exact subgraph matching that, (a) is highly scalable in practice, (b) supports arbitrary labeled subgraph queries, and (c) offers message and memory efficiency. Then, we show that this graph reduction approach lends itself well to efficiently solve a class of edit-distance subgraph matching and human-in-the-loop incremental subgraph search problems. This work has resulted in an open source software artifact, PruneJuice, which has been used internally at LLNL and by academic research groups. We demonstrated advantages of this solution approach through strong and weak scaling experiments on massive-scale real-world (up to 127 billion edges) and synthetic (up to 2.2 trillion edges) graphs, respectively, and at scales (up to 1,024 nodes or 36,864 cores), orders of magnitude larger than used in the past for similar problems. Currently we are working on extending the graph reduction principles of PruneJuice to efficiently support property graphs with arbitrary vertex and edge attributes.

Graph-based Reasoning of Biomedical Data

CASC, LLNL and Weill, UCSF | Paper - [AI Magazine'22]

SPOKE is a multi-institutional NFS funded effort for representation and reasoning of a biomedical knowledge network. The SPOKE network is a property graph constructed from (and evolving) 30+ human-validated sources, e.g., molecule, genome and (anonymized) patient clinical datasets. The goal is to exploit this vast relational knowledge repository for reasoning research questions in health sciences, as well as application areas such as insights in understanding diseases, discovering and repurposing drugs. To date, our contributions include in-depth analysis of various topological properties (e.g., network diameter and centrality measures), and discovery of metabolic paths and subgraphs of interest identified by domain experts. Also, we are investigating novel techniques (e.g., Steiner tree computation) for discovering causal relationships between genes and diseases to help domain experts understand role of genes in diseases.

Optimized Graph Processing on NUMA and GPU-equipped Shared Memory Platforms

NetSysLab, UBC | Code - Totem | Papers - [BigData'20], [IA^3'18], [IEEE HPBDC'18], [NetSysLab Technical Report'14]

This is a large effort by the NetSysLab at UBC toward building general purpose graph processing solutions optimized for modern shared memory platforms and accelerators. Most recently, we characterized parallel graph traversal algorithms and studied design trade-offs of key solutions on Non-Uniform Memory Access (NUMA) platforms. Lessons learned from our investigations led to the development of HyGN - a graph engine that marries, (a) distributed, shared-nothing designs (e.g., data partitioning) with shared memory features (e.g., byte addressing), and (b) unique advantages of synchronous and asynchronous processing within the same computation task, toward offering superior performance and scalability on NUMA platforms. Using popular benchmarks, we demonstrate that HyGN comfortably outperforms state-of-the-art NUMA-oblivious solutions, achieves maximum 97% efficient scaling on a four-socket NUMA machine. Totem is an in-memory graph processing framework, which leverages both CPU and GPU of a shared memory platform. Within Totem, a graph is partitioned and processed simultaneously on the CPU and GPU. Totem has been widely used by researchers for performance comparison. To date, work resulting from these efforts won several Graph 500 awards in the small and medium size dataset category.

Accelerating InSAR data Processing using GPUs

NetSysLab, UBC and 3vGeomatics | Papers - [TPDS'18], [IEEE ASAP'15]

A key application of the remote sensing technology Interferometric Synthetic Aperture Radar (InSAR) is monitoring displacement of a natural or artificial structure on the earth's surface over time. Modern SAR satellites produce higher resolution sensor data leading to orders of magnitude increase in volume of accumulated data, and therefore, the demand for superior computing capabilities for timely processing of these large datasets (tens of terabytes). We engineered and implemented a Persistent Scatterer selection kernel in the InSAR processing chain for Nvidia GPUs: on four commodity GPUs, it offers 65x speedup over an OpenMP implementation on a dual-socket (16-cores) Intel Xeon platform. Also, this work presents a detailed characterization of the target problem under different workloads, and demonstrates, effectiveness of the design choices to improve parallelism and GPU utilization, while avoiding redundant computations, and near optimal strong scaling with increasing GPU count.

Teaching

University of Waterloo

■ ECE 750 Special Topics: Scalable Computer System Design (graduate) - Fall 2026

■ ECE 454 Distributed Computing (undergraduate) - Spring 2026

■ ECE 750 Special Topics: Scalable Computer System Design (graduate) - Fall 2025

■ ECE 454 Distributed Computing (undergraduate) - Spring 2025

■ ECE 750 Special Topics: Scalable Computer System Design (graduate) - Fall 2024

■ ECE 454 Distributed Computing (undergraduate) - Spring 2024

Service

Conference Program Committee

■ 2026 IEEE International Conference on Distributed Computing Systems (ICDCS'26), Program Committee Member (Big Data, Models, and Systems Track)

■ 2026 IEEE International Conference on Cluster Computing (CLUSTER'26), Program Committee Member (Applications, Algorithms, and Libraries Track)

■ 2026 IEEE/ACM The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'26), Tutorials Committee Member

■ 2025 IEEE International Conference on Distributed Computing Systems (ICDCS'25), Program Committee Member (Big Data, Models, and Systems Track)

■ 2025 ACM International Conference on Supercomputing (ICS'25), Program Committee Member

■ 2025 IEEE/ACM The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'25), Tutorials Committee Member

Reviewer - Conference

■ 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS’22)

Reviewer - Journal

■ IEEE Transaction on Parallel and Distributed Systems (TPDS)

■ Journal of Parallel and Distributed Computing (JPDC), Elsevier

■ ACM Transactions on Parallel Computing (TOPC)

■ ACM Transactions on Architecture and Code Optimization (TACO)

■ ACM Computing Surveys

■ Information Sciences, Elsevier

■ Expert Systems with Applications, Elsevier