Abstract
Tree traversal is a fundamental operation in many applications, such as database indexing and physics
simulations. Although tree traversals feature high parallelism, they are inherently divergent and irregular,
leading to inefficient performance on GPUs. Tree traversals are also prevalent in ray tracing, which is
executed on dedicated Ray-Tracing Accelerators (RTAs) in modern GPUs to mitigate inefficiencies such as
control flow divergence and underutilization of memory bandwidth by irregular memory accesses.
In this paper, we propose the Tree Traversal Accelerator (TTA) to replicate the success
of RTAs in ray tracing for general tree traversal applications.
TTAs extend RTAs to support tree structures and operations beyond
those in ray tracing, such as B-Tree search and radius search algorithms, by modifying existing computing
units. Despite TTAs' effectiveness, they still rely on fixed-function computations, making it challenging
to support other tree-based applications such as N-Body simulation fully. Thus, we introduce TTA+ as an
alternative design, which modularizes the RTA computing units and makes them programmable, trading some
efficiency for flexibility. With less than 1% increase in RTA area, our proposals can achieve up to 5.4x
speedup for B-Tree search, 1.7x for N-Body simulation, and 1.2x for select ray-tracing applications.
Resources