Writing a Hardware-Accelerated PBRT Engine
This entry documents the progress of writing a physically-accurate ray tracer step by step. This starts from bsaic ray tracing optimizations (BVH acceleration structure with thread paralliziation for the 67th time), to adding more cool features. This project is built off of a minimalistic ray tracing library, but it is not enough for what we want to achieve!
Jump to Entry:
2/9/2026: Monte Carlo integration for photorealistic rendering stuff yay woo wahoo i love math
yeah
2/8/2026: We balling with C++ parallelization library
After implementing the basic BVH, one tends to figure out two things. One, BVH fast. Two, building the BVH still takes a while. I profiled the time it takes to build the BVH on my machine (my machine iiiiiis a 12th Gen Intel(R) Core(TM) i7-12800H yet it is hanging on by a thread hahahahahah):
BVH Statistics: Build time: 657.81 ms Triangle count: 544566 Interior nodes: 77570 Leaf nodes: 77571 Avg triangles per leaf: 7.02023 Bytes per node: 32 Total nodes: 155141
Basically serial BVH construction builds the tree by appending nodes into a single shared array as it recurses. Therefore, I employ the helpful Thread Building Blocks Library to attempt to parallelize the tree building process. As my code is right now, two threads CANNOT safely push to the same array at the same time. Rather than trying to synchronize access to a shared data structure, I restructured the build so each subtree is constructed in complete isolation (average cs major lifestyle is also to be in complete isolation); it gets its own copy of the triangle indices and builds into its own private arrays of nodes and indices, with no shared mutable state between threads:
struct BVHBuildResult {
std::vector nodes;
std::vector indices;
};
// each call owns its indices (passed by value) and writes to its own result
BVHBuildResult Accel::buildParallel(std::vector indices, const BoundingBox3f &bbox) {
BVHBuildResult result;
// ... SAH split logic, partition into leftIndices / rightIndices ...
At each split, if the current subtree contains a shit ton of triangles (I set 1024 as the threshold), the left and right children are dispatched as concurrent tasks using TBB's task_group. Below that threshold we just run the construction serially because teh overhead of task scheduling outweighs the benefits for small workloads!
BVHBuildResult leftResult, rightResult;
if (count >= PARALLEL_THRESHOLD) { // PARALLEL_THRESHOLD = 1024
tbb::task_group tg;
tg.run([&]() { leftResult = buildParallel(std::move(leftIndices), leftBox); });
tg.run([&]() { rightResult = buildParallel(std::move(rightIndices), rightBox); });
tg.wait();
} else {
leftResult = buildParallel(std::move(leftIndices), leftBox);
rightResult = buildParallel(std::move(rightIndices), rightBox);
}
Once the children finish, their independently built arrays are merged back together into the parent's result. The final merged result slots direcrly into the same flat node layout the serial BVH construction produces, so I didn't have to modify any ray traversal code at all! And this appraoch scales well because TBB's work-stealing scheduler is really good at handling load balancing across cores.
With fallat said,the build time is now much smaller with all other stats being the same!
BVH Statistics: Build time: 168.2 ms Triangle count: 544566 Interior nodes: 77570 Leaf nodes: 77571 Avg triangles per leaf: 7.02023 Bytes per node: 32 Total nodes: 155141
C++ library import is a nightmare but we balling. Here is some more stats if you gaf:
gracejin@grace:~/nori-26sp/build$ ./nori ../scenes/a1/ajax-normals.xml --no-gui
Loading "../scenes/a1/ajax.obj" .. done. (V=409676, F=544566, took 1.5s and 18.7 MiB)
BVH Statistics:
Build time: 168.2 ms
Triangle count: 544566
Interior nodes: 77570
Leaf nodes: 77571
Avg triangles per leaf: 7.02023
Bytes per node: 32
Total nodes: 155141
Configuration: Scene[
integrator = NormalIntegrator[],
sampler = Independent[sampleCount=32]
camera = PerspectiveCamera[
cameraToWorld = [ -0.5356, 0.2999, 0.7894, -65.61;
1.353e-05, 0.9348, -0.3551, 47.58;
-0.8445, -0.1902, -0.5007, 24.36;
0, 0, 0, 1],
outputSize = [768, 768],
fov = 30.000000,
clip = [0.000100, 10000.000000],
rfilter = GaussianFilter[radius=2.000000, stddev=0.500000]
],
meshes = {
Mesh[
name = "../scenes/a1/ajax.obj",
vertexCount = 409676,
triangleCount = 544566,
bsdf = Diffuse[
albedo = [0.500000, 0.500000, 0.500000]
],
emitter = null
]
}
]
Rendering .. done. (took 2.2s)
Writing a 768x768 OpenEXR file to "../scenes/a1/ajax-normals"
Writing a 768x768 PNG file to "../scenes/a1/ajax-normals"
gracejin@grace:~/nori-26sp/build$
2/6/2026: starting writing basic, CPU-bound ray tracer in C++.
Question: how do we speed up rendering 500k+ triangles from taking 30+ minutes to take 2 seconds?
We start with a very basic, unoptimized CPU bound codebase. The base code from this library currently implements a brute force ray intersesctor algorithm, in short it traces a ray against all triangles and stores the resulting information in an intersection struct. This shit slow:
/* Brute force search through all triangles */
for (uint32_t idx = 0; idx < m_mesh->getTriangleCount(); ++idx) {
float u, v, t;
if (m_mesh->rayIntersect(idx, ray, u, v, t)) {
/* An intersection was found! Can terminate
immediately if this is a shadow ray query */
if (shadowRay)
return true;
ray.maxt = its.t = t;
its.uv = Point2f(u, v);
its.mesh = m_mesh;
f = idx;
foundIntersection = true;
}
}
Let's do the low hanging fruit of implementing a Thread Bounding Volume Hiearchy, which organizes a scene's triangles into a binary tree where each node stores an axis-aligned bounding box enclosing all geometry in its subtree. During ray tracing, if a ray misses a node's bounding box, the entire subtree is skipped, which turns O(n) brute force search into something closer to O(logn). For the scene I rendered with more than 500k triangles, this optimization determines the difference between minutes and seconds (for the brute force version of my code, I ran it for 30+ minutes before rage quiting.) Here is a picture of BVH acceleration: And, My basic algorithm looks like this:
buildBVH(triangles, bbox):
if no triangles: return null
if triangles ≤ 10: return leaf(triangles)
// SAH binned split along longest axis
axis = longest axis of bbox
divide axis into 12 bins
for each triangle:
assign to bin based on centroid position
// evaluate all 11 candidate splits
bestCost = ∞
for each split plane between bins:
cost = surfaceArea(left) × count(left)
+ surfaceArea(right) × count(right)
if cost < bestCost: record this split
// only split if it's cheaper than a leaf
if bestCost ≥ surfaceArea(bbox) × totalCount:
return leaf(triangles)
partition triangles at best split
return interior(
left = buildBVH(leftTriangles, leftBBox),
right = buildBVH(rightTriangles, rightBBox)
)
rayTraverse(ray):
stack = [root]
while stack not empty:
node = stack.pop()
if ray misses node.bbox: skip
if node is leaf:
test ray against each triangle
if shadow ray and hit: return true
else:
stack.push(left child)
stack.push(right child)
return closest hit
The construction of my BVH is top-down, at each node I use the Surface Area Heuristic to find a good split. I bucket the triangle centroids into a few binds along the bounding box's longest axis, and I evaluate the SAH cost C = S(A) * N_A + S(B) * N_B for each candidate split plane. The split that minimizes the cost is chosen, My design for ray traversal uses an explicit stack rather than recursion, which a fixed-size array of 64 entries (more than enough) to avoid any per-ray heap allocation:
uint32_t stack[64];
int stackPtr = 0;
stack[stackPtr++] = 0; // start at root
while (stackPtr > 0) {
const BVHNode &node = m_nodes[stack[--stackPtr]];
if (!node.bbox.rayIntersect(ray))
continue;
if (node.isLeaf()) {
// test triangles in this leaf
} else {
stack[stackPtr++] = nodeIdx + 1; // left child
stack[stackPtr++] = node.start; // right child
}
}
After I implemented this optimization, I ultimately shaved the brute-force render time from over half an hour to about 2 seconds. This is 800x faster!