Farewell, TensorFlow

Friday was my last day working on TensorFlow at Google. The past five years were a lot of fun, and I feel incredibly lucky to have been in the war room at 6 AM when we launched the project into the world. Since then, it’s been inspiring to see how people have used TensorFlow, from all the folks asking questions on Stack Overflow, through the machine learning and systems research that built on our work, to the huge software ecosystem that has grown up around it.

Before I move on, I wanted to reminisce about what I enjoyed most about working at Google. Google is well-known for the in-house developer infrastructure that its software engineers use every day. Among my favorites as a curious engineer, Code Search and Piper offer a way to learn how almost anything the company uses was implemented, including several systems whose papers have been required reading in graduate-level systems classes for more than a decade (and even the edit histories for some of those papers).

For me, however, the most vital developer infrastructure was the Google Wide Profiler (GWP). GWP aggregates profiles from all code running in production, so you can understand how it performs under realistic conditions. Once a piece of code becomes heavily used, it starts showing up in GWP, letting you measure the impact of performance optimizations in terms of thousands of CPUs saved, or more. Sometimes a straightforward C++ change, like adding a std::move(), or switching a std::string argument to take a std::string_view, will save a few hundred cores.¹ In terms of effort-to-impact ratio, this single character change to the tf.concat() implementation will be hard to beat. But GWP can also spur deeper investigations, and I wanted to share one: the story of my final significant contribution to TensorFlow.

A GWP story: optimizing TensorFlow’s executor

Most mornings, I would check in with GWP to see what functions in TensorFlow were hot over the last day, and find if there were any surprising new entries or regressions. One day during mandatory work-from-home, I ran my usual GWP query, and a familiar symbol stared back at me: tensorflow::(anonymous namespace)::ExecutorState::PropagateOutputs(). This method is the heart of TensorFlow’s dataflow executor, and it is responsible for two things: (i) forwarding output tensors from one kernel to the kernels that consume them, and (ii) adding kernels to the scheduler queue when they become runnable. In other words, it is literally the method that makes tensors flow in TensorFlow. It was also painfully inefficient, and I didn’t want to leave it as unfinished business.

For sure, PropagateOutputs() is non-trivial code, and better engineers than I have optimized it down to the level of structure packing and bitfields. Even with these optimizations, it remained stubbornly expensive, made worse by the contended mutex that guards the state update. This mutex would cause most kernel completions within a single graph to run serially, and put worker threads to sleep when they still had a lot of work to do. This was particularly bad when you had a lot of fine-grained and potentially parallelizable kernels in your graph, which is common in both inference and tf.data input preprocessing.

Despite this complexity, the high-level logic in PropagateOutputs() is simple, based on Kahn’s algorithm for topological sorting:

Each kernel starts out with a “pending count” equal to its in-degree.
When a kernel finishes, it forwards its outputs along edges in the graph, and decrements the pending count for each consumer kernel.
When the pending count for a kernel hits zero, it becomes runnable, and we add it to the ready queue.

Unfortunately, it’s a bit more complicated than that: the propagation rules are different if the consumer is a special control flow kernel, or if the edge is in a dead branch of a control flow construct (e.g. the untaken branch of a tf.cond()).² That means PropagateOutputs() had to load the graph node structure for each consumer kernel, which was unlikely to be in the cache already, and compute hard-to-predict branches on it. On top of this overhead, the mutex in PropagateOutputs() had bothered me for a long time. It always seemed possible to model the “pending count” as an atomic reference count and remove the mutex altogether for some graphs. However, the code is intricate, and so heavily used that one false step changing it could easily cost more than my salary in increased utilization.

The answer seemed clear: we need one executor for graphs with control flow, and another simpler executor for graphs without it. The original release of TensorFlow actually had this split, but it was too tedious to keep the duplicated logic in sync, so we combined them. To land the optimization, I needed a zero-cost way to dispatch to different implementations of PropagateOutputs() based on the static graph topology: even making PropagateOutputs() a virtual method would slow down existing users too much. I’m quite pleased with the following refactoring steps, which achieved my goal without too much churn or duplication:

Split the propagation and execution logic into two classes, PropagatorState and ExecutorState<PropagatorStateType>.
Implement a SimplePropagatorState class with the same interface as PropagatorState, and instantiate an ExecutorState<SimplePropagatorState> when the graph contains no control flow. The SimplePropagatorState avoids unnecessary indirections, dynamic allocations, and branches when running simple graphs.
Replace the PendingCounts in SimplePropagatorState with atomic counts for each kernel, and finally get rid of the infernal mutex.

Along the way, I relied on a growing suite of microbenchmarks to make sure that my changes didn’t slow down existing users. For a general-purpose component like the executor, it can be hard to predict what benchmarks will be useful, so I used measurements from GWP to write microbenchmarks that tracked realistic usage patterns.³ The microbenchmarks contain a suite of synthetic graphs, and, when run under pprof, they gave me an easy way to visualize executor performance as a flamegraph or drill down to the time spent in individual instructions.

The results so far have been pretty encouraging. As days went by, GWP showed the fraction of time spent in PropagateOutputs() decrease as users rebuilt their code with the latest version. There were some decent reductions to inference latency, with some users reporting up to a 10% improvement end-to-end. If you want to try out the new code, the changes will be in the upcoming 2.3 release, or they are available today on GitHub and in tf-nightly. There are still opportunities to improve things: in particular, it feels like it should be possible to extend the atomic optimization to at least some graphs that have control flow (at least tf.cond(), if not tf.while_loop()). If you see something that could be done better, I hope you’ll consider submitting a pull request!

Postscript: September 5th, 2022

Apparently I’m not the only one who cares about speeding up TensorFlow’s executor! After I picked the low-hanging fruit, two more intrepid engineers have stepped up to optimize the harder, more general executor:

Todd Lipcon figured out how to avoid an exclusive lock for the fast path, which covers the vast majority of nodes that aren’t adjacent to a control-flow merge or trigger node.
My old teammate Xiao Yu went further still and downgraded the exclusive lock in the slow path to a shared lock.

These are much trickier optimizations than the fairly mechanical one I described above, and each would surely make an interesting write-up.

There is still one useful optimization that could apply to the executor, if anyone’s interested: when TensorFlow executes a multi-device graph, each per-device subgraph is passed to the slower executor, even when there is no control flow in the graph. In my original change, I assumed that any _Recv node could produce a dead output, because there was no way to tell locally if the subgraph is nested within some other control flow construct (and left myself a TODO to fix this!). If this information were passed down from the graph partitioner when control flow is added (e.g. by undeprecating this field in the control message), the lock-free executor would work for many large partitioned graphs that turn up in multi-device inference or parameter server training (if anyone still does that…).

Many of these can be found automatically with tools like clang-tidy. ↩
We wrote a paper about TensorFlow’s control flow scheme, but unfortunately it doesn’t go into detail about the fine details of efficient executor implementations. With hindsight, we should have evaluated the effect on performance of adding control flow support to graphs without control flow. ↩
I could go on at length about how usage shifted and a lot of our original assumptions were invalidated over time. Graphs with thousands of fine-grained “inexpensive” kernels, and kernels huge in- and out-degrees are just two examples, which stressed the executor particularly. ↩