Bryan Cantrill
2011-07-02 18:27:14 UTC
All,
A longstanding problem that we have had is that enablings on defunct
providers (e.g., USDT probes on dead processes) are not reaped: the
probes will exist as long as there exists an enabling for them. When
processes are turning over frequently (or when enablings are
long-running), this can clog up the probe space to the point that
DTrace probe creation will silently fail (an absolutely maddening
failure mode). This has been hit several times over the years (we
were nailed by it on our build machines at Fishworks) -- so when Theo
Schlossnagle mentioned to me that he was getting killed by this
problem in an environment with rapidly turning over Postgres
processes, I was embarrassed that I hadn't tackled it earlier. As it
turns out, it was a tad thorny for locking reasons, but a patch for
this problem is attached. We have integrated this into our bits at
Joyent (internal ticket is OS-454, "enablings on defunct providers
prevent providers from unregistering"), so you'll see this show up
soon at http://github.com/joyent/illumos-joyent -- but I wanted to
give everyone here a heads-up.
Anyway, patch is attached, with my thanks to Adam for a helpful
discussion on fasttrap's asynchronous provider retiring mechanics.
Note that Adam hasn't (yet) reviewed this, and its integration
upstream should wait until he's had a chance to look it over. Please
let me know if you have any questions or comments!
Thanks,
Bryan
A longstanding problem that we have had is that enablings on defunct
providers (e.g., USDT probes on dead processes) are not reaped: the
probes will exist as long as there exists an enabling for them. When
processes are turning over frequently (or when enablings are
long-running), this can clog up the probe space to the point that
DTrace probe creation will silently fail (an absolutely maddening
failure mode). This has been hit several times over the years (we
were nailed by it on our build machines at Fishworks) -- so when Theo
Schlossnagle mentioned to me that he was getting killed by this
problem in an environment with rapidly turning over Postgres
processes, I was embarrassed that I hadn't tackled it earlier. As it
turns out, it was a tad thorny for locking reasons, but a patch for
this problem is attached. We have integrated this into our bits at
Joyent (internal ticket is OS-454, "enablings on defunct providers
prevent providers from unregistering"), so you'll see this show up
soon at http://github.com/joyent/illumos-joyent -- but I wanted to
give everyone here a heads-up.
Anyway, patch is attached, with my thanks to Adam for a helpful
discussion on fasttrap's asynchronous provider retiring mechanics.
Note that Adam hasn't (yet) reviewed this, and its integration
upstream should wait until he's had a chance to look it over. Please
let me know if you have any questions or comments!
Thanks,
Bryan