2011-07-07 00:47:38 UTC
The last time I visited the DTrace folks in person, we discussed how we wish
there was a way to allocate kernel memory without going to sleep AND without
causing unnatural amounts of reclamation. I don't remember exactly how or
why DTrace desires this, but I can say with certainty that there's a fair
amount of networking-related allocation that could benefit from a
lighter-weight non-blocking allocation as well.
Turns out we have one in Illumos now:
everywhere(common/sys)% hg log kmem.h | head -15
user: Tom Erickson <***@Sun.COM>
date: Thu Jun 24 11:35:31 2010 -0700
6710343 dnode cache should register a dnode_move() callback to limit fragmentation
6583724 dnode_create should not call kmem_cache_constructor directly
6374545 disk write cache flush code overloads buf_t.b_list pointer
user: Stan Studzinski <***@Sun.COM>
date: Tue Apr 13 11:03:56 2010 -0700
6675738 KM_NOSLEEP may still try too hard for some allocations
The fix for 6675738 enables the new flag KM_NORMALPRI to be ORed along with
KM_NOSLEEP, and one can allocate without blocking, and get a NULL result much
more quickly because the VM system won't be invoked fully to reclaim pages.
I recalled from talking with the DTrace folks that the problem was
reproducible. A discussion today with Bryan confirmed my recollection:
Yes! I would love to make DTrace behave better in these situations
-- especially because the test that deliberately induces this
condition is really, really painful on non-DEBUG kernels (to the
point that it will often just run the system out of memory and hang
out waiting for more). So please send a note to illumos-dev (and
dtrace-discuss?). This fix should be a no-brainer provided that
KM_NORMALPRI is implemented correctly: it should be of verifiably low
risk and demonstrably high value...
So here I am mentioning this idea to folks in both audiences. I'm a little
distracted with Nexenta-driven work, alas, but there's gotta be someone out
there who can try this out with DTrace? I'll even entertain reviewing
attempts to make the TCP/IP stack better here, and if you want to keep it
small, start with the creation of IPsec SAs (easily tested with massive
(ab)use of ipseckey(1M)'s "add" command).
As far as KM_NORMALPRI's implementation: I went poking down that rabbit hole
after landing at Nexenta. At first glance it gives up and returns NULL at
the appropriate times (it has to hit the VM layer before doing so... there's
a VM_NORMALPRI as well), but it might require further review.