On using KM_NORMALPRI for non-blocking AND non-destructive allocs

Dan McDonald

2011-07-07 00:47:38 UTC

Hello Illumos and DTrace communities.

The last time I visited the DTrace folks in person, we discussed how we wish
there was a way to allocate kernel memory without going to sleep AND without
causing unnatural amounts of reclamation. I don't remember exactly how or
why DTrace desires this, but I can say with certainty that there's a fair
amount of networking-related allocation that could benefit from a
lighter-weight non-blocking allocation as well.

Turns out we have one in Illumos now:

everywhere(common/sys)[0]% pwd
/export/home/danmcd/ws/illumos-clone/usr/src/uts/common/sys
everywhere(common/sys)[0]% hg log kmem.h | head -15
changeset: 12684:397e44ebb8a9
user: Tom Erickson <***@Sun.COM>
date: Thu Jun 24 11:35:31 2010 -0700
description:
6710343 dnode cache should register a dnode_move() callback to limit fragmentation
6583724 dnode_create should not call kmem_cache_constructor directly
6374545 disk write cache flush code overloads buf_t.b_list pointer

changeset: 12156:3c537b2a7425
user: Stan Studzinski <***@Sun.COM>
date: Tue Apr 13 11:03:56 2010 -0700
description:
6675738 KM_NOSLEEP may still try too hard for some allocations

The fix for 6675738 enables the new flag KM_NORMALPRI to be ORed along with
KM_NOSLEEP, and one can allocate without blocking, and get a NULL result much
more quickly because the VM system won't be invoked fully to reclaim pages.

I recalled from talking with the DTrace folks that the problem was
reproducible. A discussion today with Bryan confirmed my recollection:

Yes! I would love to make DTrace behave better in these situations
-- especially because the test that deliberately induces this
condition is really, really painful on non-DEBUG kernels (to the
point that it will often just run the system out of memory and hang
out waiting for more). So please send a note to illumos-dev (and
dtrace-discuss?). This fix should be a no-brainer provided that
KM_NORMALPRI is implemented correctly: it should be of verifiably low
risk and demonstrably high value...

So here I am mentioning this idea to folks in both audiences. I'm a little
distracted with Nexenta-driven work, alas, but there's gotta be someone out
there who can try this out with DTrace? I'll even entertain reviewing
attempts to make the TCP/IP stack better here, and if you want to keep it
small, start with the creation of IPsec SAs (easily tested with massive
(ab)use of ipseckey(1M)'s "add" command).

As far as KM_NORMALPRI's implementation: I went poking down that rabbit hole
after landing at Nexenta. At first glance it gives up and returns NULL at
the appropriate times (it has to hit the VM layer before doing so... there's
a VM_NORMALPRI as well), but it might require further review.

Whatcha think?

Thanks,
Dan