Discussion:
ustack() misses caller of leaf under profile probe
Ryan Johnson
2010-07-14 10:05:38 UTC
Permalink
Hi all,

There was a discussion about this a while back, but now some new
information has come to light:

Consider the situation where a thread calls a leaf function which does
not create a stack frame (such as atomic_*). The following toy program
demonstrates this scenario:

--- begin backtrace.c ------------
#include <atomic.h>
unsigned int x;
int foo() {
int y = 1;
while(y != 100) {
y = atomic_swap_32(&x, y);
}
return x;
}
int main() {
return 10 + foo();
}
--- end backtrace.c ------------

If we grab stack frames by instrumenting the function directly --
"pid$target::atomic_swap_32:entry {@counts[ustack()]=count();}" -- then
we get:

libc.so.1`atomic_swap_32
a.out`foo+0x2c
a.out`main+0x4
a.out`_start+0x108
1054138

If, on the other hand, we profile the process --
"profile-7777us/pid==$target/ {@counts[ustack()]=count(); }" -- and grab
stack frames again, we get (among other far rarer variants):

libc.so.1`atomic_swap_32
a.out`main+0x4
a.out`_start+0x108
1167

Note that this is *not* related to tail calls (there are none here).
Since in both cases $pc is the same, the problem must be due to either
the profile probe's way of saving process context or else the stack
walking code doing something wrong under profiling...

Ideas?
Ryan
Adam Leventhal
2010-07-19 17:55:47 UTC
Permalink
Hey Ryan,

We have special logic in pid provider entry and return probes to account for the fact that they execute in leaf call context. Unfortunately, this is not generally possible.

Search for uses of CPU_DTRACE_ENTRY on cvs.opensolaris.org.

Adam
Post by Ryan Johnson
Hi all,
--- begin backtrace.c ------------
#include <atomic.h>
unsigned int x;
int foo() {
int y = 1;
while(y != 100) {
y = atomic_swap_32(&x, y);
}
return x;
}
int main() {
return 10 + foo();
}
--- end backtrace.c ------------
libc.so.1`atomic_swap_32
a.out`foo+0x2c
a.out`main+0x4
a.out`_start+0x108
1054138
libc.so.1`atomic_swap_32
a.out`main+0x4
a.out`_start+0x108
1167
Note that this is *not* related to tail calls (there are none here). Since in both cases $pc is the same, the problem must be due to either the profile probe's way of saving process context or else the stack walking code doing something wrong under profiling...
Ideas?
Ryan
_______________________________________________
dtrace-discuss mailing list
--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Ryan Johnson
2010-07-20 15:50:22 UTC
Permalink
Hi Adam,

CPU_DTRACE_ENTRY: that would explain the difference between the two use
cases...

Not generally possible: I guess this is technically true, thanks to
trampolines, tail calls, and JMPL whose source registers have probably
been clobbered. However, we should at least be able to detect vanilla
CALL reliably by decoding %i7 (I've never heard of a compiler clobbering
it). The attached scripts do this. The dtrace script reads in the
instruction at %i7, decodes it and computes the target when possible,
and emits the result at the head of each stack trace. The awk script
(which takes the dtrace script's output as input) then checks whether
i7's target points to the same function as the leaf's caller and decide
whether to filter it out.

Running the two scripts on a process, I got 13.5k total samples, with
with 214 of those unable to read %i7 for some reason; of the valid
samples, 8.2k had %i7's target point to a different function than the leaf.

I've checked source code for a smattering of the samples, and every time
the i7-enhanced stack trace is the correct one.

Maybe this won't ever get baked into dtrace, but at least there's a
workaround now if other folks need it.

Also, it might be good to update the wiki with this gotcha. Missing tail
calls is one thing... silently missing legitimate callers who made a
stack frame is pretty annoying (and makes profiling significantly less
useful if you're hunting for the caller of that expensive function).

Regards,
Ryan
Post by Adam Leventhal
Hey Ryan,
We have special logic in pid provider entry and return probes to account for the fact that they execute in leaf call context. Unfortunately, this is not generally possible.
Search for uses of CPU_DTRACE_ENTRY on cvs.opensolaris.org.
Adam
Post by Ryan Johnson
Hi all,
--- begin backtrace.c ------------
#include<atomic.h>
unsigned int x;
int foo() {
int y = 1;
while(y != 100) {
y = atomic_swap_32(&x, y);
}
return x;
}
int main() {
return 10 + foo();
}
--- end backtrace.c ------------
libc.so.1`atomic_swap_32
a.out`foo+0x2c
a.out`main+0x4
a.out`_start+0x108
1054138
libc.so.1`atomic_swap_32
a.out`main+0x4
a.out`_start+0x108
1167
Note that this is *not* related to tail calls (there are none here). Since in both cases $pc is the same, the problem must be due to either the profile probe's way of saving process context or else the stack walking code doing something wrong under profiling...
Ideas?
Ryan
_______________________________________________
dtrace-discuss mailing list
--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Adam Leventhal
2010-07-21 00:18:11 UTC
Permalink
Not generally possible: I guess this is technically true, thanks to trampolines, tail calls, and JMPL whose source registers have probably been clobbered. However, we should at least be able to detect vanilla CALL reliably by decoding %i7 (I've never heard of a compiler clobbering it). The attached scripts do this. The dtrace script reads in the instruction at %i7, decodes it and computes the target when possible, and emits the result at the head of each stack trace. The awk script (which takes the dtrace script's output as input) then checks whether i7's target points to the same function as the leaf's caller and decide whether to filter it out.
It it not generally possible to determine the the probe fires whether or not the thread is currently executing in leaf context. All symbolic translations happens in user-land after the fact for both ustack() and uaddr().
Running the two scripts on a process, I got 13.5k total samples, with with 214 of those unable to read %i7 for some reason; of the valid samples, 8.2k had %i7's target point to a different function than the leaf.
I've checked source code for a smattering of the samples, and every time the i7-enhanced stack trace is the correct one.
Maybe this won't ever get baked into dtrace, but at least there's a workaround now if other folks need it.
What we could do is have the ustack() action record %o7 as well and then figure out in user-land whether or not it's relevant.
Also, it might be good to update the wiki with this gotcha. Missing tail calls is one thing... silently missing legitimate callers who made a stack frame is pretty annoying (and makes profiling significantly less useful if you're hunting for the caller of that expensive function).
Good idea. Propose some changes to the list and we'll get it reviewed.

Thanks.

Adam
Regards,
Ryan
Post by Adam Leventhal
Hey Ryan,
We have special logic in pid provider entry and return probes to account for the fact that they execute in leaf call context. Unfortunately, this is not generally possible.
Search for uses of CPU_DTRACE_ENTRY on cvs.opensolaris.org.
Adam
Post by Ryan Johnson
Hi all,
--- begin backtrace.c ------------
#include<atomic.h>
unsigned int x;
int foo() {
int y = 1;
while(y != 100) {
y = atomic_swap_32(&x, y);
}
return x;
}
int main() {
return 10 + foo();
}
--- end backtrace.c ------------
libc.so.1`atomic_swap_32
a.out`foo+0x2c
a.out`main+0x4
a.out`_start+0x108
1054138
libc.so.1`atomic_swap_32
a.out`main+0x4
a.out`_start+0x108
1167
Note that this is *not* related to tail calls (there are none here). Since in both cases $pc is the same, the problem must be due to either the profile probe's way of saving process context or else the stack walking code doing something wrong under profiling...
Ideas?
Ryan
_______________________________________________
dtrace-discuss mailing list
--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
<profile-i7.d><profile-i7.awk>
--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Ryan Johnson
2010-07-21 06:46:07 UTC
Permalink
Post by Adam Leventhal
Not generally possible: I guess this is technically true, thanks to trampolines, tail calls, and JMPL whose source registers have probably been clobbered. However, we should at least be able to detect vanilla CALL reliably by decoding %i7 (I've never heard of a compiler clobbering it). The attached scripts do this. The dtrace script reads in the instruction at %i7, decodes it and computes the target when possible, and emits the result at the head of each stack trace. The awk script (which takes the dtrace script's output as input) then checks whether i7's target points to the same function as the leaf's caller and decide whether to filter it out.
It it not generally possible to determine the the probe fires whether or not the thread is currently executing in leaf context. All symbolic translations happens in user-land after the fact for both ustack() and uaddr().
I'm afraid you lost me. What does symbolic translation have to do with
detecting the presence of a stack frame? I could imagine it being
possible to stash this info in debug sections (at least for dwarf2), but
I don't think that's what you're referring to here.

By 'leaf' I just meant 'the top of the user stack' (as in the analyzer's
'leaf is' predicate) but maybe there's a more precise definition of leaf
I wasn't aware of?
Post by Adam Leventhal
Running the two scripts on a process, I got 13.5k total samples, with with 214 of those unable to read %i7 for some reason; of the valid samples, 8.2k had %i7's target point to a different function than the leaf.
I've checked source code for a smattering of the samples, and every time the i7-enhanced stack trace is the correct one.
Maybe this won't ever get baked into dtrace, but at least there's a workaround now if other folks need it.
What we could do is have the ustack() action record %o7 as well and then figure out in user-land whether or not it's relevant.
Actually, I tried %o7 first, and it's pretty dodgy -- once a stack frame
has been created, it can hold anything. CC seems to use it a lot for
bit-shifting and as the target of JMPL. In contrast, %i7 is (nearly)
always valid.
Post by Adam Leventhal
Also, it might be good to update the wiki with this gotcha. Missing tail calls is one thing... silently missing legitimate callers who made a stack frame is pretty annoying (and makes profiling significantly less useful if you're hunting for the caller of that expensive function).
Good idea. Propose some changes to the list and we'll get it reviewed.
Two changes, I think. BTW, is this a sparc-only issue or would it also
affect x86?

http://wikis.sun.com/display/DTrace/Actions+and+Subroutines#ActionsandSubroutines-%7B%7Bustack%7D%7D,
needs something resembling:

Limitations: Because ustack() must traverse stack frames to build its
stack trace, functions which do not establish a stack frame can lead to
unpredictable results. In particular

* Functions making tail calls will not appear because they tear down
their own stack frame before making the call.
* Except inside function entry probes, leaf functions which have not
(yet) established a stack frame sometimes prevent their caller
from appearing in the stack trace (e.g. foo -> bar -> baz will
appear as foo -> baz). See <link-to-note-at-profile-provider>.


http://wikis.sun.com/display/DTrace/profile+Provider needs a new section
at the end (before 'Stability'):

Limitations:

ustack() only reports the caller of a leaf function if the latter has
established a stack frame. Otherwise, the caller's caller appears to
call the leaf function (e.g. foo -> bar -> baz will be repoted as foo ->
baz). This behavior occurs because dtrace has no way to determine
whether the current top of stack has established a frame or not. DTrace
can avoid the problem with function entry probes because they always
fire before the target has established a stack frame, and with the
return address is in a known location, neither of which is necessarily
true in profile probe context.

Thoughts?
Ryan
Adam Leventhal
2010-07-22 01:15:21 UTC
Permalink
Post by Adam Leventhal
It it not generally possible to determine the the probe fires whether or not the thread is currently executing in leaf context. All symbolic translations happens in user-land after the fact for both ustack() and uaddr().
I'm afraid you lost me. What does symbolic translation have to do with detecting the presence of a stack frame? I could imagine it being possible to stash this info in debug sections (at least for dwarf2), but I don't think that's what you're referring to here.
Neither the symbol table nor stabs-like sections are available from the context in which the probe fires.
By 'leaf' I just meant 'the top of the user stack' (as in the analyzer's 'leaf is' predicate) but maybe there's a more precise definition of leaf I wasn't aware of?
Yes, I understand what you mean.
Post by Adam Leventhal
What we could do is have the ustack() action record %o7 as well and then figure out in user-land whether or not it's relevant.
Actually, I tried %o7 first, and it's pretty dodgy -- once a stack frame has been created, it can hold anything. CC seems to use it a lot for bit-shifting and as the target of JMPL. In contrast, %i7 is (nearly) always valid.
In leaf-context, %o7 is all you have -- the %is and %ls belong to the previous frame.
Two changes, I think. BTW, is this a sparc-only issue or would it also affect x86?
This would apply for x86/x64 as well as SPARC.
Limitations: Because ustack() must traverse stack frames to build its stack trace, functions which do not establish a stack frame can lead to unpredictable results. In particular
• Functions making tail calls will not appear because they tear down their own stack frame before making the call.
• Except inside function entry probes, leaf functions which have not (yet) established a stack frame sometimes prevent their caller from appearing in the stack trace (e.g. foo -> bar -> baz will appear as foo -> baz). See <link-to-note-at-profile-provider>.
ustack() only reports the caller of a leaf function if the latter has established a stack frame.
A leaf function is typically used to refer to one that doesn't establish a stack frame.
Otherwise, the caller's caller appears to call the leaf function (e.g. foo -> bar -> baz will be repoted as foo -> baz). This behavior occurs because dtrace has no way to determine whether the current top of stack has established a frame or not. DTrace can avoid the problem with function entry probes because they always fire before the target has established a stack frame, and with the return address is in a known location, neither of which is necessarily true in profile probe context.
Close enough.

Adam

--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Ryan Johnson
2010-07-22 10:09:14 UTC
Permalink
This post might be inappropriate. Click to display it.
Nicolas Williams
2010-08-23 19:07:19 UTC
Permalink
Post by Adam Leventhal
Post by Ryan Johnson
Post by Adam Leventhal
It it not generally possible to determine the the probe fires
whether or not the thread is currently executing in leaf context.
All symbolic translations happens in user-land after the fact for
both ustack() and uaddr().
I'm afraid you lost me. What does symbolic translation have to do
with detecting the presence of a stack frame? I could imagine it
being possible to stash this info in debug sections (at least for
dwarf2), but I don't think that's what you're referring to here.
Neither the symbol table nor stabs-like sections are available from
the context in which the probe fires.
Incidentally, for frame-pointer-less code this is fatal to ustack().

GCC generates debug code for traversing stacks of frame-pointer-less
code. That debug code requires interpretation, and it requires the live
stack. That implies that DTrace would have to have the debug sections
mapped in and the interpreter code (running in dtrace context!). DTrace
context can't page fault, so debug sections would have to be pinned in
memory by the kernel, and that's not bloody likely (particularly if
there is memory pressure).

Nico
--

Loading...