Discussion:
dtrace (userland) seg faults while processing aggregates
Ryan Johnson
2010-07-15 13:51:12 UTC
Permalink
Hi all,

Odd behavior here: I'm using the profile probe to sample stack traces in
my app, 100+ threads banging hard on a T5440. The script exits after 30
seconds of profiling and seg faults while trying to print the aggregated
$ dbx - core
Corefile specified executable: "/usr/sbin/sparcv9/dtrace"
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.8' in
your .dbxrc
Reading /usr/sbin/sparcv9/dtrace
core file header read successfully
<snipped long list of .so files>
program terminated by signal SEGV (no mapping at the fault address)
0xffffffff7e8a8ff0: _ndoprnt+0x0004: stx %i2, [%sp + 2223]
(dbx) bt
=>[1] _ndoprnt()
[2] snprintf()
[3] dtrace_uaddr2str()
[4] pfprint_uaddr()
[5] dt_printf_format()
[6] dt_fprinta()
[7] dt_aggregate_walk_sorted()
[8] dtrace_fprinta()
[9] dt_consume_cpu()
[10] dtrace_consume()
[11] dtrace_work()
[12] main()
The problem seems related to the fact that I had specified 'dtrace -x
aggsize=256k' -- rerunning with aggsize=2048k makes the problem go away.
Note that I didn't see any messages from dtrace about drops, so I can't
say whether 256k should have been enough or not. However, the app has a
lot of code and deep call stacks, so I could easily imagine 256k not
being enough.

Ideas? Should I file a bug report?

Ryan
Ryan Johnson
2010-07-16 15:41:51 UTC
Permalink
Update: the bug manifests quite often no matter what aggsize I use. I
just got lucky that one time...

Has anyone else seen this?

Ryan
Post by Ryan Johnson
Hi all,
Odd behavior here: I'm using the profile probe to sample stack traces
in my app, 100+ threads banging hard on a T5440. The script exits
after 30 seconds of profiling and seg faults while trying to print the
$ dbx - core
Corefile specified executable: "/usr/sbin/sparcv9/dtrace"
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.8' in
your .dbxrc
Reading /usr/sbin/sparcv9/dtrace
core file header read successfully
<snipped long list of .so files>
program terminated by signal SEGV (no mapping at the fault address)
0xffffffff7e8a8ff0: _ndoprnt+0x0004: stx %i2, [%sp + 2223]
(dbx) bt
=>[1] _ndoprnt()
[2] snprintf()
[3] dtrace_uaddr2str()
[4] pfprint_uaddr()
[5] dt_printf_format()
[6] dt_fprinta()
[7] dt_aggregate_walk_sorted()
[8] dtrace_fprinta()
[9] dt_consume_cpu()
[10] dtrace_consume()
[11] dtrace_work()
[12] main()
The problem seems related to the fact that I had specified 'dtrace -x
aggsize=256k' -- rerunning with aggsize=2048k makes the problem go
away. Note that I didn't see any messages from dtrace about drops, so
I can't say whether 256k should have been enough or not. However, the
app has a lot of code and deep call stacks, so I could easily imagine
256k not being enough.
Ideas? Should I file a bug report?
Ryan
_______________________________________________
dtrace-discuss mailing list
Adam Leventhal
2010-07-16 21:22:23 UTC
Permalink
Hey Ryan,

Thanks for providing the core file to me offline. This turns out to be a
$C
ffffffff7f7ff6b1 libc.so.1`_ndoprnt+4(ffffffff7f174760, ffffffff7f800e58,
ffffffff7f800d38, 0, ff00, 2f000000)
ffffffff7f800481 libc.so.1`snprintf+0x88(ffffffff7f801690, 0, ffffffff7f174760,
ffffffff7f800e9c, ffffffff7f801290, c4)
ffffffff7f8005c1 libdtrace.so.1`dtrace_uaddr2str+0x108(1001117e0, 6106, 100521a6c, ffffffff7f801f40, 10d, ffffffff7f801690)
ffffffff7f801691 libdtrace.so.1`pfprint_uaddr+0xd8(1001117e0, 10010b570,
ffffffff7fffde54, 6106, 10d, 10)
ffffffff7fffd4d1 libdtrace.so.1`dt_printf_format+0x6f4(ffffffff7fffde55, 100577ed8, 100114d40, ffffffff7fffde55, ffffffff7f276000, ffffffff7f1584bc)
ffffffff7fffd6b1 libdtrace.so.1`dt_fprinta+0x88(0, ffffffff7fffe108, 100114d40, 4, 100577e70, 100577e98)
ffffffff7fffd791 libdtrace.so.1`dt_aggregate_walk_sorted+0x108(1001117e0,
ffffffff7f15af84, ffffffff7fffe108, 4c50, 98a, 1013ddb20)
ffffffff7fffd841 libdtrace.so.1`dtrace_fprinta+0x11c(1001117e0, 10010b570, 100114d40, 1, 10010dac8, 1)
ffffffff7fffd941 libdtrace.so.1`dt_consume_cpu+0x78c(1001117e0, 10010b570, 0,
ffffffff7f276000, 1, 1009949b8)
ffffffff7fffdb41 libdtrace.so.1`dtrace_consume+0x2f8(1001117e0, 10010b570, 1000044f8, 10000446c, c0, 0)
ffffffff7fffdc01 libdtrace.so.1`dtrace_work+0x108(1001117e0, 10010b570, 1000044f8, 10000446c, 0, 1)
ffffffff7fffdcb1 main+0x1fb0(10010b, 10010b390, 10010b378, 100000, 10010b, 100000)
ffffffff7fffdee1 _start+0x17c(0, 0, 0, 0, 0, 0)
ffffffff7fffd4d1-ffffffff7f801691=K
7fbe40
(ffffffff7fffd4d1-ffffffff7f801691)%0t1024=D
8175

That's an 8MB stack frame!

Here's the problem in dt_printf_format():

for (i = 0; i < pfv->pfv_argc; i++, pfd = pfd->pfd_next) {
...
if (pfd->pfd_preflen != 0) {
char *tmp = alloca(pfd->pfd_preflen + 1);

bcopy(pfd->pfd_prefix, tmp, pfd->pfd_preflen);
tmp[pfd->pfd_preflen] = '\0';

if ((rval = dt_printf(dtp, fp, tmp)) < 0)
return (rval);
...

We're allocating unbounded amounts of data off the stack. You may be able to
work around the problem by using ulimit(1) to increase the size of dtrace(1M)'s
stack.

I've filed the following bug:

6969970 dt_printf_format() calls alloca(3C) in a loop

Adam
Update: the bug manifests quite often no matter what aggsize I use. I just
got lucky that one time...
Has anyone else seen this?
Ryan
Post by Ryan Johnson
Hi all,
Odd behavior here: I'm using the profile probe to sample stack traces in
my app, 100+ threads banging hard on a T5440. The script exits after 30
seconds of profiling and seg faults while trying to print the aggregated
$ dbx - core
Corefile specified executable: "/usr/sbin/sparcv9/dtrace"
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.8' in your
.dbxrc
Reading /usr/sbin/sparcv9/dtrace
core file header read successfully
<snipped long list of .so files>
program terminated by signal SEGV (no mapping at the fault address)
0xffffffff7e8a8ff0: _ndoprnt+0x0004: stx %i2, [%sp + 2223]
(dbx) bt
=>[1] _ndoprnt()
[2] snprintf()
[3] dtrace_uaddr2str()
[4] pfprint_uaddr()
[5] dt_printf_format()
[6] dt_fprinta()
[7] dt_aggregate_walk_sorted()
[8] dtrace_fprinta()
[9] dt_consume_cpu()
[10] dtrace_consume()
[11] dtrace_work()
[12] main()
The problem seems related to the fact that I had specified 'dtrace -x
aggsize=256k' -- rerunning with aggsize=2048k makes the problem go away.
Note that I didn't see any messages from dtrace about drops, so I can't
say whether 256k should have been enough or not. However, the app has a
lot of code and deep call stacks, so I could easily imagine 256k not being
enough.
Ideas? Should I file a bug report?
Ryan
_______________________________________________
dtrace-discuss mailing list
_______________________________________________
dtrace-discuss mailing list
--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Loading...