Orange is my favorite color

Here’s a guerilla debugging tactic based upon an old post of mine, Finding Memory Leaks in Coldfusion JVM, and some inspiration from Charlie Arehart. I’m in the process of migrating from ColdFusion 8 to 10 and am encountering some unexpected errors. This stack trace snippet should strike fear into the heart of any developer:

ava.lang.StackOverflowError at org.apache.xerces.dom.ParentNode.item(Unknown Source) at
net.sf.saxon.dom.NodeWrapper$ChildEnumeration.skipFollowingTextNodes(NodeWrapper.java:1166) at
net.sf.saxon.dom.NodeWrapper$ChildEnumeration.next(NodeWrapper.java:1194) at
net.sf.saxon.tree.util.Navigator$EmptyTextFilter.next(Navigator.java:918) at
net.sf.saxon.tree.util.Navigator$DescendantEnumeration.advance(Navigator.java:1052) at
net.sf.saxon.tree.util.Navigator$DescendantEnumeration.advance(Navigator.java:1111) at
net.sf.saxon.tree.util.Navigator$BaseEnumeration.next(Navigator.java:949) at
net.sf.saxon.tree.util.Navigator$DescendantEnumeration.advance(Navigator.java:1043) at
net.sf.saxon.tree.util.Navigator$BaseEnumeration.next(Navigator.java:949) at
... (endless repeat)

Something is stuck in a loop but because the stack trace fills up, you can’t see any of the lines of code being executed just before the ka-blammo. So, what to do?

Brute Force Debugging

My solution was to swipe a perl script to run commands at very short intervals and dump jstack output into a series of text files. Low-tech, but effective. This is the timer.pl script:

#!/usr/bin/perl

use strict;
use warnings;
use Time::HiRes qw/time sleep/;

sub launch {
    return if fork;
    exec @_;
    die "Couldn't exec";
}

$SIG{CHLD} = 'IGNORE';

my $interval = shift;
my $start = time();
while (1) {
    launch(@ARGV);
    $start += $interval;
    sleep $start - time();
}

And a separate shell script, dump.sh, so the redirect would create a new file every time it was run. 9892 is just the process id of the ColdFusion instance:

#!/bin/sh
sudo -u nobody /usr/java/latest/bin/jstack 9892 > t_$(date '+%Y%m%d%H%M%S_%N')

I tried running it every second, but that wasn’t enough to catch the stack overflow in the act. I upped it to every 500ms, 100ms and finally 50ms until it caught the offending code. Here’s the command in action:

perl timer.pl 0.05 ./dump.sh 

That says run the dump.sh every 50ms which will output to files formatted like t_20130402171521_009840232 using the current date and time.

Results

Running every 50ms, it doesn’t take long to fill up a directory so I started the script on the command line and in my web browser, triggered the offending code. Here’s what my directory looks like:

 52 t_20130402171519_959011475
 52 t_20130402171520_009648175
 52 t_20130402171520_060316767
 52 t_20130402171520_110450371
 52 t_20130402171520_160301392
 52 t_20130402171520_210357359
 52 t_20130402171520_260432922
 52 t_20130402171520_309171913
 52 t_20130402171520_360033048
 52 t_20130402171520_409978373
 52 t_20130402171520_459903958
 52 t_20130402171520_510488116
 52 t_20130402171520_558777256
 60 t_20130402171520_610244244
 72 t_20130402171520_659494494
136 t_20130402171520_709058569
136 t_20130402171520_760286651
136 t_20130402171520_810351161
136 t_20130402171520_860382457
136 t_20130402171520_910136459
136 t_20130402171520_960370954
136 t_20130402171521_009840232
 72 t_20130402171521_059986004
136 t_20130402171521_110340195
136 t_20130402171521_160277982
136 t_20130402171521_209090149
 84 t_20130402171521_260090368
136 t_20130402171521_310428721
136 t_20130402171521_367027484
136 t_20130402171521_410303418
136 t_20130402171521_460314216
136 t_20130402171521_509053355
136 t_20130402171521_560120477
136 t_20130402171521_610417764
128 t_20130402171521_677647036
136 t_20130402171521_712667159
136 t_20130402171521_758884870
136 t_20130402171521_810562747
136 t_20130402171521_860425517
...

That’s kilobytes in the first column, filename in the second. You can see the steady state while I’m going to push the button in the web browser and then very quickly the stack traces get larger as code is executing. What I wasn’t getting until I stepped down to 50ms iterations were the intermediate 72kb and 84kb stack trace which actually showed the offending lines between the stack overflows churning.

While this hasn’t solved the problem yet, at least I know where to go looking (and it’s at you, XmlSearch()) to continue the troubleshooting process.

Conclusion

This is a pretty hacky solution but if you’re this far into the deep end of the “what do I do?” pool, it’s a useful technique. A line debugger would be far better but I can’t replicate this in my test environment. So, forget the “right” way to do it and let’s just get some useful information so we can make progress.

Comments are closed.