Orange is my favorite color

I’m setting up a new server as recent posts and tweets have suggested. Just about everything was working with one exception: horribly slow startup times of 2-4 minutes on a dual quad-core Xeon machine with 16GB of RAM. Tailing the cfusion-event.log during startup, ColdFusion would hang either when starting the SQL manager pool or MessageBrokerServlet. It looked something like this (note the timestamps):

01/10 13:03:28 user MessageBrokerServlet: init
**** 90 seconds of "working" ****
01/10 13:05:13 user CFFormGateway: init
01/10 12:32:58 Information [scheduler-12] - Starting debugging...
01/10 12:32:58 Information [scheduler-12] - Starting sql...
****  almost 2 minutes here ****
01/10 12:34:46 Information [scheduler-12] - Pool Manager Started
01/10 12:34:46 Information [scheduler-12] - Starting mail... 

This simply would not do. Not only was it slow, it was unpredictable, sometimes taking 90-115 seconds and other times as long as 224 seconds. We generally restart our CF servers when we do deployments and the sluggish startup meant more queued requests and timeouts as we come back from an update. No bueno.

I turned to my friends on the CFGURU list hoping someone had the answer. The always helpful Dan Switzer pointed me to a cfaussie thread which was Ubuntu specific but had my same behavior:

With the startup you can see that cranking up the Message Broker Servlet (ie Livecycle/Blaze) is taking almost a minute, which I find is usually due to some network ports used by LCDS still being open from the last time ColdFusion was run.

The good news was that they were experiencing the exact same problem I was. The bad news is that I have nothing to do with Blaze or LiveCycle so I wasn’t sure this was related. We are also running RHEL/CentOS but the Ubuntu connection proved helpful later in my research.

Anytime I see hanging processes in ColdFusion, I generate a complete JVM stack trace. Unfortunately, I’m quite bad at reading them. I identified a BLOCKED process but Daryl Banttari took a look and set me straight:

This is the thread that's your root problem:

"scheduler-10" prio=10 tid=0x00002aaaac28c800 nid=0x5edd runnable [0x00000000441de000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:199)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00002b4fc75a5018> (a java.io.BufferedInputStream)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00002b4fc75a4ce0> (a java.io.BufferedInputStream)
at sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte(SeedGenerator.java:453)
at sun.security.provider.SeedGenerator.getSeedBytes(SeedGenerator.java:123)
at sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:118)
at sun.security.provider.SecureRandom.engineGenerateSeed(SecureRandom.java:114)
at java.security.SecureRandom.generateSeed(SecureRandom.java:495)
at com.rsa.jsafe.provider.c7.engineGenerateSeed(Unknown Source)
at java.security.SecureRandom.generateSeed(SecureRandom.java:495)
at java.security.SecureRandom.getSeed(SecureRandom.java:482)
at com.rsa.jsafe.crypto.au.g(Unknown Source)
at com.rsa.jsafe.crypto.au.a(Unknown Source)
at com.rsa.jsafe.provider.JSA_FIPS186PRNGXChangeNoticeGeneral.engineNextBytes(Unknown Source)
at java.security.SecureRandom.nextBytes(SecureRandom.java:433)
- locked <0x00002b4fdca32370> (a java.security.SecureRandom)
at java.security.SecureRandom.next(SecureRandom.java:455)
at java.util.Random.nextInt(Random.java:253)
at flex.messaging.util.UUIDUtils.createUUID(UUIDUtils.java:110)
at flex.messaging.util.UUIDUtils.createUUID(UUIDUtils.java:97)
at flex.messaging.log.AbstractTarget.<init>(AbstractTarget.java:52)
at flex.messaging.log.LineFormattedTarget.<init>(LineFormattedTarget.java:79)
at flex.messaging.log.ConsoleTarget.<init>(ConsoleTarget.java:34)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at java.lang.Class.newInstance0(Class.java:355)
at java.lang.Class.newInstance(Class.java:308)
at flex.messaging.config.MessagingConfiguration.createLogger(MessagingConfiguration.java:298)
at flex.messaging.MessageBrokerServlet.init(MessageBrokerServlet.java:114)
at coldfusion.flex.ColdFusionMessageBrokerServlet.init(ColdFusionMessageBrokerServlet.java:24)
at coldfusion.bootstrap.ClassloaderHelper.initServletClass(ClassloaderHelper.java:94)
at coldfusion.bootstrap.BootstrapServlet.init(BootstrapServlet.java:59)
at jrun.servlet.WebApplicationService.loadServlet(WebApplicationService.java:1213)
- locked <0x00002b4fdbc4e5c8> (a java.lang.Object)
at jrun.servlet.WebApplicationService.preloadServlets(WebApplicationService.java:793)
at jrun.servlet.WebApplicationService.postStart(WebApplicationService.java:295)
at jrun.ea.EnterpriseApplication.start(EnterpriseApplication.java:203)
at jrun.deployment.DeployerService.initModules(DeployerService.java:708)
at jrun.deployment.DeployerService.createWatchedDeployment(DeployerService.java:243)
at jrun.deployment.DeployerService.deploy(DeployerService.java:428)
- locked <0x00002b4ed0c3f140> (a java.util.HashMap)
at jrun.deployment.DeployerService.checkWatchedDirectories(DeployerService.java:179)
at jrun.deployment.DeployerService.run(DeployerService.java:889)
- locked <0x00002b4ed0c3f140> (a java.util.HashMap)
at jrunx.scheduler.SchedulerService.invokeRunnable(SchedulerService.java:230)
at jrunx.scheduler.ThreadPool$DownstreamMetrics.invokeRunnable(ThreadPool.java:320)
at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:428)
at jrunx.scheduler.ThreadPool$UpstreamMetrics.invokeRunnable(ThreadPool.java:266)
at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)

Looks like it's stuck trying to read from /dev/random while trying to create a UUID?

The key was that Java was waiting on input to the secure number generator code. Thanks to Daryl’s stack trace reading kung fu, I had something to Google. On Linux, /dev/random is a “blocking” number generator meaning if it doesn’t have enough random data to provide, it will simply wait until it does. Here’s some background on /dev/random. Keyboard and mouse input as well as disk activity can generate the randomness, or entropy, needed but perhaps not fast enough for particular applications. A lack of random data would force the JVM and ColdFusion to wait, for eternity if necessary, until chaos caught up with /etc/init.d/coldfusion8multi.

It turns out a lot of applications experience this problem: any that use random number generators. I found frustrated users of Apache, SSL, Livecycle/Blaze DS and a myriad of other applications as my search terms refined.

Verify the culprit

Before you try the changes below, verify that this is indeed your problem. It’s easy! First, start by checking how much entropy you have:

cat /proc/sys/kernel/random/entropy_avail

This kernel facility in /proc has been available since 2.3.16 so it should work on any reasonably recent Linux. A full pool should be 4096 (check your poolsize config in /proc/sys/kernel/random/poolsize). On my normal servers, I had values around 3500 but this particular server was 150. Stop ColdFusion (or Apache or LCDS or whatever app is giving you fits) and run the following command before starting CF again:

watch –interval 0.1 cat /proc/sys/kernel/random/entropy_avail

Now start ColdFusion and what you may see is as it gets to MessageBrokerServlet, the available entropy drops to near-zero, climbs back up to around 50-70, then drops to zero and repeats like a stairmaster until it completes and CF can resume at a normal pace.

Make random data non-randomly appear

Now we know roughly what the problem is: there ain’t enough random data. How do we make it so ColdFusion and JRun can start up faster? We need to ensure that there is a constantly available source of entropy so the JVM can seed its secure number generator quickly.

Armed with new search terms for Java’s SecureRandom, I came across a StackOverflow thread which offered a JVM argument to switch from the blocking /dev/random to the less-secure-but-non-blocking /dev/urandom:

-Djava.security.egd=file:/dev/./urandom

I appended that onto the end of java.args in /opt/jrun4/bin/jvm.config and the server started up in 12 seconds! I felt a little bit of hair on my head grow back at that moment.

Solving for Production

At this stage I’ve identified the culprit but I had uneasy feelings about using /dev/urandom in production where evil doers are constantly trying to do evil things to our server, application and users. We don’t want an insecure random number generator.

Following the Ubuntu connection from Stack Overflow, this post on slow Apache starts described the same symptoms and the use of rngd (a random number generator daemon) to populate the entropy pool. Rngd is included in rng-tools or rng-utils depending on your distro. On RHEL 5.5, it was already installed under /sbin.

The trick is to feed rngd data from the kinda-insecure /dev/urandom and let it figure out what is secure enough to pass to /dev/random with a FIPS-140 test. You can run rngd two ways. If you want to populate /dev/random once, right now, run it in the foreground:

/sbin/rngd -r /dev/urandom -o /dev/random -f -t 5

Ctrl-C out of it after 15 seconds and check entropy_avail – you should find it’s up around 4000. Very nice! Now, how about running it as a daemon so you don’t have to ever worry if ColdFusion has enough to start quickly? I added this to my /etc/rc3.d/S99local file so it will automatically start every time the server is started:


/sbin/rngd -r /dev/urandom -o /dev/random -t 5
 

Fire that up, restart ColdFusion, crack a beer and ride off into the sunset. Another ridiculous unexpected behavior solved by yours truly.

Interesting side note: Once upon a time, Intel included hardware RNGs in their chip sets that would be exposed as /dev/hw_random. If you had one of those, you could use it as the parameter for -r. Unfortunately that practice died out in the 800 series chips and are no longer available. You can still buy external, cryptographically secure hardware RNGs that can be used for rngd but the above seems to be “good enough” according to the tubes.

If you want to go nuts, you can use a radio tuned to static as an audio input to provide the source data. Unless your a crypto-nerd, this thread about RNG sources will make your eyes bleed.

Resources

EGD – entropy gathering daemon is a userspace substitute for /dev/random, written in perl. It could be used with the java.security.egd parameter above in lieu of /dev/random.

http://whatan00b.com/slow-apache-starts-on-ubuntu Llinked to from the StackOverflow thread on the Java SecureRandom slowness, explained entropy and how to monitor it.

http://blog.sbf5.com/?p=50 Pointed me in the right direction for using rngd to generate entropy for /dev/random

Increasing entropy pool on RHEL/Fedora without keyboard or mouse Another source to monitoring entropy and using rngd that turned up in Google as I got more specific with my searches.

Wikipedia on Hardware RNGs An overview of ways to source random data

Did you know? /proc/sys/kernel/random/uuid contains a fresh Version 4 UUID.

4 Comments

  1. Mark Mandel said:

    on January 11, 2011 at 2:54 pm

    Dude!

    I’ve battled this for years on my local Linux servers. I have to say a big THANK YOU! I never actually thought to go hunting through stack traces *facepalm*

    So I wrote up a little script for Ubuntu users, if you want to run it manually if you ever need to reset CF (I’m sure it can be adjusted for other people as well)


    sudo timeout -2 20s rngd -r /dev/urandom -o /dev/random -f -t 5
    cat /proc/sys/kernel/random/entropy_avail

    There you go, no more long waits on restarts. Awesome sauce.

  2. mark kruger said:

    on January 11, 2011 at 3:52 pm

    Brian,

    Best troubleshooting battle and solution EVER!!! I love it. Good work – and my hats off to you.

    -Mark

  3. Pete Freitag said:

    on January 11, 2011 at 6:55 pm

    Nice work Brian! I think I’ve seen similar delays on my Mac desktop in the MessageBrokerServlet, good info!

  4. faser said:

    on May 3, 2011 at 5:06 am

    A BIG thank you!

{ RSS feed for comments on this post}