SRX High CPU: httpd

Over the past few years of my Juniper SRX adventures, I’ve run into a few cases where the Routing Engine (RE) CPU is pegged at 100%. From what I’ve seen so far, this is typically one of three causes: high traffic (spike in IPS inspection), logging using event mode, or a stuck web management session.

In a few occasional cases, the CPU issue doesn’t resolve itself and someone needs to manually investigate the cause. Luckily, the httpd issue is pretty easy to spot and fix – so I wanted to cover that briefly today. This issue can crop up randomly after someone uses the JWeb GUI to administer an SRX firewall. You could avoid this issue entirely by disabling the web interface entirely – but that’s not always possible.

So the first thing we want to do is log into our SRX firewall and check the current CPU utilization for our RE processor:

{primary:node0}
[email protected]> show chassis routing-engine node 0 
node0:
--------------------------------------------------------------------------
Routing Engine status:
    Temperature                  41 degrees C / 105 degrees F
    CPU temperature              70 degrees C / 158 degrees F
    Total memory               4096 MB Max 1556 MB used ( 38 percent)
      Control plane memory     2976 MB Max 804 MB used ( 27 percent)
      Data plane memory        1120 MB Max 773 MB used ( 69 percent)
    5 sec CPU utilization:
      User                       41 percent
      Background                  0 percent
      Kernel                     59 percent
      Interrupt                   0 percent
      Idle                        0 percent
    Model                           RE-SRX345
    Serial ID                       XX1000XX0002
    Start time                      2016-09-01 02:49:50 UTC
    Uptime                          351 days, 13 hours, 28 minutes, 47 seconds
    Last reboot reason              0x1:power cycle/failure
    Load averages:                  1 minute   5 minute   15 minute
                                        1.29       1.27        1.10

So we can see that over the past 5 seconds, there is 0% idle CPU – It’s all split between User and Kernel. Some higher-end SRX models will also show utilization for 1 minute, 5 minutes, and 15 minutes.

Next, we want to confirm which process is consuming that CPU:

{primary:node0}
[email protected]> show system processes extensive node 0
node0:
--------------------------------------------------------------------------
last pid: 25330;  load averages:  1.16,  1.24,  1.10  up 351+13:29:51    16:19:11
165 processes: 21 running, 132 sleeping, 12 waiting

Mem: 354M Active, 191M Inact, 1253M Wired, 585M Cache, 112M Buf, 1595M Free
Swap:


  PID USERNAME     THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
1635 root           7  76    0  1192M   113M RUN    0    ??? 281.93% flowd_octeon_hm
14607 nobody         3  76    0 14848K  6308K ucondt 0  25:03 83.45% httpd
   21 root           1 171   52     0K    16K RUN    0 6952.9  0.00% idle: cpu0
 1679 root           1  76    0 48580K 24476K select 0  90.2H  0.00% mib2d
 1715 root           1  76    0 35264K 19520K select 0  49.0H  0.00% snmpd
   23 root           1 -20 -139     0K    16K RUN    0  29.9H  0.00% swi7: clock
 1681 root           1   4    0   101M 68284K kqread 0  28.0H  0.00% rpd
   22 root           1 -40 -159     0K    16K WAIT   0  26.0H  0.00% swi2: netisr 0
  <-- Output Truncated -->

In this case it’s pretty clear that httpd is the top offender for CPU usage. You might also notice the process named ‘flowd_octeon_hm’. This is part of the firewall processes, so don’t be surprised if this process is also one of the top. It’s pretty normal for this process to show >100% CPU, so this is safe to ignore. If you see eventd as a top consumer, then you might have your logging configured to use event mode rather than stream mode – which I’ll cover in another post.

So how do we fix the httpd problem? Reboot the SRX? Well, yeah that would probably fix it – but there is an easier way:

{primary:node0}
[email protected]> restart web-management
Web management gatekeeper process started, pid 25343

One quick command and we’ve restarted all of the web management processes, including httpd. So now you’ll want to give the SRX a few seconds to recover itself – then run the show system processes extensive command again:

{primary:node0}
[email protected]> show chassis routing-engine node 0
node0:
--------------------------------------------------------------------------
Routing Engine status:
    Temperature                 41 degrees C / 105 degrees F
    CPU temperature             69 degrees C / 156 degrees F
    Total memory              4096 MB Max  1556 MB used ( 38 percent)
      Control plane memory    2976 MB Max   804 MB used ( 27 percent)
      Data plane memory       1120 MB Max   773 MB used ( 69 percent)
    5 sec CPU utilization:
      User                       6 percent
      Background                 0 percent
      Kernel                     3 percent
      Interrupt                  0 percent
      Idle                      91 percent
    Model                          RE-SRX345
    Serial ID                      XX1000XX0002
    Start time                     2016-09-01 02:49:50 UTC
    Uptime                         351 days, 13 hours, 32 minutes, 52 seconds
    Last reboot reason             0x1:power cycle/failure
    Load averages:                 1 minute   5 minute  15 minute
                                       0.35       0.99       1.04

Looks much better, with 91% idle CPU!

Even though this issue can be annoying, its a quick fix – I recommend that you perform some sort of CPU monitoring/alerting on your SRX clusters (I use Observium for this). This will help to identify the issue quickly and then get it resolved quickly. If this issue is left unchecked, it can sometimes cause some latency and performance issues.

Hope this helps!