Over the past few years of my Juniper SRX adventures, I’ve run into a few cases where the Routing Engine (RE) CPU is pegged at 100%. From what I’ve seen so far, this is typically one of three causes: high traffic (spike in IPS inspection), logging using event mode, or a stuck web management session.
In a few occasional cases, the CPU issue doesn’t resolve itself and someone needs to manually investigate the cause. Luckily, the httpd issue is pretty easy to spot and fix – so I wanted to cover that briefly today. This issue can crop up randomly after someone uses the JWeb GUI to administer an SRX firewall. You could avoid this issue entirely by disabling the web interface entirely – but that’s not always possible.
So the first thing we want to do is log into our SRX firewall and check the current CPU utilization for our RE processor:
{primary:node0}
[email protected]> show chassis routing-engine node 0
node0:
--------------------------------------------------------------------------
Routing Engine status:
Temperature 41 degrees C / 105 degrees F
CPU temperature 70 degrees C / 158 degrees F
Total memory 4096 MB Max 1556 MB used ( 38 percent)
Control plane memory 2976 MB Max 804 MB used ( 27 percent)
Data plane memory 1120 MB Max 773 MB used ( 69 percent)
5 sec CPU utilization:
User 41 percent
Background 0 percent
Kernel 59 percent
Interrupt 0 percent
Idle 0 percent
Model RE-SRX345
Serial ID XX1000XX0002
Start time 2016-09-01 02:49:50 UTC
Uptime 351 days, 13 hours, 28 minutes, 47 seconds
Last reboot reason 0x1:power cycle/failure
Load averages: 1 minute 5 minute 15 minute
1.29 1.27 1.10
So we can see that over the past 5 seconds, there is 0% idle CPU – It’s all split between User and Kernel. Some higher-end SRX models will also show utilization for 1 minute, 5 minutes, and 15 minutes.
Next, we want to confirm which process is consuming that CPU:
{primary:node0}
[email protected]> show system processes extensive node 0
node0:
--------------------------------------------------------------------------
last pid: 25330; load averages: 1.16, 1.24, 1.10 up 351+13:29:51 16:19:11
165 processes: 21 running, 132 sleeping, 12 waiting
Mem: 354M Active, 191M Inact, 1253M Wired, 585M Cache, 112M Buf, 1595M Free
Swap:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
1635 root 7 76 0 1192M 113M RUN 0 ??? 281.93% flowd_octeon_hm
14607 nobody 3 76 0 14848K 6308K ucondt 0 25:03 83.45% httpd
21 root 1 171 52 0K 16K RUN 0 6952.9 0.00% idle: cpu0
1679 root 1 76 0 48580K 24476K select 0 90.2H 0.00% mib2d
1715 root 1 76 0 35264K 19520K select 0 49.0H 0.00% snmpd
23 root 1 -20 -139 0K 16K RUN 0 29.9H 0.00% swi7: clock
1681 root 1 4 0 101M 68284K kqread 0 28.0H 0.00% rpd
22 root 1 -40 -159 0K 16K WAIT 0 26.0H 0.00% swi2: netisr 0
<-- Output Truncated -->
In this case it’s pretty clear that httpd is the top offender for CPU usage. You might also notice the process named ‘flowd_octeon_hm’. This is part of the firewall processes, so don’t be surprised if this process is also one of the top. It’s pretty normal for this process to show >100% CPU, so this is safe to ignore. If you see eventd as a top consumer, then you might have your logging configured to use event mode rather than stream mode – which I’ll cover in another post.
So how do we fix the httpd problem? Reboot the SRX? Well, yeah that would probably fix it – but there is an easier way:
{primary:node0}
[email protected]> restart web-management
Web management gatekeeper process started, pid 25343
One quick command and we’ve restarted all of the web management processes, including httpd. So now you’ll want to give the SRX a few seconds to recover itself – then run the show system processes extensive command again:
{primary:node0}
[email protected]> show chassis routing-engine node 0
node0:
--------------------------------------------------------------------------
Routing Engine status:
Temperature 41 degrees C / 105 degrees F
CPU temperature 69 degrees C / 156 degrees F
Total memory 4096 MB Max 1556 MB used ( 38 percent)
Control plane memory 2976 MB Max 804 MB used ( 27 percent)
Data plane memory 1120 MB Max 773 MB used ( 69 percent)
5 sec CPU utilization:
User 6 percent
Background 0 percent
Kernel 3 percent
Interrupt 0 percent
Idle 91 percent
Model RE-SRX345
Serial ID XX1000XX0002
Start time 2016-09-01 02:49:50 UTC
Uptime 351 days, 13 hours, 32 minutes, 52 seconds
Last reboot reason 0x1:power cycle/failure
Load averages: 1 minute 5 minute 15 minute
0.35 0.99 1.04
Looks much better, with 91% idle CPU!
Even though this issue can be annoying, its a quick fix – I recommend that you perform some sort of CPU monitoring/alerting on your SRX clusters (I use Observium for this). This will help to identify the issue quickly and then get it resolved quickly. If this issue is left unchecked, it can sometimes cause some latency and performance issues.
Hope this helps!