Performance analysis in Linux

My steps to troubleshooting performance issues with CPU, memory or IOPs.


Performance tools:
mpstat
iostat
ioping
collectl
htop
ps
vmstat
netstat (ss) / nfsstat
iotop
1. CPU
High CPU utilization usually points to a process misbehaving, memory leak, applications overloading the system or
protocols latencies (NFS, SMB, iSCSI read/writes to disks).

1.1 Identify processes that use highest CPU / MEM

Identify CPU high utilization using mpstat -P ALL.

# mpstat -P ALL
CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
  all   10.68    2.12    4.22    1.58    0.00    0.08    0.00    0.00    0.00   81.33
   0     4.98    1.16    2.18    2.21    0.00    0.13    0.00    0.00    0.00   89.34
   1   18.06    3.34    6.63    1.16    0.00    0.01    0.00    0.00    0.00   70.79
   2   15.59    3.10    6.74    0.82    0.00    0.05    0.00    0.00    0.00   73.69
   3   16.57    2.96    5.76    0.73    0.00    0.01    0.00    0.00    0.00   73.97

htop help finding out which processes use most CPU and / or memory.




The htop tabs are clickable so it’s easy to choose between CPU, MEM, Command etc.
For more options htop man page.
Use lsof to identify files opened by certain processes.
List open files by process (example tgt)
# lsof -c tgt
      COMMAND PID USER   FD      TYPE             DEVICE       SIZE/OFF   NODE NAME
tgtd    517 root  cwd         DIR                8,0           4096      2 /
tgtd    517 root  rtd           DIR                8,0           4096      2 /
tgtd    517 root  txt           REG                8,0         300016 419498 /usr/bin/tgtd
tgtd    517 root  mem       REG                8,0        2047384 396560 /usr/lib/libc-2.19.so
tgtd    517 root  mem       REG                8,0          14648 396525 /usr/lib/libdl-2.19.so
tgtd    517 root  mem       REG                8,0         149301 396575 /usr/lib/libpthread-2.19.so
tgtd    517 root  mem       REG                8,0         160026 396544 /usr/lib/ld-2.19.so

List open files by user (example root)

#lsof -u root
#ps auxefw

for a complete list of processes and commands.
the output will be fairly long so better to use | grep if searching for specific processes or users.

1.2 Once the troubling process has been identified, check whether it can be killed or restarted. Use ‘kill’ to send specific signals to a process.

# kill -s SIGSTOP 519
# kill 9 519

2. Memory
vmstat is a good tool for memory util troubleshooting.
A few examples:
# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 631948  13532 114992    0    0    15     2   16   35  0  0 99  1  0

# vmstat -a
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 632080 101472 124512    0    0    15     2   16   35  0  0 99  1  0

# vmstat -d    
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
fd0        0      0       0       0      0      0       0       0      0      0
sda     4843   1484  249642  156650   1378   2059   42232   27756      0     39
sdb     1276    226    9198    1180     38      0     189       6      0      1
sdd     1269    229    7456    1326     38      0     189       3      0      1
sdc     1254    229    7394     980     45      0     189      20      0      0

# vmstat --help for all options

3. Disks

The iostat command generates two types of reports, the CPU Utilization report and the Device Utilization report.

Iostat is a very good tool when troubleshooting disk issues, IOs / throughput issues.

# iostat
Linux 3.14.2-1-ARCH (server1)   09/18/14        _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                  0.04      0.00     0.11         0.57    0.00   99.28

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               1.25        24.84         4.25     124913      21392
sdb               0.26         0.91         0.02       4599         94
sdd               0.26         0.74         0.02       3728         94
sdc               0.26         0.74         0.02       3697         94
sde               0.03         0.19         0.00        960          0
sdg               0.03         0.19         0.00        960          0
sdf               0.03         0.19         0.00        960          0
sdh               0.02         0.18         0.00        924          0


%user - Show the percentage of CPU utilization that occurred while executing at the user level (application).
%nice - Show the percentage of CPU utilization that occurred while executing at the user level with nice priority.
%system - Show the percentage of CPU utilization that occurred while executing at the system level (kernel).
%iowait - Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
%steal - Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
%idle - Show the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.


The disk transfer rates are shown in 1K blocks by default, unless indicated otherwise.
tps - total number of transfers to the device
Blk_read/s /  Blk_wrtn/s - read and writes per second
Blk_wrtn / Blk_read - total blocks written / read
%util - Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when  this  value  is close  to  100%  for  devices serving requests serially.  But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
await - The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the  time  spent servicing them

# man iostat

%util and await are the most important indicators if there is an IO bottleneck.
The disk performance is determined by RAID type, disk number in the array, average IOPs / drive (determined by the rotational speed) and IO workload (how many reads and writes)

Sometimes DBs running on NFS exports are trawling the filesystem (indexing, scanning, etc) instead of doing any actual useful reads or writes.

iotop
iotop  watches  I/O usage information output by the Linux kernel (requires 2.6.20 or later) and displays a table of
current I/O usage by processes or threads on the system.
# iotop -o -a



The collectl utility is a system monitoring tool that records or displays specific operating system data for one or more sets of subsystems. Any set of the subsystems, such as CPU, Disks, Memory or Sockets can be included in or
excluded from data collection.
Used without options it shows a general view of the CPU, MEM and network utilization

# collectl
waiting for 1 second sample...
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut 
   0      0    30     60      0            0          0         0         0        0      0       0 
   0      0    47     78      0            0          0         0         0        2      0       1 
   0      0    32     78      0            0          0         0         0        4      0       1 

These generate summary, which is the total of ALL data for a particular type
  b - buddy info (memory fragmentation)
  c - cpu
  d - disk
  f - nfs
  i - inodes
  j - interrupts by CPU
  l - lustre
  m - memory
  n - network
  s - sockets
  t - tcp
  x - interconnect (currently supported: Infiniband and Quadrics)
  y - slabs