cgropu实现系统资源隔离

时间：2014-05-27 03:55:06 阅读：685 评论：0 收藏：0 [点我收藏+]

以下是对NUMA和cgroup的初次实践分享，仅供参考。

详细介绍请参考：https://access.redhat.com/site/documentation/zh-CN/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/

一、线上程序现象
最近在不少os6.4的系统上发现机器负载较大，部分cpu使用率很高，但部分cpu使用率很低；物理内存很空闲，使用率很低，但swap使用了很多；某些程序因分配不到内存而报错等现象。更为严重我的发现有启动iptables都报错失败的情况，如下：
[root@abc tmp]# /etc/init.d/iptables start
iptables: Applying firewall rules: iptables-restore: line 31 failed
[FAILED]
[root@abc tmp]# tail /var/log/messages
Aug 14 00:21:19 abc kernel: Swap cache stats: add 386347, delete 385412, find 747707/754322
Aug 14 00:21:19 abc kernel: Free swap = 20167792kB
Aug 14 00:21:19 abc kernel: Total swap = 20971512kB
Aug 14 00:21:19 abc kernel: 4194303 pages RAM
Aug 14 00:21:19 abc kernel: 115322 pages reserved
Aug 14 00:21:19 abc kernel: 437874 pages shared
Aug 14 00:21:19 abc kernel: 3591085 pages non-shared
Aug 14 00:21:19 abc kernel: Unable to create nf_conn slab cache
Aug 14 00:21:20 abc modprobe: FATAL: Error inserting xt_state (/lib/modules/2.6.32-358.14.1.el6.x86_64/kernel/net/netfilter/xt_state.ko): Cannot allocate memory
Aug 14 00:21:47 abc kernel: nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[root@abc tmp]# free -m
total used free shared buffers cached
Mem: 15933 15768 165 0 58 11429
-/+ buffers/cache: 4280 11652
Swap: 20479 784 19695

此时mem和swap都有空余，但还是Cannot allocate memory。经过查找资料这是NUMA导致的。

二、NUMA简单介绍
NUMA是多核心CPU架构中的一种，其全称为Non-Uniform Memory Access（非同一内存），简单来说就是在多核心CPU中，机器的物理内存是分配给各个核的，每个核访问分配给自己的内存会比访问分配给其它核的内存要快；【从系统架构来说，目前的主流企业服务器基本可以分为三类：SMP (Symmetric Multi Processing，对称多处理架构)，NUMA (Non-Uniform Memory Access，非一致存储访问架构)，和MPP (Massive Parallel Processing，海量并行处理架构)。三种架构各有特点，SMP架构：所有cpu以平等代价访问memory且共享系统总线，系统总线可能成为性能瓶颈且不易扩展；MPP架构：逻辑上划分为多个node且每个node上的cpu访问自己本地资源，扩展性好，node间数据交换难；NUMA架构介于前两者之间；详细介绍可以网上参考下】

查看是否支持numa及其numa信息：
[root@abc tmp]# numactl --show
policy: default //当前numa策略为default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0 1
nodebind: 0 1
membind: 0 1 //以上为可供绑定的cpu mem node等资源

[root@abc tmp]# numactl --hardware
available: 2 nodes (0-1) //表示有两个可用节点
node 0 cpus: 0 2 4 6
node 0 size: 8192 MB
node 0 free: 2904 MB //节点0包含的cpu及其内存使用情况（本地资源情况）
node 1 cpus: 1 3 5 7
node 1 size: 8179 MB
node 1 free: 4220 MB //节点1
node distances:
node 0 1
0: 10 20
1: 20 10 //节点0访问节点0即本地资源的代价是10，节点0访问节点1的资源代价是20（访问本地资源比远程快）

在Linux上NUMA API支持四种内存分配策略：
1. 缺省(default) - 总是在本地节点分配（分配在当前线程运行的节点上）
2. 绑定(bind) - 分配到指定节点上
3. 交织(interleave) - 在所有节点或者指定的节点上交织分配
4. 优先(preferred) - 在指定节点上分配，失败则在其他节点上分配
绑定和优先的区别是，在指定节点上分配失败时（如无足够内存），绑定策略会报告分配失败，而优先策略会尝试在其他节点上进行分配。强制使用绑定有可能会导致前期的内存短缺，并引起大量换页。缺省的策略是更加普适的优先策略。

三、使用cgroup软件对资源进行分配控制与隔离
Cgroups是control groups的缩写，是Linux内核提供的一种可以限制、记录、隔离进程组（process groups）所使用的物理资源（如：cpu,memory,IO等等）；

1、安装
yum install libcgroup -y

2、配置
cat > /etc/cgconfig.conf <<END
mount {
cpuset = /cgroup/cpu_and_mem;
memory = /cgroup/cpu_and_mem; //cpuset memory资源子系统挂载路径
}

group chunk_server { //控制族群chunk_server，可包括lssubsys -a查看到的一些系统支持的子系统
cpuset {
cpuset.cpus = "0,2,4,6"; //cpu使用第0,2,4,6，属于同一node
cpuset.mems="0"; //node 0的memory资源
}
memory {
memory.limit_in_bytes=4G; //memory限制大小为4G
memory.memsw.limit_in_bytes=4G; //限制mem+swap的大小为4G
memory.swappiness=0; //积极使用物理内存（同vm.swappiness）
}
}

group other {
cpuset {
cpuset.cpus = "1,3,5,7";
cpuset.mems="1";
}
memory {
memory.limit_in_bytes=8G;
memory.memsw.limit_in_bytes=8G;
memory.swappiness=0;
}
}

END

memory.limit_in_bytes 内存；

memory.memsw.limit_in_bytes 内存与swap的和（大于memory.limit_in_bytes）；

cgconfigparser -l /etc/cgconfig.conf 测试配置是否正确；

3、启动
/etc/init.d/cgconfig restart

4、任务使用
killall -9 chunk_server
cgexec -g memory,cpuset:chunk_server /usr/local/gfs/bin/gfs_chunk_server_daemon.sh

killall -9 vip_cdn
cgexec -g memory,cpuset:other /usr/local/vip_cdn/bin/monitor.sh

以上步骤的2跟4操作方法我觉得比较适用，当然也可以使用命令对资源进行挂载、分配、使用，如：cgcreate/cgget /cgset /cgdelete/cgclear/cgclassify等；

5、某些命令
[root@abc tmp]# lssubsys -am //查看当前支持的子系统及其挂载情况
ns
cpu
cpuacct
devices
freezer
net_cls
blkio
perf_event
net_prio
cpuset,memory /cgroup/cpu_and_mem

[root@abc tmp]# lscgroup //查看当前cgroup层次结构
cpuset,memory:/
cpuset,memory:/other
cpuset,memory:/chunk_server

[root@abc tmp]# wc -l /cgroup/cpu_and_mem/chunk_server/tasks
31 /cgroup/cpu_and_mem/chunk_server/tasks
[root@abc tmp]# wc -l /cgroup/cpu_and_mem/other/tasks
243 /cgroup/cpu_and_mem/other/tasks
//当前这两个资源group中的任务数，tasks中记录的是进程id（也可以手动加入某进程id到tasks文件中，则该进程及其子进程将受到该group控制）

[root@abc tmp]# cat /proc/`pidof chunk_server`/cgroup

2:memory,cpuset:/chunk_server //查看chunk_server程序所在group信息（此处表示层次为2，资源为memory和cpuset，group为/chunk_server）

[root@abc tmp]# numastat //numa的访问统计（numa未关闭时）

node0 node1

numa_hit 124707339647 29192299625

numa_miss 13468643333 2043118290

numa_foreign 2043118290 13468643333

interleave_hit 10961 10991

local_node 124707158180 29191808808

other_node 13468824800 2043609107

[root@abc tmp]# numastat //numa的访问统计（numa关闭时）

node0

numa_hit 1201229534

numa_miss 0

numa_foreign 0

interleave_hit 21915

local_node 1201229534

other_node 0

6、自动配置及使用cgroup
上面的1-4是简单的安装以及使用cgroup来控制资源分配到程序，如果是批量部署，让程序启动的时候自动加入group，而不是手动cgexec呢，可以使用cgred服务配合cgroup自动控制程序及其子进程，下面是针对目前线上大多数机器的部署脚本：
cat cgroup.sh
#!/bin/bash
# bigy @ 20130916

#检查节点node个数
nodes=`/usr/bin/numactl --hardware |awk ‘/available/{print $2}‘`
if [ $nodes -lt 2 ];then
echo "available nodes is $node,not support"
exit 1
fi

#获取节点
node0=`/usr/bin/numactl --hardware |awk -F: ‘/node 0 cpus/{print $2}‘ |awk ‘{print $1","$2","$3","$4}‘`
node1=`/usr/bin/numactl --hardware |awk -F: ‘/node 1 cpus/{print $2}‘ |awk ‘{print $1","$2","$3","$4}‘`

#若机器numa架构，则检查并cgroup
rpm -qa |grep -q libcgroup || yum install libcgroup -y || (echo "yum install libcgroup error" && exit 1 )

#=====update /etc/cgconfig.conf =====
cat > /etc/cgconfig.conf <<END
mount {
cpuset = /cgroup/cpu_and_mem;
memory = /cgroup/cpu_and_mem;
}

group chunk_server {
cpuset {
cpuset.cpus = "$node0";
cpuset.mems="0";
}
memory {
memory.limit_in_bytes=4G;
memory.memsw.limit_in_bytes=4G;
memory.swappiness=0;
}
}

group other {
cpuset {
cpuset.cpus = "$node1";
cpuset.mems="1";
}
memory {
memory.limit_in_bytes=8G;
memory.memsw.limit_in_bytes=8G;
memory.swappiness=0;
}
}
END

#=====update /etc/cgrules.conf =====

#格式为：用户:程序子系统逻辑挂载点

#程序：可以为程序名称、程序全路径（程序须全路径运行），或者程序的启动脚本
#挂载点：lscgroup命令查看

cat > /etc/cgrules.conf <<END
*:/usr/local/gfs/bin/chunk_server cpuset,memory /chunk_server
*:/usr/local/vip_cdn/bin/monitor.sh cpuset,memory /other
END

#===== start cgconfig and cgred =====
/etc/init.d/cgconfig restart
/sbin/chkconfig cgconfig on

/etc/init.d/cgred restart
/sbin/chkconfig cgred on

echo "===== install cgroup done! ====="

#重启程序，让其自动加入group中
echo "===== kill processes for restart ====="
killall -9 chunk_server
/usr/local/gfs/bin/gfs_chunk_server_daemon.sh
killall -9 vip_cdn
/usr/local/vip_cdn/bin/monitor.sh

说明：
*:/usr/local/gfs/bin/chunk_server cpuset,memory /chunk_server
或者*:chunk_server cpuset,memory /chunk_server
程序启动必须是全路径启动，否则匹配不到；
*:/usr/local/vip_cdn/bin/monitor.sh cpuset,memory /other
由于http_down启动后是./vip_cdn，不是全路径运行，所以不能像chunk_server一样直接写chunk_server；可以写全路径的监控脚本。

7、关闭numa
当系统跑的程序比较多而且资源使用不易控制的时候，可以考虑把numa关闭。对部分机器做了不同的测试，关闭numa的效果稍微好点，目前已经全部在os内核关闭。
关闭方法：
1.硬件层，在BIOS中设置关闭；
2.OS内核，启动时kernel后添加参数numa=off；
3.可以用numactl命令将内存分配策略修改为interleave（交叉）。

ps：

由于线上机器业务比较复杂，只是大概的分了两个组来隔离，保证最重要的服务能正常而不受影响即可。我们线上机器部署的复杂性，numa基本关闭也很少使用cgroup做资源隔离。但有个机房比较特殊几台24核64Gmem万兆网卡机器当作一个节点，这样每个机器上面的程序就更多了，这时使用cgroup对各个程序做了资源隔离，效果很好，由于程序性能问题，目前这几台机器只跑了大概15G带宽的样子。

本文出自 “Big_Y” 博客，请务必保留此出处http://bigys.blog.51cto.com/2933677/1412873

cgropu实现系统资源隔离,布布扣,bubuko.com

cgropu实现系统资源隔离

原文：http://bigys.blog.51cto.com/2933677/1412873

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)