内核中定义了如下一些管理区zone:
enum zone_type {
#ifdef CONFIG_ZONE_DMA
/*
* ZONE_DMA is used when there are devices that are not able
* to do DMA to all of addressable memory (ZONE_NORMAL). Then we
* carve out the portion of memory that is needed for these devices.
* The range is arch specific.
*
* Some examples
*
* Architecture Limit
* ---------------------------
* parisc, ia64, sparc <4G
* s390 <2G
* arm Various
* alpha Unlimited or 0-16MB.
*
* i386, x86_64 and multiple other arches
* <16M.
*/
ZONE_DMA,
#endif
#ifdef CONFIG_ZONE_DMA32
/*
* x86_64 needs two ZONE_DMAs because it supports devices that are
* only able to do DMA to the lower 16M but also 32 bit devices that
* can only do DMA areas below 4G.
*/
ZONE_DMA32,
#endif
/*
* Normal addressable memory is in ZONE_NORMAL. DMA operations can be
* performed on pages in ZONE_NORMAL if the DMA devices support
* transfers to all addressable memory.
*/
ZONE_NORMAL,
#ifdef CONFIG_HIGHMEM
/*
* A memory area that is only addressable by the kernel through
* mapping portions into its own address space. This is for example
* used by i386 to allow the kernel to address the memory beyond
* 900MB. The kernel will set up special mappings (page
* table entries on i386) for each page that the kernel needs to
* access.
*/
ZONE_HIGHMEM,
#endif
ZONE_MOVABLE,
__MAX_NR_ZONES
};
简单来说,可迁移的页面不一定都在ZONE_MOVABLE中,但是ZONE_MOVABLE中的也页面必须都是可迁移的,我们通过查看/proc/pagetypeinfo来看下实例:
xie:/proc # cat pagetypeinfo
Page block order: 10
Pages per block: 1024
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 76 50 24 20 27 25 19 3 1 2 0
Node 0, zone DMA, type Movable 117 35 28 172 281 93 49 21 7 4 4
Node 0, zone DMA, type Reclaimable 0 3 1 0 0 0 0 1 0 1 0
Node 0, zone DMA, type CMA 3380 1798 856 386 152 55 21 8 4 0 0
Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 521 654 531 286 132 52 15 2 1 4 0
Node 0, zone Normal, type Movable 1 8 21 21 1 1 5 3 1 0 0
Node 0, zone Normal, type Reclaimable 18 24 1 1 0 0 1 0 1 0 0
Node 0, zone Normal, type CMA 9 0 1 6 2 0 1 0 0 0 0
Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Movable, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Movable, type Movable 963 649 188 48 24 112 49 21 8 3 50
Node 0, zone Movable, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Movable, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Movable, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Movable, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Number of blocks type Unmovable Movable Reclaimable CMA HighAtomic Isolate
Node 0, zone DMA 123 310 18 61 0 0
Node 0, zone Normal 406 310 43 9 0 0
Node 0, zone Movable 0 256 0 0 0 0
Number of mixed blocks Unmovable Movable Reclaimable CMA HighAtomic Isolate
Node 0, zone DMA 0 61 0 0 0 0
Node 0, zone Normal 0 11 3 0 0 0
Node 0, zone Movable 0 0 0 0 0 0
可以看到在Movable Zone中不存在Unmovable类型的页面,只有Movable类型的页面。
这个管理区,主要是和memory hotplug功能有关,为什么要设计内存热插拔功能,主要是为了如下两点考虑:
1.逻辑内存热插拔,对于虚拟机的支持,对于虚拟机按照需求来分配可用内存
2.物理内存热插拔,对于NUMA服务器的支持,不需要的内存就设置为offline,以降低功耗
3.优化内存碎片问题
这个管理区域存放的page都是可迁移的,只能被带有__GFP_HIGHMEM和__GFP_MOVABLE标志的内存申请所使用,比如:
#define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
主要注意的是不要把分配标志__GFP_MOVABLE和管理区ZONE_MOVABLE混淆,两者并不是对应的关系。
#define __GFP_DMA ((__force gfp_t)___GFP_DMA)
#define __GFP_HIGHMEM ((__force gfp_t)___GFP_HIGHMEM)
#define __GFP_DMA32 ((__force gfp_t)___GFP_DMA32)
#define __GFP_MOVABLE ((__force gfp_t)___GFP_MOVABLE) /* Page is movable */
#define GFP_ZONEMASK (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
这几个分配标志被称为Zone modifiers,他们用来标识优先从哪个zone分配内存。
bit result
=================
0x0 => NORMAL
0x1 => DMA or NORMAL
0x2 => HIGHMEM or NORMAL
0x3 => BAD (DMA+HIGHMEM)
0x4 => DMA32 or DMA or NORMAL
0x5 => BAD (DMA+DMA32)
0x6 => BAD (HIGHMEM+DMA32)
0x7 => BAD (HIGHMEM+DMA32+DMA)
0x8 => NORMAL (MOVABLE+0)
0x9 => DMA or NORMAL (MOVABLE+DMA)
0xa => MOVABLE (Movable is valid only if HIGHMEM is set too)
0xb => BAD (MOVABLE+HIGHMEM+DMA)
0xc => DMA32 (MOVABLE+DMA32)
0xd => BAD (MOVABLE+DMA32+DMA)
0xe => BAD (MOVABLE+DMA32+HIGHMEM)
0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
一共有4个bit用来表示组合类型,其中低3个bit只能选择一个(__GFP_DMA/__GFP_HIGHMEM/__GFP_DMA32),而__GFP_MOVABLE可以和其他三种的任何一个组合使用,因此一共有16中组合,根据各种类型进行一个偏移存放到一个long类型table中。
GFP_ZONE_TABLE:
|BAD|BAD|BAD|DMA32|BAD|MOVABLE|......|NORMAL|
这些结果会根据上面的bit组合值做一个偏移,存放到ZONE TABLE中,从而可以根据组合快速定位要使用的ZONE管理区。由上可见,__GFP_MOVABLE代表的是一种分配策略,并不是和ZONE_MOVABLE匹配的,上一节也做了介绍,必须是(__GFP_HIGHMEM和__GFP_MOVABLE)同时置位才会从ZONE_MOVABLE管理区去分配内存。
The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA
因此我们分配内存时并不一定就会按照传入的FLAG来进行分配,如果对应zone中没有符合要求的内存,那么会依次进行fallback查找符合要求的内存。
- For all memory hotplug
Memory model -> Sparse Memory (CONFIG_SPARSEMEM)
Allow for memory hot-add (CONFIG_MEMORY_HOTPLUG)
- To enable memory removal, the followings are also necessary
Allow for memory hot remove (CONFIG_MEMORY_HOTREMOVE)
Page Migration (CONFIG_MIGRATION)
- For ACPI memory hotplug, the followings are also necessary
Memory hotplug (under ACPI Support menu) (CONFIG_ACPI_HOTPLUG_MEMORY)
This option can be kernel module.
- As a related configuration, if your box has a feature of NUMA-node hotplug
via ACPI, then this option is necessary too.
ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
(CONFIG_ACPI_CONTAINER).
This option can be kernel module too.
1) When kernelcore=YYYY boot option is used,
Size of memory not for movable pages (not for offline) is YYYY.
Size of memory for movable pages (for offline) is TOTAL-YYYY.
2) When movablecore=ZZZZ boot option is used,
Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
Size of memory for movable pages (for offline) is ZZZZ.
内核中定义了sysfs节点用来控制内存的热插拔:
% echo online > /sys/devices/system/memory/memoryXXX/state
使能内存。
% echo online_movable > /sys/devices/system/memory/memoryXXX/state
切换内存管理区为ZONE_MOVABLE。
% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
切换内存管理区为ZONE_NORMAL。
我们先来看下在memory zone初始化时的处理:
对于NUMA使能的系统处理是这样的:
zone_sizes_init->free_area_init_nodes->find_zone_movable_pfns_for_nodes:
/*
* If movable_node is specified, ignore kernelcore and movablecore
* options.
*/
if (movable_node_is_enabled()) {
for_each_memblock(memory, r) {
if (!memblock_is_hotpluggable(r))
continue;
nid = r->nid;
usable_startpfn = PFN_DOWN(r->base);
zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
min(usable_startpfn, zone_movable_pfn[nid]) :
usable_startpfn;
}
goto out2;
}
当我们在dts设备树中配置对应的property时就会配置对应的memblock flag:
int __init early_init_dt_scan_memory(unsigned long node, const char *uname, int depth, void *data) { bool hotpluggable; hotpluggable = of_get_flat_dt_prop(node, "hotpluggable", NULL); while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) { u64 base, size; base = dt_mem_next_cell(dt_root_addr_cells, ®); size = dt_mem_next_cell(dt_root_size_cells, ®); if (size == 0) continue; pr_debug(" - %llx , %llx\n", (unsigned long long)base, (unsigned long long)size); early_init_dt_add_memory_arch(base, size); if (!hotpluggable) continue; if (early_init_dt_mark_hotplug_memory_arch(base, size)) pr_warn("failed to mark hotplug range 0x%llx - 0x%llx\n", base, base + size); } } int __init __weak early_init_dt_mark_hotplug_memory_arch(u64 base, u64 size) { return memblock_mark_hotplug(base, size); } int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size) { return memblock_setclr_flag(base, size, 1, MEMBLOCK_HOTPLUG); }
from: https://blog.csdn.net/rikeyone/article/details/86498298
原文:https://www.cnblogs.com/aspirs/p/12781693.html