[转载] Jemalloc 内存分配器解析

张三太爷 2023/06/03 0

INSIDE OF JEMALLOC
The Algorithm and Implementation of Jemalloc

author: vector03
mail: mmzsmm@163.com

–[ Table of contents

1 – 简介
2 – Basic structures
2.1 – Overview
2.2 – Arena (arena_t)
2.2.1 – CPU Cache-Line
2.2.2 – Arena 原理
2.2.3 – choose_arena
2.2.4 – Arena 结构
2.3 – Chunk (arena_chunk_t)
2.3.1 – overview
2.3.2 – chunk 结构
2.3.3 – chunk map (arena_chunk_map_t)
2.4 – Run (arena_run_t)
2.4.1 – run 结构
2.4.2 – size class
2.4.3 – size2bin/bin2size
2.5 – Bins (arena_bin_t)
2.6 – Thread caches (tcache_t)
2.7 – Extent Node (extent_node_t)
2.8 – Base
3 – Allocation
3.1 – Overview
3.2 – Initialize
3.3 – small allocation (Arena)
3.3.1 – arena_run_reg_alloc
3.3.2 – arena_bin_malloc_hard
3.3.3 – arena_bin_nonfull_run_get
3.3.4 – small run alloc
3.3.5 – chunk alloc
3.4 – small allocation (tcache)
3.5 – large allocation
3.6 – huge allocation
4 – Deallocation
4.1 – Overview
4.2 – arena_dalloc_bin
4.3 – small run dalloc
4.4 – arena purge
4.5 – arena chunk dalloc
4.6 – large/huge dalloc
5 – 总结: 与 Dl 的对比
附: 快速调试 Jemalloc

–[ 1 – 简介

Jemalloc 最初是 Jason Evans 为 FreeBSD 开发的新一代内存分配器，用来替代原来的 phkmalloc，最早投入使用是在 2005 年，到目前为止，除了原版 Je，还有很多变种被用在各种项目里，Google 在 Android 5.0 里将 bionic 中的默认分配器从 Dl 替换为 Je，也是看中了其强大的多核多线程分配能力。

同经典分配器 —— 如 Dlmalloc 相比，Je 在基本思路和实现上存在明显的差别，比如，Dl 在分配策略上倾向于先 dss 后 mmap 的方式，为的是快速向前分配，但 Je 则完全相反，而实现上也放弃了经典的 boundary tag，这些设计牺牲了局部分配速度和回收效率，但在更大的空间和时间范围内却获得更好的分配效果。

更重要的是，相对经典分配器，Je 最大的优势还是其强大的多核/多线程分配能力。以现代计算机硬件架构来说，最大的瓶颈已经不再是内存容量或 CPU 速度，而是多核/多线程下的 lock contention。因为无论核心数量如何多，通常情况下内存只有一份。可以说，如果内存足够大，CPU 的核心数量越多，程序线程数越多，Je 的分配速度越快。而这一点是经典分配器所无法达到的。

这篇文章基于 Android 5.x 中的 Jemalloc 3.6.0。最新的版本为 4.x，获取最新代码请至：https://github.com/jemalloc/jemalloc/releases

–[ 2 – Basic structures

相对于 Dl，Je 引入了更多更复杂的分配结构，如 arena、chunk、bin、run、region、tcache 等等。其中有些类似 Dl，但更多的具有不同含义。本节将对它们一一介绍。

—-[ 2.1 – Overview

首先，先给出一个整体的概念。Je 对内存划分按照如下由高到低的顺序，

1. 内存是由一定数量的 arenas 进行管理；
2. 一个 arena 被分割成若干 chunks, 后者主要负责记录 bookkeeping；
3. chunk 内部又包含着若干 runs，作为分配小块内存的基本单元；
4. run 由 pages 组成，最终被划分成一定数量的 regions；
5. 对于 small size 的分配请求来说，这些 region 就相当于 user memory。

Arena #0
+----------------------------------------------------------------------------+
|                                                                            |
|    Chunk #0                             Chunk #1                           |
|  +---------------------------------+  +---------------------------------+  |
|  |                                 |  |                                 |  |
|  |   Run #0          Run #1        |  |   Run #0          Run #1        |  |
|  | +-------------+ +-------------+ |  | +-------------+ +-------------+ |  |
|  | |             | |             | |  | |             | |             | |  |
|  | |   Page      | |   Page      | |  | |   Page      | |   Page      | |  |
|  | | +---------+ | | +---------+ | |  | | +---------+ | | +---------+ | |  |
|  | | |         | | | |         | | |  | | |         | | | |         | | |  |
|  | | | Regions | | | | Regions | | |  | | | Regions | | | | Regions | | |  |
|  | | |[] [] [] | | | |[] [] [] | | |  | | |[] [] [] | | | |[] [] [] | | |  |
|  | | |         | | | |         | | |  | | |         | | | |         | | |  |
|  | | +---------+ | | +---------+ | |  | | +---------+ | | +---------+ | |  |
|  | |             | |             | |  | |             | |             | |  |
|  | |   Page      | |   Page      | |  | |   Page      | |   Page      | |  |
|  | | +---------+ | | +---------+ | |  | | +---------+ | | +---------+ | |  |
|  | | |         | | | |         | | |  | | |         | | | |         | | |  |
|  | | | ...     | | | | ...     | | |  | | | ...     | | | | ...     | | |  |
|  | | +---------+ | | +---------+ | |  | | +---------+ | | +---------+ | |  |
|  | +-------------+ +-------------+ |  | +-------------+ +-------------+ |  |
|  +---------------------------------+  +---------------------------------+  |
+----------------------------------------------------------------------------+

Arena #0

+----------------------------------------------------------------------------+

| |

| Chunk #0 Chunk #1 |

| +---------------------------------+ +---------------------------------+ |

| | | | | |

| | +-------------+ +-------------+ | | +-------------+ +-------------+ | |

| | | | | | | | | | | | | |

| | | +---------+ | | +---------+ | | | | +---------+ | | +---------+ | | |

| | | | | | | | | | | | | | | | | | | | | |

| | | |[] [] [] | | | |[] [] [] | | | | | |[] [] [] | | | |[] [] [] | | | |

| | | | | | | | | | | | | | | | | | | | | |

| | | +---------+ | | +---------+ | | | | +---------+ | | +---------+ | | |

| | | | | | | | | | | | | |

| | | +---------+ | | +---------+ | | | | +---------+ | | +---------+ | | |

| | | | | | | | | | | | | | | | | | | | | |

| | | | ... | | | | ... | | | | | | ... | | | | ... | | | |

| | | +---------+ | | +---------+ | | | | +---------+ | | +---------+ | | |

| | +-------------+ +-------------+ | | +-------------+ +-------------+ | |

| +---------------------------------+ +---------------------------------+ |

+----------------------------------------------------------------------------+

—-[ 2.2 – Arena (arena_t)

如前所述，Arena 是 Je 中最大或者说最顶层的基础结构。这个概念其实上是针对“对称多处理机（SMP）”产生的。在 SMP 中，导致性能劣化的一个重要原因在于“false sharing”导致 cache-line 失效。

为了解决 cache-line 共享问题，同时保证更少的内部碎片（internal fragmentation），Je 使用了 arena。

——[ 2.2.1 – CPU Cache-Line

现代处理器为了解决内存总线吞吐量的瓶颈使用了内部 cache 技术。尽管 cache 的工作机制很复杂，且对外透明，但在编程上，还是有必要了解 cache 的基本性质。

Cache 相当于嵌入到 CPU 内部的一组内存单元，速度是主存的 N 倍，但造价很高，因此一般容量很小。有些 CPU 设计了容量逐级逐渐增大的多级 cache，但速度逐级递减。多级处理更复杂，但原理类似，为了简化，仅讨论 L1 data cache。

Cache 同主存进行数据交换有一个最小粒度，称为 cache-line，通常这个值为 64。例如在一个 ILP32 的机器上，一次 cache 交换可以读写连续 16 个 int 型数据。因此当访问数组 #0 元素时，后面 15 个元素也被同时“免费”读到了 cache 中，这对于数组的连续访问是非常有利的。

然而这种免费加载不总是会带来好处，有时候甚至起到反效果，所谓“false sharing”。试想两个线程 A 和 B 分别执行在不同的 CPU 核心中并分别操作各自上下文中的变量 x 和 y。如果因为某种原因（比如 x、y 可能位于同一个 class 内部，或者分别是数组中的两个相邻元素），两者位于相同的 cache-line 中，则在两个 core 的 L1 cache 里都存在 x 和 y 的副本。倘若线程 A 修改了 x 的值, 就会导致在 B 中的 x 与 A 中看到的不一致. 尽管这个变量 x 对 B 可能毫无用处, 但 CPU 为了保证前后的正确和一致性，只能判定 core #1 的 cache 失效。因此 core #0 必须将 cache-line 回写到主存，然后 core #1 再重新加载 cache-line，反之亦然。如果恰好两个线程交替操作同一 cache-line 中的数据，将对 cache 造成极大的损害，导致严重的性能退化。

+-----------------------+        +-----------------------+
| core #0               |        | core #1               |
|                       |        |                       |
|  +----------+         |        |  +----------+         |
|  | ThreadA  |         |        |  | ThreadB  |         |
|  +----------+         |        |  +----------+         |
|        |              |        |        |              |
|    +---+              |        |        |              |
|    |                  |        |        |              |
|    v        D-cache   |        |        v     D-cache  |
|  +-----------------+  |        |  +-----------------+  |
|  | x'| y | ... ... | <---+  +---> | x | y'| ... ... |  |
|  |-----------------|  |  |  |  |  |-----------------|  |
|  |    ... ...      |  |  |  |  |  |    ... ...      |  |
|  |    ... ...      |  |  |  |  |  |    ... ...      |  |
|  |    ... ...      |  |  |  |  |  |    ... ...      |  |
|  +-----------------+  |  |  |  |  +-----------------+  |
+-----------------------+  |  |  +-----------------------+
|  |
+------+  |
|         |
v         v
memory   +-----------------------------
| ... | x | y |     ... ...
+-----------------------------

+-----------------------+ +-----------------------+

| core #0 | | core #1 |

| | | |

| +----------+ | | +----------+ |

| +----------+ | | +----------+ |

| | | | | |

| +---+ | | | |

| | | | | |

| v D-cache | | v D-cache |

| +-----------------+ | | +-----------------+ |

| | x'| y | ... ... | <---+ +---> | x | y'| ... ... | |

| |-----------------| | | | | |-----------------| |

| | ... ... | | | | | | ... ... | |

| +-----------------+ | | | | +-----------------+ |

+-----------------------+ | | +-----------------------+

| |

+------+ |

| |

v v

memory +-----------------------------

| ... | x | y | ... ...

+-----------------------------

说到底，从程序的角度看，变量是独立的地址单元，但在 CPU 看来则是以 cache-line 为整体的单元。单独的变量竞争可以在代码中增加同步来解决，而 cache-line 的竞争是透明的、不可控的，只能被动由 CPU 仲裁。这种观察角度和处理方式的区别，正是 false sharing 的根源。

——[ 2.2.2 – Arena 原理

回到 memory allocator 的话题上。对于一个多线程+多 CPU 核心的运行环境，传统分配器中大量开销被浪费在 lock contention 和 false sharing 上，随着线程数量和核心数量增多，这种分配压力将越来越大。

针对多线程，一种解决方法是将一把 global lock 分散成很多与线程相关的 lock。而针对多核心，则要尽量把不同线程下分配的内存隔离开，避免不同线程使用同一个 cache-line 的情况。按照上面的思路，一个较好的实现方式就是引入 arena。

+---------+     +---------+     +---------+     +---------+     +---------+
| threadA |     | threadB |     | threadC |     | threadD |     | threadE |
+---------+     +---------+     +---------+     +---------+     +---------+
|               |               |               |               |
|               +---------------|---------------|---------+     |
+------------------+    +-------+        +------+         |     |
+-----------------|----|----------------|----------------|-----+
|                 |    |                |                |
v                 v    v                v                v
+----------+        +----------+        +----------+        +----------+
|          |        |          |        |          |        |          |
| Arena #0 |        | Arena #1 |        | Arena #2 |        | Arena #3 |
|          |        |          |        |          |        |          |
+----------+        +----------+        +----------+        +----------+

+---------+ +---------+ +---------+ +---------+ +---------+

+---------+ +---------+ +---------+ +---------+ +---------+

| | | | |

| +---------------|---------------|---------+ |

+------------------+ +-------+ +------+ | |

+-----------------|----|----------------|----------------|-----+

| | | | |

v v v v v

+----------+ +----------+ +----------+ +----------+

| | | | | | | |

| | | | | | | |

+----------+ +----------+ +----------+ +----------+

Je 将内存划分成若干数量的 arenas，线程最终会与某一个 arena 绑定。比如上图中的 threadA 和 B 就分别绑定到 arena #1 和 #3 上。由于两个 arena 在地址空间上几乎不存在任何联系，就可以在无锁的状态下完成分配。同样由于空间不连续，落到同一个 cache-line 中的几率也很小，保证了各自独立。

由于 arena 的数量有限，因此不能保证所有线程都能独占 arena，比如，图中 threadA 和 C 就都绑定到了 arena1 上。分享同一个 arena 的所有线程，由该 arena 内部的 lock 保持同步。

Je 将 arena 保存到一个数组中，该数组全局记录了所有 arenas，

arena_t            **arenas;

1	arena_t **arenas;

事实上，该数组是动态分配的，arenas 仅仅是一个数组指针。默认情况下 arenas 数组的长度与如下变量相关，

unsigned    narenas_total;
unsigned    narenas_auto;
size_t        opt_narenas = 0;

unsigned narenas_total;

unsigned narenas_auto;

size_t opt_narenas = 0;

而它们又与当前 CPU 核心数量相关。核心数量记录在另外一个全局变量 ncpus 里，

unsigned    ncpus;

1	unsigned ncpus;

如果 ncpus 等于 1，则有且仅有一个 arena，如果大于 1，则默认 arenas 的数量为 ncpus 的四倍。即双核下 8 个 arena，四核下 16 个 arena，依此类推。

(gdb) p ncpus
$20 = 4
(gdb) p narenas_total
$21 = 16

(gdb) p ncpus

$20 = 4

(gdb) p narenas_total

$21 = 16

jemalloc 变体很多，不同变体对 arenas 的数量有所调整，比如 firefox 中 arena 固定为 1，而 Android 被限定为最大不超过 2。这个限制被写到 Android jemalloc 的 mk 文件中。

——[ 2.2.3 – choose_arena

最早引入 arena 并非由 Je 首创，但早期线程与 arena 绑定是通过 hash 线程 id 实现的，相对来说随机性比较强。Je 改进了绑定的算法，使之更加科学合理。

Je 中线程与 arena 绑定由函数 choose_arena 完成，被绑定的 arena 记录在该线程的 tls 中，

JEMALLOC_INLINE arena_t *
choose_arena(arena_t *arena)
{
......
// xf: 通常情况下线程所绑定的 arena 记录在 arenas_tls 中
if ((ret = *arenas_tsd_get()) == NULL) {
// xf: 如果当前 thread 未绑定 arena, 则为其指定一个, 并保存到 tls
ret = choose_arena_hard();
}
return (ret);
}

JEMALLOC_INLINE arena_t *

choose_arena(arena_t *arena)

{

......

// xf: 通常情况下线程所绑定的 arena 记录在 arenas_tls 中

if ((ret = *arenas_tsd_get()) == NULL) {

// xf: 如果当前 thread 未绑定 arena, 则为其指定一个, 并保存到 tls

ret = choose_arena_hard();

}

return (ret);

}

初次搜索 arenas_tsd_get 可能找不到该函数在何处被定义。实际上，Je 使用了一组宏来生成一个函数族，以达到类似函数模板的目的。tsd 相关的函数族被定义在 tsd.h 中。

1. malloc_tsd_protos – 定义了函数声明，包括初始化函数 boot、get/set 函数；
2. malloc_tsd_externs – 定义变量声明，包括 tls，初始化标志等等；
3. malloc_tsd_data – tls 变量定义；
4. malloc_tsd_funcs – 定义了 1 中声明函数的实现。

与 arena tsd 相关的函数和变量声明如下，

malloc_tsd_protos(JEMALLOC_ATTR(unused), arenas, arena_t *)
malloc_tsd_externs(arenas, arena_t *)
malloc_tsd_data(, arenas, arena_t *, NULL)
malloc_tsd_funcs(JEMALLOC_ALWAYS_INLINE, arenas, arena_t *, NULL, arenas_cleanup)

malloc_tsd_protos(JEMALLOC_ATTR(unused), arenas, arena_t *)

malloc_tsd_externs(arenas, arena_t *)

malloc_tsd_data(, arenas, arena_t *, NULL)

malloc_tsd_funcs(JEMALLOC_ALWAYS_INLINE, arenas, arena_t *, NULL, arenas_cleanup)

当线程还未与任何 arena 绑定时，会进一步通过 choose_arena_hard 寻找一个合适的 arena进行绑定。Je 会遍历 arenas 数组，并按照优先级由高到低的顺序挑选，

1. 如果找到当前线程绑定数为 0 的 arena，则优先使用它；
2. 如果当前已初始化 arena 中没有线程绑定数为 0 的，则优先使用剩余空的数组位置构造一个新的 arena。需要说明的是，arenas 数组遵循 lazy create 原则，初始状态整个数组只有 0 号元素是被初始化的，其他的 slot 位置都是 null 指针。因此随着新的线程不断创造出来，arena 数组也被逐渐填满。
3. 如果 1、2 两条都不满足，则选择当前绑定数最小的，且 slot 位置更靠前的一个 arena。

arena_t * choose_arena_hard(void)
{
......
if (narenas_auto > 1) {
......
first_null = narenas_auto;
// xf: 循环遍历所有 arenas, 找到绑定 thread 数量最小的 arena, 并记录
// first_null 索引值
for (i = 1; i < narenas_auto; i++) {
if (arenas[i] != NULL) {
if (arenas[i]->nthreads < arenas[choose]->nthreads)
choose = i;
} else if (first_null == narenas_auto) {
first_null = i;
}
}
// xf: 若选定的 arena 绑定 thread 为 0, 或者当前 arena 数组中已满, 则返回
// 被选中的 arena
if (arenas[choose]->nthreads == 0 || first_null == narenas_auto) {
ret = arenas[choose];
} else {
// xf: 否则构造并初始化一个新的 arena
ret = arenas_extend(first_null);
}
......
} else {
// xf: 若不存在多于一个 arena（单核 CPU 或人为强制设定），则返回唯一的
// 0 号 arena
ret = arenas[0];
......
}
// xf: 将已绑定的 arena 设置到 tsd 中
arenas_tsd_set(&ret);
return (ret);
}

arena_t * choose_arena_hard(void)

{

......

if (narenas_auto > 1) {

......

first_null = narenas_auto;

// xf: 循环遍历所有 arenas, 找到绑定 thread 数量最小的 arena, 并记录

// first_null 索引值

for (i = 1; i < narenas_auto; i++) {

if (arenas[i] != NULL) {

if (arenas[i]->nthreads < arenas[choose]->nthreads)

choose = i;

} else if (first_null == narenas_auto) {

first_null = i;

}

// xf: 若选定的 arena 绑定 thread 为 0, 或者当前 arena 数组中已满, 则返回

// 被选中的 arena

if (arenas[choose]->nthreads == 0 || first_null == narenas_auto) {

ret = arenas[choose];

} else {

// xf: 否则构造并初始化一个新的 arena

ret = arenas_extend(first_null);

}

......

} else {

// xf: 若不存在多于一个 arena（单核 CPU 或人为强制设定），则返回唯一的

// 0 号 arena

ret = arenas[0];

......

}

// xf: 将已绑定的 arena 设置到 tsd 中

arenas_tsd_set(&ret);

return (ret);

}

对比早期的绑定方式，Je 的算法显然更加公平，尽可能的让各个 CPU 核心平分当前线程，平衡负载。

——[ 2.2.4 – Arena 结构

struct arena_s {
unsigned              ind;
unsigned              nthreads;
malloc_mutex_t        lock;
arena_stats_t         stats;
ql_head(tcache_t)     tcache_ql;
uint64_t              prof_accumbytes;
dss_prec_t            dss_prec;
arena_chunk_tree_t    chunks_dirty;
arena_chunk_t         *spare;
size_t                nactive;
size_t                ndirty;
size_t                npurgatory;
arena_avail_tree_t    runs_avail;
chunk_alloc_t         *chunk_alloc;
chunk_dalloc_t        *chunk_dalloc;
arena_bin_t           bins[NBINS];
};

struct arena_s {

unsigned ind;

unsigned nthreads;

malloc_mutex_t lock;

arena_stats_t stats;

ql_head(tcache_t) tcache_ql;

uint64_t prof_accumbytes;

dss_prec_t dss_prec;

arena_chunk_tree_t chunks_dirty;

arena_chunk_t *spare;

size_t nactive;

size_t ndirty;

size_t npurgatory;

arena_avail_tree_t runs_avail;

chunk_alloc_t *chunk_alloc;

chunk_dalloc_t *chunk_dalloc;

arena_bin_t bins[NBINS];

};

ind：在 arenas 数组中的索引值。

nthreads：当前绑定的线程数。

lock：局部 arena lock，取代传统分配器的 global lock。
一般地，如下操作需要 arena lock 同步，
1. 线程绑定，需要修改 nthreads
2. new chunk alloc
3. new run alloc

stats：全局统计, 需要打开统计功能。

tcache_ql：ring queue，注册所有绑定线程的 tcache，作为统计用途，需要打开统计功能。

dss_prec：代表当前 chunk alloc 时对系统内存的使用策略，分为几种情况，

typedef enum {
dss_prec_disabled  = 0,
dss_prec_primary   = 1,
dss_prec_secondary = 2,
dss_prec_limit     = 3
} dss_prec_t;

typedef enum {

dss_prec_disabled = 0,

dss_prec_primary = 1,

dss_prec_secondary = 2,

dss_prec_limit = 3

} dss_prec_t;

第一个代表禁止使用系统 DSS，后两种代表是否优先选用 DSS。如果使用 primary，则本着先 dss->mmap 的顺序，否则按照先 mmap->dss。默认使用 dss_prec_secondary。

chunks_dirty：rb tree，代表所有包含 dirty page 的 chunk 集合。后面在 chunk 中会详细介绍。

spare：是一个缓存变量，记录最近一次被释放的 chunk。当 arena 收到一个新的 chunk alloc 请求时，会优先从 spare 中开始查找，由此提高频繁分配释放时，可能导致内部 chunk 利用率下降的情况。

runs_avail：rb tree，记录所有未被分配的 runs，用来在分配 new run 时寻找合适的 available run。一般作为 alloc run 时的仓库。

chunk_alloc/chunk_dalloc：用户可定制的 chunk 分配/释放函数，Je 提供了默认的版本，chunk_alloc_default/chunk_dalloc_default

bins：bins 数组，记录不同 class size 可用 free regions 的分配信息，后面会详细介绍。

—-[ 2.3 – Chunk (arena_chunk_t)

chunk 是仅次于 arena 的次级内存结构。如果有了解过 Dlmalloc，就会知道在 Dl 中同样定义了名为 chunk 的基础结构。但这个概念在两个分配器中含义完全不同，Dl 中的 chunk 指代最低级分配单元，而 Je 中则是一个较大的内存区域。

——[ 2.3.1 – overview

从前面 arena 的数据结构可以发现，它是一个非常抽象的概念，其大小也不代表实际的内存分配量。原始的内存数据既非挂载在 arena 外部，也并没有通过内部指针引用，而是记录在 chunk 中。按照一般的思路，chunk 包含原始内存数据，又从属于 arena，因此后者应该会有一个数组之类的结构以记录所有 chunk 信息。但事实上同样找不到这样的记录。那 Je 又如何获得 chunk 指针呢？

所谓的 chunk 结构，只是整个 chunk 的一个 header，bookkeeping 以及 user memory 都挂在 header 外面。另外 Je 对 chunk 又做了规定，默认每个 chunk 大小为 4MB，同时还必须对齐到 4MB 的边界上。

#define    LG_CHUNK_DEFAULT    22

1	#define LG_CHUNK_DEFAULT 22

这个宏定义了 chunk 的大小。注意到前缀 LG_，代表 log 即指数部分。Je 中所有该前缀的代码都是这个含义，便于通过 bit 操作进行快速的运算。

有了上述规定，获得 chunk 就变得几乎没有代价。因为返回给 user 程序的内存地址肯定属于某个 chunk，而该 chunk header 对齐到 4M 边界上，且不可能超过 4M 大小，所以只需要对该地址做一个下对齐就得到 chunk 指针，如下，

#define    CHUNK_ADDR2BASE(a)                        \
((void *)((uintptr_t)(a) & ~chunksize_mask))

1 2	#define CHUNK_ADDR2BASE(a) \ ((void *)((uintptr_t)(a) & ~chunksize_mask))

计算相对于 chunk header 的偏移量，

#define    CHUNK_ADDR2OFFSET(a)                        \
((size_t)((uintptr_t)(a) & chunksize_mask))

1 2	#define CHUNK_ADDR2OFFSET(a) \ ((size_t)((uintptr_t)(a) & chunksize_mask))

以及上对齐到 chunk 边界的计算，

#define    CHUNK_CEILING(s)                        \
(((s) + chunksize_mask) & ~chunksize_mask)

1 2	#define CHUNK_CEILING(s) \ (((s) + chunksize_mask) & ~chunksize_mask)

用图来表示如下，

chunk_ptr(4M aligned)                memory for user
|                                   |
v                                   v
+--------------+--------------------------------------------
| chunk header |        ... ...     | region |   ... ...
+--------------+--------------------------------------------
|<------------- offset ------------>|

chunk_ptr(4M aligned) memory for user

| |

v v

+--------------+--------------------------------------------

| chunk header | ... ... | region | ... ...

+--------------+--------------------------------------------

|<------------- offset ------------>|

——[ 2.3.2 – Chunk 结构

struct arena_chunk_s {
arena_t                  *arena;
rb_node(arena_chunk_t)   dirty_link;
size_t                   ndirty;
size_t                   nruns_avail;
size_t                   nruns_adjac;
arena_chunk_map_t        map[1];
}

struct arena_chunk_s {

arena_t *arena;

rb_node(arena_chunk_t) dirty_link;

size_t ndirty;

size_t nruns_avail;

size_t nruns_adjac;

arena_chunk_map_t map[1];

}

arena：chunk 属于哪个 arena

dirty_link：用于 rb tree 的链接节点。如果某个 chunk 内部含有任何 dirty page，就会被挂载到 arena 中的 chunks_dirty tree 上。

ndirty：内部 dirty page 数量。

nruns_avail：内部 available runs 数量。

nruns_adjac：available runs 又分为 dirty 和 clean 两种，相邻的两种 run 是无法合并的，除非其中的 dirty runs 通过 purge 才可以。该数值记录的就是可以通过 purge 合并的 run 数量。

map：动态数组，每一项对应 chunk 中的一个 page 状态（不包含 header 即 map 本身的占用）。chunk（包括内部的 run）都是由 page 组成的。page 又分为 unallocated、small、
large 三种。unallocated 指的那些还未建立 run 的 page。small/large 分别指代该 page 所属 run 的类型是 small/large run。这些 page 的分配状态、属性、偏移量，及其他的标记信息等等，都记录在 arena_chunk_map_t 中。

|<--------- map_bias ----------->|
| page | page |  ... ...  | page |
+-----------------------------------------------------------------------+
| chunk_header |    chunk map    | page #0  | page #1  | ... | page #n  |
|    ... ...   | [0] [1] ... [n] |          |          |     |          |
+-----------------|---|-------|-----------------------------------------+
|   |       |       ^          ^                 ^
+---|-------|-------+          |                 |
+-------|------------------+                 |
+ -----------------------------------+

|<--------- map_bias ----------->|

+-----------------------------------------------------------------------+

| ... ... | [0] [1] ... [n] | | | | |

+-----------------|---|-------|-----------------------------------------+

| | | ^ ^ ^

+---|-------|-------+ | |

+-------|------------------+ |

+ -----------------------------------+

至于由 chunk header 和 chunk map 占用的 page 数量，保存在 map_bias 变量中。该变量是 Je 在 arena boot 时通过迭代算法预先计算好的，所有 chunk 都是相同的。迭代方法如下，

1. 第一次迭代初始 map_bias 等于 0，计算最大可能大小，即 header_size + chunk_npages * map_size，获得 header+map 需要的 page 数量，结果肯定高于最终的值。
2. 第二次将之前计算的 map_bias 迭代回去，将最大 page 数减去 map_bias 数，重新计算 header+map 大小，由于第一次迭代 map_bias 过大，第二次迭代必定小于最终结果。
3. 第三次再将 map_bias 迭代回去，得到最终大于第二次且小于第一次的计算结果。

——[ 2.3.3 – chunk map (arena_chunk_map_t)

chunk 记录 page 状态的结构为 arena_chunk_map_t，为了节省空间，使用了 bit 压缩存储信息。

struct arena_chunk_map_s {
#ifndef JEMALLOC_PROF
union {
#endif
union {
rb_node(arena_chunk_map_t)    rb_link;
ql_elm(arena_chunk_map_t)     ql_link;
}                      u;
prof_ctx_t            *prof_ctx;
#ifndef JEMALLOC_PROF
};
#endif
size_t                 bits;
}

struct arena_chunk_map_s {

#ifndef JEMALLOC_PROF

union {

#endif

union {

rb_node(arena_chunk_map_t) rb_link;

ql_elm(arena_chunk_map_t) ql_link;

} u;

prof_ctx_t *prof_ctx;

#ifndef JEMALLOC_PROF

};

#endif

size_t bits;

}

chunk map 内部包含两个 link node，分别可以挂载到 rb tree 或环形队列上，同时为了节省空间又使用了 union。由于 run 本身也是由连续 page 组成的，因此 chunk map 除了记录 page 状态之外，还负责 run 的基址检索。

举例来说，Je 会把所有已分配 run 记录在内部 rb tree 上以快速检索，实际的操作是将该 run 中第一个 page 对应的 chunk_map 作为 rb node 挂载到 tree 上，检索时也是先找出将相应的 chunk map，再进行地址转换得到 run 的基址。

按照通常的设计思路，我们可能会把 run 指针作为节点直接保存到 rb tree 中，但 Je 中的设计明显要更复杂。究其原因，如果把 link node 放到 run 中，后果是 bookkeeping 和 user memory 将混淆在一起，这对于分配器的安全性是很不利的。包括 Dl 在内的传统分配器都具有这样的缺陷。而如果单独用 link node 记录 run，又会造成空间浪费。正因为 Je 中无论是 chunk 还是 run 都是连续 page 组成，所以用首个 page 对应的 chunk map 就能同时代表该 run 的基址。

Je 中通常用 mapelm 换算出 pageind，再将 pageind << LG_PAGE + chunk_base，就能得到 run 指针，代码如下，

arena_chunk_t *run_chunk = CHUNK_ADDR2BASE(mapelm);
size_t pageind = arena_mapelm_to_pageind(mapelm);
run = (arena_run_t *)((uintptr_t)run_chunk + (pageind << LG_PAGE));
JEMALLOC_INLINE_C size_t
arena_mapelm_to_pageind(arena_chunk_map_t *mapelm)
{
uintptr_t map_offset = CHUNK_ADDR2OFFSET(mapelm) - offsetof(arena_chunk_t, map);
return ((map_offset / sizeof(arena_chunk_map_t)) + map_bias);
}

arena_chunk_t *run_chunk = CHUNK_ADDR2BASE(mapelm);

size_t pageind = arena_mapelm_to_pageind(mapelm);

run = (arena_run_t *)((uintptr_t)run_chunk + (pageind << LG_PAGE));

JEMALLOC_INLINE_C size_t

arena_mapelm_to_pageind(arena_chunk_map_t *mapelm)

{

uintptr_t map_offset = CHUNK_ADDR2OFFSET(mapelm) - offsetof(arena_chunk_t, map);

return ((map_offset / sizeof(arena_chunk_map_t)) + map_bias);

}

chunk map 对 page 状态描述都压缩记录到 bits 中，由于内容较多，直接引用 Je 代码中的注释，

下面是一个假想的 ILP32 系统下的 bits layout，

???????? ???????? ????nnnn nnnndula

1	???????? ???????? ????nnnn nnnndula

“?”的部分分三种情况，分别对应 unallocated、small 和 large。
? : Unallocated：首尾 page 写入该 run 的地址，而内部 page 则不做要求。
Small：全部是 page 的偏移量。
Large：首 page 是 run size，后续的 page 不做要求。
n : 对于 small run 指其所在 bin 的 index，对 large run 写入 BININD_INVALID。
d : dirty？
u : unzeroed？
l : large？
a : allocated？

下面是对三种类型的 run page 做的举例，

p : run page offset
s : run size
n : binind for size class; large objects set these to BININD_INVALID
x : don’t care
– : 0
+ : 1
[DULA] : bit set
[dula] : bit unset

Unallocated (clean):

ssssssss ssssssss ssss++++ ++++du-a
xxxxxxxx xxxxxxxx xxxxxxxx xxxx-Uxx
ssssssss ssssssss ssss++++ ++++dU-a

ssssssss ssssssss ssss++++ ++++du-a

xxxxxxxx xxxxxxxx xxxxxxxx xxxx-Uxx

ssssssss ssssssss ssss++++ ++++dU-a

Unallocated (dirty):

ssssssss ssssssss ssss++++ ++++D--a
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
ssssssss ssssssss ssss++++ ++++D--a

ssssssss ssssssss ssss++++ ++++D--a

xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

ssssssss ssssssss ssss++++ ++++D--a

Small:

pppppppp pppppppp ppppnnnn nnnnd--A
pppppppp pppppppp ppppnnnn nnnn---A
pppppppp pppppppp ppppnnnn nnnnd--A

pppppppp pppppppp ppppnnnn nnnnd--A

pppppppp pppppppp ppppnnnn nnnn---A

pppppppp pppppppp ppppnnnn nnnnd--A

Small page 需要注意的是，这里代表的 p 并非是一个固定值，而是该 page 相对于其所在 run 的第一个 page 的偏移量，比如可能是这样，

00000000 00000000 0000nnnn nnnnd--A
00000000 00000000 0001nnnn nnnn---A
00000000 00000000 0010nnnn nnnn---A
00000000 00000000 0011nnnn nnnn---A
...
00000000 00000001 1010nnnn nnnnd--A
Large:
ssssssss ssssssss ssss++++ ++++D-LA
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
-------- -------- ----++++ ++++D-LA
Large (sampled, size <= PAGE):
ssssssss ssssssss ssssnnnn nnnnD-LA
Large (not sampled, size == PAGE):
ssssssss ssssssss ssss++++ ++++D-LA

00000000 00000000 0000nnnn nnnnd--A

00000000 00000000 0001nnnn nnnn---A

00000000 00000000 0010nnnn nnnn---A

00000000 00000000 0011nnnn nnnn---A

...

00000000 00000001 1010nnnn nnnnd--A

Large:

ssssssss ssssssss ssss++++ ++++D-LA

xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

-------- -------- ----++++ ++++D-LA

Large (sampled, size <= PAGE):

ssssssss ssssssss ssssnnnn nnnnD-LA

Large (not sampled, size == PAGE):

ssssssss ssssssss ssss++++ ++++D-LA

为了提取/设置 map bits 内部的信息，Je 提供了一组函数，这里列举两个最基本的，剩下的都是读取 mapbits 后做一些位运算而已，

读取 mapbits，

JEMALLOC_ALWAYS_INLINE size_t
arena_mapbits_get(arena_chunk_t *chunk, size_t pageind)
{
return (arena_mapbitsp_read(arena_mapbitsp_get(chunk, pageind)));
}

JEMALLOC_ALWAYS_INLINE size_t

arena_mapbits_get(arena_chunk_t *chunk, size_t pageind)

{

return (arena_mapbitsp_read(arena_mapbitsp_get(chunk, pageind)));

}

根据 pageind 获取对应的 chunk map，

JEMALLOC_ALWAYS_INLINE arena_chunk_map_t *
arena_mapp_get(arena_chunk_t *chunk, size_t pageind)
{
......
return (&chunk->map[pageind-map_bias]);
}

JEMALLOC_ALWAYS_INLINE arena_chunk_map_t *

arena_mapp_get(arena_chunk_t *chunk, size_t pageind)

{

......

return (&chunk->map[pageind-map_bias]);

}

—-[ 2.4 – Run (arena_run_t)

如同在 2.1 节所述，在 Je 中 run 才是真正负责分配的主体（前提是对 small region 来说）。run 的大小对齐到 page size 上，并且在内部划分成大小相同的 region。当有外部分配请求时，run 就会从内部挑选一个 free region 返回。可以认为 run 就是 small region 仓库。

——[ 2.4.1 – Run 结构

struct arena_run_s {
arena_bin_t    *bin;
uint32_t        nextind;
unsigned        nfree;
};

struct arena_run_s {

arena_bin_t *bin;

uint32_t nextind;

unsigned nfree;

};

run 的结构非常简单，但同 chunk 类似，所谓的 arena_run_t 不过是整个 run 的 header 部分。

bin：与该 run 相关联的 bin。每个 run 都有其所属的 bin，详细内容在之后介绍。
nextind：记录下一个 clean region 的索引。
nfree：记录当前空闲 region 数量。

除了 header 部分之外，run 的真实 layout 如下，

/--------------------\
| arena_run_t header |
| ...                |
bitmap_offset | bitmap             |
| ...                |
|--------------------|
| redzone            |
reg0_offset | region 0           |
| redzone            |
|--------------------| \
| redzone            | |
| region 1           |  > reg_interval
| redzone            | /
|--------------------|
| ...                |
| ...                |
| ...                |
|--------------------|
| redzone            |
| region nregs-1     |
| redzone            |
|--------------------|
| alignment pad?     |
\--------------------/

/--------------------\

| arena_run_t header |

| ... |

bitmap_offset | bitmap |

| ... |

|--------------------|

| redzone |

reg0_offset | region 0 |

| redzone |

|--------------------| \

| redzone | |

| region 1 | > reg_interval

| redzone | /

|--------------------|

| ... |

|--------------------|

| redzone |

| region nregs-1 |

| redzone |

|--------------------|

| alignment pad? |

\--------------------/

正如 chunk 通过 chunk map 记录内部所有 page 状态一样，run 通过在 header 后挂载 bitmap 来记录其内部的 region 状态。bitmap 之后是 regions 区域。内部 region 大小相等，且在前后都有 redzone 保护（需要在设置里打开 redzone 选项）。

简单来说，run 就是通过查询 bitmap 来找到可用的 region。而传统分配器由于使用 boundary tag，空闲 region 一般是被双向链表管理的。相比之下，传统方式查找速度更快，也更简单。缺点之前也提到过，安全和稳定性都存在缺陷。从这一点可以看到，Je 在设计思路上将 bookkeeping 和 user memory 分离是贯穿始终的原则，更甚于对性能的影响（事实上这点影响在并发条件下被大大赚回来了）。

——[ 2.4.2 – size classes

内存分配器对内部管理的 region 往往按照某种特殊规律来分配。比如 Dl 将内存划分成 small 和 large 两种类型。small 类型从 8 字节开始每 8 个字节为一个分割直至 256 字节。而 large 类型则从 256 字节开始，挂载到 dst 上。这种划分方式有助于分配器对内存进行有效的管理和控制，让已分配的内存更加紧实（tightly packed），以降低外部碎片率。

Je 进一步优化了分配效率。采用了类似于“二分伙伴（Binary Buddy）算法”的分配方式。在 Je 中将不同大小的类型称为 size class。

在 Je 源码的 size_classes.h 文件中，定义了不同体系架构下的 region size。该文件实际是通过名为 size_classes.sh 的 shell script 自动生成的。script 按照四种不同量纲定义来区分各个体系平台的区别，然后将它们做排列组合，就可以兼容各个体系。这四种量纲分别是，

LG_SIZEOF_PTR：代表指针长度，ILP32 下是 2，LP64 则是 3。

LG_QUANTUM：量子，binary buddy 分得的最小单位。除了 tiny size，其他的 size classes 都是 quantum 的整数倍大小。

LG_TINY_MIN：是比 quantum 更小的 size class，且必须对齐到 2 的指数倍上。它是 Je 可分配的最小的 size class。

LG_PAGE：就是 page 大小。

根据 binary buddy 算法，Je 会将内存不断的二平分，每一份称作一个 group。同一个 group 内又做四等分。例如，一个典型的 ILP32，tiny 等于 8byte，quantum 为 16byte，page 为 4096byte 的系统，其 size classes 划分是这样的，

#if (LG_SIZEOF_PTR == 2 && LG_TINY_MIN == 3 && LG_QUANTUM == 4 && LG_PAGE == 12)
#define    SIZE_CLASSES \
index, lg_grp, lg_delta, ndelta,  bin, lg_delta_lookup  \
SC(  0,      3,        3,      0,   yes,        3) \
\
SC(  1,      3,        3,      1,   yes,        3) \
SC(  2,      4,        4,      1,   yes,        4) \
SC(  3,      4,        4,      2,   yes,        4) \
SC(  4,      4,        4,      3,   yes,        4) \
\
SC(  5,      6,        4,      1,   yes,        4) \
SC(  6,      6,        4,      2,   yes,        4) \
SC(  7,      6,        4,      3,   yes,        4) \
SC(  8,      6,        4,      4,   yes,        4) \
\
SC(  9,      7,        5,      1,   yes,        5) \
SC( 10,      7,        5,      2,   yes,        5) \
SC( 11,      7,        5,      3,   yes,        5) \
SC( 12,      7,        5,      4,   yes,        5) \
... ...

#if (LG_SIZEOF_PTR == 2 && LG_TINY_MIN == 3 && LG_QUANTUM == 4 && LG_PAGE == 12)

#define SIZE_CLASSES \

index, lg_grp, lg_delta, ndelta, bin, lg_delta_lookup \

SC( 0, 3, 3, 0, yes, 3) \

SC( 1, 3, 3, 1, yes, 3) \

SC( 2, 4, 4, 1, yes, 4) \

SC( 3, 4, 4, 2, yes, 4) \

SC( 4, 4, 4, 3, yes, 4) \

SC( 5, 6, 4, 1, yes, 4) \

SC( 6, 6, 4, 2, yes, 4) \

SC( 7, 6, 4, 3, yes, 4) \

SC( 8, 6, 4, 4, yes, 4) \

SC( 9, 7, 5, 1, yes, 5) \

SC( 10, 7, 5, 2, yes, 5) \

SC( 11, 7, 5, 3, yes, 5) \

SC( 12, 7, 5, 4, yes, 5) \

... ...

宏 SIZE_CLASSES 主要功能就是可以生成几种类型的 table，而 SC 则根据不同的情况被定义成不同的含义。SC 传入的 6 个参数的含义如下，

index：      在 table 中的位置
lg_grp：     所在 group 的指数
lg_delta： group 内偏移量指数
ndelta：     group 内偏移数
bin：        是否由 bin 记录。small region 是记录在 bins 中
lg_delta_lookup：    在 lookup table 中的调用 S2B_# 的尾数后缀

因此得到 reg_size 的计算公式，reg_size = 1 << lg_grp + ndelta << lg_delta。根据该公式可以得到 region 的范围，

┌─────────┬─────────┬───────────────────────────────────────┐
│Category │ Spacing │ Size                                  │
├─────────┼─────────┼───────────────────────────────────────┤
│         │      lg │ [8]                                   │
│         ├─────────┼───────────────────────────────────────┤
│         │      16 │ [16, 32, 48, ..., 128]                │
│         ├─────────┼───────────────────────────────────────┤
│         │      32 │ [160, 192, 224, 256]                  │
│         ├─────────┼───────────────────────────────────────┤
│Small    │      64 │ [320, 384, 448, 512]                  │
│         ├─────────┼───────────────────────────────────────┤
│         │     128 │ [640, 768, 896, 1024]                 │
│         ├─────────┼───────────────────────────────────────┤
│         │     256 │ [1280, 1536, 1792, 2048]              │
│         ├─────────┼───────────────────────────────────────┤
│         │     512 │ [2560, 3072, 3584]                    │
├─────────┼─────────┼───────────────────────────────────────┤
│Large    │   4 KiB │ [4 KiB, 8 KiB, 12 KiB, ..., 4072 KiB] │
├─────────┼─────────┼───────────────────────────────────────┤
│Huge     │   4 MiB │ [4 MiB, 8 MiB, 12 MiB, ...]           │
└─────────┴─────────┴───────────────────────────────────────┘

┌─────────┬─────────┬───────────────────────────────────────┐

│Category │ Spacing │ Size │

├─────────┼─────────┼───────────────────────────────────────┤

│ │ lg │ [8] │

│ ├─────────┼───────────────────────────────────────┤

│ │ 16 │ [16, 32, 48, ..., 128] │

│ ├─────────┼───────────────────────────────────────┤

│ │ 32 │ [160, 192, 224, 256] │

│ ├─────────┼───────────────────────────────────────┤

│Small │ 64 │ [320, 384, 448, 512] │

│ ├─────────┼───────────────────────────────────────┤

│ │ 128 │ [640, 768, 896, 1024] │

│ ├─────────┼───────────────────────────────────────┤

│ │ 256 │ [1280, 1536, 1792, 2048] │

│ ├─────────┼───────────────────────────────────────┤

│ │ 512 │ [2560, 3072, 3584] │

├─────────┼─────────┼───────────────────────────────────────┤

│Large │ 4 KiB │ [4 KiB, 8 KiB, 12 KiB, ..., 4072 KiB] │

├─────────┼─────────┼───────────────────────────────────────┤

│Huge │ 4 MiB │ [4 MiB, 8 MiB, 12 MiB, ...] │

└─────────┴─────────┴───────────────────────────────────────┘

除此之外，在 size_classes.h 中还定义了一些常量,

tiny bins 的数量

#define    NTBINS            1

1	#define NTBINS 1

可以通过 lookup table 查询的 bins 数量

#define    NLBINS            29

1	#define NLBINS 29

small bins 的数量

#define    NBINS            28

1	#define NBINS 28

最大 tiny size class 的指数

#define    LG_TINY_MAXCLASS    3

1	#define LG_TINY_MAXCLASS 3

最大 lookup size class，也就是 NLBINS – 1 个 bins

#define    LOOKUP_MAXCLASS        ((((size_t)1) << 11) + (((size_t)4) << 9))

1	#define LOOKUP_MAXCLASS ((((size_t)1) << 11) + (((size_t)4) << 9))

最大 small size class，也就是 NBINS – 1 个 bins

#define    SMALL_MAXCLASS        ((((size_t)1) << 11) + (((size_t)3) << 9))

1	#define SMALL_MAXCLASS ((((size_t)1) << 11) + (((size_t)3) << 9))

——[ 2.4.3 – size2bin/bin2size

通过 SIZE_CLASSES 建立的 table 就是为了在 O(1) 的时间复杂度内快速进行 size2bin 或者 bin2size 切换。同样的技术在 Dl 中有所体现，来看 Je 是如何实现的。

size2bin 切换提供了两种方式，较快的是通过查询 lookup table，较慢的是计算得到。从原理上，只有 small size class 需要查找 bins，但可通过 lookup 查询的 size class 数量要小于整个 small size class 数量。因此，部分 size class 只能计算得到。在原始 Je 中统一只采用查表法，但在 Android 版本中可能是考虑减小 lookup table size 而增加了直接计算法。

JEMALLOC_ALWAYS_INLINE size_t
small_size2bin(size_t size)
{
......
if (size <= LOOKUP_MAXCLASS)
return (small_size2bin_lookup(size));
else
return (small_size2bin_compute(size));
}

JEMALLOC_ALWAYS_INLINE size_t

small_size2bin(size_t size)

{

......

if (size <= LOOKUP_MAXCLASS)

return (small_size2bin_lookup(size));

else

return (small_size2bin_compute(size));

}

小于 LOOKUP_MAXCLASS 的请求通过 small_size2bin_lookup 直接查表。查询的算法是这样的，

size_t ret = ((size_t)(small_size2bin_tab[(size-1) >> LG_TINY_MIN]));

1	size_t ret = ((size_t)(small_size2bin_tab[(size-1) >> LG_TINY_MIN]));

也就是说，Je 通过一个 f(x) = (x – 1) / 2^LG_TINY_MIN 的变换将 size 映射到 lookup table 的相应区域。这个 table 在 gdb 中可能是这样的，

(gdb) p  /d small_size2bin
$6 = {0, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10,
11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14,
14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16,
16, 16, 17 <repeats 16 times>, 18 <repeats 16 times>, 19 <repeats 16 times>,
20 <repeats 16 times>, 21 <repeats 32 times>, 22 <repeats 32 times>,
23 <repeats 32 times>, 24 <repeats 32 times>, 25 <repeats 64 times>,
26 <repeats 64 times>, 27 <repeats 64 times>}

(gdb) p /d small_size2bin

$6 = {0, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10,

11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14,

14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16,

16, 16, 17 <repeats 16 times>, 18 <repeats 16 times>, 19 <repeats 16 times>,

20 <repeats 16 times>, 21 <repeats 32 times>, 22 <repeats 32 times>,

23 <repeats 32 times>, 24 <repeats 32 times>, 25 <repeats 64 times>,

26 <repeats 64 times>, 27 <repeats 64 times>}

该数组的含义与 binary buddy 算法是一致的。对应的 bin index 越高，其在数组中占用的字节数就越多。除了 0 号 bin 之外，相邻的 4 个 bin 属于同一 group，两个 group 之间相差二倍，因此在数组中占用的字节数也就相差 2 倍。所以，上面数组的 group 划分如下，

{0}, {1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}, {13, 14, 15, 16}, ...

1	{0}, {1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}, {13, 14, 15, 16}, ...

以 bin#9 为例，其所管辖的范围 (128, 160]，由于其位于更高一级 group，因此相比 bin#8 在 lookup table 中多一倍的字节数，假设我们需要查询 132，经过映射，

(132 - 1) >> 3 = 16

1	(132 - 1) >> 3 = 16

这样可以快速得到其所在的 bin #9。如图，

bin #1     bin #3          132 is HERE!
|          |                |
v          v                v
+----------------------------------------------------------------
| 0 | 1 | 2 2 | 3 3 | ... | 8 8 | 9 9 9 9 | ... | 16 ... 16 | ...
+----------------------------------------------------------------
^        ^                 ^       ^                ^
|        |                 |       |                |
bin #0    bin #2            bin #8  bin #9          bin #16

bin #1 bin #3 132 is HERE!

| | |

v v v

+----------------------------------------------------------------

| 0 | 1 | 2 2 | 3 3 | ... | 8 8 | 9 9 9 9 | ... | 16 ... 16 | ...

+----------------------------------------------------------------

^ ^ ^ ^ ^

| | | | |

bin #0 bin #2 bin #8 bin #9 bin #16

Je 巧妙的通过前面介绍 CLASS_SIZE 宏生成了这个 lookup table，代码如下，

JEMALLOC_ALIGNED(CACHELINE)
const uint8_t    small_size2bin_tab[] = {
#define    S2B_3(i)    i,
#define    S2B_4(i)    S2B_3(i) S2B_3(i)
#define    S2B_5(i)    S2B_4(i) S2B_4(i)
#define    S2B_6(i)    S2B_5(i) S2B_5(i)
#define    S2B_7(i)    S2B_6(i) S2B_6(i)
#define    S2B_8(i)    S2B_7(i) S2B_7(i)
#define    S2B_9(i)    S2B_8(i) S2B_8(i)
#define    S2B_no(i)
#define    SC(index, lg_grp, lg_delta, ndelta, bin, lg_delta_lookup) \
S2B_##lg_delta_lookup(index)
SIZE_CLASSES
#undef S2B_3
#undef S2B_4
#undef S2B_5
#undef S2B_6
#undef S2B_7
#undef S2B_8
#undef S2B_9
#undef S2B_no
#undef SC
};

JEMALLOC_ALIGNED(CACHELINE)

const uint8_t small_size2bin_tab[] = {

#define S2B_3(i) i,

#define S2B_4(i) S2B_3(i) S2B_3(i)

#define S2B_5(i) S2B_4(i) S2B_4(i)

#define S2B_6(i) S2B_5(i) S2B_5(i)

#define S2B_7(i) S2B_6(i) S2B_6(i)

#define S2B_8(i) S2B_7(i) S2B_7(i)

#define S2B_9(i) S2B_8(i) S2B_8(i)

#define S2B_no(i)

#define SC(index, lg_grp, lg_delta, ndelta, bin, lg_delta_lookup) \

S2B_##lg_delta_lookup(index)

SIZE_CLASSES

#undef S2B_3

#undef S2B_4

#undef S2B_5

#undef S2B_6

#undef S2B_7

#undef S2B_8

#undef S2B_9

#undef S2B_no

#undef SC

};

这里的 S2B_xx 是一系列宏的嵌套展开，最终对应的就是不同 group 在 lookup table 中占据的字节数以及 bin 索引。相信看懂了前面的介绍就不难理解。

另一方面，大于 LOOKUP_MAXCLASS 但小于 SMALL_MAXCLASS 的 size class 不能查表获得，需要进行计算。简言之，一个 bin number 是三部分组成的，

bin_number = NTBIN + group_number << LG_SIZE_CLASS_GROUP + mod

1	bin_number = NTBIN + group_number << LG_SIZE_CLASS_GROUP + mod

即 tiny bin 数量加上其所在 group 再加上 group 中的偏移 (0-2)。源码如下，

JEMALLOC_INLINE size_t
small_size2bin_compute(size_t size)
{
......
{
// xf: lg_floor 相当于 ffs
size_t x = lg_floor((size<<1)-1);
// xf: 计算size class所在group number
size_t shift = (x < LG_SIZE_CLASS_GROUP + LG_QUANTUM) ? 0 : x - (LG_SIZE_CLASS_GROUP + LG_QUANTUM);
size_t grp = shift << LG_SIZE_CLASS_GROUP;
size_t lg_delta = (x < LG_SIZE_CLASS_GROUP + LG_QUANTUM + 1) ? LG_QUANTUM : x - LG_SIZE_CLASS_GROUP - 1;
size_t delta_inverse_mask = ZI(-1) << lg_delta;
// xf: 计算剩余 mod 部分
size_t mod = ((((size-1) & delta_inverse_mask) >> lg_delta)) & ((ZU(1) << LG_SIZE_CLASS_GROUP) - 1);
// xf: 组合计算 bin number
size_t bin = NTBINS + grp + mod;
return (bin);
}
}

JEMALLOC_INLINE size_t

small_size2bin_compute(size_t size)

{

......

{

// xf: lg_floor 相当于 ffs

size_t x = lg_floor((size<<1)-1);

// xf: 计算size class所在group number

size_t shift = (x < LG_SIZE_CLASS_GROUP + LG_QUANTUM) ? 0 : x - (LG_SIZE_CLASS_GROUP + LG_QUANTUM);

size_t grp = shift << LG_SIZE_CLASS_GROUP;

size_t lg_delta = (x < LG_SIZE_CLASS_GROUP + LG_QUANTUM + 1) ? LG_QUANTUM : x - LG_SIZE_CLASS_GROUP - 1;

size_t delta_inverse_mask = ZI(-1) << lg_delta;

// xf: 计算剩余 mod 部分

size_t mod = ((((size-1) & delta_inverse_mask) >> lg_delta)) & ((ZU(1) << LG_SIZE_CLASS_GROUP) - 1);

// xf: 组合计算 bin number

size_t bin = NTBINS + grp + mod;

return (bin);

}

其中 LG_SIZE_CLASS_GROUP 是 size_classes.h 中的定值，代表一个 group 中包含的 bin 数量，根据 binary buddy 算法，该值通常情况下是 2。而要找到 size class 所在的 group，与其最高有效位相关。Je 通过类似于 ffs 的函数首先获得 size 的最高有效位 x，

size_t x = lg_floor((size<<1)-1);

1	size_t x = lg_floor((size<<1)-1);

至于 group number，则与 quantum size 有关。因为除了 tiny class，quantum size 位于 group #0 的第一个。因此不难推出，

group_number = 2^x / quantum_size / 2^LG_SIZE_CLASS_GROUP

1	group_number = 2^x / quantum_size / 2^LG_SIZE_CLASS_GROUP

对应代码就是，

size_t shift = (x < LG_SIZE_CLASS_GROUP + LG_QUANTUM) ? 0 : x - (LG_SIZE_CLASS_GROUP + LG_QUANTUM);

1	size_t shift = (x < LG_SIZE_CLASS_GROUP + LG_QUANTUM) ? 0 : x - (LG_SIZE_CLASS_GROUP + LG_QUANTUM);

而对应 group 起始位置就是，

size_t grp = shift << LG_SIZE_CLASS_GROUP;

1	size_t grp = shift << LG_SIZE_CLASS_GROUP;

至于 mod 部分，与之相关的是最高有效位之后的两个 bit。Je 在这里经过了复杂的位变换，

size_t lg_delta = (x < LG_SIZE_CLASS_GROUP + LG_QUANTUM + 1) ? LG_QUANTUM : x - LG_SIZE_CLASS_GROUP - 1;
size_t delta_inverse_mask = ZI(-1) << lg_delta;
size_t mod = ((((size-1) & delta_inverse_mask) >> lg_delta)) & ((ZU(1) << LG_SIZE_CLASS_GROUP) - 1);

size_t lg_delta = (x < LG_SIZE_CLASS_GROUP + LG_QUANTUM + 1) ? LG_QUANTUM : x - LG_SIZE_CLASS_GROUP - 1;

size_t delta_inverse_mask = ZI(-1) << lg_delta;

size_t mod = ((((size-1) & delta_inverse_mask) >> lg_delta)) & ((ZU(1) << LG_SIZE_CLASS_GROUP) - 1);

上面代码直白的翻译，实际上就是要求得如下两个 bit，

1 0000
10 0000
11 0000
group #0              100 0000
-------------------------------------------------
+--+
101 0000 - 1 = 1|00| 1111
110 0000 - 1 = 1|01| 1111
111 0000 - 1 = 1|10| 1111
group #1             1000 0000 - 1 = 1|11| 1111
+--+
--------------------------------------------------
+--+
1010 0000 - 1 = 1|00|1 1111
1100 0000 - 1 = 1|01|1 1111
1110 0000 - 1 = 1|10|1 1111
group #2            10000 0000 - 1 = 1|11|1 1111
+--+
--------------------------------------------------

1 0000

10 0000

11 0000

group #0 100 0000

-------------------------------------------------

+--+

101 0000 - 1 = 1|00| 1111

110 0000 - 1 = 1|01| 1111

111 0000 - 1 = 1|10| 1111

group #1 1000 0000 - 1 = 1|11| 1111

+--+

--------------------------------------------------

+--+

1010 0000 - 1 = 1|00|1 1111

1100 0000 - 1 = 1|01|1 1111

1110 0000 - 1 = 1|10|1 1111

group #2 10000 0000 - 1 = 1|11|1 1111

+--+

--------------------------------------------------

根据这个图示再去看 Je 的代码就不难理解了。mod 的计算结果就是从 0-3 的数值。

而最终的结果是前面三部分的组合即，

size_t bin = NTBINS + grp + mod;

1	size_t bin = NTBINS + grp + mod;

而 bin2size 查询就简单得多。上一节介绍 SIZE_CLASSES 时提到过 small region 的计算公式，只需要根据该公式提前计算出所有 bin 对应的 region size，直接查表即可。
这里不再赘述.

—-[ 2.5 – bins (arena_bin_t)

run 是分配的执行者，而分配的调度者是 bin。这个概念同 Dl 中的 bin 是类似的，但 Je 中 bin 要更复杂一些。直白地说，可以把 bin 看作 non-full run 的仓库，bin 负责记录当前 arena 中某一个 size class 范围内所有 non-full run 的使用情况。当有分配请求时，arena 查找相应 size class 的 bin，找出可用于分配的 run，再由 run 分配 region。当然，因为只有 small region 分配需要 run，所以 bin 也只对应 small size class。

与 bin 相关的数据结构主要有两个，分别是 arena_bin_t 和 arena_bin_info_t。在 2.1.3 中提到 arena_t 内部保存了一个 bin 数组，其中的成员就是 arena_bin_t。

其结构如下，

struct arena_bin_s {
malloc_mutex_t        lock;
arena_run_t            *runcur;
arena_run_tree_t    runs;
malloc_bin_stats_t  stats;
};

struct arena_bin_s {

malloc_mutex_t lock;

arena_run_t *runcur;

arena_run_tree_t runs;

malloc_bin_stats_t stats;

};

lock：该 lock 同 arena 内部的 lock 不同，主要负责保护 current run。而对于 run 本身的分配和释放还是需要依赖 arena lock。通常情况下，获得 bin lock 的前提是获得 arena lock，但反之不成立。

runcur：当前可用于分配的 run，一般情况下指向地址最低的 non-full run，同一时间一个 bin 只有一个 current run 用于分配。

runs：rb tree，记录当前 arena 中该 bin 对应 size class 的所有 non-full runs。因为分配是通过 current run 完成的，所以也相当于 current run 的仓库。

stats：统计信息。

另一个与 bin 相关的结构是 arena_bin_info_t。与前者不同，bin_info 保存的是 arena_bin_t 的静态信息，包括相对应 size class run 的 bitmap offset、region size、region number、bitmap info 等等（此类信息只要 class size 决定，就固定下来）。所有上述信息在 Je 中由全局数组 arena_bin_info 记录。因此与 arena 无关，全局仅保留一份。

arena_bin_info_t 的定义如下，

struct arena_bin_info_s {
size_t        reg_size;
size_t        redzone_size;
size_t        reg_interval;
size_t        run_size;
uint32_t    nregs;
uint32_t    bitmap_offset;
bitmap_info_t    bitmap_info;
uint32_t    reg0_offset;
};

struct arena_bin_info_s {

size_t reg_size;

size_t redzone_size;

size_t reg_interval;

size_t run_size;

uint32_t nregs;

uint32_t bitmap_offset;

bitmap_info_t bitmap_info;

uint32_t reg0_offset;

};

reg_size：与当前 bin 的 size class 相关联的 region size。

reg_interval：reg_size+redzone_size。

run_size：当前 bin 的 size class 相关联的 run size。

nregs：当前 bin 的 size class 相关联的 run 中 region 数量。

bitmap_offset：当前 bin 的 size class 相关联的 run 中 bitmap 偏移。

bitmap_info：记录当前 bin 的 size class 相关联的 run 中 bitmap 信息。

reg0_offset：index 为 0 的 region 在 run 中的偏移量。

以上记录的静态信息中尤为重要的是 bitmap_info 和 bitmap_offset。

其中 bitmap_info_t 定义如下，

struct bitmap_info_s {
size_t nbits;
unsigned nlevels;
bitmap_level_t levels[BITMAP_MAX_LEVELS+1];
};

struct bitmap_info_s {

size_t nbits;

unsigned nlevels;

bitmap_level_t levels[BITMAP_MAX_LEVELS+1];

};

nbits：bitmap 中逻辑 bit 位数量（特指 level#0 的 bit 数）

nlevels：bitmap 的 level 数量

levels：level 偏移量数组，每一项记录该级 level 在 bitmap 中的起始 index

struct bitmap_level_s {
size_t group_offset;
};

struct bitmap_level_s {

size_t group_offset;

};

在 2.3.1 节中介绍 arena_run_t 时曾提到 Je 通过外挂 bitmap 将 bookkeeping 和 user memory 分离。但 bitmap 查询速度要慢于 boundary tag。为了弥补这个缺陷，Je 对此做了改进，通过多级 level 缓冲以替代线性查找。

Je 为 bitmap 增加了多级 level。bottom level 同普通 bitmap 一致，每 1bit 代表一个 region。而高一级 level 中 1bit 代表前一级 level 中一个 byte。譬如说，若我们在当前 run 中存在 128 个 region，则在 ILP32 系统上，需要 128/32 = 4byte 来表示这 128个region。Je 将这 4 个 byte 看作 level #0。为了进一步表示这 4 个字节是否被占用，又额外需要 1byte 以缓存这 4byte 的内容（仅使用了 4bit），此为 level#1。即整个 bitmap 一共有 2 级 level，共 5byte 来描述。

+--------------+              +--------+
+-----------|------------ +|   +----------|-------+|
v           v             ||   v          v       ||
+--------------------------------------------------------------------------
| 1101 0010 | 0000 0000 | ... | 10?? ???? | ???? ???? | 1??? ????    | ...
+--------------------------------------------------------------------------
|<--------- level #0 -------->|<----- level #1 ------>|<- level #2 ->|

+--------------+ +--------+

+-----------|------------ +| +----------|-------+|

v v || v v ||

+--------------------------------------------------------------------------

| 1101 0010 | 0000 0000 | ... | 10?? ???? | ???? ???? | 1??? ???? | ...

+--------------------------------------------------------------------------

|<--------- level #0 -------->|<----- level #1 ------>|<- level #2 ->|

—-[ 2.6 – Thread caches (tcache_t)

TLS/TSD 是另一种针对多线程优化使用的分配技术，Je 中称为 tcache。tcache 解决的是同一 CPU core 下不同线程对 heap 的竞争。通过为每个线程指定专属分配区域，来减小线程间的干扰。但显然这种方法会增大整体内存消耗量。为了减小副作用，Je 将 tcache 设计成一个 bookkeeping 结构，在 tcache 中保存的仅仅是指向外部 region 的指针，region 对象仍然位于各个 run 当中。换句话说，如果一个 region 被 tcache 记录了，那么从 run 的角度看，它就已经被分配了。

tcache 的内容如下，

struct tcache_s {
ql_elm(tcache_t) link;
uint64_t         prof_accumbytes;
arena_t             *arena;
unsigned         ev_cnt;
unsigned         next_gc_bin;
tcache_bin_t     tbins[1];
};

struct tcache_s {

ql_elm(tcache_t) link;

uint64_t prof_accumbytes;

arena_t *arena;

unsigned ev_cnt;

unsigned next_gc_bin;

tcache_bin_t tbins[1];

};

link：链接节点，用于将同一个 arena 下的所有 tcache 链接起来。

prof_accumbytes：memory profile 相关。

arena：该 tcache 所属的 arena 指针。

ev_cnt：是 tcache 内部的一个周期计数器。每当 tcache 执行一次分配或释放时，ev_cnt 会记录一次。直到周期到来，Je 会执行一次 incremental GC。这里的 GC 会清理 tcache 中多余的 region，将它们释放掉。尽管这不意味着系统内存会获得释放，但可以解放更多的 region 交给其他更饥饿的线程以分配。

next_gc_bin：指向下一次 GC 的 binidx。tcache GC 按照一周期清理一个 bin 执行。

tbins：tcache bin 数组。同样外挂在 tcache 后面。

同 arena bin 类似，tcache 同样有 tcache_bin_t 和 tcache_bin_info_t。tcache_bin_t 作用类似于 arena bin，但其结构要比后者更简单。准确的说，tcache bin 并没有分配调度的功能，而仅起到记录作用。其内部通过一个 stack 记录指向外部 arena run 中的 region 指针。而一旦 region 被 cache 到 tbins 内，就不能再被其他任何线程所使用，尽管它可能甚至与其他线程 tcache 中记录的
region 位于同一个 arena run 中。

tcache bin 结构如下，

struct tcache_bin_s {
tcache_bin_stats_t tstats;
int     low_water;
unsigned    lg_fill_div;
unsigned    ncached;
void        **avail;
}

struct tcache_bin_s {

tcache_bin_stats_t tstats;

int low_water;

unsigned lg_fill_div;

unsigned ncached;

void **avail;

}

tstats：tcache bin 内部统计。

low_water：记录两次 GC 间 tcache 内部使用的最低水线。该数值与下一次 GC 时尝试释放的 region 数量有关。释放量相当于 low water 数值的 3/4。

lg_fill_div：用作 tcache refill 时作为除数。当 tcache 耗尽时，会请求 arena run 进行 refill。但 refill 不会一次性灌满 tcache，而是依照其最大容量缩小 2^lg_fill_div 的倍数。该数值同 low_water 一样是动态的，两者互相配合确保 tcache 处于一个合理的充满度。

ncached：指当前缓存的 region 数量，同时也代表栈顶 index。

avail：保存 region 指针的 stack，称为 avail-stack。

tcache_bin_info_t 保存 tcache bin 的静态信息。其本身只保存了 tcache max 容量。该数值是在 tcache boot 时根据相对应的 arena bin 的 nregs 决定的。通常等于 nregs 的二倍，但不得超过 TCACHE_NSLOTS_SMALL_MAX。该数值默认为 200，但在 Android 中大大提升了该限制，small bins 不得超过 8，large bins 则为 16。

struct tcache_bin_info_s {
unsigned    ncached_max;
};

struct tcache_bin_info_s {

unsigned ncached_max;

};

tcache layout 如下，

+---------------+
/ | link          |
tcache_t  <  | next_gc_bin   |
\ | ...           |
|---------------|
/ | tstats        |
tbins[0]  <  | ...           |
| | ncached       |
\ | avail --------------+
|---------------|     |
| ...           |     |
| ...           |     |
| ...           |     |
|---------------|     |
/ | tstats        |     |
tbins[nhb  <  | ...           |     |
ins]     | | ncached       |     |
\ | avail --------------|---+
|---------------|     |   |               current arena run
| padding       |     |   |               +----------------+
|---------------| <---+   |               | run header     |
/ | stack[0]      |         |               | ...            |
avail-stack  <  | stack[1]      |         |               | bitmap         |
for tbins[0]  | | ...           |         |               | ...            |
| | ...           |         |               |----------------|
| | stack[ncached |         |               | region #0      |
\ | _max - 1]     |         |               | ...            |
|---------------|         |               |----------------|
| ...           |         |    +--------> | region #1      |
| ...           |         |    |          | ...            |
| ...           |         |    |          |----------------|
|---------------| <-------+    |          | ...            |
avail-stack   / | stack[0]      |--------------|--+       | ...            |
for tbins[   <  | ...           |              |  |       |----------------|
nhbins]      | | stack[n]      |--------------|--|-----> | region #n      |
| | ...           |              |  |       | ...            |
| | stack[ncached |              |  |       |----------------|
\ | _max - 1]     |--------------+  |       | ...            |
+---------------+                 |       | ...            |
|       |----------------|
+-----> | region #nregs-1|
| ...            |
+----------------+

+---------------+

/ | link |

tcache_t < | next_gc_bin |

\ | ... |

|---------------|

/ | tstats |

tbins[0] < | ... |

| | ncached |

\ | avail --------------+

|---------------| |

| ... | |

|---------------| |

/ | tstats | |

tbins[nhb < | ... | |

ins] | | ncached | |

\ | avail --------------|---+

|---------------| | | current arena run

| padding | | | +----------------+

|---------------| <---+ | | run header |

/ | stack[0] | | | ... |

for tbins[0] | | ... | | | ... |

| | ... | | |----------------|

\ | _max - 1] | | | ... |

|---------------| | |----------------|

| ... | | +--------> | region #1 |

| ... | | | | ... |

| ... | | | |----------------|

|---------------| <-------+ | | ... |

avail-stack / | stack[0] |--------------|--+ | ... |

for tbins[ < | ... | | | |----------------|

nhbins] | | stack[n] |--------------|--|-----> | region #n |

| | ... | | | | ... |

| | stack[ncached | | | |----------------|

\ | _max - 1] |--------------+ | | ... |

+---------------+ | | ... |

| |----------------|

+-----> | region #nregs-1|

| ... |

+----------------+

—-[ 2.7 – Extent Node (extent_node_t)

extent node 代表 huge region，即大于 chunk 大小的内存单元。同 arena run 不同，extent node 并非是一个 header 构造，而是外挂的。因此其本身仍属 small region。只不过并不通过 bin 分配，而由 base_nodes 直接动态创建。

Je 中对所有 huge region 都是通过 rb tree 管理。因此 extent node 同时也是 tree node。一个 node 节点被同时挂载到两棵 rb tree 上。一棵采用 szad 的查询方式，另一棵则采用纯 ad 的方式。作用是当执行 chunk recycle 时查询到可用 region，后面会详述。

struct extent_node_s {
rb_node(extent_node_t)    link_szad;
rb_node(extent_node_t)    link_ad;
prof_ctx_t        *prof_ctx;
void            *addr;
size_t            size;
arena_t            *arena;
bool            zeroed;
};

struct extent_node_s {

rb_node(extent_node_t) link_szad;

rb_node(extent_node_t) link_ad;

prof_ctx_t *prof_ctx;

void *addr;

size_t size;

arena_t *arena;

bool zeroed;

};

link_szad：szad tree 的 link 节点。

link_ad：ad tree 的 link 节点。

prof_ctx：用于 memory profile。

addr：指向 huge region 的指针。

size：region size。

arena：huge region 所属 arena。

zeroed：代表是否 zero-filled，chunk recycle 时会用到。

—-[ 2.8 – Base

base 并不是数据类型，而是一块特殊区域，主要服务于内部 meta data（例如 arena_t、tcache_t、extent_node_t 等等）的分配。base 区域管理与如下变量相关，

static malloc_mutex_t    base_mtx;
static void        *base_pages;
static void        *base_next_addr;
static void        *base_past_addr;
static extent_node_t    *base_nodes;

static malloc_mutex_t base_mtx;

static void *base_pages;

static void *base_next_addr;

static void *base_past_addr;

static extent_node_t *base_nodes;

base_mtx：       base lock。
base_pages：     base page 指针，代表整个区域的起始位置。
base_next_addr：当前 base 指针，类似于 brk 指针。
base_past_addr： base page 的上限指针。
base_nodes：     指向 extent_node_t 链表的外挂头指针。

base page 源于 arena 中的空闲 chunk，通常情况下大小相当于 chunk。当 base 耗尽时，会以 chunk alloc 的名义重新申请新的 base pages。

为了保证内部 meta data 的快速分配和访问，Je 将内部请求大小都对齐到 cache-line 上，以避免在 SMP 下的 false sharing。而分配方式上，采用了快速移动 base_next_addr
指针进行高速开采的方法。

void *
base_alloc(size_t size)
{
......
// xf: 将内部分配请求对齐的cache-line上, 阻止false sharing
csize = CACHELINE_CEILING(size);
malloc_mutex_lock(&base_mtx);
// xf: 如果base耗尽, 则重新分配base_pages. 默认大小为chunksize.
if ((uintptr_t)base_next_addr + csize > (uintptr_t)base_past_addr) {
if (base_pages_alloc(csize)) {
malloc_mutex_unlock(&base_mtx);
return (NULL);
}
}
// xf: 快速向前开采
ret = base_next_addr;
base_next_addr = (void *)((uintptr_t)base_next_addr + csize);
malloc_mutex_unlock(&base_mtx);
return (ret);
}

void *

base_alloc(size_t size)

{

......

// xf: 将内部分配请求对齐的cache-line上, 阻止false sharing

csize = CACHELINE_CEILING(size);

malloc_mutex_lock(&base_mtx);

// xf: 如果base耗尽, 则重新分配base_pages. 默认大小为chunksize.

if ((uintptr_t)base_next_addr + csize > (uintptr_t)base_past_addr) {

if (base_pages_alloc(csize)) {

malloc_mutex_unlock(&base_mtx);

return (NULL);

}

// xf: 快速向前开采

ret = base_next_addr;

base_next_addr = (void *)((uintptr_t)base_next_addr + csize);

malloc_mutex_unlock(&base_mtx);

return (ret);

}

与通常的 base alloc 有所不同，分配 extent_node_t 会优先从一个 node 链表中获取节点，而 base_nodes 则为该链表的外挂头指针。只有当其耗尽时，才使用前面的分配方式。这里区别对待 extent_node_t 与其他类型，主要与 chunk recycle 机制有关，后面会做详细说明。

有意思的是，该链表实际上借用了 extent node 内部 rb tree node 的左子树节点指针作为其 link 指针。如 2.7 节所述，extent_node_t 结构的起始位置存放一个 rb node。但在这里，当 base_nodes 赋值给 ret 后，会强制将 ret 转型成 (extent_node_t **)，实际上就是指向 extent_node_t 结构体的第一个 field 的指针，并将其指向的 node 指针记录到 base_nodes 里，成为新的 header 节点。这里需要仔细体会这个强制类型转换的巧妙之处。

ret = base_nodes
|
v   +---- (extent_node_t**)ret
+---|------------------------------ +
|   |              extent_node      |
| +-|-------------------------+     |
| | v       rb_node           |     |
| | +----------+-----------+  |     |
| | | rbn_left | rbn_right |  | ... |
| | +----------+-----------+  |     |
| +-------|-------------------+     |
+---------|-------------------------+
v
base_nodes---> +---------------+
| extent_node   |
+---------------+

ret = base_nodes

v +---- (extent_node_t**)ret

+---|------------------------------ +

| | extent_node |

| +-|-------------------------+ |

| | v rb_node | |

| | +----------+-----------+ | |

| | +----------+-----------+ | |

| +-------|-------------------+ |

+---------|-------------------------+

base_nodes---> +---------------+

| extent_node |

+---------------+

extent_node_t *
base_node_alloc(void)
{
extent_node_t *ret;
malloc_mutex_lock(&base_mtx);
if (base_nodes != NULL) {
ret = base_nodes;
// xf: 这里也可以理解为, base_nodes = (extent_node_t*)(*ret);
base_nodes = *(extent_node_t **)ret;
malloc_mutex_unlock(&base_mtx);
} else {
malloc_mutex_unlock(&base_mtx);
ret = (extent_node_t *)base_alloc(sizeof(extent_node_t));
}
return (ret);
}

extent_node_t *

base_node_alloc(void)

{

extent_node_t *ret;

malloc_mutex_lock(&base_mtx);

if (base_nodes != NULL) {

ret = base_nodes;

// xf: 这里也可以理解为, base_nodes = (extent_node_t*)(*ret);

base_nodes = *(extent_node_t **)ret;

malloc_mutex_unlock(&base_mtx);

} else {

malloc_mutex_unlock(&base_mtx);

ret = (extent_node_t *)base_alloc(sizeof(extent_node_t));

}

return (ret);

}

–[ 3 – Allocation

—-[ 3.1 – Overview

在 2.3.2 节中得知，Je 将 size class 划分成 small、large、huge 三种类型。分配时这三种类型分别按照不同的算法执行。后面的章节也将按照这个类型顺序描述。

总体来说，Je 分配函数从 je_malloc 入口开始，经过，

je_malloc -> imalloc_body -> imalloc -> imalloct ---> arena_malloc
|
+-> huge_malloc

je_malloc -> imalloc_body -> imalloc -> imalloct ---> arena_malloc

+-> huge_malloc

实际执行分配的分别是对应 small/large 的 arena malloc 和对应 huge 的 huge malloc。

分配算法可以概括如下，
1. 首先检查 Je 是否初始化，如果没有则初始化 Je，并标记全局 malloc_initialized 标记；
2. 检查请求 size 是否大于 huge，如果是则执行 8，否则进入下一步；
3. 执行 arena_malloc，首先检查 size 是否小于等于 small maxclass，如果是则下一步，否则执行 6；
4. 如果允许且当前线程已绑定 tcache，则从 tcache 分配 small，并返回，否则下一步；
5. 选择 arena，并执行 arena malloc small，返回；
6. 如果允许且当前线程已绑定 tcache，则从 tcache 分配 large，并返回，否则下一步；
7. 选择 arena，并执行 arena malloc large，返回；
8. 执行 huge malloc，并返回。

—-[ 3.2 – Initialize

Je 通过全局标记 malloc_initialized 指代是否初始化。在每次分配时，需要检查该标记，如果没有则执行 malloc_init。

但通常条件下，malloc_init 是在 Je 库被载入之前就调用的。通过 gcc 的编译扩展属性“constructor”实现，

JEMALLOC_ATTR(constructor)
static void
jemalloc_constructor(void)
{
malloc_init();
}

JEMALLOC_ATTR(constructor)

static void

jemalloc_constructor(void)

{

malloc_init();

}

接下来由 malloc_init_hard 执行各项初始化工作。这里首先需要考虑的是多线程初始化导致的重入，Je 通过 malloc_initialized 和 malloc_initializer 两个标记来识别。

malloc_mutex_lock(&init_lock);
// xf: 如果在获得 init_lock 前已经有其他线程完成 malloc_init，
// 或者当前线程在初始化过程中执行了 malloc，导致递归初始化，则立即退出.
if (malloc_initialized || IS_INITIALIZER) {
malloc_mutex_unlock(&init_lock);
return (false);
}
// xf: 如果开启多线程初始化，需要执行 busy wait 直到 malloc_init 在另外线程中
// 执行完毕后返回。
#ifdef JEMALLOC_THREADED_INIT
if (malloc_initializer != NO_INITIALIZER && IS_INITIALIZER == false) {
do {
malloc_mutex_unlock(&init_lock);
CPU_SPINWAIT;
malloc_mutex_lock(&init_lock);
} while (malloc_initialized == false);
malloc_mutex_unlock(&init_lock);
return (false);
}
#endif
// xf: 将当前线程注册为 initializer
malloc_initializer = INITIALIZER;

malloc_mutex_lock(&init_lock);

// xf: 如果在获得 init_lock 前已经有其他线程完成 malloc_init，

// 或者当前线程在初始化过程中执行了 malloc，导致递归初始化，则立即退出.

if (malloc_initialized || IS_INITIALIZER) {

malloc_mutex_unlock(&init_lock);

return (false);

}

// xf: 如果开启多线程初始化，需要执行 busy wait 直到 malloc_init 在另外线程中

// 执行完毕后返回。

#ifdef JEMALLOC_THREADED_INIT

if (malloc_initializer != NO_INITIALIZER && IS_INITIALIZER == false) {

do {

malloc_mutex_unlock(&init_lock);

CPU_SPINWAIT;

malloc_mutex_lock(&init_lock);

} while (malloc_initialized == false);

malloc_mutex_unlock(&init_lock);

return (false);

}

#endif

// xf: 将当前线程注册为 initializer

malloc_initializer = INITIALIZER;

初始化工作由各个 xxx_boot 函数完成。注意的是，boot 函数返回 false 代表成功，否则代表失败。

tsd boot：Thread specific data 初始化，主要负责 tsd 析构函数数组长度初始化。

base boot：base 是 Je 内部用于 meta data 分配的保留区域，使用内部独立的分配方式。base boot 负责 base node 和 base mutex 的初始化。

chunk boot：主要有三件工作，
1. 确认 chunk_size 和 chunk_npages；
2. chunk_dss_boot，chunk dss 指 chunk 分配的 dss（Data Storage Segment）方式。其中涉及 dss_base、dss_prev 指针的初始化工作。
3. chunk tree 的初始化，在 chunk recycle 时要用到。

arena boot：主要是确认 arena_maxclass，这个 size 代表 arena 管理的最大 region，超过该值被认为 huge region。在 2.2.2 小节中有过介绍，先通过多次迭代计算出 map_bias，再用 chunksize – (map_bias << LG_PAGE) 即可得到。另外还对另一个重要的静态数组 arena_bin_info 执行了初始化。可参考 2.3.2 介绍 class size 的部分。

tcache boot：分为 tcache_boot0 和 tcache_boot1 两个部分执行。前者负责 tcache 所有静态信息，包含 tcache_bin_info、stack_nelms、nhbins 等的初始化。后者负责 tcache tsd 数据的初始化（tcache 保存到线程 tsd 中）。

huge boot：负责 huge mutex 和 huge tree 的初始化。

除此之外，其他重要的初始化还包括分配 arenas 数组。注意 arenas 是一个指向指针数组的指针，因此各个 arena 还需要动态创建。这里 Je 采取了 lazy create 的方式，只有当 choose_arena 时才可能由 choose_arena_hard 创建真实的 arena 实例。但在 malloc_init 中，首个 arena 还是会在此时创建，以保证基本的分配。

—-[ 3.3 – Small allocation (Arena)

先介绍最复杂的 arena malloc small。

1. 先通过 small_size2bin 查到 bin index（2.4.3 节有述）；
2. 若对应 bin 中 current run 可用则进入下一步，否则执行 4；
3. 由 arena_run_reg_alloc 在 current run 中直接分配，并返回；
4. current run 耗尽或不存在，尝试从 bin 中获得可用 run 以填充 current run，成功则执行 9，否则进入下一步；
5. 当前 bin 的 run tree 中没有可用 run，转而从 arena 的 avail-tree 上尝试切割一个可用 run，成功则执行 9，否则进入下一步；
6. 当前 arena 没有可用的空闲 run，构造一个新的 chunk 以分配 new run。成功则执行 9，否则进入下一步；
7. chunk 分配失败，再次查询 arena 的 avail-tree，查找可用 run。成功则执行 9，否则进入下一步；
8. alloc run 尝试彻底失败，则再次查询当前 bin 的 run-tree，尝试获取 run；
9. 在使用新获得 run 之前，重新检查当前 bin 的 current run，如果可用（这里有两种可能，其一是其他线程可能通过 free 释放了多余的 region 或 run，另一种可能是抢在当前线程之前已经分配了新 run），则使用其分配，并返回；另外，如果当前手中的 new run 是空的，则将其释放掉。否则若其地址比 current run 更低，则交换二者，将旧的 current run 插回 avail-tree；
10. 在 new run 中分配 region，并返回。

void *
arena_malloc_small(arena_t *arena, size_t size, bool zero)
{
......
// xf: 根据size计算bin index
binind = small_size2bin(size);
assert(binind < NBINS);
bin = &arena->bins[binind];
size = small_bin2size(binind);
malloc_mutex_lock(&bin->lock);
// xf: 如果 bin 中 current run 不为空，且存在空闲 region，则在 current
// run 中分配。否则在其他 run 中分配。
if ((run = bin->runcur) != NULL && run->nfree > 0)
ret = arena_run_reg_alloc(run, &arena_bin_info[binind]);
else
ret = arena_bin_malloc_hard(arena, bin);
// xf: 若返回 null，则分配失败。
if (ret == NULL) {
malloc_mutex_unlock(&bin->lock);
return (NULL);
}
......
return (ret);
}

void *

arena_malloc_small(arena_t *arena, size_t size, bool zero)

{

......

// xf: 根据size计算bin index

binind = small_size2bin(size);

assert(binind < NBINS);

bin = &arena->bins[binind];

size = small_bin2size(binind);

malloc_mutex_lock(&bin->lock);

// xf: 如果 bin 中 current run 不为空，且存在空闲 region，则在 current

// run 中分配。否则在其他 run 中分配。

if ((run = bin->runcur) != NULL && run->nfree > 0)

ret = arena_run_reg_alloc(run, &arena_bin_info[binind]);

else

ret = arena_bin_malloc_hard(arena, bin);

// xf: 若返回 null，则分配失败。

if (ret == NULL) {

malloc_mutex_unlock(&bin->lock);

return (NULL);

}

......

return (ret);

}

——[ 3.3.1 – arena_run_reg_alloc

1. 首先根据 bin_info 中的静态信息 bitmap_offset 计算 bitmap 基址；
2. 扫描当前 run 的 bitmap，获得第一个 free region 所在的位置；
3. region 地址 = run 基址 + 第一个 region 的偏移量 + free region 索引 * region 内部 size

static inline void *
arena_run_reg_alloc(arena_run_t *run, arena_bin_info_t *bin_info)
{
......
// xf: 计算 bitmap 基址
bitmap_t *bitmap = (bitmap_t *)((uintptr_t)run + (uintptr_t)bin_info->bitmap_offset);
......
// xf: 获得当前 run 中第一个 free region 所在 bitmap 中的位置
regind = bitmap_sfu(bitmap, &bin_info->bitmap_info);
// xf: 计算返回值
ret = (void *)((uintptr_t)run + (uintptr_t)bin_info->reg0_offset + (uintptr_t)(bin_info->reg_interval * regind));
// xf: free 减 1
run->nfree--;
......
return (ret);
}

static inline void *

arena_run_reg_alloc(arena_run_t *run, arena_bin_info_t *bin_info)

{

......

// xf: 计算 bitmap 基址

bitmap_t *bitmap = (bitmap_t *)((uintptr_t)run + (uintptr_t)bin_info->bitmap_offset);

......

// xf: 获得当前 run 中第一个 free region 所在 bitmap 中的位置

regind = bitmap_sfu(bitmap, &bin_info->bitmap_info);

// xf: 计算返回值

ret = (void *)((uintptr_t)run + (uintptr_t)bin_info->reg0_offset + (uintptr_t)(bin_info->reg_interval * regind));

// xf: free 减 1

run->nfree--;

......

return (ret);

}

其中 bitmap_sfu 是执行 bitmap 遍历并设置第一个 unset bit。如 2.5 节所述，bitmap 由多级组成，遍历由 top level 开始循环迭代，直至 bottom level。

JEMALLOC_INLINE size_t
bitmap_sfu(bitmap_t *bitmap, const bitmap_info_t *binfo)
{
......
// xf: 找到最高级 level，并计算 ffs
i = binfo->nlevels - 1;
g = bitmap[binfo->levels[i].group_offset];
bit = jemalloc_ffsl(g) - 1;
// xf: 循环迭代, 直到 level0
while (i > 0) {
i--;
// xf: 根据上一级 level 的结果，计算当前 level 的 group
g = bitmap[binfo->levels[i].group_offset + bit];
// xf: 根据当前 level group，计算下一级需要的 bit
bit = (bit << LG_BITMAP_GROUP_NBITS) + (jemalloc_ffsl(g) - 1);
}
// xf: 得到 level0 的 bit，设置 bitmap
bitmap_set(bitmap, binfo, bit);
return (bit);
}

JEMALLOC_INLINE size_t

bitmap_sfu(bitmap_t *bitmap, const bitmap_info_t *binfo)

{

......

// xf: 找到最高级 level，并计算 ffs

i = binfo->nlevels - 1;

g = bitmap[binfo->levels[i].group_offset];

bit = jemalloc_ffsl(g) - 1;

// xf: 循环迭代, 直到 level0

while (i > 0) {

i--;

// xf: 根据上一级 level 的结果，计算当前 level 的 group

g = bitmap[binfo->levels[i].group_offset + bit];

// xf: 根据当前 level group，计算下一级需要的 bit

bit = (bit << LG_BITMAP_GROUP_NBITS) + (jemalloc_ffsl(g) - 1);

}

// xf: 得到 level0 的 bit，设置 bitmap

bitmap_set(bitmap, binfo, bit);

return (bit);

}

bitmap_set 同普通 bitmap 操作没有什么区别，只是在 set/unset 之后需要反向迭代更新各个高等级 level 对应的 bit 位。

JEMALLOC_INLINE void
bitmap_set(bitmap_t *bitmap, const bitmap_info_t *binfo, size_t bit)
{
......
// xf: 计算该 bit 所在 level0 中的 group
goff = bit >> LG_BITMAP_GROUP_NBITS;
// xf: 得到目标 group 的值 g
gp = &bitmap[goff];
g = *gp;
// xf: 根据 remainder，找到 target bit，并反转
g ^= 1LU << (bit & BITMAP_GROUP_NBITS_MASK);
*gp = g;
......
// xf: 若 target bit 所在 group 为 0，则需要更新 highlevel 的相应 bit，
// 是 bitmap_sfu 的反向操作。
if (g == 0) {
unsigned i;
for (i = 1; i < binfo->nlevels; i++) {
bit = goff;
goff = bit >> LG_BITMAP_GROUP_NBITS;
gp = &bitmap[binfo->levels[i].group_offset + goff];
g = *gp;
assert(g & (1LU << (bit & BITMAP_GROUP_NBITS_MASK)));
g ^= 1LU << (bit & BITMAP_GROUP_NBITS_MASK);
*gp = g;
if (g != 0)
break;
}
}
}

JEMALLOC_INLINE void

bitmap_set(bitmap_t *bitmap, const bitmap_info_t *binfo, size_t bit)

{

......

// xf: 计算该 bit 所在 level0 中的 group

goff = bit >> LG_BITMAP_GROUP_NBITS;

// xf: 得到目标 group 的值 g

gp = &bitmap[goff];

g = *gp;

// xf: 根据 remainder，找到 target bit，并反转

g ^= 1LU << (bit & BITMAP_GROUP_NBITS_MASK);

*gp = g;

......

// xf: 若 target bit 所在 group 为 0，则需要更新 highlevel 的相应 bit，

// 是 bitmap_sfu 的反向操作。

if (g == 0) {

unsigned i;

for (i = 1; i < binfo->nlevels; i++) {

bit = goff;

goff = bit >> LG_BITMAP_GROUP_NBITS;

gp = &bitmap[binfo->levels[i].group_offset + goff];

g = *gp;

assert(g & (1LU << (bit & BITMAP_GROUP_NBITS_MASK)));

g ^= 1LU << (bit & BITMAP_GROUP_NBITS_MASK);

*gp = g;

if (g != 0)

break;

}

——[ 3.3.2 – arena_bin_malloc_hard

1. 从 bin 中获得可用的 nonfull run，这个过程中 bin->lock 有可能被解锁；
2. 暂不使用 new run，返回检查 bin->runcur 是否重新可用。如果是，则直接在其中分配 region（其他线程在 bin lock 解锁期间可能提前修改了 runcur）。否则，执行4；
3. 重新检查 1 中得到的 new run，如果为空，则释放该 run；否则与当前 runcur 作比较，若地址低于 runcur，则与其做交换。将旧的 runcur 插回 run tree。并返回 new rigion；
4. 用 new run 填充 runcur，并在其中分配 region，返回。

static void *
arena_bin_malloc_hard(arena_t *arena, arena_bin_t *bin)
{
......
// xf: 获得 bin 对应的 arena_bin_info，并将 current run 置空
binind = arena_bin_index(arena, bin);
bin_info = &arena_bin_info[binind];
bin->runcur = NULL;
// xf: 从指定 bin 中获得一个可用的 run
run = arena_bin_nonfull_run_get(arena, bin);
// 对 bin->runcur 做重新检查，如果可用且未耗尽，则直接分配
if (bin->runcur != NULL && bin->runcur->nfree > 0) {
ret = arena_run_reg_alloc(bin->runcur, bin_info);
// xf: 若 new run 为空则将其释放，否则重新插入 run tree
if (run != NULL) {
arena_chunk_t *chunk;
chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(run);
if (run->nfree == bin_info->nregs)
arena_dalloc_bin_run(arena, chunk, run, bin);
else
arena_bin_lower_run(arena, chunk, run, bin);
}
return (ret);
}
if (run == NULL)
return (NULL);
// xf: 到这里在 bin->runcur 中分配失败，用当前新获得的 run 填充 current run
bin->runcur = run;
// xf: 在 new run 中分配 region
return (arena_run_reg_alloc(bin->runcur, bin_info));
}

static void *

arena_bin_malloc_hard(arena_t *arena, arena_bin_t *bin)

{

......

// xf: 获得 bin 对应的 arena_bin_info，并将 current run 置空

binind = arena_bin_index(arena, bin);

bin_info = &arena_bin_info[binind];

bin->runcur = NULL;

// xf: 从指定 bin 中获得一个可用的 run

run = arena_bin_nonfull_run_get(arena, bin);

// 对 bin->runcur 做重新检查，如果可用且未耗尽，则直接分配

if (bin->runcur != NULL && bin->runcur->nfree > 0) {

ret = arena_run_reg_alloc(bin->runcur, bin_info);

// xf: 若 new run 为空则将其释放，否则重新插入 run tree

if (run != NULL) {

arena_chunk_t *chunk;

chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(run);

if (run->nfree == bin_info->nregs)

arena_dalloc_bin_run(arena, chunk, run, bin);

else

arena_bin_lower_run(arena, chunk, run, bin);

}

return (ret);

}

if (run == NULL)

return (NULL);

// xf: 到这里在 bin->runcur 中分配失败，用当前新获得的 run 填充 current run

bin->runcur = run;

// xf: 在 new run 中分配 region

return (arena_run_reg_alloc(bin->runcur, bin_info));

}

——[ 3.3.3 – arena_bin_nonfull_run_get

1. 尝试在当前 run tree 中寻找可用 run，成功则返回，否则进入下一步；
2. 解锁 bin lock，并加锁 arena lock，尝试在当前 arena 中分配 new run，之后重新解锁 arena lock，并加锁 bin lock。如果成功则返回，否则进入下一步；
3. 分配失败，重新在当前 run tree 中寻找一遍可用 run。

static arena_run_t *
arena_bin_nonfull_run_get(arena_t *arena, arena_bin_t *bin)
{
......
// xf: 尝试从当前 run tree 中寻找一个可用 run，如果存在就返回
run = arena_bin_nonfull_run_tryget(bin);
if (run != NULL)
return (run);
......
// xf: 打开 bin lock，让其他线程可以操作当前的 bin tree
malloc_mutex_unlock(&bin->lock);
// xf: 锁住 arena lock，以分配 new run
malloc_mutex_lock(&arena->lock);
// xf: 尝试分配 new run
run = arena_run_alloc_small(arena, bin_info->run_size, binind);
if (run != NULL) {
// 初始化 new run 和 bitmap
bitmap_t *bitmap = (bitmap_t *)((uintptr_t)run + (uintptr_t)bin_info->bitmap_offset);
run->bin = bin;
run->nextind = 0;
run->nfree = bin_info->nregs;
bitmap_init(bitmap, &bin_info->bitmap_info);
}
// xf: 解锁 arena lock
malloc_mutex_unlock(&arena->lock);
// xf: 重新加锁 bin lock
malloc_mutex_lock(&bin->lock);
if (run != NULL) {
......
return (run);
}
// xf: 如果 run alloc 失败，则回过头重新 try get 一次（前面解锁 bin lock
// 给了其他线程机会）
run = arena_bin_nonfull_run_tryget(bin);
if (run != NULL)
return (run);
return (NULL);
}

static arena_run_t *

arena_bin_nonfull_run_get(arena_t *arena, arena_bin_t *bin)

{

......

// xf: 尝试从当前 run tree 中寻找一个可用 run，如果存在就返回

run = arena_bin_nonfull_run_tryget(bin);

if (run != NULL)

return (run);

......

// xf: 打开 bin lock，让其他线程可以操作当前的 bin tree

malloc_mutex_unlock(&bin->lock);

// xf: 锁住 arena lock，以分配 new run

malloc_mutex_lock(&arena->lock);

// xf: 尝试分配 new run

run = arena_run_alloc_small(arena, bin_info->run_size, binind);

if (run != NULL) {

// 初始化 new run 和 bitmap

bitmap_t *bitmap = (bitmap_t *)((uintptr_t)run + (uintptr_t)bin_info->bitmap_offset);

run->bin = bin;

run->nextind = 0;

run->nfree = bin_info->nregs;

bitmap_init(bitmap, &bin_info->bitmap_info);

}

// xf: 解锁 arena lock

malloc_mutex_unlock(&arena->lock);

// xf: 重新加锁 bin lock

malloc_mutex_lock(&bin->lock);

if (run != NULL) {

......

return (run);

}

// xf: 如果 run alloc 失败，则回过头重新 try get 一次（前面解锁 bin lock

// 给了其他线程机会）

run = arena_bin_nonfull_run_tryget(bin);

if (run != NULL)

return (run);

return (NULL);

}

——[ 3.3.4 – Small Run Alloc

1. 从 arena avail tree 上获得一个可用 run，并对其切割。失败进入下一步；
2. 尝试给 arena 分配新的 chunk，以构造 new run。此过程可能会解锁 arena lock。失败进入下一步；
3. 其他线程可能在此过程中释放了某些 run，重新检查 avail-tree，尝试获取 run。

static arena_run_t *
arena_run_alloc_small(arena_t *arena, size_t size, size_t binind)
{
......
// xf: 从 available tree 上尝试寻找并切割一个合适的 run，并对其初始化
run = arena_run_alloc_small_helper(arena, size, binind);
if (run != NULL)
return (run);
// xf: 当前 arena 内没有可用的空闲 run，构造一个新的 chunk 以分配 new run
chunk = arena_chunk_alloc(arena);
if (chunk != NULL) {
run = (arena_run_t *)((uintptr_t)chunk + (map_bias << LG_PAGE));
arena_run_split_small(arena, run, size, binind);
return (run);
}
// xf: 重新检查 arena avail-tree
return (arena_run_alloc_small_helper(arena, size, binind));
}
static arena_run_t *
arena_run_alloc_small_helper(arena_t *arena, size_t size, size_t binind)
{
......
// xf: 在 arena 的 available tree 中寻找一个大于等于 size 大小的最小 run
key = (arena_chunk_map_t *)(size | CHUNK_MAP_KEY);
mapelm = arena_avail_tree_nsearch(&arena->runs_avail, key);
if (mapelm != NULL) {
arena_chunk_t *run_chunk = CHUNK_ADDR2BASE(mapelm);
size_t pageind = arena_mapelm_to_pageind(mapelm);
// xf: 计算候选 run 的地址
run = (arena_run_t *)((uintptr_t)run_chunk + (pageind << LG_PAGE));
// xf: 根据分配需求，切割候选 run
arena_run_split_small(arena, run, size, binind);
return (run);
}
return (NULL);
}

static arena_run_t *

arena_run_alloc_small(arena_t *arena, size_t size, size_t binind)

{

......

// xf: 从 available tree 上尝试寻找并切割一个合适的 run，并对其初始化

run = arena_run_alloc_small_helper(arena, size, binind);

if (run != NULL)

return (run);

// xf: 当前 arena 内没有可用的空闲 run，构造一个新的 chunk 以分配 new run

chunk = arena_chunk_alloc(arena);

if (chunk != NULL) {

run = (arena_run_t *)((uintptr_t)chunk + (map_bias << LG_PAGE));

arena_run_split_small(arena, run, size, binind);

return (run);

}

// xf: 重新检查 arena avail-tree

return (arena_run_alloc_small_helper(arena, size, binind));

}

static arena_run_t *

arena_run_alloc_small_helper(arena_t *arena, size_t size, size_t binind)

{

......

// xf: 在 arena 的 available tree 中寻找一个大于等于 size 大小的最小 run

key = (arena_chunk_map_t *)(size | CHUNK_MAP_KEY);

mapelm = arena_avail_tree_nsearch(&arena->runs_avail, key);

if (mapelm != NULL) {

arena_chunk_t *run_chunk = CHUNK_ADDR2BASE(mapelm);

size_t pageind = arena_mapelm_to_pageind(mapelm);

// xf: 计算候选 run 的地址

run = (arena_run_t *)((uintptr_t)run_chunk + (pageind << LG_PAGE));

// xf: 根据分配需求，切割候选 run

arena_run_split_small(arena, run, size, binind);

return (run);

}

return (NULL);

}

切割 small run 主要分为 4 步，

1. 将候选 run 的 arena_chunk_map_t 节点从 avail-tree 上摘除；
2. 根据节点储存的原始 page 信息，以及 need pages 信息，切割该 run；
3. 更新 remainder 节点信息（只需更新首尾 page），重新插入 avail-tree；
4. 设置切割后 new run 所有 page 对应的 map 节点信息（根据 2.3.3 节所述）。

static void
arena_run_split_small(arena_t *arena, arena_run_t *run, size_t size,
size_t binind)
{
......
// xf: 获取目标 run 的 dirty flag
chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(run);
run_ind = (unsigned)(((uintptr_t)run - (uintptr_t)chunk) >> LG_PAGE);
flag_dirty = arena_mapbits_dirty_get(chunk, run_ind);
need_pages = (size >> LG_PAGE);
// xf: 1. 将候选 run 从 available tree 上摘除
//     2. 根据 need pages 对候选 run 进行切割
//     3. 将 remainder 重新插入 available tree
arena_run_split_remove(arena, chunk, run_ind, flag_dirty, need_pages);
// xf: 设置刚刚被 split 后的 run 的第一个 page
arena_mapbits_small_set(chunk, run_ind, 0, binind, flag_dirty);
......
// xf: 依次设置 run 中的其他 page，run index 依次递增
for (i = 1; i < need_pages - 1; i++) {
arena_mapbits_small_set(chunk, run_ind+i, i, binind, 0);
.......
}
// xf: 设置 run 中的最后一个 page
arena_mapbits_small_set(chunk, run_ind+need_pages-1, need_pages-1, binind, flag_dirty);
......
}

static void

arena_run_split_small(arena_t *arena, arena_run_t *run, size_t size,

size_t binind)

{

......

// xf: 获取目标 run 的 dirty flag

chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(run);

run_ind = (unsigned)(((uintptr_t)run - (uintptr_t)chunk) >> LG_PAGE);

flag_dirty = arena_mapbits_dirty_get(chunk, run_ind);

need_pages = (size >> LG_PAGE);

// xf: 1. 将候选 run 从 available tree 上摘除

// 2. 根据 need pages 对候选 run 进行切割

// 3. 将 remainder 重新插入 available tree

arena_run_split_remove(arena, chunk, run_ind, flag_dirty, need_pages);

// xf: 设置刚刚被 split 后的 run 的第一个 page

arena_mapbits_small_set(chunk, run_ind, 0, binind, flag_dirty);

......

// xf: 依次设置 run 中的其他 page，run index 依次递增

for (i = 1; i < need_pages - 1; i++) {

arena_mapbits_small_set(chunk, run_ind+i, i, binind, 0);

.......

}

// xf: 设置 run 中的最后一个 page

arena_mapbits_small_set(chunk, run_ind+need_pages-1, need_pages-1, binind, flag_dirty);

......

}

——[ 3.3.5 – Chunk Alloc

arena 获取 chunk 一般有两个途径。其一是通过内部的 spare 指针，该指针缓存了最近一次 chunk 被释放的记录，因此该方式速度很快。另一种更加常规，通过内部分配函数分配，最终将由 chunk_alloc_core 执行。但在 Je 的设计中，执行 arena chunk 的分配器是可定制的，你可以替换任何第三方 chunk 分配器。这里仅讨论默认情况。

Je 在 chunk_alloc_core 中同传统分配器如 Dl 有较大区别。通常情况下，从系统获取内存无非是 morecore 或 mmap 两种方式。Dl 中按照先 morecore->mmap 的顺序，而 Je 更为灵活，具体的顺序由 dss_prec_t 决定。

该类型是一个枚举，定义如下，

typedef enum {
dss_prec_disabled  = 0,
dss_prec_primary   = 1,
dss_prec_secondary = 2,
dss_prec_limit     = 3
} dss_prec_t;

typedef enum {

dss_prec_disabled = 0,

dss_prec_primary = 1,

dss_prec_secondary = 2,

dss_prec_limit = 3

} dss_prec_t;

这里 dss 和 morecore 含义是相同的。primary 表示优先 dss，secondary 则优先 mmap。Je 默认使用后者。

实际分配时，无论采用哪种策略，都会优先执行 chunk_recycle，再执行 chunk alloc，如下，

static void *
chunk_alloc_core(size_t size, size_t alignment, bool base, bool *zero,
dss_prec_t dss_prec)
{
void *ret;
if (have_dss && dss_prec == dss_prec_primary) {
if ((ret = chunk_recycle(&chunks_szad_dss, &chunks_ad_dss, size, alignment, base, zero)) != NULL)
return (ret);
if ((ret = chunk_alloc_dss(size, alignment, zero)) != NULL)
return (ret);
}
if ((ret = chunk_recycle(&chunks_szad_mmap, &chunks_ad_mmap, size, alignment, base, zero)) != NULL)
return (ret);
if ((ret = chunk_alloc_mmap(size, alignment, zero)) != NULL)
return (ret);
if (have_dss && dss_prec == dss_prec_secondary) {
if ((ret = chunk_recycle(&chunks_szad_dss, &chunks_ad_dss, size, alignment, base, zero)) != NULL)
return (ret);
if ((ret = chunk_alloc_dss(size, alignment, zero)) != NULL)
return (ret);
}
return (NULL);
}

static void *

chunk_alloc_core(size_t size, size_t alignment, bool base, bool *zero,

dss_prec_t dss_prec)

{

void *ret;

if (have_dss && dss_prec == dss_prec_primary) {

if ((ret = chunk_recycle(&chunks_szad_dss, &chunks_ad_dss, size, alignment, base, zero)) != NULL)

return (ret);

if ((ret = chunk_alloc_dss(size, alignment, zero)) != NULL)

return (ret);

}

if ((ret = chunk_recycle(&chunks_szad_mmap, &chunks_ad_mmap, size, alignment, base, zero)) != NULL)

return (ret);

if ((ret = chunk_alloc_mmap(size, alignment, zero)) != NULL)

return (ret);

if (have_dss && dss_prec == dss_prec_secondary) {

if ((ret = chunk_recycle(&chunks_szad_dss, &chunks_ad_dss, size, alignment, base, zero)) != NULL)

return (ret);

if ((ret = chunk_alloc_dss(size, alignment, zero)) != NULL)

return (ret);

}

return (NULL);

}

所谓 chunk recycle 是在 alloc chunk 之前，优先在废弃的 chunk tree 上搜索可用 chunk，并分配 base node 以储存 meta data 的过程。好处是其一可以加快分配速度，其二是使空间分配更加紧凑，并节省内存。

在 Je 中存在 4 棵全局的 rb tree，分别为，

static extent_tree_t    chunks_szad_mmap;
static extent_tree_t    chunks_ad_mmap;
static extent_tree_t    chunks_szad_dss;
static extent_tree_t    chunks_ad_dss;

static extent_tree_t chunks_szad_mmap;

static extent_tree_t chunks_ad_mmap;

static extent_tree_t chunks_szad_dss;

static extent_tree_t chunks_ad_dss;

它们分别对应 mmap 和 dss 方式。当一个 chunk 或 huge region 被释放后，将收集到这 4 棵 tree 中。szad 和 ad 在内容上并无本质区别，只是检索方式不一样。前者采用先 size 后 address 的方式，后者则是纯 address 的检索。

recycle 算法概括如下，
1. 检查 base 标志，如果为真则直接返回，否则进入下一步；开始的检查是必要的，因为 recycle 过程中可能会创建新的 extent node，要求调用 base allocator 分配。另一方面，base alloc 可能因为耗尽的原因而反过来调用 chunk alloc。如此将导致 dead loop；
2. 根据 alignment 计算分配大小，并在 szad tree（mmap 还是 dss 需要上一级决定）上寻找一个大于等于 alloc size 的最小 node；
3. chunk tree 上的 node 未必对齐到 alignment 上，将地址对齐，之后将得到 leadsize 和 trailsize；
4. 将原 node 从 chunk tree 上 remove。若 leadsize 不为 0，则将其作为新的 chunk 重新 insert 回 chunk tree。trailsize 不为 0 的情况亦然。若 leadsize 和 trailsize 同时不为 0，则通过 base_node_alloc 为 trailsize 生成新的 node 并插入。若 base alloc 失败，则整个新分配的 region 都要销毁；
5. 若 leadsize 和 trailsize 都为 0，则将 node（注意仅仅是节点）释放。返回对齐后的 chunk 地址。

static void *
chunk_recycle(extent_tree_t *chunks_szad, extent_tree_t *chunks_ad, size_t size,
size_t alignment, bool base, bool *zero)
{
......
// xf: 由于构造 extent_node 时可能因为内存不足的原因，同样需要构造 chunk，
// 这样就导致 recursively dead loop。因此依靠 base 标志，区分普通 alloc 和
// base node alloc。如果是 base alloc，则立即返回
if (base) {
return (NULL);
}
// xf: 计算分配大小
alloc_size = size + alignment - chunksize;
......
key.addr = NULL;
key.size = alloc_size;
// xf: 在指定的 szad tree 上寻找大于等于 alloc size 的最小可用 node
malloc_mutex_lock(&chunks_mtx);
node = extent_tree_szad_nsearch(chunks_szad, &key);
......
// xf: 将候选节点基址对齐到分配边界上，并计算 leadsize、trailsize
// 以及返回地址
leadsize = ALIGNMENT_CEILING((uintptr_t)node->addr, alignment) - (uintptr_t)node->addr;
trailsize = node->size - leadsize - size;
ret = (void *)((uintptr_t)node->addr + leadsize);
......
// xf: 将原 node 从 szad/ad tree 上移除
extent_tree_szad_remove(chunks_szad, node);
extent_tree_ad_remove(chunks_ad, node);
// xf: 如果存在 leadsize，则将前面多余部分作为一个 chunk 重新插入
// szad/ad tree 上
if (leadsize != 0) {
node->size = leadsize;
extent_tree_szad_insert(chunks_szad, node);
extent_tree_ad_insert(chunks_ad, node);
node = NULL;
}
// xf: 同样如果存在 trailsize，也将后面的多余部分插入
if (trailsize != 0) {
// xf: 如果 leadsize 不为 0，这时原来的 extent_node 已经被用过了，
// 则必须为 trailsize 部分重新分配新的 extent_node
if (node == NULL) {
malloc_mutex_unlock(&chunks_mtx);
node = base_node_alloc();
......
}
// xf: 计算 trail chunk，并插入
node->addr = (void *)((uintptr_t)(ret) + size);
node->size = trailsize;
node->zeroed = zeroed;
extent_tree_szad_insert(chunks_szad, node);
extent_tree_ad_insert(chunks_ad, node);
node = NULL;
}
malloc_mutex_unlock(&chunks_mtx);
// xf: leadsize & basesize 都不存在，将 node 释放
if (node != NULL)
base_node_dalloc(node);
......
return (ret);
}

static void *

chunk_recycle(extent_tree_t *chunks_szad, extent_tree_t *chunks_ad, size_t size,

size_t alignment, bool base, bool *zero)

{

......

// xf: 由于构造 extent_node 时可能因为内存不足的原因，同样需要构造 chunk，

// 这样就导致 recursively dead loop。因此依靠 base 标志，区分普通 alloc 和

// base node alloc。如果是 base alloc，则立即返回

if (base) {

return (NULL);

}

// xf: 计算分配大小

alloc_size = size + alignment - chunksize;

......

key.addr = NULL;

key.size = alloc_size;

// xf: 在指定的 szad tree 上寻找大于等于 alloc size 的最小可用 node

malloc_mutex_lock(&chunks_mtx);

node = extent_tree_szad_nsearch(chunks_szad, &key);

......

// xf: 将候选节点基址对齐到分配边界上，并计算 leadsize、trailsize

// 以及返回地址

leadsize = ALIGNMENT_CEILING((uintptr_t)node->addr, alignment) - (uintptr_t)node->addr;

trailsize = node->size - leadsize - size;

ret = (void *)((uintptr_t)node->addr + leadsize);

......

// xf: 将原 node 从 szad/ad tree 上移除

extent_tree_szad_remove(chunks_szad, node);

extent_tree_ad_remove(chunks_ad, node);

// xf: 如果存在 leadsize，则将前面多余部分作为一个 chunk 重新插入

// szad/ad tree 上

if (leadsize != 0) {

node->size = leadsize;

extent_tree_szad_insert(chunks_szad, node);

extent_tree_ad_insert(chunks_ad, node);

node = NULL;

}

// xf: 同样如果存在 trailsize，也将后面的多余部分插入

if (trailsize != 0) {

// xf: 如果 leadsize 不为 0，这时原来的 extent_node 已经被用过了，

// 则必须为 trailsize 部分重新分配新的 extent_node

if (node == NULL) {

malloc_mutex_unlock(&chunks_mtx);

node = base_node_alloc();

......

}

// xf: 计算 trail chunk，并插入

node->addr = (void *)((uintptr_t)(ret) + size);

node->size = trailsize;

node->zeroed = zeroed;

extent_tree_szad_insert(chunks_szad, node);

extent_tree_ad_insert(chunks_ad, node);

node = NULL;

}

malloc_mutex_unlock(&chunks_mtx);

// xf: leadsize & basesize 都不存在，将 node 释放

if (node != NULL)

base_node_dalloc(node);

......

return (ret);

}

常规分配方式先来看 dss。由于 dss 是与当前进程的 brk 指针相关的，任何线程（包括可能不通过 Je 执行分配的线程）都有权修改该指针值。因此，首先要把 dss 指针调整到对齐在
chunksize 边界的位置，否则很多与 chunk 相关的计算都会失效。接下来，还要做第二次调整对齐到外界请求的 alignment 边界。在此基础上再进行分配。

与 dss 分配相关的变量如下，

static malloc_mutex_t    dss_mtx;
static void        *dss_base;
static void        *dss_prev;
static void        *dss_max;

static malloc_mutex_t dss_mtx;

static void *dss_base;

static void *dss_prev;

static void *dss_max;

dss_mtx： dss lock。注意其并不能起到保护 dss 指针的作用，因为 brk 是一个系统资源。该 lock 保护的是 dss_prev、dss_max 指针。

dss_base：只在 chunk_dss_boot 时更新一次。主要用作识别 chunk 在线性地址空间中所处的位置，与 mmap 作出区别。

dss_prev：当前 dss 指针，是系统 brk 指针的副本，值等于 -1 代表 dss 耗尽。

dss_max：当前 dss 区域上限。

dss alloc 算法如下，
1. 获取 brk 指针，更新到 dss_max；
2. 将 dss_max 对齐到 chunksize 边界上，计算 padding 大小 gap_size；
3. 再将 dss_max 对齐到 aligment 边界上，得到 cpad_size；
4. 计算需要分配的大小，并尝试 sbrk；incr = gap_size + cpad_size + size
5. 分配成功，检查 cpad 是否非 0，是则将这部分重新回收。而 gap_size 部分因为不可用则被废弃；
6. 如果分配失败，则检查 dss 是否耗尽，如果没有则返回 1 重新尝试，否则返回。

示意图,

chunk_base             cpad        ret        dss_next
|                   |           |            |
v                   v           v            v
+--------+----------+-----------+------   ---+
|  used  | gap_size | cpad_size | size ...   |
+--------+----------+-----------+------   ---+
|<------------- incr -------------->|
^          ^           ^
|          |           |
dss_max  chunk_base +     +-- chunk_base +
chunk_size          alignment

chunk_base cpad ret dss_next

| | | |

v v v v

+--------+----------+-----------+------ ---+

+--------+----------+-----------+------ ---+

|<------------- incr -------------->|

^ ^ ^

| | |

dss_max chunk_base + +-- chunk_base +

chunk_size alignment

源码注释,

void *
chunk_alloc_dss(size_t size, size_t alignment, bool *zero)
{
......
// xf: dss 是否耗尽
malloc_mutex_lock(&dss_mtx);
if (dss_prev != (void *)-1) {
......
do {
// xf: 获取当前 dss 指针
dss_max = chunk_dss_sbrk(0);
// xf: 计算对齐到 chunk size 边界需要的 padding 大小
gap_size = (chunksize - CHUNK_ADDR2OFFSET(dss_max)) & chunksize_mask;
// xf: 对齐到 chunk 边界的 chunk 起始地址
cpad = (void *)((uintptr_t)dss_max + gap_size);
// xf: 对齐到 alignment 边界的起始地址
ret = (void *)ALIGNMENT_CEILING((uintptr_t)dss_max, alignment);
cpad_size = (uintptr_t)ret - (uintptr_t)cpad;
// xf: dss_max 分配后的更新地址
dss_next = (void *)((uintptr_t)ret + size);
......
incr = gap_size + cpad_size + size;
// xf: 从 dss 分配
dss_prev = chunk_dss_sbrk(incr);
if (dss_prev == dss_max) {
dss_max = dss_next;
malloc_mutex_unlock(&dss_mtx);
// xf: 如果 ret 和 cpad 对齐不在同一个位置，则将 cpad 开始
// cpad_size 大小的内存回收到 szad/ad tree 中。而以之前
// dss 起始的 gap_size 大小内存由于本身并非对齐到
// chunk_size，则被废弃
if (cpad_size != 0)
chunk_unmap(cpad, cpad_size);
......
return (ret);
}
} while (dss_prev != (void *)-1);   // xf: 反复尝试直至dss耗尽
}
malloc_mutex_unlock(&dss_mtx);
return (NULL);
}

void *

chunk_alloc_dss(size_t size, size_t alignment, bool *zero)

{

......

// xf: dss 是否耗尽

malloc_mutex_lock(&dss_mtx);

if (dss_prev != (void *)-1) {

......

do {

// xf: 获取当前 dss 指针

dss_max = chunk_dss_sbrk(0);

// xf: 计算对齐到 chunk size 边界需要的 padding 大小

gap_size = (chunksize - CHUNK_ADDR2OFFSET(dss_max)) & chunksize_mask;

// xf: 对齐到 chunk 边界的 chunk 起始地址

cpad = (void *)((uintptr_t)dss_max + gap_size);

// xf: 对齐到 alignment 边界的起始地址

ret = (void *)ALIGNMENT_CEILING((uintptr_t)dss_max, alignment);

cpad_size = (uintptr_t)ret - (uintptr_t)cpad;

// xf: dss_max 分配后的更新地址

dss_next = (void *)((uintptr_t)ret + size);

......

incr = gap_size + cpad_size + size;

// xf: 从 dss 分配

dss_prev = chunk_dss_sbrk(incr);

if (dss_prev == dss_max) {

dss_max = dss_next;

malloc_mutex_unlock(&dss_mtx);

// xf: 如果 ret 和 cpad 对齐不在同一个位置，则将 cpad 开始

// cpad_size 大小的内存回收到 szad/ad tree 中。而以之前

// dss 起始的 gap_size 大小内存由于本身并非对齐到

// chunk_size，则被废弃

if (cpad_size != 0)

chunk_unmap(cpad, cpad_size);

......

return (ret);

}

} while (dss_prev != (void *)-1); // xf: 反复尝试直至dss耗尽

}

malloc_mutex_unlock(&dss_mtx);

return (NULL);

}

最后介绍 chunk_alloc_mmap。同 dss 方式类似，mmap 也存在对齐的问题。由于系统 mmap 调用无法指定 alignment，因此 Je 实现了一个可以实现对齐但速度更慢的 mmap slow 方式。作为弥补，在 chunk alloc mmap 时先尝试依照普通方式 mmap，如果返回值恰好满足对齐要求则直接返回（多数情况下是可行的）。否则将返回值 munmap，再调用 mmap slow。

void *
chunk_alloc_mmap(size_t size, size_t alignment, bool *zero)
{
......
ret = pages_map(NULL, size);
if (ret == NULL)
return (NULL);
offset = ALIGNMENT_ADDR2OFFSET(ret, alignment);
if (offset != 0) {
pages_unmap(ret, size);
return (chunk_alloc_mmap_slow(size, alignment, zero));
}
......
return (ret);
}

void *

chunk_alloc_mmap(size_t size, size_t alignment, bool *zero)

{

......

ret = pages_map(NULL, size);

if (ret == NULL)

return (NULL);

offset = ALIGNMENT_ADDR2OFFSET(ret, alignment);

if (offset != 0) {

pages_unmap(ret, size);

return (chunk_alloc_mmap_slow(size, alignment, zero));

}

......

return (ret);

}

mmap slow 通过事先分配超量 size，对齐后再执行 trim，去掉前后余量实现 mmap 对齐。page trim 通过两次 munmap 将 leadsize 和 trailsize 部分分别释放。因此理论上，mmap
slow 需要最多三次 munmap。

|<-------------alloc_size---------->|
+-----------+-----   --+------------+
| lead_size | size...  | trail_size |
+-----------+-----   --+------------+
^           ^
|           |
pages      ret(alignment)

|<-------------alloc_size---------->|

+-----------+----- --+------------+

| lead_size | size... | trail_size |

+-----------+----- --+------------+

^ ^

| |

pages ret(alignment)

static void *
chunk_alloc_mmap_slow(size_t size, size_t alignment, bool *zero)
{
......
alloc_size = size + alignment - PAGE;
if (alloc_size < size)
return (NULL);
do {
pages = pages_map(NULL, alloc_size);
if (pages == NULL)
return (NULL);
leadsize = ALIGNMENT_CEILING((uintptr_t)pages, alignment) - (uintptr_t)pages;
ret = pages_trim(pages, alloc_size, leadsize, size);
} while (ret == NULL);
......
return (ret);
}
static void *
pages_trim(void *addr, size_t alloc_size, size_t leadsize, size_t size)
{
void *ret = (void *)((uintptr_t)addr + leadsize);
......
{
size_t trailsize = alloc_size - leadsize - size;
if (leadsize != 0)
pages_unmap(addr, leadsize);
if (trailsize != 0)
pages_unmap((void *)((uintptr_t)ret + size), trailsize);
return (ret);
}
}

static void *

chunk_alloc_mmap_slow(size_t size, size_t alignment, bool *zero)

{

......

alloc_size = size + alignment - PAGE;

if (alloc_size < size)

return (NULL);

do {

pages = pages_map(NULL, alloc_size);

if (pages == NULL)

return (NULL);

leadsize = ALIGNMENT_CEILING((uintptr_t)pages, alignment) - (uintptr_t)pages;

ret = pages_trim(pages, alloc_size, leadsize, size);

} while (ret == NULL);

......

return (ret);

}

static void *

pages_trim(void *addr, size_t alloc_size, size_t leadsize, size_t size)

{

void *ret = (void *)((uintptr_t)addr + leadsize);

......

{

size_t trailsize = alloc_size - leadsize - size;

if (leadsize != 0)

pages_unmap(addr, leadsize);

if (trailsize != 0)

pages_unmap((void *)((uintptr_t)ret + size), trailsize);

return (ret);

}

—-[ 3.4 – Small allocation (tcache)

tcache 内分配按照先 easy 后 hard 的方式。easy 方式直接从 tcache bin 的 avail-stack 中获得可用 region。如果 tbin 耗尽，使用 hard 方式，先 refill avail-stack，再执行 easy 分配。

JEMALLOC_ALWAYS_INLINE void *
tcache_alloc_small(tcache_t *tcache, size_t size, bool zero)
{
......
// xf: 先从 tcache bin 尝试分配
ret = tcache_alloc_easy(tbin);
// xf: 如果尝试失败，则 refill tcache bin，并尝试分配
if (ret == NULL) {
ret = tcache_alloc_small_hard(tcache, tbin, binind);
if (ret == NULL)
return (NULL);
}
......
// xf: 执行 tcache event
tcache_event(tcache);
return (ret);
}
JEMALLOC_ALWAYS_INLINE void *
tcache_alloc_easy(tcache_bin_t *tbin)
{
void *ret;
// xf: 如果 tcache bin 耗尽，更新水线为 -1
if (tbin->ncached == 0) {
tbin->low_water = -1;
return (NULL);
}
// xf: pop 栈顶的 region，如果需要更新水线
tbin->ncached--;
if ((int)tbin->ncached < tbin->low_water)
tbin->low_water = tbin->ncached;
ret = tbin->avail[tbin->ncached];
return (ret);
}
void *
tcache_alloc_small_hard(tcache_t *tcache, tcache_bin_t *tbin, size_t binind)
{
void *ret;
arena_tcache_fill_small(tcache->arena, tbin, binind, config_prof ? tcache->prof_accumbytes : 0);
if (config_prof)
tcache->prof_accumbytes = 0;
ret = tcache_alloc_easy(tbin);
return (ret);
}

JEMALLOC_ALWAYS_INLINE void *

tcache_alloc_small(tcache_t *tcache, size_t size, bool zero)

{

......

// xf: 先从 tcache bin 尝试分配

ret = tcache_alloc_easy(tbin);

// xf: 如果尝试失败，则 refill tcache bin，并尝试分配

if (ret == NULL) {

ret = tcache_alloc_small_hard(tcache, tbin, binind);

if (ret == NULL)

return (NULL);

}

......

// xf: 执行 tcache event

tcache_event(tcache);

return (ret);

}

JEMALLOC_ALWAYS_INLINE void *

tcache_alloc_easy(tcache_bin_t *tbin)

{

void *ret;

// xf: 如果 tcache bin 耗尽，更新水线为 -1

if (tbin->ncached == 0) {

tbin->low_water = -1;

return (NULL);

}

// xf: pop 栈顶的 region，如果需要更新水线

tbin->ncached--;

if ((int)tbin->ncached < tbin->low_water)

tbin->low_water = tbin->ncached;

ret = tbin->avail[tbin->ncached];

return (ret);

}

void *

tcache_alloc_small_hard(tcache_t *tcache, tcache_bin_t *tbin, size_t binind)

{

void *ret;

arena_tcache_fill_small(tcache->arena, tbin, binind, config_prof ? tcache->prof_accumbytes : 0);

if (config_prof)

tcache->prof_accumbytes = 0;

ret = tcache_alloc_easy(tbin);

return (ret);

}

tcache fill 同普通的 arena bin 分配类似。首先，获得与 tbin 相同 index 的 arena bin。之后确定 fill 值，该数值与 2.7 节介绍的 lg_fill_div 有关。如果 arena run 的 runcur 可用则直接分配并 push stack，否则 arena_bin_malloc_hard 分配 region。push 后的顺序按照从低到高排列，低地址的 region 更靠近栈顶位置。

void
arena_tcache_fill_small(arena_t *arena, tcache_bin_t *tbin, size_t binind,
uint64_t prof_accumbytes)
{
......
// xf: 得到与 tbin 同 index 的 arena bin
bin = &arena->bins[binind];
malloc_mutex_lock(&bin->lock);
// xf: tbin 的充满度与 lg_fill_div 相关
for (i = 0, nfill = (tcache_bin_info[binind].ncached_max >> tbin->lg_fill_div); i < nfill; i++) {
// xf: 如果 current run 可用，则从中分配
if ((run = bin->runcur) != NULL && run->nfree > 0)
ptr = arena_run_reg_alloc(run, &arena_bin_info[binind]);
else    // xf: current run 耗尽，则从 bin 中查找其他 run 分配
ptr = arena_bin_malloc_hard(arena, bin);
if (ptr == NULL)
break;
......
// xf: 低地址 region 优先放入栈顶
tbin->avail[nfill - 1 - i] = ptr;
}
......
malloc_mutex_unlock(&bin->lock);
// xf: 更新 ncached
tbin->ncached = i;
}

void

arena_tcache_fill_small(arena_t *arena, tcache_bin_t *tbin, size_t binind,

uint64_t prof_accumbytes)

{

......

// xf: 得到与 tbin 同 index 的 arena bin

bin = &arena->bins[binind];

malloc_mutex_lock(&bin->lock);

// xf: tbin 的充满度与 lg_fill_div 相关

for (i = 0, nfill = (tcache_bin_info[binind].ncached_max >> tbin->lg_fill_div); i < nfill; i++) {

// xf: 如果 current run 可用，则从中分配

if ((run = bin->runcur) != NULL && run->nfree > 0)

ptr = arena_run_reg_alloc(run, &arena_bin_info[binind]);

else // xf: current run 耗尽，则从 bin 中查找其他 run 分配

ptr = arena_bin_malloc_hard(arena, bin);

if (ptr == NULL)

break;

......

// xf: 低地址 region 优先放入栈顶

tbin->avail[nfill - 1 - i] = ptr;

}

......

malloc_mutex_unlock(&bin->lock);

// xf: 更新 ncached

tbin->ncached = i;

}

另外，如 2.7 节所述，tcache 在每次分配和释放后都会更新 ev_cnt 计数器。当计数周期达到 TCACHE_GC_INCR 时，就会启动 tcache GC。GC 过程中会清理相当于 low_water 3/4 数量的 region，并根据当前的 low_water 和 lg_fill_div 动态调整下一次 refill 时 tbin 的充满度。

void
tcache_bin_flush_small(tcache_bin_t *tbin, size_t binind, unsigned rem,
tcache_t *tcache)
{
......
// xf: 循环 scan，直到 nflush 为空
// 因为 avail-stack 中的 region 可能来自不同 arena，因此需要多次 scan
// 每次 scan 将不同 arena 的 region 移动到栈顶，留到下一轮 scan 时清理
for (nflush = tbin->ncached - rem; nflush > 0; nflush = ndeferred) {
// xf: 获得栈顶 region 所属的 arena 和 arena bin
arena_chunk_t *chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(tbin->avail[0]);
arena_t *arena = chunk->arena;
arena_bin_t *bin = &arena->bins[binind];
......
// xf: 锁住栈顶 region 的 arena bin
malloc_mutex_lock(&bin->lock);
......
// xf: ndefered 代表所属不同 arena 的 region 被搬移的位置，默认从 0 开始。
// 本意是随着 scan 进行，nflush 逐渐递增，nflush 之前的位置空缺出来。
// 当 scan 到不同 arena region 时，将其指针移动到 nflush 前面的空缺中，
// 留到下一轮 scan，nflush 重新开始。直到 ndefered 和 nflush 重新为 0。
ndeferred = 0;
for (i = 0; i < nflush; i++) {
ptr = tbin->avail[i];
chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(ptr);
// xf: 如果 scan 的 region 与栈顶 region 位于同一 arena，则释放，
// 否则移动到 ndefered 标注的位置，留到后面 scan。
if (chunk->arena == arena) {
size_t pageind = ((uintptr_t)ptr - (uintptr_t)chunk) >> LG_PAGE;
arena_chunk_map_t *mapelm = arena_mapp_get(chunk, pageind);
......
// xf: 释放多余 region
arena_dalloc_bin_locked(arena, chunk, ptr, mapelm);
} else {
tbin->avail[ndeferred] = ptr;
ndeferred++;
}
}
malloc_mutex_unlock(&bin->lock);
}
......
// xf: 将 remainder regions 指针移动到栈顶位置，完成 gc 过程
memmove(tbin->avail, &tbin->avail[tbin->ncached - rem], rem * sizeof(void *));
// xf: 修正 ncached 以及 low_water
tbin->ncached = rem;
if ((int)tbin->ncached < tbin->low_water)
tbin->low_water = tbin->ncached;
}

void

tcache_bin_flush_small(tcache_bin_t *tbin, size_t binind, unsigned rem,

tcache_t *tcache)

{

......

// xf: 循环 scan，直到 nflush 为空

// 因为 avail-stack 中的 region 可能来自不同 arena，因此需要多次 scan

// 每次 scan 将不同 arena 的 region 移动到栈顶，留到下一轮 scan 时清理

for (nflush = tbin->ncached - rem; nflush > 0; nflush = ndeferred) {

// xf: 获得栈顶 region 所属的 arena 和 arena bin

arena_chunk_t *chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(tbin->avail[0]);

arena_t *arena = chunk->arena;

arena_bin_t *bin = &arena->bins[binind];

......

// xf: 锁住栈顶 region 的 arena bin

malloc_mutex_lock(&bin->lock);

......

// xf: ndefered 代表所属不同 arena 的 region 被搬移的位置，默认从 0 开始。

// 本意是随着 scan 进行，nflush 逐渐递增，nflush 之前的位置空缺出来。

// 当 scan 到不同 arena region 时，将其指针移动到 nflush 前面的空缺中，

// 留到下一轮 scan，nflush 重新开始。直到 ndefered 和 nflush 重新为 0。

ndeferred = 0;

for (i = 0; i < nflush; i++) {

ptr = tbin->avail[i];

chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(ptr);

// xf: 如果 scan 的 region 与栈顶 region 位于同一 arena，则释放，

// 否则移动到 ndefered 标注的位置，留到后面 scan。

if (chunk->arena == arena) {

size_t pageind = ((uintptr_t)ptr - (uintptr_t)chunk) >> LG_PAGE;

arena_chunk_map_t *mapelm = arena_mapp_get(chunk, pageind);

......

// xf: 释放多余 region

arena_dalloc_bin_locked(arena, chunk, ptr, mapelm);

} else {

tbin->avail[ndeferred] = ptr;

ndeferred++;

}

malloc_mutex_unlock(&bin->lock);

}

......

// xf: 将 remainder regions 指针移动到栈顶位置，完成 gc 过程

memmove(tbin->avail, &tbin->avail[tbin->ncached - rem], rem * sizeof(void *));

// xf: 修正 ncached 以及 low_water

tbin->ncached = rem;

if ((int)tbin->ncached < tbin->low_water)

tbin->low_water = tbin->ncached;

}

—-[ 3.5 – Large allocation

arena 上的 large alloc 同 small 相比除了省去 arena bin 的部分之外，并无本质区别。基本算法如下，

1. 把请求大小对齐到 page size 上，直接从 avail-tree 上寻找 first-best-fit runs。如果成功，则根据请求大小切割内存。切割过程也同切割 small run 类似，区别在之后对 chunk map 的初始化不同。chunk map 细节可回顾 2.3.3。如果失败，则进入下一步；
2. 没有可用 runs，尝试创建 new chunk，成功同样切割 run，失败进入下一步；
3. 再次尝试从 avail-tree 上寻找可用 runs，并返回。

同上面的过程可以看出，所谓 large region 分配相当于 small run 的分配。区别仅在于 chunk map 信息不同。

tcache 上的 large alloc 同样按照先 easy 后 hard 的顺序。尽管常规 arena 上的分配不存在 large bin，但在 tcache 中却存在 large tbin，因此仍然是先查找 avail-stack。如果 tbin 中找不到，就会向 arena 申请 large runs。这里与 small alloc 的区别在不执行 tbin refill，因为考虑到过多 large region 的占用量问题。large tbin 仅在 tcache_dalloc_large 的时候才负责收集 region。当 tcache 已满或 GC 周期到时执行 tcache GC。

—-[ 3.6 – Huge allocation

huge alloc 相对于前面就更加简单。因为对于 Je 而言，huge region 和 chunk 是等同的，这在前面有过叙述。huge alloc 就是调用 chunk alloc，并将 extent_node 记录在 huge tree 上。

void *
huge_palloc(arena_t *arena, size_t size, size_t alignment, bool zero)
{
void *ret;
size_t csize;
extent_node_t *node;
bool is_zeroed;
// xf: huge alloc 对齐到 chunksize
csize = CHUNK_CEILING(size);
......
// xf: create extent node 以记录 huge region
node = base_node_alloc();
......
arena = choose_arena(arena);
// xf: 调用 chunk alloc 分配
ret = arena_chunk_alloc_huge(arena, csize, alignment, &is_zeroed);
// xf: 失败则清除 extent node
if (ret == NULL) {
base_node_dalloc(node);
return (NULL);
}
node->addr = ret;
node->size = csize;
node->arena = arena;
// xf: 插入 huge tree 上
malloc_mutex_lock(&huge_mtx);
extent_tree_ad_insert(&huge, node);
malloc_mutex_unlock(&huge_mtx);
......
return (ret);
}

void *

huge_palloc(arena_t *arena, size_t size, size_t alignment, bool zero)

{

void *ret;

size_t csize;

extent_node_t *node;

bool is_zeroed;

// xf: huge alloc 对齐到 chunksize

csize = CHUNK_CEILING(size);

......

// xf: create extent node 以记录 huge region

node = base_node_alloc();

......

arena = choose_arena(arena);

// xf: 调用 chunk alloc 分配

ret = arena_chunk_alloc_huge(arena, csize, alignment, &is_zeroed);

// xf: 失败则清除 extent node

if (ret == NULL) {

base_node_dalloc(node);

return (NULL);

}

node->addr = ret;

node->size = csize;

node->arena = arena;

// xf: 插入 huge tree 上

malloc_mutex_lock(&huge_mtx);

extent_tree_ad_insert(&huge, node);

malloc_mutex_unlock(&huge_mtx);

......

return (ret);

}

–[ 4 – Deallocation

—-[ 4.1 – Overview

释放同分配过程相反，按照一个从 ptr -> run -> bin -> chunk -> arena 的路径。但因为涉及 page 合并和 purge，实现更为复杂。dalloc 的入口从 je_free -> ifree -> iqalloc -> iqalloct -> idalloct。对 dalloc 的分析从 idalloct 开始，代码如下，

JEMALLOC_ALWAYS_INLINE void
idalloct(void *ptr, bool try_tcache)
{
......
// xf: 获得被释放地址所在的 chunk
chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(ptr);
if (chunk != ptr)
arena_dalloc(chunk, ptr, try_tcache);
else
huge_dalloc(ptr);
}

JEMALLOC_ALWAYS_INLINE void

idalloct(void *ptr, bool try_tcache)

{

......

// xf: 获得被释放地址所在的 chunk

chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(ptr);

if (chunk != ptr)

arena_dalloc(chunk, ptr, try_tcache);

else

huge_dalloc(ptr);

}

首先会检测被释放指针 ptr 所在 chunk 的首地址与 ptr 是否一致，如果是，则一定为 huge region，否则为 small/large。从这里分为 arena 和 huge 两条线。再看一下 arena_dalloc，

JEMALLOC_ALWAYS_INLINE void
arena_dalloc(arena_chunk_t *chunk, void *ptr, bool try_tcache)
{
......
// xf: 得到页面 mapbits
mapbits = arena_mapbits_get(chunk, pageind);
if ((mapbits & CHUNK_MAP_LARGE) == 0) {
if (try_tcache && (tcache = tcache_get(false)) != NULL) {
// xf: ptr 所在 tcache 的 index
binind = arena_ptr_small_binind_get(ptr, mapbits);
tcache_dalloc_small(tcache, ptr, binind);
} else
arena_dalloc_small(chunk->arena, chunk, ptr, pageind);
} else {
size_t size = arena_mapbits_large_size_get(chunk, pageind);
if (try_tcache && size <= tcache_maxclass && (tcache = tcache_get(false)) != NULL) {
tcache_dalloc_large(tcache, ptr, size);
} else
arena_dalloc_large(chunk->arena, chunk, ptr);
}
}

JEMALLOC_ALWAYS_INLINE void

arena_dalloc(arena_chunk_t *chunk, void *ptr, bool try_tcache)

{

......

// xf: 得到页面 mapbits

mapbits = arena_mapbits_get(chunk, pageind);

if ((mapbits & CHUNK_MAP_LARGE) == 0) {

if (try_tcache && (tcache = tcache_get(false)) != NULL) {

// xf: ptr 所在 tcache 的 index

binind = arena_ptr_small_binind_get(ptr, mapbits);

tcache_dalloc_small(tcache, ptr, binind);

} else

arena_dalloc_small(chunk->arena, chunk, ptr, pageind);

} else {

size_t size = arena_mapbits_large_size_get(chunk, pageind);

if (try_tcache && size <= tcache_maxclass && (tcache = tcache_get(false)) != NULL) {

tcache_dalloc_large(tcache, ptr, size);

} else

arena_dalloc_large(chunk->arena, chunk, ptr);

}

这里通过得到 ptr 所在 page 的 mapbits，判断其来自于 small 还是 large。然后再分别作处理。

因此，在 dalloc 一开始基本上分成了 small/large/huge 三条路线执行。事实上，结合前面的知识，large/huge 可以看作 run 和 chunk 的特例。所以，这三条 dalloc 路线最终会汇到一起，只需要搞清楚其中最为复杂的 small region dalloc 就可以了。

无论 small/large region，都会先尝试释放回 tcache，不管其是否从 tache 中分配而来。所谓 tcache dalloc 只不过是将 region 记录在 tbin 中，并不算真正的释放。除非两种情况，一是如果当前线程 tbin 已满，会直接执行一次 tbin flush，释放出部分 tbin 空间。二是如果 tcache_event 触发了 tache GC，也会执行 flush。两者的区别在于，前者会回收指定 tbin 1/2 的空间，而后者则释放 next_gc_bin 相当于 3/4 low water 数量的空间。

JEMALLOC_ALWAYS_INLINE void
tcache_dalloc_small(tcache_t *tcache, void *ptr, size_t binind)
{
......
tbin = &tcache->tbins[binind];
tbin_info = &tcache_bin_info[binind];
// xf: 如果当前 tbin 已满，则执行 flush 清理 tbin
if (tbin->ncached == tbin_info->ncached_max) {
tcache_bin_flush_small(tbin, binind, (tbin_info->ncached_max >> 1), tcache);
}
// xf: 将被释放的 ptr 重新 push 进 tbin
tbin->avail[tbin->ncached] = ptr;
tbin->ncached++;
tcache_event(tcache);
}

JEMALLOC_ALWAYS_INLINE void

tcache_dalloc_small(tcache_t *tcache, void *ptr, size_t binind)

{

......

tbin = &tcache->tbins[binind];

tbin_info = &tcache_bin_info[binind];

// xf: 如果当前 tbin 已满，则执行 flush 清理 tbin

if (tbin->ncached == tbin_info->ncached_max) {

tcache_bin_flush_small(tbin, binind, (tbin_info->ncached_max >> 1), tcache);

}

// xf: 将被释放的 ptr 重新 push 进 tbin

tbin->avail[tbin->ncached] = ptr;

tbin->ncached++;

tcache_event(tcache);

}

tcache GC 和 tcache flush 在 2.7 和 3.4 节中已经介绍，不再赘述。

—-[ 4.2 – arena_dalloc_bin

small region dalloc 的第一步是尝试将 region 返还给所属的 bin。首要的步骤就是根据用户传入的 ptr 推算出其所在 run 的地址。

run addr = chunk base + run page offset << LG_PAGE

1	run addr = chunk base + run page offset << LG_PAGE

而 run page offset 根据 2.3.3 小节的说明，可以通过 ptr 所在 page 的 mapbits 获得。

run page offset = ptr page index - ptr page offset

1	run page offset = ptr page index - ptr page offset

得到 run 后就进一步拿到所属的 bin，接着对 bin 加锁并回收，如下，

void
arena_dalloc_bin(arena_t *arena, arena_chunk_t *chunk, void *ptr,
size_t pageind, arena_chunk_map_t *mapelm)
{
......
// xf: 计算 ptr 所在 run 地址
run = (arena_run_t *)((uintptr_t)chunk + (uintptr_t)((pageind - arena_mapbits_small_runind_get(chunk, pageind)) << LG_PAGE));
bin = run->bin;
malloc_mutex_lock(&bin->lock);
arena_dalloc_bin_locked(arena, chunk, ptr, mapelm);
malloc_mutex_unlock(&bin->lock);
}

void

arena_dalloc_bin(arena_t *arena, arena_chunk_t *chunk, void *ptr,

size_t pageind, arena_chunk_map_t *mapelm)

{

......

// xf: 计算 ptr 所在 run 地址

run = (arena_run_t *)((uintptr_t)chunk + (uintptr_t)((pageind - arena_mapbits_small_runind_get(chunk, pageind)) << LG_PAGE));

bin = run->bin;

malloc_mutex_lock(&bin->lock);

arena_dalloc_bin_locked(arena, chunk, ptr, mapelm);

malloc_mutex_unlock(&bin->lock);

}

lock 的内容无非是将 region 在 run 内部的 bitmap 上标记为可用。bitmap unset 的过程此处省略，请参考 3.3.1 小节中分配算法的解释。与 tcache dalloc 类似，通常情况下 region 并不会真正释放。但如果 run 内部全部为空闲 region，则会进一步触发 run 的释放。

void
arena_dalloc_bin_locked(arena_t *arena, arena_chunk_t *chunk, void *ptr,
arena_chunk_map_t *mapelm)
{
......
// xf: 通过 run 回收 region，在 bitmap 上重新标记 region 可用
arena_run_reg_dalloc(run, ptr);
// xf: 如果其所在 run 完全 free，则尝试释放该 run。
// 如果所在 run 处在将满状态（因为刚刚的释放腾出一个 region 的空间），
// 则根据地址高低优先将其交换到 current run 的位置（MRU）。
if (run->nfree == bin_info->nregs) {
arena_dissociate_bin_run(chunk, run, bin);
arena_dalloc_bin_run(arena, chunk, run, bin);
} else if (run->nfree == 1 && run != bin->runcur)
arena_bin_lower_run(arena, chunk, run, bin);
......
}

void

arena_dalloc_bin_locked(arena_t *arena, arena_chunk_t *chunk, void *ptr,

arena_chunk_map_t *mapelm)

{

......

// xf: 通过 run 回收 region，在 bitmap 上重新标记 region 可用

arena_run_reg_dalloc(run, ptr);

// xf: 如果其所在 run 完全 free，则尝试释放该 run。

// 如果所在 run 处在将满状态（因为刚刚的释放腾出一个 region 的空间），

// 则根据地址高低优先将其交换到 current run 的位置（MRU）。

if (run->nfree == bin_info->nregs) {

arena_dissociate_bin_run(chunk, run, bin);

arena_dalloc_bin_run(arena, chunk, run, bin);

} else if (run->nfree == 1 && run != bin->runcur)

arena_bin_lower_run(arena, chunk, run, bin);

......

}

此外还有一种情况是，如果原先 run 本来是满的，因为前面的释放多出一个空闲位置，就会尝试与 current run 交换位置。若当前 run 比 current run 地址更低，会替代后者并成为新的 current run，这样的好处显然可以保证低地址的内存更紧实。

static void
arena_bin_lower_run(arena_t *arena, arena_chunk_t *chunk, arena_run_t *run,
arena_bin_t *bin)
{
if ((uintptr_t)run < (uintptr_t)bin->runcur) {
if (bin->runcur->nfree > 0)
arena_bin_runs_insert(bin, bin->runcur);
bin->runcur = run;
if (config_stats)
bin->stats.reruns++;
} else
arena_bin_runs_insert(bin, run);
}

static void

arena_bin_lower_run(arena_t *arena, arena_chunk_t *chunk, arena_run_t *run,

arena_bin_t *bin)

{

if ((uintptr_t)run < (uintptr_t)bin->runcur) {

if (bin->runcur->nfree > 0)

arena_bin_runs_insert(bin, bin->runcur);

bin->runcur = run;

if (config_stats)

bin->stats.reruns++;

} else

arena_bin_runs_insert(bin, run);

}

通常情况下，至此一个 small region 就释放完毕了，准确的说是回收了。但如前面所说，若整个 run 都为空闲 region，则进入 run dalloc。这是一个比较复杂的过程。

—-[ 4.3 – small run dalloc

一个 non-full 的 small run 被记录在 bin 内的 run tree 上，因此要移除它，首先要移除其在 run tree 中的信息，即 arena_dissociate_bin_run。

static void
arena_dissociate_bin_run(arena_chunk_t *chunk, arena_run_t *run,
arena_bin_t *bin)
{
// xf: 如果当前 run 为 current run，清除 runcur，否则，从 run tree 上 remove
if (run == bin->runcur)
bin->runcur = NULL;
else {
......
if (bin_info->nregs != 1) {
arena_bin_runs_remove(bin, run);
}
}
}

static void

arena_dissociate_bin_run(arena_chunk_t *chunk, arena_run_t *run,

arena_bin_t *bin)

{

// xf: 如果当前 run 为 current run，清除 runcur，否则，从 run tree 上 remove

if (run == bin->runcur)

bin->runcur = NULL;

else {

......

if (bin_info->nregs != 1) {

arena_bin_runs_remove(bin, run);

}

接下来要通过 arena_dalloc_bin_run() 正式释放 run，由于过程稍复杂，这里先给出整个算法的梗概，

1. 计算 nextind region 所在 page 的 index。所谓 nextind 是 run 内部 clean-dirty region 的边界。如果内部存在 clean pages 则执行下一步，否则执行 3；
2. 将原始的 small run 转化成 large run，之后根据上一步得到的 nextind 将 run 切割成 dirty 和 clean 两部分，且单独释放掉 clean 部分；
3. 将待 remove 的 run pages 标记为 unalloc。且根据传入的 dirty 和 cleaned 两个 hint 决定标记后的 page mapbits 的 dirty flag；
4. 检查 unalloc 后的 run pages 是否可以前后合并。合并的标准是，
1) 不超过 chunk 范围；
2) 前后毗邻的 page 同样为 unalloc；
3) 前后毗邻 page 的 dirty flag 与 run pages 相同；
5. 将合并后（也可能没合并）的 unalloc run 插入 avail-tree；
6. 检查如果 unalloc run 的大小等于 chunk size，则将 chunk 释放掉；
7. 如果之前释放 run pages 为 dirty，则检查当前 arena 内部的 dirty-active pages 比例。若 dirty 数量超过了 active 的 1/8（Android 这里的标准有所不同），则启动 arena purge。否则直接返回；
8. 计算当前 arena 可以清理的 dirty pages 数量 npurgatory；
9. 从 dirty tree 上依次取出 dirty chunk，并检查内部的 unalloc dirty pages，将其重新分配为 large pages，并插入到临时的 queue 中；
10. 对临时队列中的 dirty pages 执行 purge，返回值为 unzeroed 标记。再将 purged pages 的 unzeroed 标记设置一遍；
11. 最后对所有 purged pages 重新执行一遍 dalloc run 操作，将其重新释放回 avail-tree。

可以看到，释放 run 本质上是将其回收至 avail-tree。但额外的 dirty page 机制却增加了整个算法的复杂程度。原因就在于 Je 使用了不同以往的内存释放方式。

在 Dl 这样的经典分配器中，系统内存回收方式更加“古板”。比如在 heap 区需要 top-most space 存在大于某个 threshold 的连续 free 空间时才能进行 auto-trimming。而 mmap 区则更要等到某个 segment 全部空闲才能执行 munmap。这对于回收系统内存是极为不利的，因为条件过于严格。

而 Je 使用了更为聪明的方式，并不会直接交还系统内存，而是通过 madvise 暂时释放掉页面与物理页面之间的映射。本质上这同 sbrk/munmap 之类的调用要达到的目的是类似的，只不过从进程内部的角度看，该地址仍然被占用。但 Je 对这些使用过的地址都详细做了记录，因此再分配时可以 recycle，并不会导致对线性地址无休止的开采。

另外，为了提高对已释放 page 的利用率，Je 将 unalloc pages 用 dirty flag（注意，这里同 page replacement 中的含义不同）做了标记（参考 2.3.3 节中 chunkmapbits）。所有 pages 被分成 active、dirty 和 clean 三种。dirty pages 表示曾经使用过，且仍可能关联着物理页面，recycle 速度较快。而 clean 则代表尚未使用，或已经通过 purge 释放了物理页面，较前者速度慢。显然，需要一种内置算法来保持三种 page 的动态平衡，以兼顾分配速度和内存占用量。如果当前 dirty pages 数量超过了 active pages 数量的 1/2^opt_lg_dirty_mult，就会启动 arena_purge()。这个值默认是 1/8，如下，

static inline void
arena_maybe_purge(arena_t *arena)
{
......
// xf: 如果当前 dirty pages 全部在执行 purging, 则直接返回
if (arena->ndirty <= arena->npurgatory)
return;
// xf: 检查 purageable pages 是否超出 active-dirty 比率，超出则
// 执行 purge。google 在这里增加了 ANDROID_ALWAYS_PURGE 开关，
// 打开则总会执行 arena_purge（默认是打开的）。
#if !defined(ANDROID_ALWAYS_PURGE)
npurgeable = arena->ndirty - arena->npurgatory;
threshold = (arena->nactive >> opt_lg_dirty_mult);
if (npurgeable <= threshold)
return;
#endif
// xf: 执行 purge
arena_purge(arena, false);
}

static inline void

arena_maybe_purge(arena_t *arena)

{

......

// xf: 如果当前 dirty pages 全部在执行 purging, 则直接返回

if (arena->ndirty <= arena->npurgatory)

return;

// xf: 检查 purageable pages 是否超出 active-dirty 比率，超出则

// 执行 purge。google 在这里增加了 ANDROID_ALWAYS_PURGE 开关，

// 打开则总会执行 arena_purge（默认是打开的）。

#if !defined(ANDROID_ALWAYS_PURGE)

npurgeable = arena->ndirty - arena->npurgatory;

threshold = (arena->nactive >> opt_lg_dirty_mult);

if (npurgeable <= threshold)

return;

#endif

// xf: 执行 purge

arena_purge(arena, false);

}

但 google 显然希望对 dirty pages 管理更严格一些，以适应移动设备上内存偏小的问题。这里增加了一个 ALWAYS_PURGE 的开关，打开后会强制每次释放时都执行 arena_purge。

arena_run_dalloc 代码如下，

static void
arena_run_dalloc(arena_t *arena, arena_run_t *run, bool dirty, bool cleaned)
{
......
// xf: 如果 run pages 的 dirty flag 实际读取为 true，且 cleaned 不为 true，
// 则同样认为该 pages 在 dalloc 后是 dirty 的，否则被视为 clean（该情况适用于
// chunk purge 后，重新 dalloc 时，此时的 run pages 虽然 dirty flag 可能为 ture，
// 但经过 purge 后应该修改为 clean）。
if (cleaned == false && arena_mapbits_dirty_get(chunk, run_ind) != 0)
dirty = true;
flag_dirty = dirty ? CHUNK_MAP_DIRTY : 0;
// xf: 将被 remove 的 run 标记为 unalloc pages。前面的判断如果是 dirty，则 pages
// mapbits 将带有 dirty flag，否则将不带有 dirty flag。
if (dirty) {
arena_mapbits_unallocated_set(chunk, run_ind, size, CHUNK_MAP_DIRTY);
arena_mapbits_unallocated_set(chunk, run_ind+run_pages-1, size, CHUNK_MAP_DIRTY);
} else {
arena_mapbits_unallocated_set(chunk, run_ind, size,
arena_mapbits_unzeroed_get(chunk, run_ind));
arena_mapbits_unallocated_set(chunk, run_ind+run_pages-1, size,
arena_mapbits_unzeroed_get(chunk, run_ind+run_pages-1));
}
// xf: 尝试将被 remove run 与前后 unalloc pages 合并
arena_run_coalesce(arena, chunk, &size, &run_ind, &run_pages, flag_dirty);
......
// xf: 将执行过合并后的 run 重新 insert 到 avail-tree
arena_avail_insert(arena, chunk, run_ind, run_pages, true, true);
// xf: 检查如果合并后的 size 已经完全 unallocated，则 dalloc 整个 chunk
if (size == arena_maxclass) {
......
arena_chunk_dalloc(arena, chunk);
}
if (dirty)
arena_maybe_purge(arena);
}

static void

arena_run_dalloc(arena_t *arena, arena_run_t *run, bool dirty, bool cleaned)

{

......

// xf: 如果 run pages 的 dirty flag 实际读取为 true，且 cleaned 不为 true，

// 则同样认为该 pages 在 dalloc 后是 dirty 的，否则被视为 clean（该情况适用于

// chunk purge 后，重新 dalloc 时，此时的 run pages 虽然 dirty flag 可能为 ture，

// 但经过 purge 后应该修改为 clean）。

if (cleaned == false && arena_mapbits_dirty_get(chunk, run_ind) != 0)

dirty = true;

flag_dirty = dirty ? CHUNK_MAP_DIRTY : 0;

// xf: 将被 remove 的 run 标记为 unalloc pages。前面的判断如果是 dirty，则 pages

// mapbits 将带有 dirty flag，否则将不带有 dirty flag。

if (dirty) {

arena_mapbits_unallocated_set(chunk, run_ind, size, CHUNK_MAP_DIRTY);

arena_mapbits_unallocated_set(chunk, run_ind+run_pages-1, size, CHUNK_MAP_DIRTY);

} else {

arena_mapbits_unallocated_set(chunk, run_ind, size,

arena_mapbits_unzeroed_get(chunk, run_ind));

arena_mapbits_unallocated_set(chunk, run_ind+run_pages-1, size,

arena_mapbits_unzeroed_get(chunk, run_ind+run_pages-1));

}

// xf: 尝试将被 remove run 与前后 unalloc pages 合并

arena_run_coalesce(arena, chunk, &size, &run_ind, &run_pages, flag_dirty);

......

// xf: 将执行过合并后的 run 重新 insert 到 avail-tree

arena_avail_insert(arena, chunk, run_ind, run_pages, true, true);

// xf: 检查如果合并后的 size 已经完全 unallocated，则 dalloc 整个 chunk

if (size == arena_maxclass) {

......

arena_chunk_dalloc(arena, chunk);

}

if (dirty)

arena_maybe_purge(arena);

}

coalesce 代码如下，

static void
arena_run_coalesce(arena_t *arena, arena_chunk_t *chunk, size_t *p_size,
size_t *p_run_ind, size_t *p_run_pages, size_t flag_dirty)
{
......
// xf: 尝试与后面的 pages 合并
if (run_ind + run_pages < chunk_npages &&
arena_mapbits_allocated_get(chunk, run_ind+run_pages) == 0 &&
arena_mapbits_dirty_get(chunk, run_ind+run_pages) == flag_dirty) {
size_t nrun_size = arena_mapbits_unallocated_size_get(chunk, run_ind+run_pages);
size_t nrun_pages = nrun_size >> LG_PAGE;
......
// xf: 如果与后面的 unalloc pages 合并，remove page 时后方的 adjacent hint 应为 true
arena_avail_remove(arena, chunk, run_ind+run_pages, nrun_pages, false, true);
size += nrun_size;
run_pages += nrun_pages;
arena_mapbits_unallocated_size_set(chunk, run_ind, size);
arena_mapbits_unallocated_size_set(chunk, run_ind+run_pages-1, size);
}
// xf: 尝试与前面的 pages 合并
if (run_ind > map_bias && arena_mapbits_allocated_get(chunk,
run_ind-1) == 0 && arena_mapbits_dirty_get(chunk, run_ind-1) ==
flag_dirty) {
......
}
*p_size = size;
*p_run_ind = run_ind;
*p_run_pages = run_pages;
}

static void

arena_run_coalesce(arena_t *arena, arena_chunk_t *chunk, size_t *p_size,

size_t *p_run_ind, size_t *p_run_pages, size_t flag_dirty)

{

......

// xf: 尝试与后面的 pages 合并

if (run_ind + run_pages < chunk_npages &&

arena_mapbits_allocated_get(chunk, run_ind+run_pages) == 0 &&

arena_mapbits_dirty_get(chunk, run_ind+run_pages) == flag_dirty) {

size_t nrun_size = arena_mapbits_unallocated_size_get(chunk, run_ind+run_pages);

size_t nrun_pages = nrun_size >> LG_PAGE;

......

// xf: 如果与后面的 unalloc pages 合并，remove page 时后方的 adjacent hint 应为 true

arena_avail_remove(arena, chunk, run_ind+run_pages, nrun_pages, false, true);

size += nrun_size;

run_pages += nrun_pages;

arena_mapbits_unallocated_size_set(chunk, run_ind, size);

arena_mapbits_unallocated_size_set(chunk, run_ind+run_pages-1, size);

}

// xf: 尝试与前面的 pages 合并

if (run_ind > map_bias && arena_mapbits_allocated_get(chunk,

run_ind-1) == 0 && arena_mapbits_dirty_get(chunk, run_ind-1) ==

flag_dirty) {

......

}

*p_size = size;

*p_run_ind = run_ind;

*p_run_pages = run_pages;

}

avail-tree remove 代码如下，

static void
arena_avail_remove(arena_t *arena, arena_chunk_t *chunk, size_t pageind,
size_t npages, bool maybe_adjac_pred, bool maybe_adjac_succ)
{
......
// xf: 该调用可能将导致 chunk 内部的碎片化率改变，从而影响其在 dirty tree
// 中的排序。因此，在正式 remove 之前需要将 chunk 首先从 dirty tree 中 remove，
// 待更新内部 ndirty 后，再将其重新 insert 回 dirty tree。
if (chunk->ndirty != 0)
arena_chunk_dirty_remove(&arena->chunks_dirty, chunk);
// xf: maybe_adjac_pred/succ 是外界传入的 hint，根据该值检查前后是否存在
// clean-dirty 边界。若存在边界，则 remove avail pages 后边界将减 1。
if (maybe_adjac_pred && arena_avail_adjac_pred(chunk, pageind))
chunk->nruns_adjac--;
if (maybe_adjac_succ && arena_avail_adjac_succ(chunk, pageind, npages))
chunk->nruns_adjac--;
chunk->nruns_avail--;
......
// xf: 更新 arena 及 chunk 中 dirty pages 统计
if (arena_mapbits_dirty_get(chunk, pageind) != 0) {
arena->ndirty -= npages;
chunk->ndirty -= npages;
}
// xf: 如果 chunk 内部 dirty 不为 0，将其重新 insert 到 arena dirty tree
if (chunk->ndirty != 0)
arena_chunk_dirty_insert(&arena->chunks_dirty, chunk);
// xf: 从 chunk avail-tree 中 remove 掉 unalloc pages
arena_avail_tree_remove(&arena->runs_avail, arena_mapp_get(chunk, pageind));
}

static void

arena_avail_remove(arena_t *arena, arena_chunk_t *chunk, size_t pageind,

size_t npages, bool maybe_adjac_pred, bool maybe_adjac_succ)

{

......

// xf: 该调用可能将导致 chunk 内部的碎片化率改变，从而影响其在 dirty tree

// 中的排序。因此，在正式 remove 之前需要将 chunk 首先从 dirty tree 中 remove，

// 待更新内部 ndirty 后，再将其重新 insert 回 dirty tree。

if (chunk->ndirty != 0)

arena_chunk_dirty_remove(&arena->chunks_dirty, chunk);

// xf: maybe_adjac_pred/succ 是外界传入的 hint，根据该值检查前后是否存在

// clean-dirty 边界。若存在边界，则 remove avail pages 后边界将减 1。

if (maybe_adjac_pred && arena_avail_adjac_pred(chunk, pageind))

chunk->nruns_adjac--;

if (maybe_adjac_succ && arena_avail_adjac_succ(chunk, pageind, npages))

chunk->nruns_adjac--;

chunk->nruns_avail--;

......

// xf: 更新 arena 及 chunk 中 dirty pages 统计

if (arena_mapbits_dirty_get(chunk, pageind) != 0) {

arena->ndirty -= npages;

chunk->ndirty -= npages;

}

// xf: 如果 chunk 内部 dirty 不为 0，将其重新 insert 到 arena dirty tree

if (chunk->ndirty != 0)

arena_chunk_dirty_insert(&arena->chunks_dirty, chunk);

// xf: 从 chunk avail-tree 中 remove 掉 unalloc pages

arena_avail_tree_remove(&arena->runs_avail, arena_mapp_get(chunk, pageind));

}

从 avail-tree 上 remove pages 可能会改变当前 chunk 内部 clean-dirty 碎片率，因此一开始要将其所在 chunk 从 dirty tree 上 remove，再从 avail-tree 上 remove pages。另外，arena_avail_insert() 的算法同 remove 是一样的，只是方向相反，不再赘述。

—-[ 4.4 – arena purge

清理 arena 的方式是按照从小到大的顺序遍历一棵 dirty tree，直到将 dirty pages 降低到 threshold 以下。dirty tree 挂载所有 dirty chunks，同其他 tree 的区别在于它的 cmp 函数较特殊，决定了最终的 purging order，如下，

static inline int
arena_chunk_dirty_comp(arena_chunk_t *a, arena_chunk_t *b)
{
......
if (a == b)
return (0);
{
size_t a_val = (a->nruns_avail - a->nruns_adjac) * b->nruns_avail;
size_t b_val = (b->nruns_avail - b->nruns_adjac) *
a->nruns_avail;
if (a_val < b_val)
return (1);
if (a_val > b_val)
return (-1);
}
{
uintptr_t a_chunk = (uintptr_t)a;
uintptr_t b_chunk = (uintptr_t)b;
int ret = ((a_chunk > b_chunk) - (a_chunk < b_chunk));
if (a->nruns_adjac == 0) {
assert(b->nruns_adjac == 0);
ret = -ret;
}
return (ret);
}
}

static inline int

arena_chunk_dirty_comp(arena_chunk_t *a, arena_chunk_t *b)

{

......

if (a == b)

return (0);

{

size_t a_val = (a->nruns_avail - a->nruns_adjac) * b->nruns_avail;

size_t b_val = (b->nruns_avail - b->nruns_adjac) *

a->nruns_avail;

if (a_val < b_val)

return (1);

if (a_val > b_val)

return (-1);

}

{

uintptr_t a_chunk = (uintptr_t)a;

uintptr_t b_chunk = (uintptr_t)b;

int ret = ((a_chunk > b_chunk) - (a_chunk < b_chunk));

if (a->nruns_adjac == 0) {

assert(b->nruns_adjac == 0);

ret = -ret;

}

return (ret);

}

Je 在这里给出的算法是这样的，
1. 首先排除 short cut，即 a 和 b 相同的特例；
2. 计算 a、b 的 fragmentation，该数值越高，相应的在 dirty tree 上就越靠前。
其计算方法为，

当前平均 avail run 大小      所有 avail run 数量 - 边界数量
---------------------  =  -----------------------------
去碎片后的平均大小            所有 avail run 数量

当前平均 avail run 大小所有 avail run 数量 - 边界数量

--------------------- = -----------------------------

去碎片后的平均大小所有 avail run 数量

注意，这个 fragment 不是通常意义理解的碎片。这里指由于 clean-dirty 边界形成的所谓碎片，并且是可以通过 purge 清除掉的，如图，

nruns_adjac = 2
+--------+----------+--------+-------+---------+----------+--------+-----
| dirty  |  clean   |        | clean |  dirty  |          | dirty  | ...
+--------+----------+--------+-------+---------+----------+--------+-----
^                           ^
|                           |
+--adjac #0                 +--adjac #1

nruns_adjac = 2

+--------+----------+--------+-------+---------+----------+--------+-----

+--------+----------+--------+-------+---------+----------+--------+-----

^ ^

| |

+--adjac #0 +--adjac #1

3. 当 a, b 的 fragmentation 相同时，同通常的方法类似，按地址大小排序。但若 nruns_adjac 为 0，即不存在 clean-dirty 边界时，反而会将低地址 chunk 排到后面。因为 adjac 为 0 的 chunk 再利用价值是比较高的，所以放到后面可以增加其在 purge 中的幸存几率，从而提升 recycle 效率。

这里需要说明的是，Je 这个 cmp 函数个人觉得似乎有问题，实际跟踪代码也发现其并不能更优先 purge 高碎片率的 chunk。但与其本人证实并未得到信服的说明。但这套算法仅仅在 3.x 版本中有效，在最新的 4.x 中则完全抛弃了现有的回收算法。

purge 代码如下，

static void
arena_purge(arena_t *arena, bool all)
{
......
// xf: 计算 purgeable pages，结果加入到 npurgatory 信息中
npurgatory = arena_compute_npurgatory(arena, all);
arena->npurgatory += npurgatory;
// xf: 从 dirty chunk tree 上逐 chunk 执行 purge，直到期望值 npurgatory 为 0
while (npurgatory > 0) {
......
chunk = arena_chunk_dirty_first(&arena->chunks_dirty);
// xf: traversal 结束，当前线程无法完成 purge 任务，返回
if (chunk == NULL) {
arena->npurgatory -= npurgatory;
return;
}
npurgeable = chunk->ndirty;
......
// xf: 如果当前 chunk 中 purgeable 大于前期计算的 purgatory，
// 且其 clean-dirty 碎片为 0，则让当前线程负责 purge 所有 prgeable pages。
// 原因是为了尽可能避免避免多个线程对该 chunk 的 purge 竞争。
if (npurgeable > npurgatory && chunk->nruns_adjac == 0) {
arena->npurgatory += npurgeable - npurgatory;
npurgatory = npurgeable;
}
arena->npurgatory -= npurgeable;
npurgatory -= npurgeable;
npurged = arena_chunk_purge(arena, chunk, all);
// xf: 计算 purge 期望值 npurgatory 和实际 purge 值 npurged 差值
nunpurged = npurgeable - npurged;
arena->npurgatory += nunpurged;
npurgatory += nunpurged;
}
}

static void

arena_purge(arena_t *arena, bool all)

{

......

// xf: 计算 purgeable pages，结果加入到 npurgatory 信息中

npurgatory = arena_compute_npurgatory(arena, all);

arena->npurgatory += npurgatory;

// xf: 从 dirty chunk tree 上逐 chunk 执行 purge，直到期望值 npurgatory 为 0

while (npurgatory > 0) {

......

chunk = arena_chunk_dirty_first(&arena->chunks_dirty);

// xf: traversal 结束，当前线程无法完成 purge 任务，返回

if (chunk == NULL) {

arena->npurgatory -= npurgatory;

return;

}

npurgeable = chunk->ndirty;

......

// xf: 如果当前 chunk 中 purgeable 大于前期计算的 purgatory，

// 且其 clean-dirty 碎片为 0，则让当前线程负责 purge 所有 prgeable pages。

// 原因是为了尽可能避免避免多个线程对该 chunk 的 purge 竞争。

if (npurgeable > npurgatory && chunk->nruns_adjac == 0) {

arena->npurgatory += npurgeable - npurgatory;

npurgatory = npurgeable;

}

arena->npurgatory -= npurgeable;

npurgatory -= npurgeable;

npurged = arena_chunk_purge(arena, chunk, all);

// xf: 计算 purge 期望值 npurgatory 和实际 purge 值 npurged 差值

nunpurged = npurgeable - npurged;

arena->npurgatory += nunpurged;

npurgatory += nunpurged;

}

chunk purge 如下，

static inline size_t
arena_chunk_purge(arena_t *arena, arena_chunk_t *chunk, bool all)
{
......
if (chunk == arena->spare) {
......
arena_chunk_alloc(arena);
}
......
// xf: 为了减小 arena purge 时 arena lock 的暂停时间，先将所有满足
// 需求的 unalloc dirty pages 重新“alloc”并保存，待 purge 结束再重新
// 释放回 avail-tree。
arena_chunk_stash_dirty(arena, chunk, all, &mapelms);
npurged = arena_chunk_purge_stashed(arena, chunk, &mapelms);
arena_chunk_unstash_purged(arena, chunk, &mapelms);
return (npurged);
}

static inline size_t

arena_chunk_purge(arena_t *arena, arena_chunk_t *chunk, bool all)

{

......

if (chunk == arena->spare) {

......

arena_chunk_alloc(arena);

}

......

// xf: 为了减小 arena purge 时 arena lock 的暂停时间，先将所有满足

// 需求的 unalloc dirty pages 重新“alloc”并保存，待 purge 结束再重新

// 释放回 avail-tree。

arena_chunk_stash_dirty(arena, chunk, all, &mapelms);

npurged = arena_chunk_purge_stashed(arena, chunk, &mapelms);

arena_chunk_unstash_purged(arena, chunk, &mapelms);

return (npurged);

}

chunk purge 重点在于这是一个线性查找 dirty pages 过程，Je 在这里会导致性能下降。更糟糕的是，之前和之后都是在 arena lock 被锁定的条件下被执行，绑定同一 arena 的线程不得不停下工作。因此，在正式 purge 前需要先把 unalloc dirty pages 全部临时分配出来，当 purging 时解锁 arena lock，而结束后再一次将它们全部释放。

stash dirty 代码，

static void
arena_chunk_stash_dirty(arena_t *arena, arena_chunk_t *chunk, bool all,
arena_chunk_mapelms_t *mapelms)
{
......
for (pageind = map_bias; pageind < chunk_npages; pageind += npages) {
arena_chunk_map_t *mapelm = arena_mapp_get(chunk, pageind);
if (arena_mapbits_allocated_get(chunk, pageind) == 0) {
......
if (arena_mapbits_dirty_get(chunk, pageind) != 0 &&
(all || arena_avail_adjac(chunk, pageind, npages))) {
arena_run_t *run = (arena_run_t *)((uintptr_t)chunk + (uintptr_t)(pageind << LG_PAGE));
// xf: 暂时将这些 unalloc dirty pages 通过 split large 重新分配出来
arena_run_split_large(arena, run, run_size, false);
// 加入临时列表, 留待后用
ql_elm_new(mapelm, u.ql_link);
ql_tail_insert(mapelms, mapelm, u.ql_link);
}
} else {
// xf: 跳过 allocated pages
......
}
}
......
}

static void

arena_chunk_stash_dirty(arena_t *arena, arena_chunk_t *chunk, bool all,

arena_chunk_mapelms_t *mapelms)

{

......

for (pageind = map_bias; pageind < chunk_npages; pageind += npages) {

arena_chunk_map_t *mapelm = arena_mapp_get(chunk, pageind);

if (arena_mapbits_allocated_get(chunk, pageind) == 0) {

......

if (arena_mapbits_dirty_get(chunk, pageind) != 0 &&

(all || arena_avail_adjac(chunk, pageind, npages))) {

arena_run_t *run = (arena_run_t *)((uintptr_t)chunk + (uintptr_t)(pageind << LG_PAGE));

// xf: 暂时将这些 unalloc dirty pages 通过 split large 重新分配出来

arena_run_split_large(arena, run, run_size, false);

// 加入临时列表, 留待后用

ql_elm_new(mapelm, u.ql_link);

ql_tail_insert(mapelms, mapelm, u.ql_link);

}

} else {

// xf: 跳过 allocated pages

......

}

......

}

stash 时会根据传入的 hint all 判断，如果为 false，只会 stash 存在 clean-dirty adjac 的 pages，否则会全部加入列表。

purge stashed pages 代码如下，

static size_t
arena_chunk_purge_stashed(arena_t *arena, arena_chunk_t *chunk,
arena_chunk_mapelms_t *mapelms)
{
......
// xf: 暂时解锁 arena lock，前面已经 realloc 过，这里不考虑 contention 问题
malloc_mutex_unlock(&arena->lock);
......
ql_foreach(mapelm, mapelms, u.ql_link) {
......
// xf: 逐个 purge dirty page，返回 pages 是否 unzeroed
unzeroed = pages_purge((void *)((uintptr_t)chunk + (pageind << LG_PAGE)), (npages << LG_PAGE));
flag_unzeroed = unzeroed ? CHUNK_MAP_UNZEROED : 0;
// xf: 逐 pages 设置 unzeroed 标志
for (i = 0; i < npages; i++) {
arena_mapbits_unzeroed_set(chunk, pageind+i, flag_unzeroed);
}
......
}
// xf: purging 结束重新 lock arena
malloc_mutex_lock(&arena->lock);
......
return (npurged);
}

static size_t

arena_chunk_purge_stashed(arena_t *arena, arena_chunk_t *chunk,

arena_chunk_mapelms_t *mapelms)

{

......

// xf: 暂时解锁 arena lock，前面已经 realloc 过，这里不考虑 contention 问题

malloc_mutex_unlock(&arena->lock);

......

ql_foreach(mapelm, mapelms, u.ql_link) {

......

// xf: 逐个 purge dirty page，返回 pages 是否 unzeroed

unzeroed = pages_purge((void *)((uintptr_t)chunk + (pageind << LG_PAGE)), (npages << LG_PAGE));

flag_unzeroed = unzeroed ? CHUNK_MAP_UNZEROED : 0;

// xf: 逐 pages 设置 unzeroed 标志

for (i = 0; i < npages; i++) {

arena_mapbits_unzeroed_set(chunk, pageind+i, flag_unzeroed);

}

......

}

// xf: purging 结束重新 lock arena

malloc_mutex_lock(&arena->lock);

......

return (npurged);

}

这里要注意的是，在 page purge 过后，会逐一设置 unzero flag。这是因为有些操作系统在 demand page 后会有一步 zero-fill-on-demand。因此，被 purge 过的 clean page 当再一次申请到物理页面时会全部填充为 0。

unstash 代码，

static void
arena_chunk_unstash_purged(arena_t *arena, arena_chunk_t *chunk,
arena_chunk_mapelms_t *mapelms)
{
......
for (mapelm = ql_first(mapelms); mapelm != NULL; mapelm = ql_first(mapelms)) {
......
run = (arena_run_t *)((uintptr_t)chunk + (uintptr_t)(pageind << LG_PAGE));
ql_remove(mapelms, mapelm, u.ql_link);
arena_run_dalloc(arena, run, false, true);
}
}

static void

arena_chunk_unstash_purged(arena_t *arena, arena_chunk_t *chunk,

arena_chunk_mapelms_t *mapelms)

{

......

for (mapelm = ql_first(mapelms); mapelm != NULL; mapelm = ql_first(mapelms)) {

......

run = (arena_run_t *)((uintptr_t)chunk + (uintptr_t)(pageind << LG_PAGE));

ql_remove(mapelms, mapelm, u.ql_link);

arena_run_dalloc(arena, run, false, true);

}

unstash 需要再一次调用 arena_run_dalloc() 以释放临时分配的 pages。要注意此时我们已经位于 arena_run_dalloc 调用栈中，而避免无限递归重入依靠参数 cleaned flag。

—-[ 4.5 – arena chunk dalloc

当 free chunk 被 Je 释放时，根据局部性原理，会成为下一个 spare chunk 而保存起来，其真身并未消散。而原先的 spare 则会根据内部 dalloc 方法被处理掉。

static void
arena_chunk_dalloc(arena_t *arena, arena_chunk_t *chunk)
{
......
// xf: 将 chunk 从 avail-tree 上 remove
arena_avail_remove(arena, chunk, map_bias, chunk_npages-map_bias, false, false);
// xf: 如果 spare 不为空，则将被释放的 chunk 替换原 spare chunk
if (arena->spare != NULL) {
arena_chunk_t *spare = arena->spare;
arena->spare = chunk;
arena_chunk_dalloc_internal(arena, spare);
} else
arena->spare = chunk;
}

static void

arena_chunk_dalloc(arena_t *arena, arena_chunk_t *chunk)

{

......

// xf: 将 chunk 从 avail-tree 上 remove

arena_avail_remove(arena, chunk, map_bias, chunk_npages-map_bias, false, false);

// xf: 如果 spare 不为空，则将被释放的 chunk 替换原 spare chunk

if (arena->spare != NULL) {

arena_chunk_t *spare = arena->spare;

arena->spare = chunk;

arena_chunk_dalloc_internal(arena, spare);

} else

arena->spare = chunk;

}

同 chunk alloc 一样，chunk dalloc 算法也是可定制的。Je 提供的默认算法 chunk_dalloc_default 最终会调用 chunk_unmap，如下，

void
chunk_unmap(void *chunk, size_t size)
{
......
// xf: 如果启用 dss，且当前 chunk 在 dss 内，将其 record 在 dss tree 上
// 否则如果就记录在 mmap tree 上，或者直接 munmap 释放掉
if (have_dss && chunk_in_dss(chunk))
chunk_record(&chunks_szad_dss, &chunks_ad_dss, chunk, size);
else if (chunk_dalloc_mmap(chunk, size))
chunk_record(&chunks_szad_mmap, &chunks_ad_mmap, chunk, size);
}

void

chunk_unmap(void *chunk, size_t size)

{

......

// xf: 如果启用 dss，且当前 chunk 在 dss 内，将其 record 在 dss tree 上

// 否则如果就记录在 mmap tree 上，或者直接 munmap 释放掉

if (have_dss && chunk_in_dss(chunk))

chunk_record(&chunks_szad_dss, &chunks_ad_dss, chunk, size);

else if (chunk_dalloc_mmap(chunk, size))

chunk_record(&chunks_szad_mmap, &chunks_ad_mmap, chunk, size);

}

在 3.3.5 小节中 alloc 时会根据 dss 和 mmap 优先执行 recycle。源自在 dalloc 时 record 在四棵 chunk tree 上的记录。但同 spare 记录的不同，这里的记录仅仅只剩下躯壳，record 时会强行释放物理页面，因此 recycle 速度相比 spare 较慢。

chunk record 算法如下，
1. 先 purge chunk 内部所有 pages；
2. 预分配 base node，以记录释放后的 chunk。这里分配的 node 到后面可能没有用，提前分配是因为接下来要加锁 chunks_mtx。而如果在临界段内再分配 base node，则可能因为 base pages 不足而申请新的 chunk，这样一来就会导致 dead lock；
3. 寻找与要插入 chunk 的毗邻地址。首先尝试与后面的地址合并，成功则用后者的 base node 记录，之后执行 5；
4. 合并失败，用预分配的 base node 记录 chunk；
5. 尝试与前面的地址合并；
6. 如果预分配的 base node 没有使用，释放掉。

代码如下，

static void
chunk_record(extent_tree_t *chunks_szad, extent_tree_t *chunks_ad, void *chunk,
size_t size)
{
......
// xf: purge all chunk pages
unzeroed = pages_purge(chunk, size);
// xf: 预先分配 extent_node 以记录 chunk。如果该 chunk 可以进行合并，该 node
// 可能并不会使用。这里预先分配主要是避免 dead lock。因为某些情况
// base_node_alloc 同样可能会 alloc base chunk，由于后面 chunk mutex 被 lock，
// 那样将导致 dead lock。
xnode = base_node_alloc();
xprev = NULL;
malloc_mutex_lock(&chunks_mtx);
// xf: 首先尝试与后面的 chunk 合并
key.addr = (void *)((uintptr_t)chunk + size);
node = extent_tree_ad_nsearch(chunks_ad, &key);
if (node != NULL && node->addr == key.addr) {
extent_tree_szad_remove(chunks_szad, node);
node->addr = chunk;
node->size += size;
node->zeroed = (node->zeroed && (unzeroed == false));
extent_tree_szad_insert(chunks_szad, node);
} else {
// xf: 合并失败，用提前分配好的 xnode 保存当前 chunk 信息
if (xnode == NULL) {
goto label_return;
}
node = xnode;
xnode = NULL;
node->addr = chunk;
node->size = size;
node->zeroed = (unzeroed == false);
extent_tree_ad_insert(chunks_ad, node);
extent_tree_szad_insert(chunks_szad, node);
}
// xf: 再尝试与前面的 chunk 合并
prev = extent_tree_ad_prev(chunks_ad, node);
if (prev != NULL && (void *)((uintptr_t)prev->addr + prev->size) == chunk) {
......
}
label_return:
malloc_mutex_unlock(&chunks_mtx);
// xf: 如果预先分配的 node 没有使用，则在此将之销毁
if (xnode != NULL)
base_node_dalloc(xnode);
if (xprev != NULL)
base_node_dalloc(xprev);
}

static void

chunk_record(extent_tree_t *chunks_szad, extent_tree_t *chunks_ad, void *chunk,

size_t size)

{

......

// xf: purge all chunk pages

unzeroed = pages_purge(chunk, size);

// xf: 预先分配 extent_node 以记录 chunk。如果该 chunk 可以进行合并，该 node

// 可能并不会使用。这里预先分配主要是避免 dead lock。因为某些情况

// base_node_alloc 同样可能会 alloc base chunk，由于后面 chunk mutex 被 lock，

// 那样将导致 dead lock。

xnode = base_node_alloc();

xprev = NULL;

malloc_mutex_lock(&chunks_mtx);

// xf: 首先尝试与后面的 chunk 合并

key.addr = (void *)((uintptr_t)chunk + size);

node = extent_tree_ad_nsearch(chunks_ad, &key);

if (node != NULL && node->addr == key.addr) {

extent_tree_szad_remove(chunks_szad, node);

node->addr = chunk;

node->size += size;

node->zeroed = (node->zeroed && (unzeroed == false));

extent_tree_szad_insert(chunks_szad, node);

} else {

// xf: 合并失败，用提前分配好的 xnode 保存当前 chunk 信息

if (xnode == NULL) {

goto label_return;

}

node = xnode;

xnode = NULL;

node->addr = chunk;

node->size = size;

node->zeroed = (unzeroed == false);

extent_tree_ad_insert(chunks_ad, node);

extent_tree_szad_insert(chunks_szad, node);

}

// xf: 再尝试与前面的 chunk 合并

prev = extent_tree_ad_prev(chunks_ad, node);

if (prev != NULL && (void *)((uintptr_t)prev->addr + prev->size) == chunk) {

......

}

label_return:

malloc_mutex_unlock(&chunks_mtx);

// xf: 如果预先分配的 node 没有使用，则在此将之销毁

if (xnode != NULL)

base_node_dalloc(xnode);

if (xprev != NULL)

base_node_dalloc(xprev);

}

最后顺带一提，对于 mmap 区的 pages，Je 也可以直接 munmap，前提是需要在 jemalloc_internal_defs.h 中开启 JEMALLOC_MUNMAP，这样就不会执行 pages purge。默认该选项是不开启的。但源自 dss 区中的分配则不存在反向释放一说，默认 Je 也不会优先选择 dss 就是了。

bool
chunk_dalloc_mmap(void *chunk, size_t size)
{
if (config_munmap)
pages_unmap(chunk, size);
return (config_munmap == false);
}

bool

chunk_dalloc_mmap(void *chunk, size_t size)

{

if (config_munmap)

pages_unmap(chunk, size);

return (config_munmap == false);

}

—-[ 4.6 – large/huge dalloc

前面说过 large/huge 相当于以 run 和 chunk 为粒度的特例。因此对于 arena dalloc large 来说，最终就是 arena_run_dalloc，

void
arena_dalloc_large_locked(arena_t *arena, arena_chunk_t *chunk, void *ptr)
{
if (config_fill || config_stats) {
size_t pageind = ((uintptr_t)ptr - (uintptr_t)chunk) >> LG_PAGE;
size_t usize = arena_mapbits_large_size_get(chunk, pageind);
arena_dalloc_junk_large(ptr, usize);
if (config_stats) {
arena->stats.ndalloc_large++;
arena->stats.allocated_large -= usize;
arena->stats.lstats[(usize >> LG_PAGE) - 1].ndalloc++;
arena->stats.lstats[(usize >> LG_PAGE) - 1].curruns--;
}
}
arena_run_dalloc(arena, (arena_run_t *)ptr, true, false);
}

void

arena_dalloc_large_locked(arena_t *arena, arena_chunk_t *chunk, void *ptr)

{

if (config_fill || config_stats) {

size_t pageind = ((uintptr_t)ptr - (uintptr_t)chunk) >> LG_PAGE;

size_t usize = arena_mapbits_large_size_get(chunk, pageind);

arena_dalloc_junk_large(ptr, usize);

if (config_stats) {

arena->stats.ndalloc_large++;

arena->stats.allocated_large -= usize;

arena->stats.lstats[(usize >> LG_PAGE) - 1].ndalloc++;

arena->stats.lstats[(usize >> LG_PAGE) - 1].curruns--;

}

arena_run_dalloc(arena, (arena_run_t *)ptr, true, false);

}

而 huge dalloc，则是在 huge tree 上搜寻，最终执行 chunk_dalloc，

void
huge_dalloc(void *ptr)
{
......
malloc_mutex_lock(&huge_mtx);
key.addr = ptr;
node = extent_tree_ad_search(&huge, &key);
assert(node != NULL);
assert(node->addr == ptr);
extent_tree_ad_remove(&huge, node);
malloc_mutex_unlock(&huge_mtx);
huge_dalloc_junk(node->addr, node->size);
arena_chunk_dalloc_huge(node->arena, node->addr, node->size);
base_node_dalloc(node);
}
void
arena_chunk_dalloc_huge(arena_t *arena, void *chunk, size_t size)
{
chunk_dalloc_t *chunk_dalloc;
malloc_mutex_lock(&arena->lock);
chunk_dalloc = arena->chunk_dalloc;
if (config_stats) {
arena->stats.mapped -= size;
arena->stats.allocated_huge -= size;
arena->stats.ndalloc_huge++;
stats_cactive_sub(size);
}
arena->nactive -= (size >> LG_PAGE);
malloc_mutex_unlock(&arena->lock);
chunk_dalloc(chunk, size, arena->ind);
}

void

huge_dalloc(void *ptr)

{

......

malloc_mutex_lock(&huge_mtx);

key.addr = ptr;

node = extent_tree_ad_search(&huge, &key);

assert(node != NULL);

assert(node->addr == ptr);

extent_tree_ad_remove(&huge, node);

malloc_mutex_unlock(&huge_mtx);

huge_dalloc_junk(node->addr, node->size);

arena_chunk_dalloc_huge(node->arena, node->addr, node->size);

base_node_dalloc(node);

}

void

arena_chunk_dalloc_huge(arena_t *arena, void *chunk, size_t size)

{

chunk_dalloc_t *chunk_dalloc;

malloc_mutex_lock(&arena->lock);

chunk_dalloc = arena->chunk_dalloc;

if (config_stats) {

arena->stats.mapped -= size;

arena->stats.allocated_huge -= size;

arena->stats.ndalloc_huge++;

stats_cactive_sub(size);

}

arena->nactive -= (size >> LG_PAGE);

malloc_mutex_unlock(&arena->lock);

chunk_dalloc(chunk, size, arena->ind);

}

前面已做了充分介绍，这里不再赘述。

—-[ 5 – 总结: 与 Dl 的对比

1. 单核单线程分配能力上两者不相上下，甚至小块内存分配速度理论上 Dl 还略占优势。原因是 Dl 利用双向链表组织 free chunk 可以做到 O(1)，而尽管 Je 在 bitmap 上做了一定优化，但不能做到常数时间。

2. 多核多线程下，Je 可以秒杀 Dl。arena 的加入既可以避免 false sharing，又可以减少线程间 lock contention。另外，tcache 也是可以大幅加快多线程分配速度的技术。这些 Dl 完全不具备竞争力。

3. 系统内存交换效率上也是 Je 占明显优势。Je 使用 mmap/madvise 的组合要比 Dl 使用 sbrk/mmap/munmap 灵活的多。实际对系统的压力也更小。另外，Dl 使用 dss->mmap，追求的是速度，而 Je 相反 mmap->dss，为的是灵活性。

4. 小块内存的碎片抑制上双方做的都不错，但总体上个人觉得 Je 更好一些。首先 dalloc 时，两者对空闲内存都可以实时 coalesce。alloc 时 Dl 依靠 dv 约束外部碎片，Je 更简单暴力，直接在固定的 small runs 里分配。

两相比较，dv 的候选者是随机的，大小不固定，如果选择比较小的 chunk，效果其实有限。更甚者，当找不到 dv 时，Dl 会随意切割 top-most space，通常这不利于 heap trim。

而 small runs 则是固定大小，同时是页面的整数倍，对外部碎片的约束力和规整度上都更好。

但 Dl 的优势在算法更简单，速度更快。无论是 coalesce 还是 split 代价都很低。在 Je 中有可能因为分配 8byte 的内存而实际去分配并初始化 4k 甚至 4M 的空间。

5. 大块内存分配能力上，Dl 使用 dst 管理，而 Je 采用 rb tree。原理上，据说 rb tree 的 cache 亲和力较差，不适合 memory allocator。我没有仔细研究 Je 的 rb tree 实现有何过人之处，暂且认为各有千秋吧。可以肯定的是 Je 的 large/huge region 具有比 Dl 更高的内部碎片，皆因为其更规整的 size class 划分导致的。

6. 说到 size class，可以看到 Je 的划分明显比 Dl 更细致，tiny/small/large/huge 四种分类能兼顾更多的内存使用模型。且根据不同架构和配置，可以灵活改变划分方式，具有更好的兼容性。Dl 划分的相对粗糙很多且比较固定。一方面可能在当时 256byte 以上就可以算作大块的分配了吧。另一方面某种程度是碍于算法的限制。比如在 boundary tag 中为了容纳更多的信息，就不能小于 8byte（实际有效的最小 chunk 是 16byte），bin 数量不得多余 31 个也是基于位运算的方式。

7. bookkeeping 占用上 Dl 因为算法简单，本应该占用更少内存。但由于 boundary tag 本身导致的占用，chunk 数量越多，bookkeeping 就越大。再考虑到系统回收效率上的劣势，应该说，应用内存占用越大，尤其是小内存使用量越多，运行时间越长，Dl 相对于 Je 内存使用量倾向越大。

8. 安全健壮性。只说一点，boundary tag 是原罪，其他的可以免谈了。

–[ 附: 快速调试 Jemalloc

一个简单的调试 Je 的方法是以静态库的方式将其编译到你的应用程序中。先编译 Je 的静态库，在源码目录下执行，

./configure
make
make install

./configure

make

make install

就可以编译并安装 Je 到系统路径。调试还必须打开一些选项，例如，

./configure --enable-debug  --with-jemalloc-prefix=<prefix>

1	./configure --enable-debug --with-jemalloc-prefix=<prefix>

这些选项的意义可以参考 INSTALL 文档。比如，

–disable-tcache 是否禁用 tcache，对调试非 tcache 流程有用。
–disable-prof 是否禁用 heap profile。
–enable-debug 打开调试模式，启动 assert 并关闭优化。
–with-jemalloc-prefix 将编译出的 malloc 加上设定的前缀，以区别 C 库的调用。

之后就可以将其编译到你的代码中，如，

gcc main.c /usr/local/lib/libjemalloc.a -std=c99 -O0 -g3 -pthread -o jhello

1	gcc main.c /usr/local/lib/libjemalloc.a -std=c99 -O0 -g3 -pthread -o jhello

打赏赞

发表回复取消回复

上一篇：六一杂记

下一篇：发现了一个牛人，和她的有趣项目

INSIDE OF JEMALLOC The Algorithm and Implementation of Jemalloc

–[ Table of contents

–[ 1 – 简介

–[ 2 – Basic structures

—-[ 2.1 – Overview

—-[ 2.2 – Arena (arena_t)

——[ 2.2.1 – CPU Cache-Line

——[ 2.2.2 – Arena 原理

——[ 2.2.3 – choose_arena

——[ 2.2.4 – Arena 结构

—-[ 2.3 – Chunk (arena_chunk_t)

——[ 2.3.1 – overview

——[ 2.3.2 – Chunk 结构

——[ 2.3.3 – chunk map (arena_chunk_map_t)

—-[ 2.4 – Run (arena_run_t)

——[ 2.4.1 – Run 结构

——[ 2.4.2 – size classes

——[ 2.4.3 – size2bin/bin2size

—-[ 2.5 – bins (arena_bin_t)

—-[ 2.6 – Thread caches (tcache_t)

—-[ 2.7 – Extent Node (extent_node_t)

—-[ 2.8 – Base

–[ 3 – Allocation

—-[ 3.1 – Overview

—-[ 3.2 – Initialize

—-[ 3.3 – Small allocation (Arena)

——[ 3.3.1 – arena_run_reg_alloc

——[ 3.3.2 – arena_bin_malloc_hard

——[ 3.3.3 – arena_bin_nonfull_run_get

——[ 3.3.4 – Small Run Alloc

——[ 3.3.5 – Chunk Alloc

—-[ 3.4 – Small allocation (tcache)

—-[ 3.5 – Large allocation

—-[ 3.6 – Huge allocation

–[ 4 – Deallocation

—-[ 4.1 – Overview

—-[ 4.2 – arena_dalloc_bin

—-[ 4.3 – small run dalloc

—-[ 4.4 – arena purge

—-[ 4.5 – arena chunk dalloc

—-[ 4.6 – large/huge dalloc

—-[ 5 – 总结: 与 Dl 的对比

–[ 附: 快速调试 Jemalloc

发表回复 取消回复

INSIDE OF JEMALLOC
The Algorithm and Implementation of Jemalloc

发表回复取消回复