Merge branch ' docs/bpf: Add description of register liveness tracking algorithm'
Eduard Zingerman says: ==================== An overview of the register tracking liveness algorithm. Previous versions posted here: [1], [2], [3]. - Changes from RFC to v2 (suggested by Andrii Nakryiko): - wording corrected to use term "stack slot" instead of "stack spill"; - parentage chain diagram updated to show nil links for frame #1; - added example for non-BPF_DW writes behavior; - explanation in "Read marks propagation for cache hits" is reworked. - Changes from v2 to v3: - lot's of grammatical / wording fixes as suggested by David Vernet; - "Register parentage chains" section is fixed to reflect what happens to r1-r5 when function call is processed (as suggested by David and Alexei); - Example in "Liveness marks tracking" section updated to explain why partial writes should not lead to REG_LIVE_WRITTEN marks (suggested by David); - "Read marks propagation for cache hits" section updates: - Explanation updated to hint why read marks should be propagated before jumping to example (suggested by David); - Removed box around B/D in the diagram updated (suggested by Alexei). - Changes from v3 to v4 (suggested by Edward Cree): - register parentage chain diagram updated to explain why r6 mark is not propagated; - read mark propagation algorithm pseudo-code fixed to correctly show "if state->live & REG_LIVE_WRITTEN" stop condition; - general wording improvements in section "Liveness marks tracking". [1] https://lore.kernel.org/bpf/20230124220343.2942203-1-eddyz87@gmail.com/ [2] https://lore.kernel.org/bpf/20230130182400.630997-1-eddyz87@gmail.com/ [3] https://lore.kernel.org/bpf/20230131181118.733845-1-eddyz87@gmail.com/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit is contained in:
commit
d9e44c324c
|
@ -316,6 +316,301 @@ Pruning considers not only the registers but also the stack (and any spilled
|
|||
registers it may hold). They must all be safe for the branch to be pruned.
|
||||
This is implemented in states_equal().
|
||||
|
||||
Some technical details about state pruning implementation could be found below.
|
||||
|
||||
Register liveness tracking
|
||||
--------------------------
|
||||
|
||||
In order to make state pruning effective, liveness state is tracked for each
|
||||
register and stack slot. The basic idea is to track which registers and stack
|
||||
slots are actually used during subseqeuent execution of the program, until
|
||||
program exit is reached. Registers and stack slots that were never used could be
|
||||
removed from the cached state thus making more states equivalent to a cached
|
||||
state. This could be illustrated by the following program::
|
||||
|
||||
0: call bpf_get_prandom_u32()
|
||||
1: r1 = 0
|
||||
2: if r0 == 0 goto +1
|
||||
3: r0 = 1
|
||||
--- checkpoint ---
|
||||
4: r0 = r1
|
||||
5: exit
|
||||
|
||||
Suppose that a state cache entry is created at instruction #4 (such entries are
|
||||
also called "checkpoints" in the text below). The verifier could reach the
|
||||
instruction with one of two possible register states:
|
||||
|
||||
* r0 = 1, r1 = 0
|
||||
* r0 = 0, r1 = 0
|
||||
|
||||
However, only the value of register ``r1`` is important to successfully finish
|
||||
verification. The goal of the liveness tracking algorithm is to spot this fact
|
||||
and figure out that both states are actually equivalent.
|
||||
|
||||
Data structures
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Liveness is tracked using the following data structures::
|
||||
|
||||
enum bpf_reg_liveness {
|
||||
REG_LIVE_NONE = 0,
|
||||
REG_LIVE_READ32 = 0x1,
|
||||
REG_LIVE_READ64 = 0x2,
|
||||
REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64,
|
||||
REG_LIVE_WRITTEN = 0x4,
|
||||
REG_LIVE_DONE = 0x8,
|
||||
};
|
||||
|
||||
struct bpf_reg_state {
|
||||
...
|
||||
struct bpf_reg_state *parent;
|
||||
...
|
||||
enum bpf_reg_liveness live;
|
||||
...
|
||||
};
|
||||
|
||||
struct bpf_stack_state {
|
||||
struct bpf_reg_state spilled_ptr;
|
||||
...
|
||||
};
|
||||
|
||||
struct bpf_func_state {
|
||||
struct bpf_reg_state regs[MAX_BPF_REG];
|
||||
...
|
||||
struct bpf_stack_state *stack;
|
||||
}
|
||||
|
||||
struct bpf_verifier_state {
|
||||
struct bpf_func_state *frame[MAX_CALL_FRAMES];
|
||||
struct bpf_verifier_state *parent;
|
||||
...
|
||||
}
|
||||
|
||||
* ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new
|
||||
verifier state creation;
|
||||
|
||||
* ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is
|
||||
defined by some instruction verified between this verifier state's parent and
|
||||
verifier state itself;
|
||||
|
||||
* ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot)
|
||||
is read by a some child state of this verifier state;
|
||||
|
||||
* ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid
|
||||
processing same verifier state multiple times and for some sanity checks;
|
||||
|
||||
* ``->live`` field values are formed by combining ``enum bpf_reg_liveness``
|
||||
values using bitwise or.
|
||||
|
||||
Register parentage chains
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
In order to propagate information between parent and child states, a *register
|
||||
parentage chain* is established. Each register or stack slot is linked to a
|
||||
corresponding register or stack slot in its parent state via a ``->parent``
|
||||
pointer. This link is established upon state creation in ``is_state_visited()``
|
||||
and might be modified by ``set_callee_state()`` called from
|
||||
``__check_func_call()``.
|
||||
|
||||
The rules for correspondence between registers / stack slots are as follows:
|
||||
|
||||
* For the current stack frame, registers and stack slots of the new state are
|
||||
linked to the registers and stack slots of the parent state with the same
|
||||
indices.
|
||||
|
||||
* For the outer stack frames, only caller saved registers (r6-r9) and stack
|
||||
slots are linked to the registers and stack slots of the parent state with the
|
||||
same indices.
|
||||
|
||||
* When function call is processed a new ``struct bpf_func_state`` instance is
|
||||
allocated, it encapsulates a new set of registers and stack slots. For this
|
||||
new frame, parent links for r6-r9 and stack slots are set to nil, parent links
|
||||
for r1-r5 are set to match caller r1-r5 parent links.
|
||||
|
||||
This could be illustrated by the following diagram (arrows stand for
|
||||
``->parent`` pointers)::
|
||||
|
||||
... ; Frame #0, some instructions
|
||||
--- checkpoint #0 ---
|
||||
1 : r6 = 42 ; Frame #0
|
||||
--- checkpoint #1 ---
|
||||
2 : call foo() ; Frame #0
|
||||
... ; Frame #1, instructions from foo()
|
||||
--- checkpoint #2 ---
|
||||
... ; Frame #1, instructions from foo()
|
||||
--- checkpoint #3 ---
|
||||
exit ; Frame #1, return from foo()
|
||||
3 : r1 = r6 ; Frame #0 <- current state
|
||||
|
||||
+-------------------------------+-------------------------------+
|
||||
| Frame #0 | Frame #1 |
|
||||
Checkpoint +-------------------------------+-------------------------------+
|
||||
#0 | r0 | r1-r5 | r6-r9 | fp-8 ... |
|
||||
+-------------------------------+
|
||||
^ ^ ^ ^
|
||||
| | | |
|
||||
Checkpoint +-------------------------------+
|
||||
#1 | r0 | r1-r5 | r6-r9 | fp-8 ... |
|
||||
+-------------------------------+
|
||||
^ ^ ^
|
||||
|_______|_______|_______________
|
||||
| | |
|
||||
nil nil | | | nil nil
|
||||
| | | | | | |
|
||||
Checkpoint +-------------------------------+-------------------------------+
|
||||
#2 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
|
||||
+-------------------------------+-------------------------------+
|
||||
^ ^ ^ ^ ^
|
||||
nil nil | | | | |
|
||||
| | | | | | |
|
||||
Checkpoint +-------------------------------+-------------------------------+
|
||||
#3 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
|
||||
+-------------------------------+-------------------------------+
|
||||
^ ^
|
||||
nil nil | |
|
||||
| | | |
|
||||
Current +-------------------------------+
|
||||
state | r0 | r1-r5 | r6-r9 | fp-8 ... |
|
||||
+-------------------------------+
|
||||
\
|
||||
r6 read mark is propagated via these links
|
||||
all the way up to checkpoint #1.
|
||||
The checkpoint #1 contains a write mark for r6
|
||||
because of instruction (1), thus read propagation
|
||||
does not reach checkpoint #0 (see section below).
|
||||
|
||||
Liveness marks tracking
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For each processed instruction, the verifier tracks read and written registers
|
||||
and stack slots. The main idea of the algorithm is that read marks propagate
|
||||
back along the state parentage chain until they hit a write mark, which 'screens
|
||||
off' earlier states from the read. The information about reads is propagated by
|
||||
function ``mark_reg_read()`` which could be summarized as follows::
|
||||
|
||||
mark_reg_read(struct bpf_reg_state *state, ...):
|
||||
parent = state->parent
|
||||
while parent:
|
||||
if state->live & REG_LIVE_WRITTEN:
|
||||
break
|
||||
if parent->live & REG_LIVE_READ64:
|
||||
break
|
||||
parent->live |= REG_LIVE_READ64
|
||||
state = parent
|
||||
parent = state->parent
|
||||
|
||||
Notes:
|
||||
|
||||
* The read marks are applied to the **parent** state while write marks are
|
||||
applied to the **current** state. The write mark on a register or stack slot
|
||||
means that it is updated by some instruction in the straight-line code leading
|
||||
from the parent state to the current state.
|
||||
|
||||
* Details about REG_LIVE_READ32 are omitted.
|
||||
|
||||
* Function ``propagate_liveness()`` (see section :ref:`read_marks_for_cache_hits`)
|
||||
might override the first parent link. Please refer to the comments in the
|
||||
``propagate_liveness()`` and ``mark_reg_read()`` source code for further
|
||||
details.
|
||||
|
||||
Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are
|
||||
applied conservatively: stack slots are marked as written only if write size
|
||||
corresponds to the size of the register, e.g. see function ``save_register_state()``.
|
||||
|
||||
Consider the following example::
|
||||
|
||||
0: (*u64)(r10 - 8) = 0 ; define 8 bytes of fp-8
|
||||
--- checkpoint #0 ---
|
||||
1: (*u32)(r10 - 8) = 1 ; redefine lower 4 bytes
|
||||
2: r1 = (*u32)(r10 - 8) ; read lower 4 bytes defined at (1)
|
||||
3: r2 = (*u32)(r10 - 4) ; read upper 4 bytes defined at (0)
|
||||
|
||||
As stated above, the write at (1) does not count as ``REG_LIVE_WRITTEN``. Should
|
||||
it be otherwise, the algorithm above wouldn't be able to propagate the read mark
|
||||
from (3) to checkpoint #0.
|
||||
|
||||
Once the ``BPF_EXIT`` instruction is reached ``update_branch_counts()`` is
|
||||
called to update the ``->branches`` counter for each verifier state in a chain
|
||||
of parent verifier states. When the ``->branches`` counter reaches zero the
|
||||
verifier state becomes a valid entry in a set of cached verifier states.
|
||||
|
||||
Each entry of the verifier states cache is post-processed by a function
|
||||
``clean_live_states()``. This function marks all registers and stack slots
|
||||
without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``.
|
||||
Registers/stack slots marked in this way are ignored in function ``stacksafe()``
|
||||
called from ``states_equal()`` when a state cache entry is considered for
|
||||
equivalence with a current state.
|
||||
|
||||
Now it is possible to explain how the example from the beginning of the section
|
||||
works::
|
||||
|
||||
0: call bpf_get_prandom_u32()
|
||||
1: r1 = 0
|
||||
2: if r0 == 0 goto +1
|
||||
3: r0 = 1
|
||||
--- checkpoint[0] ---
|
||||
4: r0 = r1
|
||||
5: exit
|
||||
|
||||
* At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }``
|
||||
is pushed to states processing queue (pc stands for program counter).
|
||||
|
||||
* At instruction #4:
|
||||
|
||||
* ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``;
|
||||
* ``checkpoint[0].r0`` is marked as written;
|
||||
* ``checkpoint[0].r1`` is marked as read;
|
||||
|
||||
* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed
|
||||
by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a
|
||||
read mark and all other registers and stack slots are marked as ``NOT_INIT``
|
||||
or ``STACK_INVALID``
|
||||
|
||||
* The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue
|
||||
and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states
|
||||
are considered equivalent.
|
||||
|
||||
.. _read_marks_for_cache_hits:
|
||||
|
||||
Read marks propagation for cache hits
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Another point is the handling of read marks when a previously verified state is
|
||||
found in the states cache. Upon cache hit verifier must behave in the same way
|
||||
as if the current state was verified to the program exit. This means that all
|
||||
read marks, present on registers and stack slots of the cached state, must be
|
||||
propagated over the parentage chain of the current state. Example below shows
|
||||
why this is important. Function ``propagate_liveness()`` handles this case.
|
||||
|
||||
Consider the following state parentage chain (S is a starting state, A-E are
|
||||
derived states, -> arrows show which state is derived from which)::
|
||||
|
||||
r1 read
|
||||
<------------- A[r1] == 0
|
||||
C[r1] == 0
|
||||
S ---> A ---> B ---> exit E[r1] == 1
|
||||
|
|
||||
` ---> C ---> D
|
||||
|
|
||||
` ---> E ^
|
||||
|___ suppose all these
|
||||
^ states are at insn #Y
|
||||
|
|
||||
suppose all these
|
||||
states are at insn #X
|
||||
|
||||
* Chain of states ``S -> A -> B -> exit`` is verified first.
|
||||
|
||||
* While ``B -> exit`` is verified, register ``r1`` is read and this read mark is
|
||||
propagated up to state ``A``.
|
||||
|
||||
* When chain of states ``C -> D`` is verified the state ``D`` turns out to be
|
||||
equivalent to state ``B``.
|
||||
|
||||
* The read mark for ``r1`` has to be propagated to state ``C``, otherwise state
|
||||
``C`` might get mistakenly marked as equivalent to state ``E`` even though
|
||||
values for register ``r1`` differ between ``C`` and ``E``.
|
||||
|
||||
Understanding eBPF verifier messages
|
||||
====================================
|
||||
|
||||
|
|
Loading…
Reference in New Issue