Skip to content

Commit 34b2021

Browse files
netoptimizerborkmann
authored andcommitted
bpf: Add BPF-helper for MTU checking
This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs. The SKB object is complex and the skb->len value (accessible from BPF-prog) also include the length of any extra GRO/GSO segments, but without taking into account that these GRO/GSO segments get added transport (L4) and network (L3) headers before being transmitted. Thus, this BPF-helper is created such that the BPF-programmer don't need to handle these details in the BPF-prog. The API is designed to help the BPF-programmer, that want to do packet context size changes, which involves other helpers. These other helpers usually does a delta size adjustment. This helper also support a delta size (len_diff), which allow BPF-programmer to reuse arguments needed by these other helpers, and perform the MTU check prior to doing any actual size adjustment of the packet context. It is on purpose, that we allow the len adjustment to become a negative result, that will pass the MTU check. This might seem weird, but it's not this helpers responsibility to "catch" wrong len_diff adjustments. Other helpers will take care of these checks, if BPF-programmer chooses to do actual size adjustment. V14: - Improve man-page desc of len_diff. V13: - Enforce flag BPF_MTU_CHK_SEGS cannot use len_diff. V12: - Simplify segment check that calls skb_gso_validate_network_len. - Helpers should return long V9: - Use dev->hard_header_len (instead of ETH_HLEN) - Annotate with unlikely req from Daniel - Fix logic error using skb_gso_validate_network_len from Daniel V6: - Took John's advice and dropped BPF_MTU_CHK_RELAX - Returned MTU is kept at L3-level (like fib_lookup) V4: Lot of changes - ifindex 0 now use current netdev for MTU lookup - rename helper from bpf_mtu_check to bpf_check_mtu - fix bug for GSO pkt length (as skb->len is total len) - remove __bpf_len_adj_positive, simply allow negative len adj Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/161287790461.790810.3429728639563297353.stgit@firesoul
1 parent e1850ea commit 34b2021

File tree

3 files changed

+264
-0
lines changed

3 files changed

+264
-0
lines changed

include/uapi/linux/bpf.h

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3847,6 +3847,69 @@ union bpf_attr {
38473847
* Return
38483848
* A pointer to a struct socket on success or NULL if the file is
38493849
* not a socket.
3850+
*
3851+
* long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
3852+
* Description
3853+
3854+
* Check ctx packet size against exceeding MTU of net device (based
3855+
* on *ifindex*). This helper will likely be used in combination
3856+
* with helpers that adjust/change the packet size.
3857+
*
3858+
* The argument *len_diff* can be used for querying with a planned
3859+
* size change. This allows to check MTU prior to changing packet
3860+
* ctx. Providing an *len_diff* adjustment that is larger than the
3861+
* actual packet size (resulting in negative packet size) will in
3862+
* principle not exceed the MTU, why it is not considered a
3863+
* failure. Other BPF-helpers are needed for performing the
3864+
* planned size change, why the responsability for catch a negative
3865+
* packet size belong in those helpers.
3866+
*
3867+
* Specifying *ifindex* zero means the MTU check is performed
3868+
* against the current net device. This is practical if this isn't
3869+
* used prior to redirect.
3870+
*
3871+
* The Linux kernel route table can configure MTUs on a more
3872+
* specific per route level, which is not provided by this helper.
3873+
* For route level MTU checks use the **bpf_fib_lookup**\ ()
3874+
* helper.
3875+
*
3876+
* *ctx* is either **struct xdp_md** for XDP programs or
3877+
* **struct sk_buff** for tc cls_act programs.
3878+
*
3879+
* The *flags* argument can be a combination of one or more of the
3880+
* following values:
3881+
*
3882+
* **BPF_MTU_CHK_SEGS**
3883+
* This flag will only works for *ctx* **struct sk_buff**.
3884+
* If packet context contains extra packet segment buffers
3885+
* (often knows as GSO skb), then MTU check is harder to
3886+
* check at this point, because in transmit path it is
3887+
* possible for the skb packet to get re-segmented
3888+
* (depending on net device features). This could still be
3889+
* a MTU violation, so this flag enables performing MTU
3890+
* check against segments, with a different violation
3891+
* return code to tell it apart. Check cannot use len_diff.
3892+
*
3893+
* On return *mtu_len* pointer contains the MTU value of the net
3894+
* device. Remember the net device configured MTU is the L3 size,
3895+
* which is returned here and XDP and TX length operate at L2.
3896+
* Helper take this into account for you, but remember when using
3897+
* MTU value in your BPF-code. On input *mtu_len* must be a valid
3898+
* pointer and be initialized (to zero), else verifier will reject
3899+
* BPF program.
3900+
*
3901+
* Return
3902+
* * 0 on success, and populate MTU value in *mtu_len* pointer.
3903+
*
3904+
* * < 0 if any input argument is invalid (*mtu_len* not updated)
3905+
*
3906+
* MTU violations return positive values, but also populate MTU
3907+
* value in *mtu_len* pointer, as this can be needed for
3908+
* implementing PMTU handing:
3909+
*
3910+
* * **BPF_MTU_CHK_RET_FRAG_NEEDED**
3911+
* * **BPF_MTU_CHK_RET_SEGS_TOOBIG**
3912+
*
38503913
*/
38513914
#define __BPF_FUNC_MAPPER(FN) \
38523915
FN(unspec), \
@@ -4012,6 +4075,7 @@ union bpf_attr {
40124075
FN(ktime_get_coarse_ns), \
40134076
FN(ima_inode_hash), \
40144077
FN(sock_from_file), \
4078+
FN(check_mtu), \
40154079
/* */
40164080

40174081
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5045,6 +5109,17 @@ struct bpf_redir_neigh {
50455109
};
50465110
};
50475111

5112+
/* bpf_check_mtu flags*/
5113+
enum bpf_check_mtu_flags {
5114+
BPF_MTU_CHK_SEGS = (1U << 0),
5115+
};
5116+
5117+
enum bpf_check_mtu_ret {
5118+
BPF_MTU_CHK_RET_SUCCESS, /* check and lookup successful */
5119+
BPF_MTU_CHK_RET_FRAG_NEEDED, /* fragmentation required to fwd */
5120+
BPF_MTU_CHK_RET_SEGS_TOOBIG, /* GSO re-segmentation needed to fwd */
5121+
};
5122+
50485123
enum bpf_task_fd_type {
50495124
BPF_FD_TYPE_RAW_TRACEPOINT, /* tp name */
50505125
BPF_FD_TYPE_TRACEPOINT, /* tp name */

net/core/filter.c

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5637,6 +5637,116 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
56375637
.arg4_type = ARG_ANYTHING,
56385638
};
56395639

5640+
static struct net_device *__dev_via_ifindex(struct net_device *dev_curr,
5641+
u32 ifindex)
5642+
{
5643+
struct net *netns = dev_net(dev_curr);
5644+
5645+
/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
5646+
if (ifindex == 0)
5647+
return dev_curr;
5648+
5649+
return dev_get_by_index_rcu(netns, ifindex);
5650+
}
5651+
5652+
BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
5653+
u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
5654+
{
5655+
int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
5656+
struct net_device *dev = skb->dev;
5657+
int skb_len, dev_len;
5658+
int mtu;
5659+
5660+
if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
5661+
return -EINVAL;
5662+
5663+
if (unlikely(flags & BPF_MTU_CHK_SEGS && len_diff))
5664+
return -EINVAL;
5665+
5666+
dev = __dev_via_ifindex(dev, ifindex);
5667+
if (unlikely(!dev))
5668+
return -ENODEV;
5669+
5670+
mtu = READ_ONCE(dev->mtu);
5671+
5672+
dev_len = mtu + dev->hard_header_len;
5673+
skb_len = skb->len + len_diff; /* minus result pass check */
5674+
if (skb_len <= dev_len) {
5675+
ret = BPF_MTU_CHK_RET_SUCCESS;
5676+
goto out;
5677+
}
5678+
/* At this point, skb->len exceed MTU, but as it include length of all
5679+
* segments, it can still be below MTU. The SKB can possibly get
5680+
* re-segmented in transmit path (see validate_xmit_skb). Thus, user
5681+
* must choose if segs are to be MTU checked.
5682+
*/
5683+
if (skb_is_gso(skb)) {
5684+
ret = BPF_MTU_CHK_RET_SUCCESS;
5685+
5686+
if (flags & BPF_MTU_CHK_SEGS &&
5687+
!skb_gso_validate_network_len(skb, mtu))
5688+
ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
5689+
}
5690+
out:
5691+
/* BPF verifier guarantees valid pointer */
5692+
*mtu_len = mtu;
5693+
5694+
return ret;
5695+
}
5696+
5697+
BPF_CALL_5(bpf_xdp_check_mtu, struct xdp_buff *, xdp,
5698+
u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
5699+
{
5700+
struct net_device *dev = xdp->rxq->dev;
5701+
int xdp_len = xdp->data_end - xdp->data;
5702+
int ret = BPF_MTU_CHK_RET_SUCCESS;
5703+
int mtu, dev_len;
5704+
5705+
/* XDP variant doesn't support multi-buffer segment check (yet) */
5706+
if (unlikely(flags))
5707+
return -EINVAL;
5708+
5709+
dev = __dev_via_ifindex(dev, ifindex);
5710+
if (unlikely(!dev))
5711+
return -ENODEV;
5712+
5713+
mtu = READ_ONCE(dev->mtu);
5714+
5715+
/* Add L2-header as dev MTU is L3 size */
5716+
dev_len = mtu + dev->hard_header_len;
5717+
5718+
xdp_len += len_diff; /* minus result pass check */
5719+
if (xdp_len > dev_len)
5720+
ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
5721+
5722+
/* BPF verifier guarantees valid pointer */
5723+
*mtu_len = mtu;
5724+
5725+
return ret;
5726+
}
5727+
5728+
static const struct bpf_func_proto bpf_skb_check_mtu_proto = {
5729+
.func = bpf_skb_check_mtu,
5730+
.gpl_only = true,
5731+
.ret_type = RET_INTEGER,
5732+
.arg1_type = ARG_PTR_TO_CTX,
5733+
.arg2_type = ARG_ANYTHING,
5734+
.arg3_type = ARG_PTR_TO_INT,
5735+
.arg4_type = ARG_ANYTHING,
5736+
.arg5_type = ARG_ANYTHING,
5737+
};
5738+
5739+
static const struct bpf_func_proto bpf_xdp_check_mtu_proto = {
5740+
.func = bpf_xdp_check_mtu,
5741+
.gpl_only = true,
5742+
.ret_type = RET_INTEGER,
5743+
.arg1_type = ARG_PTR_TO_CTX,
5744+
.arg2_type = ARG_ANYTHING,
5745+
.arg3_type = ARG_PTR_TO_INT,
5746+
.arg4_type = ARG_ANYTHING,
5747+
.arg5_type = ARG_ANYTHING,
5748+
};
5749+
56405750
#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
56415751
static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
56425752
{
@@ -7222,6 +7332,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
72227332
return &bpf_get_socket_uid_proto;
72237333
case BPF_FUNC_fib_lookup:
72247334
return &bpf_skb_fib_lookup_proto;
7335+
case BPF_FUNC_check_mtu:
7336+
return &bpf_skb_check_mtu_proto;
72257337
case BPF_FUNC_sk_fullsock:
72267338
return &bpf_sk_fullsock_proto;
72277339
case BPF_FUNC_sk_storage_get:
@@ -7291,6 +7403,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
72917403
return &bpf_xdp_adjust_tail_proto;
72927404
case BPF_FUNC_fib_lookup:
72937405
return &bpf_xdp_fib_lookup_proto;
7406+
case BPF_FUNC_check_mtu:
7407+
return &bpf_xdp_check_mtu_proto;
72947408
#ifdef CONFIG_INET
72957409
case BPF_FUNC_sk_lookup_udp:
72967410
return &bpf_xdp_sk_lookup_udp_proto;

tools/include/uapi/linux/bpf.h

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3847,6 +3847,69 @@ union bpf_attr {
38473847
* Return
38483848
* A pointer to a struct socket on success or NULL if the file is
38493849
* not a socket.
3850+
*
3851+
* long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
3852+
* Description
3853+
3854+
* Check ctx packet size against exceeding MTU of net device (based
3855+
* on *ifindex*). This helper will likely be used in combination
3856+
* with helpers that adjust/change the packet size.
3857+
*
3858+
* The argument *len_diff* can be used for querying with a planned
3859+
* size change. This allows to check MTU prior to changing packet
3860+
* ctx. Providing an *len_diff* adjustment that is larger than the
3861+
* actual packet size (resulting in negative packet size) will in
3862+
* principle not exceed the MTU, why it is not considered a
3863+
* failure. Other BPF-helpers are needed for performing the
3864+
* planned size change, why the responsability for catch a negative
3865+
* packet size belong in those helpers.
3866+
*
3867+
* Specifying *ifindex* zero means the MTU check is performed
3868+
* against the current net device. This is practical if this isn't
3869+
* used prior to redirect.
3870+
*
3871+
* The Linux kernel route table can configure MTUs on a more
3872+
* specific per route level, which is not provided by this helper.
3873+
* For route level MTU checks use the **bpf_fib_lookup**\ ()
3874+
* helper.
3875+
*
3876+
* *ctx* is either **struct xdp_md** for XDP programs or
3877+
* **struct sk_buff** for tc cls_act programs.
3878+
*
3879+
* The *flags* argument can be a combination of one or more of the
3880+
* following values:
3881+
*
3882+
* **BPF_MTU_CHK_SEGS**
3883+
* This flag will only works for *ctx* **struct sk_buff**.
3884+
* If packet context contains extra packet segment buffers
3885+
* (often knows as GSO skb), then MTU check is harder to
3886+
* check at this point, because in transmit path it is
3887+
* possible for the skb packet to get re-segmented
3888+
* (depending on net device features). This could still be
3889+
* a MTU violation, so this flag enables performing MTU
3890+
* check against segments, with a different violation
3891+
* return code to tell it apart. Check cannot use len_diff.
3892+
*
3893+
* On return *mtu_len* pointer contains the MTU value of the net
3894+
* device. Remember the net device configured MTU is the L3 size,
3895+
* which is returned here and XDP and TX length operate at L2.
3896+
* Helper take this into account for you, but remember when using
3897+
* MTU value in your BPF-code. On input *mtu_len* must be a valid
3898+
* pointer and be initialized (to zero), else verifier will reject
3899+
* BPF program.
3900+
*
3901+
* Return
3902+
* * 0 on success, and populate MTU value in *mtu_len* pointer.
3903+
*
3904+
* * < 0 if any input argument is invalid (*mtu_len* not updated)
3905+
*
3906+
* MTU violations return positive values, but also populate MTU
3907+
* value in *mtu_len* pointer, as this can be needed for
3908+
* implementing PMTU handing:
3909+
*
3910+
* * **BPF_MTU_CHK_RET_FRAG_NEEDED**
3911+
* * **BPF_MTU_CHK_RET_SEGS_TOOBIG**
3912+
*
38503913
*/
38513914
#define __BPF_FUNC_MAPPER(FN) \
38523915
FN(unspec), \
@@ -4012,6 +4075,7 @@ union bpf_attr {
40124075
FN(ktime_get_coarse_ns), \
40134076
FN(ima_inode_hash), \
40144077
FN(sock_from_file), \
4078+
FN(check_mtu), \
40154079
/* */
40164080

40174081
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5045,6 +5109,17 @@ struct bpf_redir_neigh {
50455109
};
50465110
};
50475111

5112+
/* bpf_check_mtu flags*/
5113+
enum bpf_check_mtu_flags {
5114+
BPF_MTU_CHK_SEGS = (1U << 0),
5115+
};
5116+
5117+
enum bpf_check_mtu_ret {
5118+
BPF_MTU_CHK_RET_SUCCESS, /* check and lookup successful */
5119+
BPF_MTU_CHK_RET_FRAG_NEEDED, /* fragmentation required to fwd */
5120+
BPF_MTU_CHK_RET_SEGS_TOOBIG, /* GSO re-segmentation needed to fwd */
5121+
};
5122+
50485123
enum bpf_task_fd_type {
50495124
BPF_FD_TYPE_RAW_TRACEPOINT, /* tp name */
50505125
BPF_FD_TYPE_TRACEPOINT, /* tp name */

0 commit comments

Comments
 (0)