forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 456
Sync to v5.13 #399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Sync to v5.13 #399
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When do system reboot, it calls dwc3_shutdown and the whole debugfs for dwc3 has removed first, when the gadget tries to do deinit, and remove debugfs for its endpoints, it meets NULL pointer dereference issue when call debugfs_lookup. Fix it by removing the whole dwc3 debugfs later than dwc3_drd_exit. [ 2924.958838] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000002 .... [ 2925.030994] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) [ 2925.037005] pc : inode_permission+0x2c/0x198 [ 2925.041281] lr : lookup_one_len_common+0xb0/0xf8 [ 2925.045903] sp : ffff80001276ba70 [ 2925.049218] x29: ffff80001276ba70 x28: ffff0000c01f0000 x27: 0000000000000000 [ 2925.056364] x26: ffff800011791e70 x25: 0000000000000008 x24: dead000000000100 [ 2925.063510] x23: dead000000000122 x22: 0000000000000000 x21: 0000000000000001 [ 2925.070652] x20: ffff8000122c6188 x19: 0000000000000000 x18: 0000000000000000 [ 2925.077797] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004 [ 2925.084943] x14: ffffffffffffffff x13: 0000000000000000 x12: 0000000000000030 [ 2925.092087] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f x9 : ffff8000102b2420 [ 2925.099232] x8 : 7f7f7f7f7f7f7f7f x7 : feff73746e2f6f64 x6 : 0000000000008080 [ 2925.106378] x5 : 61c8864680b583eb x4 : 209e6ec2d263dbb7 x3 : 000074756f307065 [ 2925.113523] x2 : 0000000000000001 x1 : 0000000000000000 x0 : ffff8000122c6188 [ 2925.120671] Call trace: [ 2925.123119] inode_permission+0x2c/0x198 [ 2925.127042] lookup_one_len_common+0xb0/0xf8 [ 2925.131315] lookup_one_len_unlocked+0x34/0xb0 [ 2925.135764] lookup_positive_unlocked+0x14/0x50 [ 2925.140296] debugfs_lookup+0x68/0xa0 [ 2925.143964] dwc3_gadget_free_endpoints+0x84/0xb0 [ 2925.148675] dwc3_gadget_exit+0x28/0x78 [ 2925.152518] dwc3_drd_exit+0x100/0x1f8 [ 2925.156267] dwc3_remove+0x11c/0x120 [ 2925.159851] dwc3_shutdown+0x14/0x20 [ 2925.163432] platform_shutdown+0x28/0x38 [ 2925.167360] device_shutdown+0x15c/0x378 [ 2925.171291] kernel_restart_prepare+0x3c/0x48 [ 2925.175650] kernel_restart+0x1c/0x68 [ 2925.179316] __do_sys_reboot+0x218/0x240 [ 2925.183247] __arm64_sys_reboot+0x28/0x30 [ 2925.187262] invoke_syscall+0x48/0x100 [ 2925.191017] el0_svc_common.constprop.0+0x48/0xc8 [ 2925.195726] do_el0_svc+0x28/0x88 [ 2925.199045] el0_svc+0x20/0x30 [ 2925.202104] el0_sync_handler+0xa8/0xb0 [ 2925.205942] el0_sync+0x148/0x180 [ 2925.209270] Code: a9025bf5 2a0203f5 121f0056 370802b5 (79400660) [ 2925.215372] ---[ end trace 124254d8e485a58b ]--- [ 2925.220012] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 2925.227676] Kernel Offset: disabled [ 2925.231164] CPU features: 0x00001001,20000846 [ 2925.235521] Memory Limit: none [ 2925.238580] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- Fixes: 8d396bb ("usb: dwc3: debugfs: Add and remove endpoint dirs dynamically") Cc: Jack Pham <[email protected]> Tested-by: Jack Pham <[email protected]> Signed-off-by: Peter Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 2a04276) Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
Commit b0b3b2c ("powerpc: Switch to relative jump labels") switched us to using relative jump labels. That involves changing the code, target and key members in struct jump_entry to be relative to the address of the jump_entry, rather than absolute addresses. We have two static inlines that create a struct jump_entry, arch_static_branch() and arch_static_branch_jump(), as well as an asm macro ARCH_STATIC_BRANCH, which is used by the pseries-only hypervisor tracing code. Unfortunately we missed updating the key to be a relative reference in ARCH_STATIC_BRANCH. That causes a pseries kernel to have a handful of jump_entry structs with bad key values. Instead of being a relative reference they instead hold the full address of the key. However the code doesn't expect that, it still adds the key value to the address of the jump_entry (see jump_entry_key()) expecting to get a pointer to a key somewhere in kernel data. The table of jump_entry structs sits in rodata, which comes after the kernel text. In a typical build this will be somewhere around 15MB. The address of the key will be somewhere in data, typically around 20MB. Adding the two values together gets us a pointer somewhere around 45MB. We then call static_key_set_entries() with that bad pointer and modify some members of the struct static_key we think we are pointing at. A pseries kernel is typically ~30MB in size, so writing to ~45MB won't corrupt the kernel itself. However if we're booting with an initrd, depending on the size and exact location of the initrd, we can corrupt the initrd. Depending on how exactly we corrupt the initrd it can either cause the system to not boot, or just corrupt one of the files in the initrd. The fix is simply to make the key value relative to the jump_entry struct in the ARCH_STATIC_BRANCH macro. Fixes: b0b3b2c ("powerpc: Switch to relative jump labels") Reported-by: Anastasia Kovaleva <[email protected]> Reported-by: Roman Bolshakov <[email protected]> Reported-by: Greg Kurz <[email protected]> Reported-by: Daniel Axtens <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Tested-by: Daniel Axtens <[email protected]> Tested-by: Greg Kurz <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
The proc_symlink() function returns NULL on error, it doesn't return error pointers. Fixes: 5b86d4f ("afs: Implement network namespacing") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: David Howells <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/YLjMRKX40pTrJvgf@mwanda/ Signed-off-by: Linus Torvalds <[email protected]>
xa_destroy() needs to be called to destroy a virtual EPC's page array before calling kfree() to free the virtual EPC. Currently it is not called so add the missing xa_destroy(). Fixes: 540745d ("x86/sgx: Introduce virtual EPC for use by KVM guests") Signed-off-by: Kai Huang <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Dave Hansen <[email protected]> Tested-by: Yang Zhong <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
Commit 591a22c ("proc: Track /proc/$pid/attr/ opener mm_struct") we started using __mem_open() to track the mm_struct at open-time, so that we could then check it for writes. But that also ended up making the permission checks at open time much stricter - and not just for writes, but for reads too. And that in turn caused a regression for at least Fedora 29, where NIC interfaces fail to start when using NetworkManager. Since only the write side wanted the mm_struct test, ignore any failures by __mem_open() at open time, leaving reads unaffected. The write() time verification of the mm_struct pointer will then catch the failure case because a NULL pointer will not match a valid 'current->mm'. Link: https://lore.kernel.org/netdev/YMjTlp2FSJYvoyFa@unreal/ Fixes: 591a22c ("proc: Track /proc/$pid/attr/ opener mm_struct") Reported-and-tested-by: Leon Romanovsky <[email protected]> Cc: Kees Cook <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Andrea Righi <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Scaled PPM conversion to PPB may (on 64bit systems) result in a value larger than s32 can hold (freq/scaled_ppm is a long). This means the kernel will not correctly reject unreasonably high ->freq values (e.g. > 4294967295ppb, 281474976645 scaled PPM). The conversion is equivalent to a division by ~66 (65.536), so the value of ppb is always smaller than ppm, but not small enough to assume narrowing the type from long -> s32 is okay. Note that reasonable user space (e.g. ptp4l) will not use such high values, anyway, 4289046510ppb ~= 4.3x, so the fix is somewhat pedantic. Fixes: d39a743 ("ptp: validate the requested frequency adjustment.") Fixes: d94ba80 ("ptp: Added a brand new class driver for ptp clocks.") Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Richard Cochran <[email protected]> Signed-off-by: David S. Miller <[email protected]>
The function get_net_ns_by_fd() could be inlined when NET_NS is not enabled. Signed-off-by: Changbin Du <[email protected]> Signed-off-by: David S. Miller <[email protected]>
This is meant to make the host side cdc_ncm interface consistently named just like the older CDC protocols: cdc_ether & cdc_ecm (and even rndis_host), which all use 'FLAG_ETHER | FLAG_POINTTOPOINT'. include/linux/usb/usbnet.h: #define FLAG_ETHER 0x0020 /* maybe use "eth%d" names */ #define FLAG_WLAN 0x0080 /* use "wlan%d" names */ #define FLAG_WWAN 0x0400 /* use "wwan%d" names */ #define FLAG_POINTTOPOINT 0x1000 /* possibly use "usb%d" names */ drivers/net/usb/usbnet.c @ line 1711: strcpy (net->name, "usb%d"); ... // heuristic: "usb%d" for links we know are two-host, // else "eth%d" when there's reasonable doubt. userspace // can rename the link if it knows better. if ((dev->driver_info->flags & FLAG_ETHER) != 0 && ((dev->driver_info->flags & FLAG_POINTTOPOINT) == 0 || (net->dev_addr [0] & 0x02) == 0)) strcpy (net->name, "eth%d"); /* WLAN devices should always be named "wlan%d" */ if ((dev->driver_info->flags & FLAG_WLAN) != 0) strcpy(net->name, "wlan%d"); /* WWAN devices should always be named "wwan%d" */ if ((dev->driver_info->flags & FLAG_WWAN) != 0) strcpy(net->name, "wwan%d"); So by using ETHER | POINTTOPOINT the interface naming is either usb%d or eth%d based on the global uniqueness of the mac address of the device. Without this 2.5gbps ethernet dongles which all seem to use the cdc_ncm driver end up being called usb%d instead of eth%d even though they're definitely not two-host. (All 1gbps & 5gbps ethernet usb dongles I've tested don't hit this problem due to use of different drivers, primarily r8152 and aqc111) Fixes tag is based purely on git blame, and is really just here to make sure this hits LTS branches newer than v4.5. Cc: Lorenzo Colitti <[email protected]> Fixes: 4d06dd5 ("cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind") Signed-off-by: Maciej Żenczykowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
When the QMI_WWAN_FLAG_PASS_THROUGH is set, netif_rx() is called from qmi_wwan_rx_fixup(). When the call to netif_rx() is successful (which is most of the time), usbnet_skb_return() is called (from rx_process()). usbnet_skb_return() will then call netif_rx() a second time for the same skb. Simplify the code and avoid the redundant netif_rx() call by changing qmi_wwan_rx_fixup() to always return 1 when QMI_WWAN_FLAG_PASS_THROUGH is set. We then leave it up to the existing infrastructure to call netif_rx(). Suggested-by: Bjørn Mork <[email protected]> Signed-off-by: Kristian Evensen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
The previous commit didn't fix the bug properly. By mistake, it replaces the pointer of the next skb in the descriptor ring instead of the current one. As a result, the two descriptors are assigned the same SKB. The error is seen during the iperf test when skb_put tries to insert a second packet and exceeds the available buffer. Fixes: c7718ee ("net: lantiq: fix memory corruption in RX ring ") Signed-off-by: Aleksander Jan Bajkowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Daniel Borkmann says: ==================== pull-request: bpf 2021-06-15 The following pull-request contains BPF updates for your *net* tree. We've added 5 non-merge commits during the last 11 day(s) which contain a total of 10 files changed, 115 insertions(+), 16 deletions(-). The main changes are: 1) Fix marking incorrect umem ring as done in libbpf's xsk_socket__create_shared() helper, from Kev Jackson. 2) Fix oob leakage under a spectre v1 type confusion attack, from Daniel Borkmann. ==================== Signed-off-by: David S. Miller <[email protected]>
i.MX8MM cannot detect certain CDP USB HUBs. usbmisc_imx.c driver is not following CDP timing requirements defined by USB BC 1.2 specification and section 3.2.4 Detection Timing CDP. During Primary Detection the i.MX device should turn on VDP_SRC and IDM_SINK for a minimum of 40ms (TVDPSRC_ON). After a time of TVDPSRC_ON, the i.MX is allowed to check the status of the D- line. Current implementation is waiting between 1ms and 2ms, and certain BC 1.2 complaint USB HUBs cannot be detected. Increase delay to 40ms allowing enough time for primary detection. During secondary detection the i.MX is required to disable VDP_SRC and IDM_SNK, and enable VDM_SRC and IDP_SINK for at least 40ms (TVDMSRC_ON). Current implementation is not disabling VDP_SRC and IDM_SNK, introduce disable sequence in imx7d_charger_secondary_detection() function. VDM_SRC and IDP_SINK should be enabled for at least 40ms (TVDMSRC_ON). Increase delay allowing enough time for detection. Cc: <[email protected]> Fixes: 746f316 ("usb: chipidea: introduce imx7d USB charger detection") Signed-off-by: Breno Lima <[email protected]> Signed-off-by: Jun Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Peter Chen <[email protected]>
…l/git/peter.chen/usb into usb-linus Peter writes: One bug fix for USB charger detection at imx7d and imx8m series SoCs * tag 'usb-v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/peter.chen/usb: usb: chipidea: imx: Fix Battery Charger 1.2 CDP detection
Commit 28e1745 ("printk: rename vprintk_func to vprintk") while improving readability by removing vprintk indirection, inadvertently placed the EXPORT_SYMBOL() for the newly renamed function at the end of the file. For reader sanity, and as is convention move the EXPORT_SYMBOL() declaration just after the end of the function. Fixes: 28e1745 ("printk: rename vprintk_func to vprintk") Signed-off-by: Punit Agrawal <[email protected]> Acked-by: Rasmus Villemoes <[email protected]> Acked-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
This patch fixes a Use-after-Free found by the syzbot. The problem is that a skb is taken from the per-session skb queue, without incrementing the ref count. This leads to a Use-after-Free if the skb is taken concurrently from the session queue due to a CTS. Fixes: 9d71dd0 ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/r/[email protected] Cc: Hillf Danton <[email protected]> Cc: linux-stable <[email protected]> Reported-by: [email protected] Reported-by: [email protected] Signed-off-by: Oleksij Rempel <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
syzbot is reporting hung task at register_netdevice_notifier() [1] and unregister_netdevice_notifier() [2], for cleanup_net() might perform time consuming operations while CAN driver's raw/bcm/isotp modules are calling {register,unregister}_netdevice_notifier() on each socket. Change raw/bcm/isotp modules to call register_netdevice_notifier() from module's __init function and call unregister_netdevice_notifier() from module's __exit function, as with gw/j1939 modules are doing. Link: https://syzkaller.appspot.com/bug?id=391b9498827788b3cc6830226d4ff5be87107c30 [1] Link: https://syzkaller.appspot.com/bug?id=1724d278c83ca6e6df100a2e320c10d991cf2bce [2] Link: https://lore.kernel.org/r/[email protected] Cc: linux-stable <[email protected]> Reported-by: syzbot <[email protected]> Reported-by: syzbot <[email protected]> Reviewed-by: Kirill Tkhai <[email protected]> Tested-by: syzbot <[email protected]> Tested-by: Oliver Hartkopp <[email protected]> Signed-off-by: Tetsuo Handa <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
On 64-bit systems, struct bcm_msg_head has an added padding of 4 bytes between struct members count and ival1. Even though all struct members are initialized, the 4-byte hole will contain data from the kernel stack. This patch zeroes out struct bcm_msg_head before usage, preventing infoleaks to userspace. Fixes: ffd980f ("[CAN]: Add broadcast manager (bcm) protocol") Link: https://lore.kernel.org/r/trinity-7c1b2e82-e34f-4885-8060-2cd7a13769ce-1623532166177@3c-app-gmx-bs52 Cc: linux-stable <[email protected]> Signed-off-by: Norbert Slusarek <[email protected]> Acked-by: Oliver Hartkopp <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
Syzbot reported memory leak in SocketCAN driver for Microchip CAN BUS Analyzer Tool. The problem was in unfreed usb_coherent. In mcba_usb_start() 20 coherent buffers are allocated and there is nothing, that frees them: 1) In callback function the urb is resubmitted and that's all 2) In disconnect function urbs are simply killed, but URB_FREE_BUFFER is not set (see mcba_usb_start) and this flag cannot be used with coherent buffers. Fail log: | [ 1354.053291][ T8413] mcba_usb 1-1:0.0 can0: device disconnected | [ 1367.059384][ T8420] kmemleak: 20 new suspected memory leaks (see /sys/kernel/debug/kmem) So, all allocated buffers should be freed with usb_free_coherent() explicitly NOTE: The same pattern for allocating and freeing coherent buffers is used in drivers/net/can/usb/kvaser_usb/kvaser_usb_core.c Fixes: 51f3baa ("can: mcba_usb: Add support for Microchip CAN BUS Analyzer") Link: https://lore.kernel.org/r/[email protected] Cc: linux-stable <[email protected]> Reported-and-tested-by: [email protected] Signed-off-by: Pavel Skripkin <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
In order to access the HDMI controller, we need to make sure the HSM clock is enabled. If we were to access it with the clock disabled, the CPU would completely hang, resulting in an hard crash. Since we have different code path that would require it, let's move that clock enable / disable to runtime_pm that will take care of the reference counting for us. Fixes: 4f6e3d6 ("drm/vc4: Add runtime PM support to the HDMI encoder driver") Signed-off-by: Maxime Ripard <[email protected]> Reviewed-by: Dave Stevenson <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
If the HPD GPIO is not available and drm_probe_ddc fails, we end up reading the HDMI_HOTPLUG register, but the controller might be powered off resulting in a CPU hang. Make sure we have the power domain and the HSM clock powered during the detect cycle to prevent the hang from happening. Fixes: 4f6e3d6 ("drm/vc4: Add runtime PM support to the HDMI encoder driver") Signed-off-by: Maxime Ripard <[email protected]> Reviewed-by: Dave Stevenson <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
…linux/kernel/git/kees/linux Pull clang LTO fix from Kees Cook: "It seems Clang has been scrubbing through the missing LTO IR flags for Clang 13, and the last of these 'only with LTO' flags is fixed now. I've asked that they please consider making these changes in a less 'break all the Clang kernel builds' kind of way in the future. :P Summary: - The '-warn-stack-size' option under LTO has moved in Clang 13 (Tor Vic)" * tag 'clang-features-v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: Makefile: lto: Pass -warn-stack-size only on LLD < 13.0.0
…kernel/git/vkoul/dmaengine Pull dmaengine fixes from Vinod Koul: "A bunch of driver fixes, notably: - More idxd fixes for driver unregister, error handling and bus assignment - HAS_IOMEM depends fix for few drivers - lock fix in pl330 driver - xilinx drivers fixes for initialize registers, missing dependencies and limiting descriptor IDs - mediatek descriptor management fixes" * tag 'dmaengine-fix-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: dmaengine: mediatek: use GFP_NOWAIT instead of GFP_ATOMIC in prep_dma dmaengine: mediatek: do not issue a new desc if one is still current dmaengine: mediatek: free the proper desc in desc_free handler dmaengine: ipu: fix doc warning in ipu_irq.c dmaengine: rcar-dmac: Fix PM reference leak in rcar_dmac_probe() dmaengine: idxd: Fix missing error code in idxd_cdev_open() dmaengine: stedma40: add missing iounmap() on error in d40_probe() dmaengine: SF_PDMA depends on HAS_IOMEM dmaengine: QCOM_HIDMA_MGMT depends on HAS_IOMEM dmaengine: ALTERA_MSGDMA depends on HAS_IOMEM dmaengine: idxd: Add missing cleanup for early error out in probe call dmaengine: xilinx: dpdma: Limit descriptor IDs to 16 bits dmaengine: xilinx: dpdma: Add missing dependencies to Kconfig dmaengine: stm32-mdma: fix PM reference leak in stm32_mdma_alloc_chan_resourc() dmaengine: zynqmp_dma: Fix PM reference leak in zynqmp_dma_alloc_chan_resourc() dmaengine: xilinx: dpdma: initialize registers before request_irq dmaengine: pl330: fix wrong usage of spinlock flags in dma_cyclc dmaengine: fsl-dpaa2-qdma: Fix error return code in two functions dmaengine: idxd: add missing dsa driver unregister dmaengine: idxd: add engine 'struct device' missing bus type assignment
When hugetlb page fault (under overcommitting situation) and memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race: CPU0: CPU1: gather_surplus_pages() page = alloc_surplus_huge_page() memory_failure_hugetlb() get_hwpoison_page(page) __get_hwpoison_page(page) get_page_unless_zero(page) zero = put_page_testzero(page) VM_BUG_ON_PAGE(!zero, page) enqueue_huge_page(h, page) put_page(page) __get_hwpoison_page() only checks the page refcount before taking an additional one for memory error handling, which is not enough because there's a time window where compound pages have non-zero refcount during hugetlb page initialization. So make __get_hwpoison_page() check page status a bit more for hugetlb pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags under hugetlb_lock makes sure that the hugetlb page is not transitive. It's notable that another new function, HWPoisonHandlable(), is helpful to prevent a race against other transitive page states (like a generic compound page just before PageHuge becomes true). Link: https://lkml.kernel.org/r/[email protected] Fixes: ead07f6 ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling") Signed-off-by: Naoya Horiguchi <[email protected]> Reported-by: Muchun Song <[email protected]> Acked-by: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tony Luck <[email protected]> Cc: <[email protected]> [5.12+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
I found it by pure code review, that pte_same_as_swp() of unuse_vma() didn't take uffd-wp bit into account when comparing ptes. pte_same_as_swp() returning false negative could cause failure to swapoff swap ptes that was wr-protected by userfaultfd. Link: https://lkml.kernel.org/r/[email protected] Fixes: f45ec5f ("userfaultfd: wp: support swap and page migration") Signed-off-by: Peter Xu <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: <[email protected]> [5.7+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Patch series "Actually fix freelist pointer vs redzoning", v4. This fixes redzoning vs the freelist pointer (both for middle-position and very small caches). Both are "theoretical" fixes, in that I see no evidence of such small-sized caches actually be used in the kernel, but that's no reason to let the bugs continue to exist, especially since people doing local development keep tripping over it. :) This patch (of 3): Instead of repeating "Redzone" and "Poison", clarify which sides of those zones got tripped. Additionally fix column alignment in the trailer. Before: BUG test (Tainted: G B ): Redzone overwritten ... Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........ Object (____ptrval____): f6 f4 a5 40 1d e8 ...@.. Redzone (____ptrval____): 1a aa .. Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........ After: BUG test (Tainted: G B ): Right Redzone overwritten ... Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........ Object (____ptrval____): f6 f4 a5 40 1d e8 ...@.. Redzone (____ptrval____): 1a aa .. Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........ The earlier commits that slowly resulted in the "Before" reporting were: d86bd1b ("mm/slub: support left redzone") ffc79d2 ("slub: use print_hex_dump") 2492268 ("SLUB: change error reporting format to follow lockdep loosely") Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Kees Cook <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Marco Elver <[email protected]> Cc: "Lin, Zhenpeng" <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
The redzone area for SLUB exists between s->object_size and s->inuse (which is at least the word-aligned object_size). If a cache were created with an object_size smaller than sizeof(void *), the in-object stored freelist pointer would overwrite the redzone (e.g. with boot param "slub_debug=ZF"): BUG test (Tainted: G B ): Right Redzone overwritten ----------------------------------------------------------------------------- INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200 INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620 Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........ Object (____ptrval____): f6 f4 a5 40 1d e8 ...@.. Redzone (____ptrval____): 1a aa .. Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........ Store the freelist pointer out of line when object_size is smaller than sizeof(void *) and redzoning is enabled. Additionally remove the "smaller than sizeof(void *)" check under CONFIG_DEBUG_VM in kmem_cache_sanity_check() as it is now redundant: SLAB and SLOB both handle small sizes. (Note that no caches within this size range are known to exist in the kernel currently.) Link: https://lkml.kernel.org/r/[email protected] Fixes: 81819f0 ("SLUB core") Signed-off-by: Kees Cook <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: "Lin, Zhenpeng" <[email protected]> Cc: Marco Elver <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
It turns out that SLUB redzoning ("slub_debug=Z") checks from s->object_size rather than from s->inuse (which is normally bumped to make room for the freelist pointer), so a cache created with an object size less than 24 would have the freelist pointer written beyond s->object_size, causing the redzone to be corrupted by the freelist pointer. This was very visible with "slub_debug=ZF": BUG test (Tainted: G B ): Right Redzone overwritten ----------------------------------------------------------------------------- INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200 INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620 Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........ Object (____ptrval____): 00 00 00 00 00 f6 f4 a5 ........ Redzone (____ptrval____): 40 1d e8 1a aa @.... Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........ Adjust the offset to stay within s->object_size. (Note that no caches of in this size range are known to exist in the kernel currently.) Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lore.kernel.org/lkml/[email protected]/Fixes: 89b83f2 (slub: avoid redzone when choosing freepointer location) Link: https://lore.kernel.org/lkml/CANpmjNOwZ5VpKQn+SYWovTkFB4VsT-RPwyENBmaK0dLcpqStkA@mail.gmail.com Signed-off-by: Kees Cook <[email protected]> Reported-by: Marco Elver <[email protected]> Reported-by: "Lin, Zhenpeng" <[email protected]> Tested-by: Marco Elver <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
The routine restore_reserve_on_error is called to restore reservation information when an error occurs after page allocation. The routine alloc_huge_page modifies the mapping reserve map and potentially the reserve count during allocation. If code calling alloc_huge_page encounters an error after allocation and needs to free the page, the reservation information needs to be adjusted. Currently, restore_reserve_on_error only takes action on pages for which the reserve count was adjusted(HPageRestoreReserve flag). There is nothing wrong with these adjustments. However, alloc_huge_page ALWAYS modifies the reserve map during allocation even if the reserve count is not adjusted. This can cause issues as observed during development of this patch [1]. One specific series of operations causing an issue is: - Create a shared hugetlb mapping Reservations for all pages created by default - Fault in a page in the mapping Reservation exists so reservation count is decremented - Punch a hole in the file/mapping at index previously faulted Reservation and any associated pages will be removed - Allocate a page to fill the hole No reservation entry, so reserve count unmodified Reservation entry added to map by alloc_huge_page - Error after allocation and before instantiating the page Reservation entry remains in map - Allocate a page to fill the hole Reservation entry exists, so decrement reservation count This will cause a reservation count underflow as the reservation count was decremented twice for the same index. A user would observe a very large number for HugePages_Rsvd in /proc/meminfo. This would also likely cause subsequent allocations of hugetlb pages to fail as it would 'appear' that all pages are reserved. This sequence of operations is unlikely to happen, however they were easily reproduced and observed using hacked up code as described in [1]. Address the issue by having the routine restore_reserve_on_error take action on pages where HPageRestoreReserve is not set. In this case, we need to remove any reserve map entry created by alloc_huge_page. A new helper routine vma_del_reservation assists with this operation. There are three callers of alloc_huge_page which do not currently call restore_reserve_on error before freeing a page on error paths. Add those missing calls. [1] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 96b96a9 ("mm/hugetlb: fix huge page reservation leak in private mapping error paths" Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Mina Almasry <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Peter Xu <[email protected]> Cc: Muchun Song <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Our syzkaller trigger the "BUG_ON(!list_empty(&inode->i_wb_list))" in clear_inode: kernel BUG at fs/inode.c:519! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: Process syz-executor.0 (pid: 249, stack limit = 0x00000000a12409d7) CPU: 1 PID: 249 Comm: syz-executor.0 Not tainted 4.19.95 Hardware name: linux,dummy-virt (DT) pstate: 80000005 (Nzcv daif -PAN -UAO) pc : clear_inode+0x280/0x2a8 lr : clear_inode+0x280/0x2a8 Call trace: clear_inode+0x280/0x2a8 ext4_clear_inode+0x38/0xe8 ext4_free_inode+0x130/0xc68 ext4_evict_inode+0xb20/0xcb8 evict+0x1a8/0x3c0 iput+0x344/0x460 do_unlinkat+0x260/0x410 __arm64_sys_unlinkat+0x6c/0xc0 el0_svc_common+0xdc/0x3b0 el0_svc_handler+0xf8/0x160 el0_svc+0x10/0x218 Kernel panic - not syncing: Fatal exception A crash dump of this problem show that someone called __munlock_pagevec to clear page LRU without lock_page: do_mmap -> mmap_region -> do_munmap -> munlock_vma_pages_range -> __munlock_pagevec. As a result memory_failure will call identify_page_state without wait_on_page_writeback. And after truncate_error_page clear the mapping of this page. end_page_writeback won't call sb_clear_inode_writeback to clear inode->i_wb_list. That will trigger BUG_ON in clear_inode! Fix it by checking PageWriteback too to help determine should we skip wait_on_page_writeback. Link: https://lkml.kernel.org/r/[email protected] Fixes: 0bc1f8b ("hwpoison: fix the handling path of the victimized page frame that belong to non-LRU") Signed-off-by: yangerkun <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Jan Kara <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Yu Kuai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
As mentioned in kernel commit 1d50e5d ("crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the formula: #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate PAGES_PER_SECTION in makedumpfile just like kernel. Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g. recently in kernel commit f0b13ee ("arm64/sparsemem: reduce SECTION_SIZE_BITS"). But user space wants a stable interface to get this info. Such info is impossible to be deduced from a crashdump vmcore. Hence append SECTION_SIZE_BITS to vmcoreinfo. Link: https://lkml.kernel.org/r/[email protected] Link: http://lists.infradead.org/pipermail/kexec/2021-June/022676.html Signed-off-by: Pingfan Liu <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Dave Young <[email protected]> Cc: Boris Petkov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: James Morse <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dave Anderson <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong. When 3.11 commit 13d60f4 ("futex: Take hugepages into account when generating futex_key"), the only shared huge pages came from hugetlbfs, and the code added to deal with its exceptional page->index was put into hugetlb source. Then that was missed when 4.8 added shmem huge pages. page_to_pgoff() is what others use for this nowadays: except that, as currently written, it gives the right answer on hugetlbfs head, but nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific hugetlb_basepage_index() on PageHuge tails as well as on head. Yes, it's unconventional to declare hugetlb_basepage_index() there in pagemap.h, rather than in hugetlb.h; but I do not expect anything but page_to_pgoff() ever to need it. [[email protected]: give hugetlb_basepage_index() prototype the correct scope] Link: https://lkml.kernel.org/r/[email protected] Fixes: 800d8c6 ("shmem: add huge pages support") Reported-by: Neel Natu <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Zhang Yi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Darren Hart <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5. I wrote this patchset to materialize what I think is the current allowable solution mentioned by the previous discussion [1]. I simply borrowed Tony's mutex patch and Aili's return code patch, then I queued another one to find error virtual address in the best effort manner. I know that this is not a perfect solution, but should work for some typical case. [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/ This patch (of 2): There can be races when multiple CPUs consume poison from the same page. The first into memory_failure() atomically sets the HWPoison page flag and begins hunting for tasks that map this page. Eventually it invalidates those mappings and may send a SIGBUS to the affected tasks. But while all that work is going on, other CPUs see a "success" return code from memory_failure() and so they believe the error has been handled and continue executing. Fix by wrapping most of the internal parts of memory_failure() in a mutex. [[email protected]: make mf_mutex local to memory_failure()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Aili Yao <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jue Wang <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
…en poisoned When memory_failure() is called with MF_ACTION_REQUIRED on the page that has already been hwpoisoned, memory_failure() could fail to send SIGBUS to the affected process, which results in infinite loop of MCEs. Currently memory_failure() returns 0 if it's called for already hwpoisoned page, then the caller, kill_me_maybe(), could return without sending SIGBUS to current process. An action required MCE is raised when the current process accesses to the broken memory, so no SIGBUS means that the current process continues to run and access to the error page again soon, so running into MCE loop. This issue can arise for example in the following scenarios: - Two or more threads access to the poisoned page concurrently. If local MCE is enabled, MCE handler independently handles the MCE events. So there's a race among MCE events, and the second or latter threads fall into the situation in question. - If there was a precedent memory error event and memory_failure() for the event failed to unmap the error page for some reason, the subsequent memory access to the error page triggers the MCE loop situation. To fix the issue, make memory_failure() return an error code when the error page has already been hwpoisoned. This allows memory error handler to control how it sends signals to userspace. And make sure that any process touching a hwpoisoned page should get a SIGBUS even in "already hwpoisoned" path of memory_failure() as is done in page fault path. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Aili Yao <[email protected]> Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jue Wang <[email protected]> Cc: Tony Luck <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
…recovers Currently me_huge_page() temporary unlocks page to perform some actions then locks it again later. My testcase (which calls hard-offline on some tail page in a hugetlb, then accesses the address of the hugetlb range) showed that page allocation code detects this page lock on buddy page and printed out "BUG: Bad page state" message. check_new_page_bad() does not consider a page with __PG_HWPOISON as bad page, so this flag works as kind of filter, but this filtering doesn't work in this case because the "bad page" is not the actual hwpoisoned page. So stop locking page again. Actions to be taken depend on the page type of the error, so page unlocking should be done in ->action() callbacks. So let's make it assumed and change all existing callbacks that way. Link: https://lkml.kernel.org/r/[email protected] Fixes: commit 78bb920 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error") Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tony Luck <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
… array In the event that somebody would call this with an already fully populated page_array, the last loop iteration would do an access beyond the end of page_array. It's of course extremely unlikely that would ever be done, but this triggers my internal static analyzer. Also, if it really is not supposed to be invoked this way (i.e., with no NULL entries in page_array), the nr_populated<nr_pages check could simply be removed instead. Link: https://lkml.kernel.org/r/[email protected] Fixes: 0f87d9d ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: Rasmus Villemoes <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
…ements Dan Carpenter reported the following The patch 0f87d9d: "mm/page_alloc: add an array-based interface to the bulk page allocator" from Apr 29, 2021, leads to the following static checker warning: mm/page_alloc.c:5338 __alloc_pages_bulk() warn: potentially one past the end of array 'page_array[nr_populated]' The problem can occur if an array is passed in that is fully populated. That potentially ends up allocating a single page and storing it past the end of the array. This patch returns 0 if the array is fully populated. Link: https://lkml.kernel.org/r/[email protected] Fixes: 0f87d9d ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: Mel Gorman <[email protected]> Reported-by: Dan Carpenter <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Fix my name to use diacritics, since MAINTAINERS supports it. Fix my e-mail address in MAINTAINERS' marvell10g PHY driver description, I accidentally put my other e-mail address here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marek Behún <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
…tics Some of my commits were sent with identities Marek Behun <[email protected]> Marek Behún <[email protected]> while the correct one is Marek Behún <[email protected]> Put this into mailmap so that git shortlog prints all my commits under one identity. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marek Behún <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Both of these drivers use ioport_map(), so they need to depend on HAS_IOPORT_MAP. Otherwise, they cannot be built even with COMPILE_TEST on architectures without an ioport implementation, such as ARCH=um. Reported-by: kernel test robot <[email protected]> Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Bartosz Golaszewski <[email protected]>
…el/git/tiwai/sound Pull sound fixes from Takashi Iwai: "Two small changes have been cherry-picked as a last material for 5.13: a coverage after UMN revert action and a stale MAINTAINERS entry fix" * tag 'sound-5.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: MAINTAINERS: remove Timur Tabi from Freescale SOC sound drivers ASoC: rt5645: Avoid upgrading static warnings to errors
…x/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - fix wake-up interrupt support on gpio-mxc - zero the padding bytes in a structure passed to user-space in the GPIO character device - require HAS_IOPORT_MAP in two drivers that need it to fix a Kbuild issue * tag 'gpio-fixes-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpio: AMD8111 and TQMX86 require HAS_IOPORT_MAP gpiolib: cdev: zero padding during conversion to gpioline_info_changed gpio: mxc: Fix disabled interrupt wake-up support
…x/kernel/git/dhowells/linux-fs Pull netfs fixes from David Howells: "This contains patches to fix netfs_write_begin() and afs_write_end() in the following ways: (1) In netfs_write_begin(), extract the decision about whether to skip a page out to its own helper and have that clear around the region to be written, but not clear that region. This requires the filesystem to patch it up afterwards if the hole doesn't get completely filled. (2) Use offset_in_thp() in (1) rather than manually calculating the offset into the page. (3) Due to (1), afs_write_end() now needs to handle short data write into the page by generic_perform_write(). I've adopted an analogous approach to ceph of just returning 0 in this case and letting the caller go round again. It also adds a note that (in the future) the len parameter may extend beyond the page allocated. This is because the page allocation is deferred to write_begin() and that gets to decide what size of THP to allocate." Jeff Layton points out: "The netfs fix in particular fixes a data corruption bug in cephfs" * tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: netfs: fix test for whether we can skip read when writing beyond EOF afs: Fix afs_write_end() to handle short writes
Pull ceph fixes from Ilya Dryomov: "Two regression fixes from the merge window: one in the auth code affecting old clusters and one in the filesystem for proper propagation of MDS request errors. Also included a locking fix for async creates, marked for stable" * tag 'ceph-for-5.13-rc8' of https://github.com/ceph/ceph-client: libceph: set global_id as soon as we get an auth ticket libceph: don't pass result into ac->ops->handle_reply() ceph: fix error handling in ceph_atomic_open and ceph_lookup ceph: must hold snap_rwsem when filling inode for async create
…x/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "Two more urgent FPU fixes: - prevent unprivileged userspace from reinitializing supervisor states - prepare init_fpstate, which is the buffer used when initializing FPU state, properly in case the skip-writing-state-components XSAVE* variants are used" * tag 'x86_urgent_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Make init_fpstate correct with optimized XSAVE x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate()
…/kvm Pull kvm fixes from Paolo Bonzini: "A selftests fix for ARM, and the fix for page reference count underflow. This is a very small fix that was provided by Nick Piggin and tested by myself" * tag 'for-linus-urgent' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: do not allow mapping valid but non-reference-counted pages KVM: selftests: Fix mapping length truncation in m{,un}map()
…inux/kernel/git/xen/tip Pull xen fix from Juergen Gross: "A fix for a regression introduced in 5.12: when migrating an irq related to a Xen user event to another cpu, a race might result in a WARN() triggering" * tag 'for-linus-5.13b-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/events: reset active flag for lateeoi events later
…rnel/git/rafael/linux-pm Pull device properties framework fix from Rafael Wysocki: "Fix a NULL pointer dereference introduced by a recent commit and occurring when device_remove_software_node() is used with a device that has never been registered (Heikki Krogerus)" * tag 'devprop-5.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: software node: Handle software node injection to an existing device properly
…kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "Three more driver bugfixes and an annotation fix for the core" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: robotfuzz-osif: fix control-request directions i2c: dev: Add __user annotation i2c: cp2615: check for allocation failure in cp2615_i2c_recv() i2c: i801: Ensure that SMBHSTSTS_INUSE_STS is cleared when leaving i801_access
This ioctl request reads from uffdio_continue structure written by userspace which justifies _IOC_WRITE flag. It also writes back to that structure which justifies _IOC_READ flag. See NOTEs in include/uapi/asm-generic/ioctl.h for more information. Fixes: f619147 ("userfaultfd: add UFFDIO_CONTINUE ioctl") Signed-off-by: Gleb Fotengauer-Malinovskiy <[email protected]> Acked-by: Peter Xu <[email protected]> Reviewed-by: Axel Rasmussen <[email protected]> Reviewed-by: Dmitry V. Levin <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Merge misc fixes from Andrew Morton: "24 patches, based on 4a09d38. Subsystems affected by this patch series: mm (thp, vmalloc, hugetlb, memory-failure, and pagealloc), nilfs2, kthread, MAINTAINERS, and mailmap" * emailed patches from Andrew Morton <[email protected]>: (24 commits) mailmap: add Marek's other e-mail address and identity without diacritics MAINTAINERS: fix Marek's identity again mm/page_alloc: do bulk array bounds check after checking populated elements mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array mm/hwpoison: do not lock page again when me_huge_page() successfully recovers mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned mm/memory-failure: use a mutex to avoid memory_failure() races mm, futex: fix shared futex pgoff on shmem huge page kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() kthread_worker: split code for canceling the delayed work timer mm/vmalloc: unbreak kasan vmalloc support KVM: s390: prepare for hugepage vmalloc mm/vmalloc: add vmalloc_no_huge nilfs2: fix memory leak in nilfs_sysfs_delete_device_group mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes mm: page_vma_mapped_walk(): get vma_address_end() earlier mm: page_vma_mapped_walk(): use goto instead of while (1) mm: page_vma_mapped_walk(): add a level of indentation mm: page_vma_mapped_walk(): crossing page table boundary ...
…it/jejb/scsi Pull SCSI fixes from James Bottomley: "Two small fixes, both in upper layer drivers (scsi disk and cdrom). The sd one is fixing a commit changing revalidation that came from the block tree a while ago (5.10) and the sr one adds handling of a condition we didn't previously handle for manually removed media" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sd: Call sd_revalidate_disk() for ioctl(BLKRRPART) scsi: sr: Return appropriate error code when disk is ejected
…nel/git/linusw/linux-pinctrl Pull pin control fixes from Linus Walleij: "Two last-minute fixes: - Put an fwnode in the errorpath in the SGPIO driver - Fix the number of GPIO lines per bank in the STM32 driver" * tag 'pinctrl-v5.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: stm32: fix the reported number of GPIO lines per bank pinctrl: microchip-sgpio: Put fwnode in error case during ->probe()
…git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix a couple of late pt_regs flags handling findings of conversion to generic entry. - Fix potential register clobbering in stack switch helper. - Fix thread/group masks for offline cpus. - Fix cleanup of mdev resources when remove callback is invoked in vfio-ap code. * tag 's390-5.13-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/stack: fix possible register corruption with stack switch helper s390/topology: clear thread/group maps for offline cpus s390/vfio-ap: clean up mdev resources when remove callback invoked s390: clear pt_regs::flags on irq entry s390: fix system call restart with multiple signals
This reverts commits 4bad58e (and 399f8dd, which tried to fix it). I do not believe these are correct, and I'm about to release 5.13, so am reverting them out of an abundance of caution. The locking is odd, and appears broken. On the allocation side (in __sigqueue_alloc()), the locking is somewhat straightforward: it depends on sighand->siglock. Since one caller doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid the case with no locks held. On the freeing side (in sigqueue_cache_or_free()), there is no locking at all, and the logic instead depends on 'current' being a single thread, and not able to race with itself. To make things more exciting, there's also the data race between freeing a signal and allocating one, which is handled by using WRITE_ONCE() and READ_ONCE(), and being mutually exclusive wrt the initial state (ie freeing will only free if the old state was NULL, while allocating will obviously only use the value if it was non-NULL, so only one or the other will actually act on the value). However, while the free->alloc paths do seem mutually exclusive thanks to just the data value dependency, it's not clear what the memory ordering constraints are on it. Could writes from the previous allocation possibly be delayed and seen by the new allocation later, causing logical inconsistencies? So it's all very exciting and unusual. And in particular, it seems that the freeing side is incorrect in depending on "current" being single-threaded. Yes, 'current' is a single thread, but in the presense of asynchronous events even a single thread can have data races. And such asynchronous events can and do happen, with interrupts causing signals to be flushed and thus free'd (for example - sending a SIGCONT/SIGSTOP can happen from interrupt context, and can flush previously queued process control signals). So regardless of all the other questions about the memory ordering and locking for this new cached allocation, the sigqueue_cache_or_free() assumptions seem to be fundamentally incorrect. It may be that people will show me the errors of my ways, and tell me why this is all safe after all. We can reinstate it if so. But my current belief is that the WRITE_ONCE() that sets the cached entry needs to be a smp_store_release(), and the READ_ONCE() that finds a cached entry needs to be a smp_load_acquire() to handle memory ordering correctly. And the sequence in sigqueue_cache_or_free() would need to either use a lock or at least be interrupt-safe some way (perhaps by using something like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the percpu operations it needs to be interrupt-safe). Fixes: 399f8dd ("signal: Prevent sigqueue caching after task got released") Fixes: 4bad58e ("signal: Allow tasks to cache one sigqueue struct") Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Christian Brauner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Linux 5.13 Signed-off-by: Miguel Ojeda <[email protected]>
This comment has been minimized.
This comment has been minimized.
ojeda
pushed a commit
that referenced
this pull request
Jul 29, 2021
Subprograms are calling map_poke_track(), but on program release there is no hook to call map_poke_untrack(). However, on program release, the aux memory (and poke descriptor table) is freed even though we still have a reference to it in the element list of the map aux data. When we run map_poke_run(), we then end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run(): [...] [ 402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e [ 402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337 [ 402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G I 5.12.0+ #399 [ 402.824715] Call Trace: [ 402.824719] dump_stack+0x93/0xc2 [ 402.824727] print_address_description.constprop.0+0x1a/0x140 [ 402.824736] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824740] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824744] kasan_report.cold+0x7c/0xd8 [ 402.824752] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824757] prog_array_map_poke_run+0xc2/0x34e [ 402.824765] bpf_fd_array_map_update_elem+0x124/0x1a0 [...] The elements concerned are walked as follows: for (i = 0; i < elem->aux->size_poke_tab; i++) { poke = &elem->aux->poke_tab[i]; [...] The access to size_poke_tab is a 4 byte read, verified by checking offsets in the KASAN dump: [ 402.825004] The buggy address belongs to the object at ffff8881905a7800 which belongs to the cache kmalloc-1k of size 1024 [ 402.825008] The buggy address is located 320 bytes inside of 1024-byte region [ffff8881905a7800, ffff8881905a7c00) The pahole output of bpf_prog_aux: struct bpf_prog_aux { [...] /* --- cacheline 5 boundary (320 bytes) --- */ u32 size_poke_tab; /* 320 4 */ [...] In general, subprograms do not necessarily manage their own data structures. For example, BTF func_info and linfo are just pointers to the main program structure. This allows reference counting and cleanup to be done on the latter which simplifies their management a bit. The aux->poke_tab struct, however, did not follow this logic. The initial proposed fix for this use-after-free bug further embedded poke data tracking into the subprogram with proper reference counting. However, Daniel and Alexei questioned why we were treating these objects special; I agree, its unnecessary. The fix here removes the per subprogram poke table allocation and map tracking and instead simply points the aux->poke_tab pointer at the main programs poke table. This way, map tracking is simplified to the main program and we do not need to manage them per subprogram. This also means, bpf_prog_free_deferred(), which unwinds the program reference counting and kfrees objects, needs to ensure that we don't try to double free the poke_tab when free'ing the subprog structures. This is easily solved by NULL'ing the poke_tab pointer. The second detail is to ensure that per subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do this, we add a pointer in the poke structure to point at the subprogram value so JITs can easily check while walking the poke_tab structure if the current entry belongs to the current program. The aux pointer is stable and therefore suitable for such comparison. On the jit_subprogs() error path, we omit cleaning up the poke->aux field because these are only ever referenced from the JIT side, but on error we will never make it to the JIT, so its fine to leave them dangling. Removing these pointers would complicate the error path for no reason. However, we do need to untrack all poke descriptors from the main program as otherwise they could race with the freeing of JIT memory from the subprograms. Lastly, a748c69 ("bpf: propagate poke descriptors to subprograms") had an off-by-one on the subprogram instruction index range check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'. However, subprog_end is the next subprogram's start instruction. Fixes: a748c69 ("bpf: propagate poke descriptors to subprograms") Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Let's get rid of some differences before the next big
-rc1
.