首页 > 解决方案 > SAS 运行时,azure vm 上随机出现“错误页面状态”内核错误

问题描述

几个月后,我们开始Bad page state看到/var/log/message.

这是确切的堆栈跟踪

Sep 27 15:14:11 az-prod-sas1 kernel: BUG: Bad page state in process sas  pfn:1a49ff
Sep 27 15:14:11 az-prod-sas1 kernel: page:ffffd9a146927fc0 count:0 mapcount:1 mapping:          (null) index:0x7f48e7fff
Sep 27 15:14:11 az-prod-sas1 kernel: page flags: 0x2fffff00080018(uptodate|dirty|swapbacked)
Sep 27 15:14:11 az-prod-sas1 kernel: page dumped because: nonzero mapcount
Sep 27 15:14:11 az-prod-sas1 kernel: Modules linked in: binfmt_misc iptable_security bridge stp llc nf_conntrack_netlink nfnetlink ext4 mbcache jbd2 nfsv3 nfs_acl nfs lockd grace drbg fscache ansi_cprng c
mac arc4 md4 nls_utf8 cifs ccm dns_resolver overlay(T) ipt_REJECT nf_reject_ipv4 xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack sunrpc dm_mirror dm_region_hash dm_log dm_mod joydev sb_edac iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg hv_utils ptp hv_balloo
n pps_core pcspkr i2c_piix4 ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic hv_netvsc hv_storvsc scsi_transport_fc hyperv_keyboard scsi_tgt hid_hyperv crct10dif_pclmul crct10dif_common crc32c_
intel ata_generic
Sep 27 15:14:11 az-prod-sas1 kernel: pata_acpi floppy hyperv_fb ata_piix serio_raw libata hv_vmbus
Sep 27 15:14:11 az-prod-sas1 kernel: CPU: 2 PID: 117797 Comm: sas Tainted: G               ------------ T 3.10.0-957.12.2.el7.x86_64 #1
Sep 27 15:14:11 az-pro-sas1 kernel: BUG: Bad page state in process sas  pfn:1a49ff
Sep 27 15:14:11 az-pro-sas1 kernel: page:ffffd9a146927fc0 count:0 mapcount:1 mapping:          (null) index:0x7f48e7fff
Sep 27 15:14:11 az-pro-sas1 kernel: page flags: 0x2fffff00080018(uptodate|dirty|swapbacked)
Sep 27 15:14:11 az-pro-sas1 kernel: page dumped because: nonzero mapcount
Sep 27 15:14:11 az-pro-sas1 kernel: Modules linked in: binfmt_misc iptable_security bridge stp llc nf_conntrack_netlink nfnetlink ext4 mbcache jbd2 nfsv3 nfs_acl nfs lockd grace drbg fscache ansi_cprng c
mac arc4 md4 nls_utf8 cifs ccm dns_resolver overlay(T) ipt_REJECT nf_reject_ipv4 xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack sunrpc dm_mirror dm_region_hash dm_log dm_mod joydev sb_edac iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg hv_utils ptp hv_balloo
n pps_core pcspkr i2c_piix4 ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic hv_netvsc hv_storvsc scsi_transport_fc hyperv_keyboard scsi_tgt hid_hyperv crct10dif_pclmul crct10dif_common crc32c_
intel ata_generic
Sep 27 15:14:11 az-pro-sas1 kernel: pata_acpi floppy hyperv_fb ata_piix serio_raw libata hv_vmbus
Sep 27 15:14:11 az-pro-sas1 kernel: CPU: 2 PID: 117797 Comm: sas Tainted: G               ------------ T 3.10.0-957.12.2.el7.x86_64 #1
Sep 27 15:14:11 az-pro-sas1 kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  06/02/2017
Sep 27 15:14:11 az-pro-sas1 kernel: Call Trace:
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a963041>] dump_stack+0x19/0x1b
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a95dcf3>] bad_page.part.76+0xdc/0xf9
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a3bf100>] free_pages_prepare+0x170/0x190
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a3bfb74>] free_hot_cold_page+0x74/0x160
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a3c4a13>] __put_single_page+0x23/0x30
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a3c4a65>] put_page+0x45/0x60
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a42ca37>] __split_huge_page+0x357/0x880
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a42cfd6>] split_huge_page_to_list+0x76/0xf0
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a42de30>] __split_huge_page_pmd+0x1d0/0x5c0
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a3e781d>] unmap_page_range+0xbdd/0xc30
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a3e78f1>] unmap_single_vma+0x81/0xf0
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a3e8d2d>] zap_page_range+0x11d/0x190
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a3e3c1d>] SyS_madvise+0x49d/0xac0
Sep 27 15:14:11 az-pro-sas1 kernel: [<ffffffff8a975ddb>] system_call_fastpath+0x22/0x27
Sep 27 15:14:11 az-pro-sas1 kernel: Disabling lock debugging due to kernel taint
Sep 27 15:14:12 az-pro-sas1 sh: abrt-dump-oops: Found oopses: 1disa

我们设法在其他虚拟机上复制了这个问题,我们尝试升级/降级内核,我们还尝试禁用透明大页面,但没有运气。

它是一个CentOS7虚拟机,运行Azure以下版本:

该错误是随机出现的,但它总是在 SAS 运行时出现。当它发生时,SAS进程永远挂起,一段时间后,虚拟机开始燃烧CPU并且虚拟机变得无响应

任何帮助将不胜感激!

标签: azurelinux-kernelcentos7sas-token

解决方案


推荐阅读