Watchdog soft lockup

pvanloo · September 24, 2019, 8:02am

Hello

I have a Ubuntu Core 16 device (RPi CM3) which sometimes (at random) loses ethernet connection.

Core snap is 16-2.40
Kernel snap is pi2-kernel 4.4.0-1120.129

Now I’ve managed to capture serial data from it and it seems it crashes completely, however it never restores (or full resets).

[18037.433255] INFO: rcu_sched self-detected stall on CPU
[18037.441257] INFO: rcu_sched detected stalls on CPUs/tasks:
[18037.441264] e[12;17H1-...: (3708966 ticks this GP) idle=78d/140000000000001/0 softirq=7574/7589 fqs=1588 
[18037.441270] e[13;17H(detected by 3, t=4495427 jiffies, g=2096, c=2095, q=35614)
[18037.441272] Task dump for CPU 1:
[18037.441278] manager-control R running      0  2310   2308 0x00000082
[18037.441285] rcu_sched kthread starved for 4493705 jiffies! g2096 c2095 f0x2 s3 ->state=0x0
[18037.720278] e[17;17H1-...: (3708966 ticks this GP) idle=78d/140000000000001/0 softirq=7574/7589 fqs=1588 
[18037.782050] e[18;17H (t=4495512 jiffies g=2096 c=2095 q=35615)
[18037.813311] rcu_sched kthread starved for 4493798 jiffies! g2096 c2095 f0x2 s3 ->state=0x0
[18037.873341] Task dump for CPU 1:
[18037.901730] manager-control R running      0  2310   2308 0x00000082
[18037.933017] [<80112554>] (unwind_backtrace) from [<8010d7dc>] (show_stack+0x20/0x24)
[18037.990378] [<8010d7dc>] (show_stack) from [<801571bc>] (sched_show_task+0xb8/0x110)
[18038.048268] [<801571bc>] (sched_show_task) from [<80159a00>] (dump_cpu_task+0x48/0x4c)
[18038.107812] [<80159a00>] (dump_cpu_task) from [<801919b8>] (rcu_dump_cpu_stacks+0x9c/0xd4)
[18038.168554] [<801919b8>] (rcu_dump_cpu_stacks) from [<80195fec>] (rcu_check_callbacks+0x5c0/0x8bc)
[18038.230663] [<80195fec>] (rcu_check_callbacks) from [<8019c3cc>] (update_process_times+0x4c/0x74)
[18038.294469] [<8019c3cc>] (update_process_times) from [<801b04dc>] (tick_sched_handle+0x64/0x70)
[18038.358337] [<801b04dc>] (tick_sched_handle) from [<801b0550>] (tick_sched_timer+0x68/0xbc)
[18038.423493] [<801b0550>] (tick_sched_timer) from [<8019d1b0>] (__hrtimer_run_queues+0x188/0x364)
[18038.490259] [<8019d1b0>] (__hrtimer_run_queues) from [<8019db4c>] (hrtimer_interrupt+0xd8/0x244)
[18038.557715] [<8019db4c>] (hrtimer_interrupt) from [<80743e5c>] (arch_timer_handler_phys+0x40/0x48)
[18038.626525] [<80743e5c>] (arch_timer_handler_phys) from [<8018bd2c>] (handle_percpu_devid_irq+0x80/0x194)
[18038.696876] [<8018bd2c>] (handle_percpu_devid_irq) from [<80186f64>] (generic_handle_irq+0x34/0x44)
[18038.768565] [<80186f64>] (generic_handle_irq) from [<80187270>] (__handle_domain_irq+0x6c/0xc4)
[18038.840084] [<80187270>] (__handle_domain_irq) from [<801096f4>] (handle_IRQ+0x28/0x2c)
[18038.910902] [<801096f4>] (handle_IRQ) from [<801015e4>] (bcm2836_arm_irqchip_handle_irq+0xb8/0xbc)
[18038.982964] [<801015e4>] (bcm2836_arm_irqchip_handle_irq) from [<808a3678>] (__irq_svc+0x58/0x78)
[18039.055005] Exception stack(0xaea2b9d0 to 0xaea2ba18)
[18039.091313] b9c0:                                     b9e3e01c 00000000 00000025 00000024
[18039.161335] b9e0: ae800040 ffffffff aea2bae4 00011000 aea2bae4 80e0355c 000b0000 aea2ba2c
[18039.230628] ba00: aea2ba30 aea2ba20 8027a628 808a2c30 200f0013 ffffffff
[18039.267788] [<808a3678>] (__irq_svc) from [<808a2c30>] (_raw_spin_lock+0x40/0x54)
[18039.335854] [<808a2c30>] (_raw_spin_lock) from [<8027a628>] (unmap_single_vma+0x1e8/0x64c)
[18039.404789] [<8027a628>] (unmap_single_vma) from [<8027bab4>] (unmap_vmas+0x64/0x78)
[18039.473309] [<8027bab4>] (unmap_vmas) from [<80282e38>] (exit_mmap+0x110/0x214)
[18039.511274] [<80282e38>] (exit_mmap) from [<80122264>] (mmput+0x6c/0x138)
[18039.548375] [<80122264>] (mmput) from [<80128a74>] (do_exit+0x32c/0xb38)
[18039.585148] [<80128a74>] (do_exit) from [<8010db5c>] (die+0x37c/0x38c)
[18039.621494] [<8010db5c>] (die) from [<8011bee0>] (__do_kernel_fault.part.0+0x74/0x1f4)
[18039.688470] [<8011bee0>] (__do_kernel_fault.part.0) from [<808a40d8>] (do_page_fault+0x244/0x3c4)
[18039.756059] [<808a40d8>] (do_page_fault) from [<80101284>] (do_DataAbort+0x58/0xe8)
[18039.822179] [<80101284>] (do_DataAbort) from [<808a35e4>] (__dabt_svc+0x44/0x80)
[18039.887651] Exception stack(0xaea2bd00 to 0xaea2bd48)
[18039.921312] bd00: b9f55fc0 00000037 00000038 00000000 b9036000 ad5f32a0 b9f55fc0 00017000
[18039.987397] bd20: ae80005c b8af3eec 80e0354c aea2bd6c aea2bd70 aea2bd50 802855e8 802a4a5c
[18040.054541] bd40: 60010113 ffffffff
[18040.086842] [<808a35e4>] (__dabt_svc) from [<802a4a5c>] (mem_cgroup_begin_page_stat+0x94/0xa0)
[18040.153350] [<802a4a5c>] (mem_cgroup_begin_page_stat) from [<802855e8>] (page_add_file_rmap+0x1c/0xa4)
[18040.220784] [<802855e8>] (page_add_file_rmap) from [<8027c2d4>] (do_set_pte+0xec/0x100)
[18040.287422] [<8027c2d4>] (do_set_pte) from [<802498e0>] (filemap_map_pages+0x27c/0x298)

and

[18064.097260] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [manager-control:2310]
[18064.163238] Modules linked in: cfg80211 nls_ascii bcm2835_wdt bcm2835_gpiomem spi_bcm2835 uio_pdrv_genirq uio i2c_bcm2708
[18064.233969] CPU: 1 PID: 2310 Comm: manager-control Tainted: G      D      L  4.4.0-1120-raspi2 #129-Ubuntu
[18064.302713] Hardware name: BCM2709
[18064.334894] task: b83f1440 ti: aea2a000 task.ti: aea2a000
[18064.368692] PC is at _raw_spin_lock+0x40/0x54
[18064.400880] LR is at unmap_single_vma+0x1e8/0x64c
[18064.432726] pc : [<808a2c30>]    lr : [<8027a628>]    psr: 200f0013
[18064.432726] sp : aea2ba20  ip : aea2ba30  fp : aea2ba2c
[18064.497639] r10: 000b0000  r9 : 80e0355c  r8 : aea2bae4
[18064.528646] r7 : 00011000  r6 : aea2bae4  r5 : ffffffff  r4 : ae800040
[18064.560481] r3 : 00000024  r2 : 00000025  r1 : 00000000  r0 : b9e3e01c
[18064.591720] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[18064.623483] Control: 10c5383d  Table: 301ec06a  DAC: 00000051
[18064.653583] CPU: 1 PID: 2310 Comm: manager-control Tainted: G      D      L  4.4.0-1120-raspi2 #129-Ubuntu
[18064.711017] Hardware name: BCM2709
[18064.737944] [<80112554>] (unwind_backtrace) from [<8010d7dc>] (show_stack+0x20/0x24)
[18064.793410] [<8010d7dc>] (show_stack) from [<804be668>] (dump_stack+0xc8/0x10c)
[18064.825162] [<804be668>] (dump_stack) from [<80109a78>] (show_regs+0x1c/0x20)
[18064.856288] [<80109a78>] (show_regs) from [<801ed21c>] (watchdog_timer_fn+0x258/0x2c0)
[18064.911793] [<801ed21c>] (watchdog_timer_fn) from [<8019d1b0>] (__hrtimer_run_queues+0x188/0x364)
[18064.968537] [<8019d1b0>] (__hrtimer_run_queues) from [<8019db4c>] (hrtimer_interrupt+0xd8/0x244)
[18065.025759] [<8019db4c>] (hrtimer_interrupt) from [<80743e5c>] (arch_timer_handler_phys+0x40/0x48)
[18065.084353] [<80743e5c>] (arch_timer_handler_phys) from [<8018bd2c>] (handle_percpu_devid_irq+0x80/0x194)
[18065.145475] [<8018bd2c>] (handle_percpu_devid_irq) from [<80186f64>] (generic_handle_irq+0x34/0x44)
[18065.208195] [<80186f64>] (generic_handle_irq) from [<80187270>] (__handle_domain_irq+0x6c/0xc4)
[18065.272665] [<80187270>] (__handle_domain_irq) from [<801096f4>] (handle_IRQ+0x28/0x2c)
[18065.338842] [<801096f4>] (handle_IRQ) from [<801015e4>] (bcm2836_arm_irqchip_handle_irq+0xb8/0xbc)
[18065.407526] [<801015e4>] (bcm2836_arm_irqchip_handle_irq) from [<808a3678>] (__irq_svc+0x58/0x78)
[18065.477520] Exception stack(0xaea2b9d0 to 0xaea2ba18)
[18065.513738] b9c0:                                     b9e3e01c 00000000 00000025 00000024
[18065.583538] b9e0: ae800040 ffffffff aea2bae4 00011000 aea2bae4 80e0355c 000b0000 aea2ba2c

Is this an issue in my process ‘manager-control’? And if so, how can I make sure if this happens again, the device completely resets?

Best regards

ogra · September 24, 2019, 9:51am

even if it was, it should not be able to lock up the kernel like this … you should file a bug for the kernel snap at:

https://bugs.launchpad.net/ubuntu/+source/linux-raspi2/+filebug

pvanloo · September 24, 2019, 10:08am

Thanks for the reply.

Reported the bug here: https://bugs.launchpad.net/ubuntu/+source/linux-raspi2/+bug/1845178