maemo.org Bugzilla – Bug 5712
Ad-hoc wifi leaks kernel memory, puts N900 into unusable swap hell
Last modified: 2010-06-18 13:48:02 UTC
You need to log in before you can comment on or make changes to this bug.
SOFTWARE VERSION: (Control Panel > General > About product) 1.2009.41-10 STEPS TO REPRODUCE THE PROBLEM: Try to use the N900 for a day or two, opening and closing apps periodically. EXPECTED OUTCOME: The N900 is usable and snappy. ACTUAL OUTCOME: It gets more and more sluggish until finally the UI totally freezes up, forcing you to reboot (if you can) or pull the battery (if you can't). REPRODUCIBILITY: (always/sometimes/once) Never happened during the first week of use after the Maemo Summit, happens every day now. It *might* be triggered by some software package I've installed, but the symptoms resemble a kernel bug (see the "OTHER COMMENTS" section). EXTRA SOFTWARE INSTALLED: account-plugin-idle 0.2 user/network 84 account-plugin-salut 0.2 user/network 80 amazon-installer 1.0+0m5 user/other 96 ap-installer 1.0+0m5 user/other 88 attitude 0.0.8 user/graphics 168 bluemaemo 0.3-7 user/multimedia 2380 bounce 1.0.0 user/games 22236 classic-gaming-wallpapers 1.3-1maemo3 user/desktop 4092 conboy 1:0.5.6 user/utilities 288 connection-switcher 0.3.3 user/other 100 dtg-installer 1.0-4+0m5 user/other 96 eyes-widget 1.0.0 user/desktop 188 facebook-installer 1.0-4+0m5 user/other 88 facebook-services 0.4.5 user/desktop 60 fbreader 0.10.7-1 user/tools 1464 fbreader-page-buttons 0.10.7-1 user/tools 40 feedhandler 2.0 user/network 88 foreca-installer 1.0+0m5 user/other 88 gpodder 2.0+git20091013-thp1 user/multimedia 1660 gps-data-logger 0.2-4 user/desktop 84 hermes 0.1.0 user/utilities 248 irreco-backend-internal-lirc 0.1.3 user/multimedia 92 irreco-core 0.8.8 user/multimedia 1048 irreco-for-n900 0.8.0 user/multimedia 44 libfribidi0 0.10.7-4 user/libs 160 liblinebreak 1.0-1 user/libs 100 libmpd0 0.12.0-1 user/multimedia 308 libzlibrary 0.10.7-1 user/libs 2280 lirc 0.8.4a-1fremantle30 user/multimedia 744 load-applet 0.4.5-2 user/utilities 100 maemo-pc-connectivity 0.4.13 user/tools 44 mc 1:4.6.2-pre1-1maemo5 user/tools 6388 mmpc 0.2.1 user/multimedia 596 more-gconv-modules 2.3.5cs2005q3.2-4.osso10 user/office 1024 numptyphysics 0.3.0.5 user/games 1048 ogg-support 1.0.5 user/multimedia 136 openntpd 3.9p1-4-maemo3 user/other 172 openssh 1:5.1p1-6.maemo5 user/network 8 openssh-client 1:5.1p1-6.maemo5 user/network 1192 openssh-server 1:5.1p1-6.maemo5 user/network 456 osm2go 0.7.21-maemo1 user/navigation 4216 pastebinit 0.11-1maemo2 user/utilities 328 pc-connectivity-manager 0.7.10 user/connection 5140 personal-gprs-mon 0.5-5 user/desktop 140 personal-ip-address 0.3-6 user/desktop 120 pwsafe 1.5.5d user/utilities 112 rdesktop-cli 1.6.3-maemo2 user/tools 536 rootsh 1.5 user/system 32 simple-brightness-applet 1.1-1maemo0 user/desktop 64 telnet 0.17-maemo2 user/network 160 transitioncontrol 1.1.1 user/system 104 ttf-droid 1.01-dfsg0maemo3 user/graphics 4392 tuner 0.1.1 user/multimedia 132 tutorial-home-applet 0.6.14+0m5 user/support 1272 unzip 5.52-10 user/utils 364 vim 7.2-0maemo5 user/utilities 20672 vncviewer 0.6.3-fremantle2 user/network 224 x11vnc 0.9.3-maemo6 user/utilities 1436 xchat 2.8.6-maemo11 user/network 744 zip 2.32-1maemo2 user/tools 168 OTHER COMMENTS: When this happens, I ssh into my n900 and try to investigate using htop and vmstat (built from procps sources, not packaged into a deb). I see the CPU spending 80-95% time in iowait state. I see a loadavg fluctuating between 17 and 23, usually at around 20. I see about 130 megs of swap space used. I see a lot of free RAM (free as in not used for apps, not used for cache, not used for buffers). The free RAM number slowly increases: from 70 to 100 megs. The system is swapping a lot: about 5 megabytes per second swapin, about 1 megabyte per second swapout. The working set of applications doesn't fit in memory, but the kernel is keeping nearly a hundred megs of RAM unused -- why? This smells like a kernel bug to me. vmstat excerpt: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 12 119620 97156 76 15236 10580 988 10608 988 9982 10767 1 15 0 84 2 16 119780 94408 76 15692 5380 1260 5384 1260 5351 5429 0 6 0 94 3 19 119964 95404 84 15412 3516 884 3688 884 4475 4785 3 19 0 77 2 18 119992 91264 84 16060 5856 284 5856 284 5063 5311 3 17 0 81 2 17 120116 89764 84 16352 5600 908 5608 908 5087 6194 0 14 0 86 2 19 120188 89884 84 16272 5756 816 5764 816 5322 5884 0 11 0 89 1 18 120276 91684 80 15992 680 480 696 480 1026 979 1 17 0 82 2 17 120372 89020 92 16428 7400 884 7488 884 7396 7844 2 13 0 85 0 15 119976 91668 84 16096 7808 2048 7920 2048 8711 8521 0 8 0 91 0 16 120144 92380 80 15992 5116 1040 5156 1040 4856 4933 3 13 0 84 2 17 120232 95500 80 15752 3072 840 3076 840 3817 4446 0 4 0 96 2 18 120232 95500 80 15796 5624 684 5632 684 4940 5100 0 7 0 93 2 17 120356 93580 80 15296 8100 908 8148 908 8012 7564 3 21 0 76 2 16 120552 97120 80 15592 12700 2072 12704 2072 12065 12535 1 8 0 91
Does dmesg show anything that might be relevant?
Sadly, no. The only messages in dmesg were about wlan signal strength switching between low and high (a *lot* of those), slide being open/closed, and kb_lock being open/closed. I have a dmesg from an earlier occasion a few days back, when the kernel decided to OOM-kill my browser for unknown reasons. The UI was sluggish but perfectly responsible back then; I restarted the browser and continued happily using it. What is somewhat suspicious is that, if I'm interpreting the kernel message correctly, there was plenty of free RAM at the time the OOM killer triggered, so maybe it is related to this bug? Judge for yourself: [ 8589.538574] Spurious irq 95: 0xffffffdf, please flush posted write for irq 86 [ 8620.396881] icd2 invoked oom-killer: gfp_mask=0xd0, order=2, oomkilladj=0 [ 8620.396911] [<c0030470>] (dump_stack+0x0/0x14) from [<c008e618>] (oom_kill_process+0x5c/0x1f0) ... snip long stack trace ... [ 8620.397369] [<c0054cbc>] (do_fork+0x0/0x2d0) from [<c002f9c0>] (sys_clone+0x34/0x3c) [ 8620.397399] [<c002f98c>] (sys_clone+0x0/0x3c) from [<c002c940>] (ret_fast_syscall+0x0/0x2c) [ 8620.397430] Mem-info: [ 8620.397460] Normal per-cpu: [ 8620.397460] CPU 0: hi: 90, btch: 15 usd: 0 [ 8620.397491] Active_anon:156 active_file:0 inactive_anon:320 [ 8620.397491] inactive_file:45 unevictable:4733 dirty:0 writeback:34 unstable:0 [ 8620.397521] free:23379 slab:25337 mapped:2642 pagetables:1675 bounce:0 [ 8620.397552] Normal free:93516kB min:2036kB low:2544kB high:3052kB active_anon:624kB inactive_anon:1280kB active_file:0kB inactive_file:180kB unevictable:18932kB present:260096kB pages_scanned:0 all_unreclaimable? no [ 8620.397583] lowmem_reserve[]: 0 0 [ 8620.397583] Normal: 2407*4kB 10466*8kB 2*16kB 2*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 93516kB [ 8620.397644] 3133 total pagecache pages [ 8620.397674] 476 pages in swap cache [ 8620.397674] Swap cache stats: add 286699, delete 286223, find 23896/37254 [ 8620.397705] Free swap = 668132kB [ 8620.397705] Total swap = 786424kB [ 8620.407867] 65536 pages of RAM [ 8620.407897] 23973 free pages [ 8620.407928] 4276 reserved pages [ 8620.407928] 15660 slab pages [ 8620.407928] 40255 pages shared [ 8620.407958] 476 pages swap cached [ 8620.407958] Out of memory: kill process 1637 (browserd) score 706347008 or a child [ 8620.408111] Killed process 1637 (browserd) Am I reading it right? The 23973 free pages (= 93 megabytes, assuming a page size of 4K) were free *before* browserd was killed?
That looks seriously pathological, but I haven't seen that behaviour here. Right now free reports: total used free shared buffers Mem: 245556 241288 4268 0 13928 Swap: 786424 202336 584088 Total: 1031980 443624 588356 and the device is generally responsive. One thing I noticed is that /proc/sys/vm/swappiness is now tuned all the way up to 100, maybe that's related? Adding Andre to CC: list, I think this is worth forwarding internally to get some feedback.
Interesting. We haven't seen anything like this internally (lots of swapping with free RAM). Can you attach output from Xrestop? Is there anything "suspicious" looking in /proc/slabinfo? After getting those infos into safety, does doing "killall hildon-desktop" to restart Desktop help anything? (I wonder whether there could be something funny related to X / SGX that triggers issues) > the kernel decided to OOM-kill my browser for unknown reasons. Because browser + flashplayer and the content they execute from the network can consume infinite amount of resources, Browser is set as OOM-preferred (along with couple of other things). That's why it's usually one of the first things to go when (kernel thinks that) device gets really low on RAM.
I think the device may be doing it again. Only this time I have no running apps (except for a single xterm), so the load is low and UI is mostly responsive (sliding the lock to unlock keyboard & screen feels slow, but that may be just my anxiety speaking). vmstat (this time with a 10s period instead of 1s) shows slow swapout and an increase in free RAM: ~ $ vmstat 10 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 141852 84576 56 16860 26 9 37 33 255 263 9 4 85 2 5 0 141856 85312 48 16584 157 70 157 70 744 1928 6 9 79 6 0 0 141856 83196 44 16636 375 12 375 12 912 2240 7 10 75 8 0 3 141928 83364 44 17024 296 215 296 215 847 1792 6 10 72 11 1 13 141984 84040 48 16440 1758 259 1759 259 1992 3060 9 13 45 32 0 7 142144 60064 68 17236 4282 164 4286 164 3850 5153 2 14 0 84 0 0 142116 58980 76 17432 893 50 893 54 1180 2332 3 7 73 17 0 0 142116 57488 76 17428 245 7 245 7 921 2294 4 10 83 3 0 0 142132 64472 72 17228 108 60 108 60 702 1848 4 9 83 4 0 0 142132 65016 72 17180 2 7 2 7 552 1649 2 6 91 0 0 0 142144 69480 68 17132 4 33 4 33 671 1925 3 7 84 6 4 1 142160 75316 60 16996 68 48 68 48 654 1793 2 7 85 6 0 0 142256 78056 60 16736 30 84 30 84 621 1736 2 6 89 3 1 1 142368 60480 60 17584 2363 259 2366 259 2502 3587 3 11 34 52 2 0 142452 67480 92 17088 500 404 508 416 1114 2271 2 8 43 48 0 0 142456 66972 100 17104 120 0 120 6 596 1508 1 5 85 10 3 0 142456 68544 96 17056 38 43 38 44 613 1808 1 5 91 3 1 0 142464 71892 92 16804 101 37 101 37 658 1770 2 7 84 7 1 0 142600 76096 84 16244 1182 251 1183 256 1654 2722 7 9 37 47 1 0 142632 82148 104 15560 184 182 185 195 783 1974 3 6 60 31 0 0 142668 82776 96 15112 274 117 274 120 847 1930 3 6 82 9 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 142668 83280 88 15080 28 30 28 30 600 1725 2 6 90 2 1 1 142684 82972 88 14984 136 31 136 31 714 1867 2 7 88 3 0 0 142700 83348 88 14756 178 186 178 186 752 1875 2 6 81 12 0 0 142700 83728 88 14776 26 34 26 34 571 1685 2 5 92 1 0 0 142704 85364 88 14496 7 53 7 53 558 1672 1 6 92 1 uptime is 1 day, 10:08 again (coincidence? When I was reporting this bug I noticed uptime being 1 day and 10 hours with some minutes). I can try running something (a web browser) to see if that would provoke the high loads I've seen yesterday. I'm a bit reluctant to do that since it makes even ssh sessions sluggish, and then it's hard to collect information. Currently xrestop shows Monitoring 31 clients. XErrors: 0 Pixmaps: 2930K total, Other: 166K total, All: 3096K total which doesn't seem like much. /proc/slabinfo is something I'm not familiar with so I cannot say what's suspicious. I can attach a copy here, after I provoke the high load. The "unknown reasons" bit was supposed to mean "I don't know why the kernel thought it was out of memory" rather than "why the kernel picked the browser instead of something else". One thing that I think I should mention: since this Monday I started using my N900 as a wireless "access point", sharing my 3G connection with my laptop. It's accomplished by running a simple shell script on the n900, and then ssh'ing into it from my laptop over ad-hoc wifi and enabling the SSH socks proxy emulation with ssh -D 1080. The script itself does this: IFACE=wlan0 ADDR=10.20.30.1 NETMASK=255.255.255.0 ADDR_MIN=10.20.30.10 ADDR_MAX=10.20.30.126 CHANNEL=$1 sudo /etc/init.d/avahi-daemon start sudo ifconfig $IFACE down # otherwise iwconfig mode says device busy sudo iwconfig $IFACE mode ad-hoc if [ -n "$CHANNEL" ]; then sudo iwconfig $IFACE channel $CHANNEL fi sudo ifconfig $IFACE up # otherwise iwconfig essid times out sudo iwconfig $IFACE essid n900 sudo ifconfig $IFACE $ADDR netmask $NETMASK up echo "Sadly, NAT is not supported yet so you'll have to use ssh -D 1080 user@$ADDR and use the SOCKS5 proxy on localhost:1080." echo "When you want to disconnect, press Ctrl+C." echo "XXX in 5 minutes the n900 will decide wlan0 is not used and power it down, breaking the sharing anyway :( to wake it up sudo ifconfig $IFACE up; sudo iwconfig $IFACE essid n900;" sudo /usr/sbin/dnsmasq -i $IFACE -a $ADDR -I lo -z -d \ -x /var/run/dnsmasq.$IFACE.pid \ --dhcp-range=$ADDR_MIN,$ADDR_MAX,6h \ --dhcp-option=3,$ADDR \ --dhcp-option=6,$ADDR I use it this way for hours every evening. It could be a coincidence, but my troubles started about the same time I started doing this. In the meantime, vmstat, still running in the background, now shows procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 143320 92700 48 11656 460 68 460 68 948 1996 1 6 78 15 1 0 143320 94692 48 11448 135 29 135 29 649 1683 1 5 87 7 3 1 143320 95616 48 11292 189 38 193 38 745 1985 1 6 81 12 1 0 143324 96856 48 11216 167 24 171 24 638 1524 1 5 87 7 1 0 143332 92948 48 11144 813 96 818 96 1262 2441 2 8 73 17 2 1 143332 93336 48 11368 1535 144 1539 144 1810 2886 1 8 47 44 1 0 143332 90720 48 11484 433 34 433 34 920 2035 1 5 84 9 1 0 143332 87312 48 11524 1777 224 1777 224 1995 2906 1 8 46 45 2 0 143344 92008 48 11480 1719 196 1719 196 1956 3005 2 7 45 46 2 2 143344 95276 48 11612 1940 251 1947 251 2246 3372 2 9 31 58 2 0 143348 91692 48 11236 1482 164 1482 164 1785 2722 1 8 40 50 2 3 143352 94068 48 11216 1874 416 1881 416 2247 3380 1 8 19 72 1 0 143348 91348 48 11388 1062 22 1062 22 1395 2440 1 6 59 35 the amount of swapout has markedly increased, as well as time spent in iowait. I can't wake up the screen by sliding the keylock slider. Tapping on it turned on the backlight, but the screen is black. Sliding out the keyboard turns on its backlight too. I just lost my ssh connection: 1 0 143348 91348 48 11388 1062 22 1062 22 1395 2440 1 6 59 35 2 2 143348 89316 48 11252 1175 229 1175 229 1534 2663 1 6 46 47 2 1 143360 92984 48 11264 1946 276 1953 276 2231 3204 1 8 24 67 2 7 143364 96272 48 11396 2645 496 2648 496 2731 3380 1 9 0 90 2 18 143548 96196 48 10988 4359 580 4364 580 4435 5575 1 11 0 88 2 5 143612 96076 48 10984 3591 326 3600 326 3588 4790 1 11 0 88 2 10 143616 93572 48 10928 4440 371 4447 371 4333 5676 1 10 0 89 2 6 143612 82160 48 12000 4548 121 4556 121 4326 5773 2 15 0 84 Timeout, server not responding. The n900 responds to pings with a typical TTL of under 4ms. I can barely ssh in again. The session is very slow. ~ $ uptime 22:34:19 up 1 day, 10:29, load average: 27.56, 17.36, 8.24 Looks like I won't need to attempt to start a web browser in order to reproduce this. ~ $ cat /proc/slabinfo > slabinfo-Im-melting.txt ~ $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 3 15 77160 80532 48 14648 41 11 51 34 271 292 9 4 85 3 0 13 77180 81424 48 14452 3036 316 3040 316 4809 7143 2 11 0 87 1 10 77256 84644 48 13468 3032 924 3032 924 3523 4455 3 16 0 81 2 10 77444 87036 48 12808 2792 2080 2796 2080 3809 4428 2 12 0 86 uptime is decreasing now 22:37:55 up 1 day, 10:33, load average: 7.62, 14.59, 9.21 but the screen is still black with no GUI available. Interesting! dmesg contains enough messages from the OOM killer to have filled the ring buffer. When I started typing this comment I checked dmesg and it had nothing except for the usual wlan messages etc. load is rising again: 0 1 88844 89280 72 9280 4616 4 4616 4 4198 4974 1 12 0 87 0 1 88844 93352 72 9280 0 8 0 8 453 1626 0 8 0 92 0 2 88844 98496 72 9276 60 1620 60 1620 1108 1907 0 9 0 91 2 4 88844 98128 72 8664 4376 676 4452 676 4377 5598 1 16 0 84 2 5 88844 99612 72 8616 1352 376 1428 376 1987 2969 0 9 0 91 0 8 88840 96664 72 9024 5440 4 5440 4 4777 5400 0 11 0 89 1 5 88840 93828 72 9180 5408 4 5408 4 4773 5674 1 13 0 86 1 6 88840 91300 72 9280 5380 0 5380 0 4912 5899 3 11 0 86 2 11 88848 98984 72 8476 4964 2340 5036 2340 5345 6331 1 10 0 90 2 12 88848 98840 72 8516 9724 736 9800 736 9298 10787 0 13 0 87 3 8 88848 98092 72 8848 6456 380 6456 380 6048 7022 0 15 0 85 2 10 88848 98308 72 8676 5148 588 5220 588 5102 6449 0 12 0 88 2 11 88848 98768 72 8596 6304 648 6380 648 6577 7705 0 16 0 84 2 12 88848 98468 72 8732 5656 516 5732 516 5368 6356 0 15 0 85 2 12 88848 98152 72 8776 6172 620 6248 620 6334 7367 1 11 0 88 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 13 88848 98124 72 8760 7672 1004 7756 1004 7919 10072 0 18 0 82 1 12 88848 97776 72 8844 4864 228 4864 228 4763 5366 1 8 0 91 2 14 88848 98360 72 8668 6676 872 6752 872 6465 6804 0 13 0 87 3 17 88844 97940 72 8864 6964 464 7040 464 6713 6960 1 13 0 86 1 13 88848 97636 72 8960 5368 476 5368 476 4854 5541 1 12 0 87 2 14 88848 98284 72 8728 3976 668 4052 668 4409 5353 1 14 0 85 2 15 88848 98512 72 8712 5964 516 6040 516 6280 7468 0 13 0 87 2 15 88848 98492 72 8916 6428 592 6504 592 6391 7461 0 19 0 81 2 11 88848 98188 72 8788 5676 376 5752 376 5509 6385 1 15 0 84 2 18 85736 97372 72 9336 12928 1328 13008 1328 12677 15245 1 18 0 81 2 20 85740 98420 72 8596 5376 908 5452 908 5754 7102 0 12 0 88 2 16 85736 97280 72 8800 7340 540 7416 540 7351 8529 2 15 0 83 Timeout, server not responding. meanwhile in my other ssh session ~ $ uptime 22:41:57 up 1 day, 10:37, load average: 12.90, 10.36, 8.46 I'd like to attempt to killall hildon-desktop now, but my one still-working ssh session just timed out, and new ones can't be easily established. (ping still works great.) Looks like the device either rebooted, or at least restarted hildon-desktop. The UI is back, applets are showing up one by one. The UI is snappy and responsive. ~ $ uptime 22:46:43 up 2 min, load average: 2.29, 1.42, 0.56 Yep, it rebooted by itself.
Created an attachment (id=1492) [details] /proc/slabinfo during high loadavg, massive swapin & swapout, lots of free ram
Created an attachment (id=1493) [details] dmesg's output during the same
Brain not working. In the above comment, make the following corrections: s/ping with a typical TTL/ping with a typical roundtrip time/ s/uptime is decreasing/loadavg is decreasing/
Apparently there's no need to wait 1 day and 10 hours. After the boot I ran my share-wifi script again, ssh'ed and ran vmstat. In the beginning things looked like this: ~ $ vmstat 1 Unknown HZ value! (93) Assume 100. procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 0 2484 55216 71476 0 0 454 36 798 1260 38 20 36 7 1 0 128 3076 55216 70832 0 124 0 124 613 2167 2 6 50 42 0 0 128 3068 55216 70832 0 0 0 0 634 2336 4 5 91 0 0 0 128 3060 55216 70832 0 0 0 0 486 1673 2 4 94 0 I hour later 0 0 91520 75096 320 23620 0 0 0 0 574 1836 1 5 94 0 ^C ~ $ uptime 23:54:22 up 1:10, load average: 0.47, 0.66, 0.71 then I tapped the screen to wake it up 0 0 91520 74376 320 23656 0 0 0 0 431 1594 0 6 94 0 0 0 91520 74368 320 23656 0 0 0 0 562 1771 0 6 94 0 0 11 91520 62480 328 24376 11076 0 11084 0 3957 3522 3 26 15 56 1 5 91864 54892 292 24964 9608 444 9608 444 4734 4851 22 39 0 40 0 0 92048 52004 268 25112 5584 264 5616 264 2871 3361 23 25 25 26 0 0 92048 51996 268 25112 0 0 0 0 598 1739 4 11 85 0 0 0 92048 51988 268 25112 0 0 0 0 512 1784 2 10 88 0 0 0 92048 51980 268 25112 0 0 0 0 607 1851 1 11 88 0 2 4 92100 48304 264 25368 4592 84 4592 84 2240 3248 9 26 47 19 7 3 93168 52084 244 24876 796 1336 796 1336 1536 2870 16 21 33 31 0 0 93420 52676 244 24720 128 256 132 256 780 2176 6 12 80 3 0 0 93420 52668 244 24720 0 0 0 0 558 1907 3 6 91 0 0 0 93420 52480 244 24756 304 0 304 0 799 2089 4 12 82 2 0 0 94504 51836 260 24616 2524 1176 2564 1184 2139 3136 15 18 31 37 0 1 94504 51820 264 24660 0 0 0 36 567 1792 2 6 80 12 and now it's slowly but steadily swapping things out again 0 0 99796 69912 172 22636 0 0 0 0 464 1660 0 5 95 0 ^C ~ $ uptime 00:00:10 up 1:15, load average: 0.61, 0.69, 0.71 Either this slow swap-everything-out behaviour is normal (swappiness *is* set to 100 after all) and I'm focusing on the wrong symptoms, or the problem shows up immediately and it's more or less constant usage that keeps things in memory and lets me use my N900 all day, postponing the freeze. Only I don't actually use it constantly; during daytime it's lying on the desk with wifi on (connected to office wlan) and 3g off, and I pick it up maybe once an hour. Hm, I just got an incoming IM message which woke up the screen and dropped free ram usage a bit: 0 0 102512 78780 112 22036 0 0 0 0 658 1853 0 4 96 0 Timeout, server not responding. $ ssh -D 1080 mg-n900 ~ $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 105844 60716 164 23756 32 23 52 26 675 1990 5 9 84 2 In the office I get an incoming mail notification every 30 minutes (the sound and the vibration are off so those don't bother me). Modest by default checks for new mail only when connected via WLAN, but not via 3G. Grasping at straws here. So far what I've observed in this comment seems to be perfectly normal behaviour. Perhaps bad things start happening when too many pages get evicted? In previous vmstat logs the load starts rising when free ram reaches ~100 megs. Meanwhile another wakeup happened, apparently, although the screen is blank now and I don't know what caused the activity: 0 0 110020 78604 68 20248 0 0 0 0 576 1840 0 6 94 0 0 0 110020 78596 68 20248 0 0 0 0 435 1633 0 4 96 0 1 5 110020 70908 80 20456 7680 0 7692 0 4256 5430 6 24 15 55 1 0 110020 60812 80 20604 9976 0 9976 0 4757 5069 6 16 0 78 0 0 110020 59664 80 20808 896 0 900 0 874 1668 1 8 87 4 0 0 110020 59536 80 20808 112 0 112 0 518 1724 4 6 90 0 ~$ uptime 00:14:49 up 1:30, load average: 0.77, 0.69, 0.69 I'm also thinking that this kind of constant background swapin/swapout cannot possibly be good for the internal eMMC card.
So, overnight I stopped dnsmasq and took wlan0 down, just in case. In the morning I looked at the tablet, was happy to find it alive (same xterm still open) and responsive, checked vmstat quickly, saw < 10 megs free RAM being used. All appears well. I restart my sharing script, ssh from the laptop, try to get uptime (12:22, load 0.58, 0.26, 0.09) and vmstat (ssh session times out before it can print any numbers). I walk back to the N900 and see that it has just rebooted. ~ $ cat /proc/bootreason sw_rst Last night, after the reboot during the high load, bootreason was 32wd_to.
Are you using other radios than phone and wlan (in ad-hoc mode) at the same time? Can you attach also ifconfig output from the situation when the device is getting quite slow so that we can see how much data has been transferred over these interfaces? (In reply to comment #6) > Created an attachment (id=1492) [details] [details] > /proc/slabinfo during high loadavg, massive swapin & swapout, lots of free ram This is what you had: # name <active_objs> <num_objs> <objsize> kmalloc-1024 66854 66856 1024 kmalloc-512 1163 1168 512 kmalloc-256 456 624 256 kmalloc-128 2129 4256 128 kmalloc-64 23718 27392 64 kmalloc-192 70590 70812 192 This is what I have currently in my device that has been up for few ours: # name <active_objs> <num_objs> <objsize> kmalloc-1024 136 144 1024 kmalloc-512 686 720 512 kmalloc-256 1936 1936 256 kmalloc-128 2113 3008 128 kmalloc-64 16915 17664 64 kmalloc-192 4494 4515 192 The number of active objects in the 1024 and 192 byte buckets is suspiciously high for you. In a device that has had light usage for >3 days, 1024 bucket has 153 objects and 192 byte bucket < 5000. You have nearly seventy thousand more in both. Could you use some script that at suitable intervals saves /proc/slabinfo and ifconfig output while the device otherwise idles while you use it as a wireless access point? E.g. something like this (untested): ------ #!/bin/sh mkdir -p data; cd data for i in $(seq 100); do ifconfig > ifconfig-$i.txt # awk '{print $2,$1}' /proc/slabinfo|sort -n|tail -5 > slabinfo-$i.txt cat /proc/slabinfo > slabinfo-$i.txt sleep $((30*60)) done ------ If there's some specific slab item that always increases, while the device gets slower, then we got at least an easier method for detecting (and therefore isolating) the cause of slowdown. ("slabtop" from "procps" package can also show the slab usage at realtime.)
Generic comments: I stopped using my ad-hoc wifi sharing for a few days, and this problem hadn't recurred. Also, another thing I noticed while I was still using adhoc wifi was that sometimes, just after setting up an ad-hoc wifi and ssh'ing in, my N900 would spontaneously reboot. Bug in the wifi driver? Now answers to comment #11: I have Bluetooth on (hidden), but no active BT connections. Thank you for the instructions for getting extra info, I'll try to find some time to reproduce it --- if not this week, then definitely during next week, when I'm back in Sweden with 3G as my only connectivity source. (BTW procps is not installable on the n900: I get busybox-symlinks-procps instead, and that one doesn't have slabtop. IIRC I got my vmstat binary by compiling procps in scratchbox, but I couldn't install the whole package because it conflicted with busybox, and I didn't want to remove busybox and brick my 770. I've been copying the armel binary manually from the 770 to N800 to N810 and N900. I should package it properly one day.)
(In reply to comment #12) > Bug in the wifi driver? > > Now answers to comment #11: I have Bluetooth on (hidden), but no active > BT connections. Could be either ad-hoc mode or interaction between the different radio parts in the kernel causing internal kernel memory leakage, maybe only in your specific WLAN/phone environment... Now need to find out what exactly causes this. Just having normal WLAN or phone connection doesn't seem to cause any changes to kmalloc bucket counts. Ad-hoc mode not yet tested. > (BTW procps is not installable on the n900: I get busybox-symlinks-procps > instead, and that one doesn't have slabtop. You can install procps to scratchbox and copy the required binary from there to the device. Or you can just replace busybox-symlinks-procps with real procps, the symlink package just adds symlinks that point these things to busybox: /bin/kill /bin/ps /usr/bin/uptime /usr/bin/pkill /usr/bin/watch /usr/bin/free /usr/bin/pgrep /usr/bin/top And "procps" naturally provides (real versions of) these things among other things like slabtop.
Internal comment: > I was testing this with just adhoc and basic ping's of different sizes for a > couple of hours and into each direction. > > I was monitoring the /proc/slabinfo buckets, and saw no increase like the one > described in the error. -> more information needed on how to reproduce this issue.
I, on the other hand, can still reproduce the issue, without even particularly wanting to ;) First attempt of collecting ifconfig and slabinfo failed since I forgot to increment $i and ended up overwriting the information. Then my ssh connection died, which terminated the script. I noticed a few minutes later, ssh'ed back in and re-ran that script (after fixing it). About an hour later ssh connection dropped again. vmstat 30 running over ssh on another terminal showed iowait being ~80% for the last 4 minutes before both ssh connections dropped. After trying to reconnect for five minutes (and watching at the frozen UI on screen -- the clock and battery indicators didn't update for 6 minutes) I held down the power button for a while. Two or three more minutes later the device shut down. kmalloc-1024 numbers from the record: bug5712/slabinfo-0.txt:kmalloc-1024 8671 8672 1024 8 2 : tunables 0 0 0 : slabdata 1084 1084 0 that was the script that overwrote its data in place, sigh bug5712-run0/slabinfo-0.txt:kmalloc-1024 39007 39008 1024 8 2 : tunables 0 0 0 : slabdata 4876 4876 0 bug5712-run0/slabinfo-1.txt:kmalloc-1024 44736 44736 1024 8 2 : tunables 0 0 0 : slabdata 5592 5592 0 bug5712-run0/slabinfo-2.txt:kmalloc-1024 50464 50464 1024 8 2 : tunables 0 0 0 : slabdata 6308 6308 0 bug5712-run0/slabinfo-3.txt:kmalloc-1024 56192 56192 1024 8 2 : tunables 0 0 0 : slabdata 7024 7024 0 bug5712-run0/slabinfo-4.txt:kmalloc-1024 61911 61912 1024 8 2 : tunables 0 0 0 : slabdata 7739 7739 0 bug5712-run0/slabinfo-5.txt:kmalloc-1024 67679 67680 1024 8 2 : tunables 0 0 0 : slabdata 8460 8460 0 constant growth until death bug5712-run1/slabinfo-0.txt:kmalloc-1024 1534 1536 1024 8 2 : tunables 0 0 0 : slabdata 192 192 0 just after reboot There was little change in traffic during the hour covered by the monitoring script: gprs0 - RX bytes:39651379 (37.8 MiB) TX bytes:2145926 (2.0 MiB) + RX bytes:43291104 (41.2 MiB) TX bytes:2981780 (2.8 MiB) wlan0 - RX bytes:27058743 (25.8 MiB) TX bytes:45860584 (43.7 MiB) + RX bytes:28396816 (27.0 MiB) TX bytes:50342120 (48.0 MiB) Overall I was sharing my 3g connection for maybe 3 hours.
After another ssh disconnection I reran my monitor-ifconfig-and-slabinfo script under screen. kmalloc-1024 grew steadily from 6814 to 64247 during 1 hours and 20 min. Snapshots made every 10 min showed an increase of almost exactly 6000. Then I stopped my ssh sessions and suspended my laptop. Overnight the kmalloc-1024 figure stayed more or less constant at 66687 (with small variations to 66689 or 66693 or 66698). In the morning I reconnected with ssh -D 1080 (plus another ssh session to run vmstat 60, plus sshfs) and the growth resumed (currently to 72351). dmesg is completely filled with messages of the kind (one message every 30 seconds): [112022.223846] wlan0: No active IBSS STAs - trying to scan for other IBSS networks with same SSID (merge) [112052.684814] wlan0: No active IBSS STAs - trying to scan for other IBSS networks with same SSID (merge) Since the timing doesn't match (once per 30s rather than 10 times per second) and the messages continued during the quiet period, I guess it is not related to the kmalloc growth.
(In reply to comment #15) > kmalloc-1024 numbers from the record: > bug5712-run0/slabinfo-0.txt:kmalloc-1024 39007 39008 1024 8 2 : > tunables 0 0 0 : slabdata 4876 4876 0 ... > bug5712-run0/slabinfo-5.txt:kmalloc-1024 67679 67680 1024 8 2 : > tunables 0 0 0 : slabdata 8460 8460 0 > > constant growth until death Thanks! Was the 192 sized bucket growth at the same rate too? > There was little change in traffic during the hour covered by the monitoring > script: > > gprs0 > - RX bytes:39651379 (37.8 MiB) TX bytes:2145926 (2.0 MiB) > + RX bytes:43291104 (41.2 MiB) TX bytes:2981780 (2.8 MiB) > > wlan0 > - RX bytes:27058743 (25.8 MiB) TX bytes:45860584 (43.7 MiB) > + RX bytes:28396816 (27.0 MiB) TX bytes:50342120 (48.0 MiB) > > Overall I was sharing my 3g connection for maybe 3 hours. The amount of traffic above doesn't seem suspicious. (In reply to comment #16) > After another ssh disconnection I reran my monitor-ifconfig-and-slabinfo > script under screen. > > kmalloc-1024 grew steadily from 6814 to 64247 during 1 hours and 20 min. > Snapshots made every 10 min showed an increase of almost exactly 6000. Thanks. If it's growing that that rate, it seems easy to detect. What kind of other network activity you had running at the same time? Can you provide e.g. tcpdump data for both wlan & gprs interfaces? > Then I stopped my ssh sessions and suspended my laptop. Overnight > the kmalloc-1024 figure stayed more or less constant at 66687 (with small > variations to 66689 or 66693 or 66698). In the morning I reconnected with > ssh -D 1080 (plus another ssh session to run vmstat 60, plus sshfs) and > the growth resumed (currently to 72351). Here are couple of other things you could try to isolate the cause of the issue: - use just "ping" instead of ssh, both from and to the device, without any other network activity to/from/through the device. - have laptop idling over the night, on power, WLAN on, but not doing network activity (i.e. is the issue triggered by whatever the laptop is doing) > dmesg is completely filled with messages of the kind (one message every 30 > seconds): > > [112022.223846] wlan0: No active IBSS STAs - trying to scan for other IBSS > networks with same SSID (merge) > [112052.684814] wlan0: No active IBSS STAs - trying to scan for other IBSS > networks with same SSID (merge) > > Since the timing doesn't match (once per 30s rather than 10 times per second) > and the messages continued during the quiet period, I guess it is not related > to the kmalloc growth. Ok.
(In reply to comment #17) > Thanks! Was the 192 sized bucket growth at the same rate too? Yes: ~/MyDocs $ grep -- -192 bug5712-run0/slab* bug5712-run0/slabinfo-0.txt:kmalloc-192 44094 44100 192 21 1 : tunables 0 0 0 : slabdata 2100 2100 0 bug5712-run0/slabinfo-1.txt:kmalloc-192 49833 49833 192 21 1 : tunables 0 0 0 : slabdata 2373 2373 0 bug5712-run0/slabinfo-2.txt:kmalloc-192 55576 55587 192 21 1 : tunables 0 0 0 : slabdata 2647 2647 0 bug5712-run0/slabinfo-3.txt:kmalloc-192 61277 61299 192 21 1 : tunables 0 0 0 : slabdata 2919 2919 0 bug5712-run0/slabinfo-4.txt:kmalloc-192 66938 66948 192 21 1 : tunables 0 0 0 : slabdata 3188 3188 0 bug5712-run0/slabinfo-5.txt:kmalloc-192 72739 72765 192 21 1 : tunables 0 0 0 : slabdata 3465 3465 0 > (In reply to comment #16) > > After another ssh disconnection I reran my monitor-ifconfig-and-slabinfo > > script under screen. > > > > kmalloc-1024 grew steadily from 6814 to 64247 during 1 hours and 20 min. > > Snapshots made every 10 min showed an increase of almost exactly 6000. > > Thanks. If it's growing that that rate, it seems easy to detect. > > What kind of other network activity you had running at the same time? Nothing special: IM accounts on the n900, IM accounts on the laptop. Facebook widget on the desktop. I don't recall if I had email checking enabled. > Can you provide e.g. tcpdump data for both wlan & gprs interfaces? I'll see what I can do. > > Then I stopped my ssh sessions and suspended my laptop. Overnight > > the kmalloc-1024 figure stayed more or less constant at 66687 (with small > > variations to 66689 or 66693 or 66698). In the morning I reconnected with > > ssh -D 1080 (plus another ssh session to run vmstat 60, plus sshfs) and > > the growth resumed (currently to 72351). Continuation of the story: kmalloc-1024 grew to 88905, and soon after the device entered happy swapland, which, suprisingly, wasn't immediately fatal. I went over to the xterm running the sharing script (i.e. dnsmasq), hit ^c. A few minutes later the device recovered. It wasn't fully okay (still sluggish, still swapping out apps, but maintaining ~40 free megs rather than ~90, which meant the apps could be swapped in in a reasonable amount of time, with only occasional multi-second ui pauses). The sluggishness was unpleasant so I rebooted. I'm mostly mentioning this because previously the device usually died after reaching about 67000 whatevers in the kmalloc-1024 column. Oh, and I discovered messages from the OOM killer in dmesg (which I've saved to a file) after the temporary reprieve. It could be that it wasn't ^C but OOM-killer killing osso-xterm that made the device unfreeze. I think it killed osso-xterm, I can't easily check now because of bug 5824 (zombie ssh sessions ate all my PTYs; can't create new ones; can't open xterms; sshfs flaky due to local network conditions).
More data: - connevted via gprs0, enabled adhoc wifi, used ssh for a while without memory growth - lost gprs connection, left ssh connection open over wifi - disconnected ssh, growth persisted - netstat showed no tcp connections, growth persisted - disconnecting from wifi on my laptop stopped memory growth on the n900 after a couple of seconds Looks like wifi driver's fault to me. Reconnecting to wifi from my laptop, pinging the n900, even reconnecting with ssh didn't start memory growth immediatelly. After a longer period with wifi up and an active ssh connection with some periodic small traffic the growth resumed. gprs0 was down the whole time. I have periodic snapshots of ifonfig and slabinfo taken every 10 minutes.
(In reply to comment #19) > More data: > - connevted via gprs0, enabled adhoc wifi, used ssh for a while without memory > growth By "enabling adhoc wifi", do you mean you connect with N900 to some adhoc AP, or do you mean you configure the N900 to be an adhoc AP, and associate with some other device to it?
By "enabling adhoc wifi" I mean the setup mentioned in comment 5: ifconfig + iwconfig behind icd's back, so I can browse from my laptop by using an ssh socks proxy. "AP" doesn't really apply to adhoc connections, although since I'm running dnsmasq on the n900 to serve DHCP leases and do DNS resolution, you could call it a sort-of AP.
I have a setup, with the N900 configured from command line to adhoc mode (hence bypassing the icd), with a static IP. With a linux laptop, I connect with the N900, manually configuring a static IP there too. On the N900, I'm running the dropbear ssh server. On the N900, I have no UI applications running except one xterm. In addition, I have configured the automatic connection stuff in the UI so that icd does not intervene with my command line use of the wlan0 interface. I have had varying pings ongoing between the devices, and a ssh session from the laptop to the N900 varyingly running top, and varyingly inspecting /proc/slabinfo. The setup has been set-up several hours now, and the kmalloc bucket counters in slabinfo have not incremented at all. There is still something in the problematic setup that I'm not reproducing in my environment. May I query the OS of the laptop, the make and driver of the WLAN adapter in use with it?
Created an attachment (id=1558) [details] script to bring up adhoc wifi and start dhcp server on n900 The laptop is a Lenovo ThinkPad T61, running Ubuntu 9.10, with Intel PRO/Wireless 3945ABG driven by iwl3945.ko version 1.2.26ks. I use the attached script on the N900 side, and regular Network Manager on the laptop side. I haven't fully figured out when exactly the slabinfo growth starts up. So far a full-proof method for me is to 1. connect the N900 to the Internet via 3G 2. run sharing script (optional argument specifies ad-hoc wifi channel; I've been using 2 and 3) 3. connect to the N900 ad-hoc wlan from the laptop 4. ssh -D user@mg-n900.local from the laptop 5. start browsing (Google Reader, some ssh connections, etc.) I've noticed that sometimes the slabinfo growth doesn't start immediately (I've browsed for maybe an hour once and found ~140 allocations); sometimes it starts immediately (once I rebooted, did this exact sequence of steps and was forced to reboot again within less than 3 hours because of memory pressure issues). I've confirmed twice that disconnecting from the adhoc wlan on the laptop side stops slabinfo growth on the N900 almost immediately (within a couple of seconds: in an xterm I press up, enter, up, enter, up, enter to repeat my grep -- -1024 /proc/slabinfo, and I see the numbers going up a bit and then stopping and even decreasing by one or two). I'll see if I can find a shorter way of reproducing the issue.
(In reply to comment #23) > I've noticed that sometimes the slabinfo growth doesn't start immediately > (I've browsed for maybe an hour once and found ~140 allocations); sometimes > it starts immediately Hm. So in the worst case the trigger can actually be some specific traffic that happens otherwise through the WLAN...?
It _could_ also be something that the "PRO/Wireless 3945ABG" does / sends. I'll have to see if I can find one such card to try with.
FYI: Finally we were able to re-produce this. It's not related to gprs connection or specific WLAN access point, just ad-hoc WLAN with the device. It can be gotten also with two rovers connected to each other over ad-hoc network. However, triggering of this reliably requires running the test at least for a day. In the test we run, it took something like 13 hours until the leakage started, and it wasn't constant, at some point it stopped leaking for 5 hours. I don't think the exact triggering condition is yet known. Seems like it's more related to what kind of traffic there is (not sure at which level).
I experienced the similar behavior at least twice. I did install some stuff which created me ad-hoc WiFi but I don't use this connection - just WiFi. The last time it happens after I downloaded a tar backup of whole system from SDHC to host via WLAN and that was not immediate - I completed the download, did something unsignificant (X-Term, ps, ssh etc) and drive home and tried to make a phone call. Applications button freezes in "pressed" state, I was not able to handle it a couple of minutes (traffic) and it unfreezes later and phone application starts. The first time is difficult to remember but it seems to me that it happens right after backup of whole FS to SDHC via tar, again during driving and I just rebooted. So, before this ticket I counted it to slow SDHC (1.6MB/sec on WiFi G) and full buffers.
(In reply to comment #26) > just ad-hoc WLAN with the device. Aha. I have just had my N900 running for about 3-4 hours in ad-hoc wifi mode (to my laptop which is forwarding net connections), while charging. It was switched off at the start of that. Just now, I received an incoming phone call, and the UI was not responding to touches to accept the call. It turned out it was responding, but taking up to 10 seconds to process each touch. Voice transmission was somewhat compromised, so we couldn't actually talk. After closing that call, because it wasn't usable, I uses the right-side slider to blank the screen then unblank it (wondering why the "End" call button wasn't responding), and it took a similar amount of time to respond to the slider. The top button for "End task" also took ages to respond. I opened an xterm to look for rogue processes, and 'top' showed /usr/sbin/icd2 using about 4% CPU which it doesn't normally do (it doesn't normally show as a CPU user in top at all), plus X using about 3% which is normal. Until finding this bug report, it didn't occur to me to look at memory statistics. Now, it isn't even possible to unblank the screen: it's just not responding. Yet, it did ring a couple of times as the person on the other end still tried to reach me, and eventually left a message, and the alert light is now flashing. But can't unblank the screen :-) Ah, no, unblanking worked after about 40 seconds, or pressing the top button, depending on which event it was that triggered it eventually. There are no applications running (except X-terminal at this point), and the usual things on the desktop don't normally cause this problem even with 48 hours use. I have only the calendar widget, X-terminal shortcut, a contact shortcut, some bookmarks, etc. That is, the calendar widget is the only active item on the desktop. Other than that, a Skype connection was running, but again I have not had anything like this problem with Skype previously. The firmware is the PR1.1 release. Very interesting that ad-hoc wifi mode shows up in this bug report. I don't use it often; this is only the second time.
(In reply to comment #28) > Now, it isn't even possible to unblank the screen: it's just not responding. > Yet, it did ring a couple of times as the person on the other end still tried > to reach me, and eventually left a message, and the alert light is now > flashing. But can't unblank the screen :-) > > Ah, no, unblanking worked after about 40 seconds, or pressing the top button, > depending on which event it was that triggered it eventually. Well, it unblanked and showed the X-terminal where I was running top then stopped it, but the X-terminal's cursor is unfocused and it's not responding to key presses or touch to focus, or anything really. I tried holding down the top button, expecting it to power off after 5 seconds or so, like most phones will do. I'd assumed there would be a hard power off circuit for emergencies. But that didn't have any effect, so I've had to remove the battery to switch it off. I can only imagine that it was getting slow at the point where I answered that call (in the previous comment) and then opened the X-terminal, but then it became catastrophically slow. Most likely it won't be alerting about incoming calls at this point either. In my personal circumstances, I have a friend who may have a medical emergency, so this behaviour is quite bad if it makes incoming call alerts stop working. If the device ends up in a swap storm, it would be far better to trigger a watchdog reset than getting stuck like this, as it's effectively frozen. Simply reducing or removing swap isn't the solution, because it's needed for idle application memory, but removing the ability to get stuck would be good. Is the watchdog kept alive by the kernel running, or does it regularly poll core applications for their response time?
(In reply to comment #29) > But that didn't have any effect, so I've had to remove the battery to switch it off. Got that bit wrong. I was expecting to have to remove the battery because holding down the top button for plenty of time didn't force it off. But it did reboot itself, after a few minutes of not responding to anything. (I was distracted by an incoming call on another phone otherwise I wouldn't have waited long enough to see that). Unfortunately I wasn't able to get any memory statistics. But it was noticable that it went from very slow (up to 10 seconds to respond to touches, but keyboard inside X-terminal working quite well), to completely non-responsive within a few minutes, but had sat there charging for a couple of hours at least. The charge reached full less than an hour before the problems; I don't know if that's relevant. I don't use ad-hoc mode often but I will try it every few days now, to see if there's a consistent pattern. My laptop also uses the iwl3945.ko wireless driver module, with Ubuntu 9.10. -- Jamie
(In reply to comment #29) > Is the watchdog kept alive by the kernel running, or does it regularly poll > core applications for their response time? There's a separate watchdog kicker thread in DSME for that, but DSME is locked into RAM so swapping doesn't affect it much. Kernel Wifi module eating all device memory would eventually do that, but it would happen after user patience ends I guess. The SW watchdog polls critical applications, but if they don't respond it just restarts them. Only if they get restarted too many times in a row / within certain time, then device is rebooted. (With kernel leak causing swapping restart might be so slow that this time limit doesn't get hit). If you can find a use-case that doesn't require from several hours to one day to trigger the start of the leakage, or differs from the one in comment , please report it[1]. If you trigger this, please provide also your /proc/slabinfo data from the device so that it can be verified whether this is the same issue as the original reported had. (In reply to comment #27) > The first time is difficult to remember but it seems to me that it happens > right after backup of whole FS to SDHC via tar, again during driving and I > just rebooted. [1] Hm, the way it could be internally triggered (after 13 hours of running a test, with 5 hour gap in leaks some time after that) was running ls in one device from another device using ssh. Maybe the triggering needs both ad-hoc WLAN and file system access? The issue isn't currently reproducible on our latest internal release, but there were no specific fixes for this issue, so it's possible that the use-case we had just doesn't happen to trigger the issue anymore (within day).
(In reply to comment #31) > [1] Hm, the way it could be internally triggered (after 13 hours of running a > test, with 5 hour gap in leaks some time after that) was running ls in one > device from another device using ssh. Maybe the triggering needs both ad-hoc > WLAN and file system access? See my comment #27.
Retitling this bug to point at ad-hoc wifi and kernel memory usage as the culprit.
Today I had to use ad-hoc mode again, with the same laptop as before. This time, some new things happened, culminating in a reboot within about 5 minutes of starting ad-hoc mode: 1. It appeared to establish a link (the ad-hoc wifi icon appeared and soon stopped flashing), but then it was unable to reach the internet via the ad-hoc peer, my laptop. It's possible it could have been a laptop configuration issue, but I did check various things and it seems unlikely. 2. After I'd seen it could not reach the internet, and checked basic configuration on the laptop, I used the Disconnect button on the N900, and then searched for WiFi access points to reconnect. (I was intending to look at packets on the laptop using tcpdump, to see why it the N900 could not reach the internet). 3. For about a minute, the N900 did not show any WiFi access points, which is unexpected as there are 4 to 5 within range which show up normally, had shown up a few minutes earlier, and would show up in step 4. This included not showing the laptop, <1m away. It did show the 3G connection as available. 4. Then all the local WiFi access points appeared, and I selected the laptop. 5. A couple of minutes after that, I thought to myself "oh, I just remembered that ad-hoc mode has problems". And a few seconds later, the N900 spontaneously rebooted itself as I watched. I didn't have much open or running on it when it rebooted. I.e. it wasn't a case of already using lots of memory. Something blatantly triggered the reboot with a few minutes of ad-hoc mode starting. Not showing any WiFi access points for about a minute indicates a driver problem, not just a memory leak. I think it did not display it's usual spinner indicating that it's searching, during that time, either. Interesting: Not showing the ad-hoc laptop connection _might_ not be a driver issue. It normally shows that in the list of available connections _always_, including when the laptop is switched off. (Which is annoying, by the way.) I.e. it's possible something went wrong with the wireless connection manager app.... Or maybe that app only adds all the access points, including the ones it's going to show regardless of signal availability, after it gets some kind of response back from the kernel. Is it possible the memory leak + connection problems in ad-hoc mode are caused by the connection manager app, rather than the kernel driver?
> And a few seconds later, the N900 spontaneously rebooted itself as I watched. Please attach /dev/mtd2 contents or "sp-oops-extract /dev/mtd2" output (you need to be root for these).
I have configured ad hoc wireless network with my laptop (Network controller: Intel Corporation PRO/Wireless 3945ABG). I do not have problems described above, but when I run top in xterm process icd2 is consuming all time above 6% cpu. Most of the time icd2 is the first process in top list with highest cpu usage. After half an hour the phone is considerably warm and battery indicator is showing about 15% less charge. Internet was bridged from my laptop lan to ad-hoc wifi. No special data transfers was occuring between phone and laptop, there was just inactive connections to googletalk and skype and inactive shh connection from laptop to phone. I'm often using similar configuration with infrastructure mode wifi (with AP), but I never observed similar icd2 behaviour.
Created an attachment (id=2222) [details] mtdoops of kernel panic after crash 5 minutes after starting ad-hoc mode (In reply to comment #35) > > And a few seconds later, the N900 spontaneously rebooted itself as I watched. > > Please attach /dev/mtd2 contents or "sp-oops-extract /dev/mtd2" output (you > need to be root for these). sp-oops-extract wasn't installed nor did apt-get find it, so I read /dev/mtd2 and edited out the runs of 0xff. (Ooh, I like mtdoops, think I'll use it on my own devices in future. But we can see it needs a bit more work, because the panic log is mangled at block boundaries.) It's triggered in __bug, called from rate_control_get_rate, called from ieee80211_process_measurement_req. Please enjoy!
(In reply to comment #36) > I do not have problems described above, but when I run top in xterm process > icd2 is consuming all time above 6% cpu. Please file a separate bug about that with "use-time" keyword and attach to it output of "strace -f -p $(pidof icd2)" trace. Strace is available from the SDK tools repo. Instructions for SDK tools installation are at the end of this page: http://wiki.maemo.org/Documentation/devtools/maemo5 (In reply to comment #37) >> Please attach /dev/mtd2 contents or "sp-oops-extract /dev/mtd2" output >> (you need to be root for these). > > sp-oops-extract wasn't installed nor did apt-get find it, It's in the SDK tools repo (you get it also if you install e.g. Crash reporter). > so I read /dev/mtd2 and edited out the runs of 0xff Thanks! I'd rather get the raw one if you think there was something wrong in the log writing though... > (Ooh, I like mtdoops, think I'll use it on my own devices in future. > But we can see it needs a bit more work, because the panic log is mangled > at block boundaries.) ...as it should handle oopslogs spanning multiple blocks fine. Btw. latest upstream kernel version has some fixes to it (the stuff used by oopslog may sometimes call other parts of the kernel which assume kernel internal memory is fine, I understood latest upstream kernel isolates mtdoops better than the version we originally contributed). If you do find issues & do fixes to mtdoops (or backport them from later kernel version to ours), please file a bug about that with a patch, thanks! > It's triggered in __bug, called from rate_control_get_rate, called from > ieee80211_process_measurement_req. Please enjoy! This seems internal bug NB#106780. It may be caused by a station/AP not announcing correct supported rates and hence triggers an incorrect rate index from minstrel. According to that bug comments, it's not triggered if a fixed rate is set using e.g. "iwconfig wlan0 rate 1M fixed". It should be fixed in next public release (internal release version: 10.2010.03-12). A related upstream kernel bug: http://bugzilla.kernel.org/show_bug.cgi?id=12490 I didn't see any comments about this being related to the internal kernel memory leakage in ad-hoc case (about which this bug is about) though.
For the record I've experienced this too. I run an ad-hoc network between PC and N900, and leave it on all day. Firstly it completely drains my battery in no time. Then also grinds the N900 to a messy halt... slow to unusable state.... easy to replicate.
(In reply to comment #39) > For the record I've experienced this too. > I run an ad-hoc network between PC and N900, and leave it on all day. > > Firstly it completely drains my battery in no time. Then also grinds the N900 > to a messy halt... slow to unusable state.... easy to replicate. Do you also see the growth in /proc/slabinfo kmalloc-* alloc buckets? In the other bug you mentioned that the slowness went away when you disconnected WLAN, if that helps with the slowness and the slabinfo counts drop significantly, then the issue is not lost allocs, but e.g. some list used by the driver growing when it shouldn't.
FYI: Internal fixes to this are being tested (there were actually several small leaks in the WiFi driver) and if they're good, they'll be in the next public release.
Fixes have been included internally and two tests (26hours and 60hours) have shown that this is fixed internally. This has been fixed in package kernel 2.6.28-20101001+0m5 which is part of the internal build version 10.2010.10-9 (Note: 2009/2010 is the year, and the number after is the week.) A future public update released with the year/week later than this internal build version will include the fix. (This is not always already the next public update.) Please verify that this new version fixes the bug by marking this bug report as VERIFIED after the public update has been released and if you have some time. To answer popular followup questions: * Nokia does not announce release dates of public updates in advance. * There is currently no access to these internal, non-public build versions. A Brainstorm proposal to change this exists at http://maemo.org/community/brainstorm/view/undelayed_bugfix_releases_for_nokia_open_source_packages-002/
*** Bug 8645 has been marked as a duplicate of this bug. ***
Fix for this is now delivered with Pr1.2. Marius could you please verify that it has solved your problem
RX-51_2009SE_10.2010.19-1_PR_MR0: ~20 minutes of internet sharing with the old script, zero kmalloc-1024 growth so far. Bug looks very fixed to me.
Thank you, I kinda assumed it was fixed :) As you (In reply to comment #45) > RX-51_2009SE_10.2010.19-1_PR_MR0: ~20 minutes of internet sharing with the old > script, zero kmalloc-1024 growth so far. Bug looks very fixed to me. > Thank you. I kinda assumed it is fixed :) Since you are the reporter coudl you mark bbug as verified so that we close the issue