Bug 5712 (int-144089)

Summary: Ad-hoc wifi leaks kernel memory, puts N900 into unusable swap hell
Product: [Maemo Official Platform] Core Reporter: Marius Gedminas <marius>
Component: KernelAssignee: unassigned <nobody>
Status: VERIFIED FIXED QA Contact: linux-kernel-bugs
Severity: major    
Priority: Medium CC: andre_klapper, dymaxion, eero.tamminen, egoshin, ext-petri.t.pitkanen, jamie, josh, juuso.oikarinen, maemo, marius, roman.kapusta, tim
Version: 5.0/(3.2010.02-8)Keywords: performance
Target Milestone: 5.0/(10.2010.19-1)   
Hardware: N900   
OS: Maemo   
Attachments: /proc/slabinfo during high loadavg, massive swapin & swapout, lots of free ram
dmesg's output during the same
script to bring up adhoc wifi and start dhcp server on n900
mtdoops of kernel panic after crash 5 minutes after starting ad-hoc mode

Description Marius Gedminas (reporter) 2009-10-22 13:20:29 UTC
SOFTWARE VERSION:
(Control Panel > General > About product)
1.2009.41-10

STEPS TO REPRODUCE THE PROBLEM:
Try to use the N900 for a day or two, opening and closing apps periodically.

EXPECTED OUTCOME:
The N900 is usable and snappy.

ACTUAL OUTCOME:
It gets more and more sluggish until finally the UI totally freezes up, forcing
you to reboot (if you can) or pull the battery (if you can't).

REPRODUCIBILITY:
(always/sometimes/once)
Never happened during the first week of use after the Maemo Summit, happens
every day now.  It *might* be triggered by some software package I've
installed, but the symptoms resemble a kernel bug (see the "OTHER COMMENTS"
section).

EXTRA SOFTWARE INSTALLED:
account-plugin-idle    0.2    user/network    84
account-plugin-salut    0.2    user/network    80
amazon-installer    1.0+0m5    user/other    96
ap-installer    1.0+0m5    user/other    88
attitude    0.0.8    user/graphics    168
bluemaemo    0.3-7    user/multimedia    2380
bounce    1.0.0    user/games    22236
classic-gaming-wallpapers    1.3-1maemo3    user/desktop    4092
conboy    1:0.5.6    user/utilities    288
connection-switcher    0.3.3    user/other    100
dtg-installer    1.0-4+0m5    user/other    96
eyes-widget    1.0.0    user/desktop    188
facebook-installer    1.0-4+0m5    user/other    88
facebook-services    0.4.5    user/desktop    60
fbreader    0.10.7-1    user/tools    1464
fbreader-page-buttons    0.10.7-1    user/tools    40
feedhandler    2.0    user/network    88
foreca-installer    1.0+0m5    user/other    88
gpodder    2.0+git20091013-thp1    user/multimedia    1660
gps-data-logger    0.2-4    user/desktop    84
hermes    0.1.0    user/utilities    248
irreco-backend-internal-lirc    0.1.3    user/multimedia    92
irreco-core    0.8.8    user/multimedia    1048
irreco-for-n900    0.8.0    user/multimedia    44
libfribidi0    0.10.7-4    user/libs    160
liblinebreak    1.0-1    user/libs    100
libmpd0    0.12.0-1    user/multimedia    308
libzlibrary    0.10.7-1    user/libs    2280
lirc    0.8.4a-1fremantle30    user/multimedia    744
load-applet    0.4.5-2    user/utilities    100
maemo-pc-connectivity    0.4.13    user/tools    44
mc    1:4.6.2-pre1-1maemo5    user/tools    6388
mmpc    0.2.1    user/multimedia    596
more-gconv-modules    2.3.5cs2005q3.2-4.osso10    user/office    1024
numptyphysics    0.3.0.5    user/games    1048
ogg-support    1.0.5    user/multimedia    136
openntpd    3.9p1-4-maemo3    user/other    172
openssh    1:5.1p1-6.maemo5    user/network    8
openssh-client    1:5.1p1-6.maemo5    user/network    1192
openssh-server    1:5.1p1-6.maemo5    user/network    456
osm2go    0.7.21-maemo1    user/navigation    4216
pastebinit    0.11-1maemo2    user/utilities    328
pc-connectivity-manager    0.7.10    user/connection    5140
personal-gprs-mon    0.5-5    user/desktop    140
personal-ip-address    0.3-6    user/desktop    120
pwsafe    1.5.5d    user/utilities    112
rdesktop-cli    1.6.3-maemo2    user/tools    536
rootsh    1.5    user/system    32
simple-brightness-applet    1.1-1maemo0    user/desktop    64
telnet    0.17-maemo2    user/network    160
transitioncontrol    1.1.1    user/system    104
ttf-droid    1.01-dfsg0maemo3    user/graphics    4392
tuner    0.1.1    user/multimedia    132
tutorial-home-applet    0.6.14+0m5    user/support    1272
unzip    5.52-10    user/utils    364
vim    7.2-0maemo5    user/utilities    20672
vncviewer    0.6.3-fremantle2    user/network    224
x11vnc    0.9.3-maemo6    user/utilities    1436
xchat    2.8.6-maemo11    user/network    744
zip    2.32-1maemo2    user/tools    168


OTHER COMMENTS:

When this happens, I ssh into my n900 and try to investigate using htop and
vmstat (built from procps sources, not packaged into a deb).  I see the CPU
spending 80-95% time in iowait state.  I see a loadavg fluctuating between 17
and 23, usually at around 20.  I see about 130 megs of swap space used.  I see
a lot of free RAM (free as in not used for apps, not used for cache, not used
for buffers).  The free RAM number slowly increases: from 70 to 100 megs.  The
system is swapping a lot: about 5 megabytes per second swapin, about 1 megabyte
per second swapout.  The working set of applications doesn't fit in memory, but
the kernel is keeping nearly a hundred megs of RAM unused -- why?   This smells
like a kernel bug to me.

vmstat excerpt:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1 12 119620  97156     76  15236 10580  988 10608   988 9982 10767  1 15  0 84
 2 16 119780  94408     76  15692 5380 1260  5384  1260 5351  5429  0  6  0 94
 3 19 119964  95404     84  15412 3516  884  3688   884 4475  4785  3 19  0 77
 2 18 119992  91264     84  16060 5856  284  5856   284 5063  5311  3 17  0 81
 2 17 120116  89764     84  16352 5600  908  5608   908 5087  6194  0 14  0 86
 2 19 120188  89884     84  16272 5756  816  5764   816 5322  5884  0 11  0 89
 1 18 120276  91684     80  15992  680  480   696   480 1026   979  1 17  0 82
 2 17 120372  89020     92  16428 7400  884  7488   884 7396  7844  2 13  0 85
 0 15 119976  91668     84  16096 7808 2048  7920  2048 8711  8521  0  8  0 91
 0 16 120144  92380     80  15992 5116 1040  5156  1040 4856  4933  3 13  0 84
 2 17 120232  95500     80  15752 3072  840  3076   840 3817  4446  0  4  0 96
 2 18 120232  95500     80  15796 5624  684  5632   684 4940  5100  0  7  0 93
 2 17 120356  93580     80  15296 8100  908  8148   908 8012  7564  3 21  0 76
 2 16 120552  97120     80  15592 12700 2072 12704  2072 12065 12535  1  8  0
91
Comment 1 Lucas Maneos 2009-10-22 19:19:29 UTC
Does dmesg show anything that might be relevant?
Comment 2 Marius Gedminas (reporter) 2009-10-22 22:40:50 UTC
Sadly, no.  The only messages in dmesg were about wlan signal strength
switching between low and high (a *lot* of those), slide being open/closed, and
kb_lock being open/closed.

I have a dmesg from an earlier occasion a few days back, when the kernel
decided to OOM-kill my browser for unknown reasons.  The UI was sluggish but
perfectly responsible back then; I restarted the browser and continued happily
using it.  What is somewhat suspicious is that, if I'm interpreting the kernel
message correctly, there was plenty of free RAM at the time the OOM killer
triggered, so maybe it is related to this bug?  Judge for yourself:

[ 8589.538574] Spurious irq 95: 0xffffffdf, please flush posted write for irq
86
[ 8620.396881] icd2 invoked oom-killer: gfp_mask=0xd0, order=2, oomkilladj=0
[ 8620.396911] [<c0030470>] (dump_stack+0x0/0x14) from [<c008e618>]
(oom_kill_process+0x5c/0x1f0)
... snip long stack trace ...
[ 8620.397369] [<c0054cbc>] (do_fork+0x0/0x2d0) from [<c002f9c0>]
(sys_clone+0x34/0x3c)
[ 8620.397399] [<c002f98c>] (sys_clone+0x0/0x3c) from [<c002c940>]
(ret_fast_syscall+0x0/0x2c)
[ 8620.397430] Mem-info:
[ 8620.397460] Normal per-cpu:
[ 8620.397460] CPU    0: hi:   90, btch:  15 usd:   0
[ 8620.397491] Active_anon:156 active_file:0 inactive_anon:320
[ 8620.397491]  inactive_file:45 unevictable:4733 dirty:0 writeback:34
unstable:0
[ 8620.397521]  free:23379 slab:25337 mapped:2642 pagetables:1675 bounce:0
[ 8620.397552] Normal free:93516kB min:2036kB low:2544kB high:3052kB
active_anon:624kB inactive_anon:1280kB active_file:0kB inactive_file:180kB
unevictable:18932kB present:260096kB pages_scanned:0 all_unreclaimable? no
[ 8620.397583] lowmem_reserve[]: 0 0
[ 8620.397583] Normal: 2407*4kB 10466*8kB 2*16kB 2*32kB 1*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 0*2048kB 0*4096kB = 93516kB
[ 8620.397644] 3133 total pagecache pages
[ 8620.397674] 476 pages in swap cache
[ 8620.397674] Swap cache stats: add 286699, delete 286223, find 23896/37254
[ 8620.397705] Free swap  = 668132kB
[ 8620.397705] Total swap = 786424kB
[ 8620.407867] 65536 pages of RAM
[ 8620.407897] 23973 free pages
[ 8620.407928] 4276 reserved pages
[ 8620.407928] 15660 slab pages
[ 8620.407928] 40255 pages shared
[ 8620.407958] 476 pages swap cached
[ 8620.407958] Out of memory: kill process 1637 (browserd) score 706347008 or a
child
[ 8620.408111] Killed process 1637 (browserd)

Am I reading it right?  The 23973 free pages (= 93 megabytes, assuming a page
size of 4K) were free *before* browserd was killed?
Comment 3 Lucas Maneos 2009-10-23 00:04:54 UTC
That looks seriously  pathological, but I haven't seen that behaviour here. 
Right now free reports:

              total         used         free       shared      buffers
  Mem:       245556       241288         4268            0        13928
 Swap:       786424       202336       584088
Total:      1031980       443624       588356

and the device is generally responsive.

One thing I noticed is that /proc/sys/vm/swappiness is now tuned all the way up
to 100, maybe that's related?

Adding Andre to CC: list, I think this is worth forwarding internally to get
some feedback.
Comment 4 Eero Tamminen nokia 2009-10-23 18:41:32 UTC
Interesting.  We haven't seen anything like this internally (lots of swapping
with free RAM).

Can you attach output from Xrestop?

Is there anything "suspicious" looking in /proc/slabinfo?

After getting those infos into safety, does doing "killall hildon-desktop" to
restart Desktop help anything?  (I wonder whether there could be something
funny related to X / SGX that triggers issues)


> the kernel decided to OOM-kill my browser for unknown reasons.

Because browser + flashplayer and the content they execute from the network can
consume infinite amount of resources, Browser is set as OOM-preferred (along
with couple of other things).  That's why it's usually one of the first things
to go when (kernel thinks that) device gets really low on RAM.
Comment 5 Marius Gedminas (reporter) 2009-10-23 23:49:47 UTC
I think the device may be doing it again.  Only this time I have no running
apps (except for a single xterm), so the load is low and UI is mostly
responsive (sliding the lock to unlock keyboard & screen feels slow, but that
may be just my anxiety speaking).  vmstat (this time with a 10s period instead
of 1s) shows slow swapout and an increase in free RAM:

~ $ vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1 141852  84576     56  16860   26    9    37    33  255   263  9  4 85  2
 5  0 141856  85312     48  16584  157   70   157    70  744  1928  6  9 79  6
 0  0 141856  83196     44  16636  375   12   375    12  912  2240  7 10 75  8
 0  3 141928  83364     44  17024  296  215   296   215  847  1792  6 10 72 11
 1 13 141984  84040     48  16440 1758  259  1759   259 1992  3060  9 13 45 32
 0  7 142144  60064     68  17236 4282  164  4286   164 3850  5153  2 14  0 84
 0  0 142116  58980     76  17432  893   50   893    54 1180  2332  3  7 73 17
 0  0 142116  57488     76  17428  245    7   245     7  921  2294  4 10 83  3
 0  0 142132  64472     72  17228  108   60   108    60  702  1848  4  9 83  4
 0  0 142132  65016     72  17180    2    7     2     7  552  1649  2  6 91  0
 0  0 142144  69480     68  17132    4   33     4    33  671  1925  3  7 84  6
 4  1 142160  75316     60  16996   68   48    68    48  654  1793  2  7 85  6
 0  0 142256  78056     60  16736   30   84    30    84  621  1736  2  6 89  3
 1  1 142368  60480     60  17584 2363  259  2366   259 2502  3587  3 11 34 52
 2  0 142452  67480     92  17088  500  404   508   416 1114  2271  2  8 43 48
 0  0 142456  66972    100  17104  120    0   120     6  596  1508  1  5 85 10
 3  0 142456  68544     96  17056   38   43    38    44  613  1808  1  5 91  3
 1  0 142464  71892     92  16804  101   37   101    37  658  1770  2  7 84  7
 1  0 142600  76096     84  16244 1182  251  1183   256 1654  2722  7  9 37 47
 1  0 142632  82148    104  15560  184  182   185   195  783  1974  3  6 60 31
 0  0 142668  82776     96  15112  274  117   274   120  847  1930  3  6 82  9
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0 142668  83280     88  15080   28   30    28    30  600  1725  2  6 90  2
 1  1 142684  82972     88  14984  136   31   136    31  714  1867  2  7 88  3
 0  0 142700  83348     88  14756  178  186   178   186  752  1875  2  6 81 12
 0  0 142700  83728     88  14776   26   34    26    34  571  1685  2  5 92  1
 0  0 142704  85364     88  14496    7   53     7    53  558  1672  1  6 92  1

uptime is 1 day, 10:08 again (coincidence?  When I was reporting this bug I
noticed uptime being 1 day and 10 hours with some minutes).

I can try running something (a web browser) to see if that would provoke the
high loads I've seen yesterday.  I'm a bit reluctant to do that since it makes
even ssh sessions sluggish, and then it's hard to collect information.

Currently xrestop shows 

          Monitoring 31 clients. XErrors: 0
          Pixmaps:    2930K total, Other:     166K total, All:    3096K total

which doesn't seem like much.

/proc/slabinfo is something I'm not familiar with so I cannot say what's
suspicious.  I can attach a copy here, after I provoke the high load.

The "unknown reasons" bit was supposed to mean "I don't know why the kernel
thought it was out of memory" rather than "why the kernel picked the browser
instead of something else".

One thing that I think I should mention: since this Monday I started using my
N900 as a wireless "access point", sharing my 3G connection with my laptop. 
It's accomplished by running a simple shell script on the n900, and then
ssh'ing into it from my laptop over ad-hoc wifi and enabling the SSH socks
proxy emulation with ssh -D 1080.  The script itself does this:

IFACE=wlan0
ADDR=10.20.30.1
NETMASK=255.255.255.0
ADDR_MIN=10.20.30.10
ADDR_MAX=10.20.30.126

CHANNEL=$1

sudo /etc/init.d/avahi-daemon start
sudo ifconfig $IFACE down # otherwise iwconfig mode says device busy
sudo iwconfig $IFACE mode ad-hoc
if [ -n "$CHANNEL" ]; then
    sudo iwconfig $IFACE channel $CHANNEL
fi
sudo ifconfig $IFACE up # otherwise iwconfig essid times out
sudo iwconfig $IFACE essid n900
sudo ifconfig $IFACE $ADDR netmask $NETMASK up
echo "Sadly, NAT is not supported yet so you'll have to use ssh -D 1080
user@$ADDR and use the SOCKS5 proxy on localhost:1080."
echo "When you want to disconnect, press Ctrl+C."
echo "XXX in 5 minutes the n900 will decide wlan0 is not used and power it
down, breaking the sharing anyway :(  to wake it up  sudo ifconfig $IFACE up;
sudo iwconfig $IFACE essid n900;"
sudo /usr/sbin/dnsmasq -i $IFACE -a $ADDR -I lo -z -d \
                       -x /var/run/dnsmasq.$IFACE.pid \
               --dhcp-range=$ADDR_MIN,$ADDR_MAX,6h \
               --dhcp-option=3,$ADDR \
               --dhcp-option=6,$ADDR

I use it this way for hours every evening.  It could be a coincidence, but my
troubles started about the same time I started doing this.

In the meantime, vmstat, still running in the background, now shows

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  0 143320  92700     48  11656  460   68   460    68  948  1996  1  6 78 15
 1  0 143320  94692     48  11448  135   29   135    29  649  1683  1  5 87  7
 3  1 143320  95616     48  11292  189   38   193    38  745  1985  1  6 81 12
 1  0 143324  96856     48  11216  167   24   171    24  638  1524  1  5 87  7
 1  0 143332  92948     48  11144  813   96   818    96 1262  2441  2  8 73 17
 2  1 143332  93336     48  11368 1535  144  1539   144 1810  2886  1  8 47 44
 1  0 143332  90720     48  11484  433   34   433    34  920  2035  1  5 84  9
 1  0 143332  87312     48  11524 1777  224  1777   224 1995  2906  1  8 46 45
 2  0 143344  92008     48  11480 1719  196  1719   196 1956  3005  2  7 45 46
 2  2 143344  95276     48  11612 1940  251  1947   251 2246  3372  2  9 31 58
 2  0 143348  91692     48  11236 1482  164  1482   164 1785  2722  1  8 40 50
 2  3 143352  94068     48  11216 1874  416  1881   416 2247  3380  1  8 19 72
 1  0 143348  91348     48  11388 1062   22  1062    22 1395  2440  1  6 59 35

the amount of swapout has markedly increased, as well as time spent in iowait.

I can't wake up the screen by sliding the keylock slider.  Tapping on it turned
on the backlight, but the screen is black.  Sliding out the keyboard turns on
its backlight too.

I just lost my ssh connection:

 1  0 143348  91348     48  11388 1062   22  1062    22 1395  2440  1  6 59 35
 2  2 143348  89316     48  11252 1175  229  1175   229 1534  2663  1  6 46 47
 2  1 143360  92984     48  11264 1946  276  1953   276 2231  3204  1  8 24 67
 2  7 143364  96272     48  11396 2645  496  2648   496 2731  3380  1  9  0 90
 2 18 143548  96196     48  10988 4359  580  4364   580 4435  5575  1 11  0 88
 2  5 143612  96076     48  10984 3591  326  3600   326 3588  4790  1 11  0 88
 2 10 143616  93572     48  10928 4440  371  4447   371 4333  5676  1 10  0 89
 2  6 143612  82160     48  12000 4548  121  4556   121 4326  5773  2 15  0 84
Timeout, server not responding.

The n900 responds to pings with a typical TTL of under 4ms.  I can barely ssh
in again.  The session is very slow.

~ $ uptime
 22:34:19 up 1 day, 10:29, load average: 27.56, 17.36, 8.24

Looks like I won't need to attempt to start a web browser in order to reproduce
this.

~ $ cat /proc/slabinfo > slabinfo-Im-melting.txt
~ $ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 3 15  77160  80532     48  14648   41   11    51    34  271   292  9  4 85  3
 0 13  77180  81424     48  14452 3036  316  3040   316 4809  7143  2 11  0 87
 1 10  77256  84644     48  13468 3032  924  3032   924 3523  4455  3 16  0 81
 2 10  77444  87036     48  12808 2792 2080  2796  2080 3809  4428  2 12  0 86

uptime is decreasing now

 22:37:55 up 1 day, 10:33, load average: 7.62, 14.59, 9.21

but the screen is still black with no GUI available.

Interesting!  dmesg contains enough messages from the OOM killer to have filled
the ring buffer.  When I started typing this comment I checked dmesg and it had
nothing except for the usual wlan messages etc.

load is rising again:

 0  1  88844  89280     72   9280 4616    4  4616     4 4198  4974  1 12  0 87
 0  1  88844  93352     72   9280    0    8     0     8  453  1626  0  8  0 92
 0  2  88844  98496     72   9276   60 1620    60  1620 1108  1907  0  9  0 91
 2  4  88844  98128     72   8664 4376  676  4452   676 4377  5598  1 16  0 84
 2  5  88844  99612     72   8616 1352  376  1428   376 1987  2969  0  9  0 91
 0  8  88840  96664     72   9024 5440    4  5440     4 4777  5400  0 11  0 89
 1  5  88840  93828     72   9180 5408    4  5408     4 4773  5674  1 13  0 86
 1  6  88840  91300     72   9280 5380    0  5380     0 4912  5899  3 11  0 86
 2 11  88848  98984     72   8476 4964 2340  5036  2340 5345  6331  1 10  0 90
 2 12  88848  98840     72   8516 9724  736  9800   736 9298 10787  0 13  0 87
 3  8  88848  98092     72   8848 6456  380  6456   380 6048  7022  0 15  0 85
 2 10  88848  98308     72   8676 5148  588  5220   588 5102  6449  0 12  0 88
 2 11  88848  98768     72   8596 6304  648  6380   648 6577  7705  0 16  0 84
 2 12  88848  98468     72   8732 5656  516  5732   516 5368  6356  0 15  0 85
 2 12  88848  98152     72   8776 6172  620  6248   620 6334  7367  1 11  0 88
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2 13  88848  98124     72   8760 7672 1004  7756  1004 7919 10072  0 18  0 82
 1 12  88848  97776     72   8844 4864  228  4864   228 4763  5366  1  8  0 91
 2 14  88848  98360     72   8668 6676  872  6752   872 6465  6804  0 13  0 87
 3 17  88844  97940     72   8864 6964  464  7040   464 6713  6960  1 13  0 86
 1 13  88848  97636     72   8960 5368  476  5368   476 4854  5541  1 12  0 87
 2 14  88848  98284     72   8728 3976  668  4052   668 4409  5353  1 14  0 85
 2 15  88848  98512     72   8712 5964  516  6040   516 6280  7468  0 13  0 87
 2 15  88848  98492     72   8916 6428  592  6504   592 6391  7461  0 19  0 81
 2 11  88848  98188     72   8788 5676  376  5752   376 5509  6385  1 15  0 84
 2 18  85736  97372     72   9336 12928 1328 13008  1328 12677 15245  1 18  0
81
 2 20  85740  98420     72   8596 5376  908  5452   908 5754  7102  0 12  0 88
 2 16  85736  97280     72   8800 7340  540  7416   540 7351  8529  2 15  0 83
Timeout, server not responding.

meanwhile in my other ssh session

~ $ uptime
 22:41:57 up 1 day, 10:37, load average: 12.90, 10.36, 8.46

I'd like to attempt to killall hildon-desktop now, but my one still-working ssh
session just timed out, and new ones can't be easily established.  (ping still
works great.)

Looks like the device either rebooted, or at least restarted hildon-desktop. 
The UI is back, applets are showing up one by one.  The UI is snappy and
responsive.

~ $ uptime
 22:46:43 up 2 min, load average: 2.29, 1.42, 0.56

Yep, it rebooted by itself.
Comment 6 Marius Gedminas (reporter) 2009-10-23 23:52:55 UTC
Created an attachment (id=1492) [details]
/proc/slabinfo during high loadavg, massive swapin & swapout, lots of free ram
Comment 7 Marius Gedminas (reporter) 2009-10-23 23:53:26 UTC
Created an attachment (id=1493) [details]
dmesg's output during the same
Comment 8 Marius Gedminas (reporter) 2009-10-23 23:59:28 UTC
Brain not working.  In the above comment, make the following corrections:

s/ping with a typical TTL/ping with a typical roundtrip time/

s/uptime is decreasing/loadavg is decreasing/
Comment 9 Marius Gedminas (reporter) 2009-10-24 01:15:50 UTC
Apparently there's no need to wait 1 day and 10 hours.  After the boot I ran my
share-wifi script again, ssh'ed and ran vmstat.  In the beginning things looked
like this:

~ $ vmstat 1
Unknown HZ value! (93) Assume 100.
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1      0   2484  55216  71476    0    0   454    36  798  1260 38 20 36  7
 1  0    128   3076  55216  70832    0  124     0   124  613  2167  2  6 50 42
 0  0    128   3068  55216  70832    0    0     0     0  634  2336  4  5 91  0
 0  0    128   3060  55216  70832    0    0     0     0  486  1673  2  4 94  0

I hour later

 0  0  91520  75096    320  23620    0    0     0     0  574  1836  1  5 94  0
^C
~ $ uptime
 23:54:22 up  1:10, load average: 0.47, 0.66, 0.71

then I tapped the screen to wake it up

 0  0  91520  74376    320  23656    0    0     0     0  431  1594  0  6 94  0
 0  0  91520  74368    320  23656    0    0     0     0  562  1771  0  6 94  0
 0 11  91520  62480    328  24376 11076    0 11084     0 3957  3522  3 26 15 56
 1  5  91864  54892    292  24964 9608  444  9608   444 4734  4851 22 39  0 40
 0  0  92048  52004    268  25112 5584  264  5616   264 2871  3361 23 25 25 26
 0  0  92048  51996    268  25112    0    0     0     0  598  1739  4 11 85  0
 0  0  92048  51988    268  25112    0    0     0     0  512  1784  2 10 88  0
 0  0  92048  51980    268  25112    0    0     0     0  607  1851  1 11 88  0
 2  4  92100  48304    264  25368 4592   84  4592    84 2240  3248  9 26 47 19
 7  3  93168  52084    244  24876  796 1336   796  1336 1536  2870 16 21 33 31
 0  0  93420  52676    244  24720  128  256   132   256  780  2176  6 12 80  3
 0  0  93420  52668    244  24720    0    0     0     0  558  1907  3  6 91  0
 0  0  93420  52480    244  24756  304    0   304     0  799  2089  4 12 82  2
 0  0  94504  51836    260  24616 2524 1176  2564  1184 2139  3136 15 18 31 37
 0  1  94504  51820    264  24660    0    0     0    36  567  1792  2  6 80 12

and now it's slowly but steadily swapping things out again

 0  0  99796  69912    172  22636    0    0     0     0  464  1660  0  5 95  0
^C
~ $ uptime
 00:00:10 up  1:15, load average: 0.61, 0.69, 0.71

Either this slow swap-everything-out behaviour is normal (swappiness *is* set
to 100 after all) and I'm focusing on the wrong symptoms, or the problem shows
up immediately and it's more or less constant usage that keeps things in memory
and lets me use my N900 all day, postponing the freeze.  Only I don't actually
use it constantly; during daytime it's lying on the desk with wifi on
(connected to office wlan) and 3g off, and I pick it up maybe once an hour.

Hm, I just got an incoming IM message which woke up the screen and dropped free
ram usage a bit:

 0  0 102512  78780    112  22036    0    0     0     0  658  1853  0  4 96  0
Timeout, server not responding.
$ ssh -D 1080 mg-n900
~ $ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1 105844  60716    164  23756   32   23    52    26  675  1990  5  9 84  2

In the office I get an incoming mail notification every 30 minutes (the sound
and the vibration are off so those don't bother me).  Modest by default checks
for new mail only when connected via WLAN, but not via 3G.

Grasping at straws here.  So far what I've observed in this comment seems to be
perfectly normal behaviour.  Perhaps bad things start happening when too many
pages get evicted?  In previous vmstat logs the load starts rising when free
ram reaches ~100 megs.

Meanwhile another wakeup happened, apparently, although the screen is blank now
and I don't know what caused the activity:

 0  0 110020  78604     68  20248    0    0     0     0  576  1840  0  6 94  0
 0  0 110020  78596     68  20248    0    0     0     0  435  1633  0  4 96  0
 1  5 110020  70908     80  20456 7680    0  7692     0 4256  5430  6 24 15 55
 1  0 110020  60812     80  20604 9976    0  9976     0 4757  5069  6 16  0 78
 0  0 110020  59664     80  20808  896    0   900     0  874  1668  1  8 87  4
 0  0 110020  59536     80  20808  112    0   112     0  518  1724  4  6 90  0

~$ uptime
 00:14:49 up  1:30, load average: 0.77, 0.69, 0.69

I'm also thinking that this kind of constant background swapin/swapout cannot
possibly be good for the internal eMMC card.
Comment 10 Marius Gedminas (reporter) 2009-10-24 12:21:22 UTC
So, overnight  I stopped dnsmasq and took wlan0 down, just in case.  In the
morning I looked at the tablet, was happy to find it alive (same xterm still
open) and responsive, checked vmstat quickly, saw < 10 megs free RAM being
used.  All appears well.  I restart my sharing script, ssh from the laptop, try
to get uptime (12:22, load 0.58, 0.26, 0.09) and vmstat (ssh session times out
before it can print any numbers).  I walk back to the N900 and see that it has
just rebooted.

~ $ cat /proc/bootreason 
sw_rst

Last night, after the reboot during the high load, bootreason was 32wd_to.
Comment 11 Eero Tamminen nokia 2009-10-27 18:56:35 UTC
Are you using other radios than phone and wlan (in ad-hoc mode) at the same
time?

Can you attach also ifconfig output from the situation when the device is
getting quite slow so that we can see how much data has been transferred over
these interfaces?


(In reply to comment #6)
> Created an attachment (id=1492) [details] [details]
> /proc/slabinfo during high loadavg, massive swapin & swapout, lots of free ram

This is what you had:

# name            <active_objs> <num_objs> <objsize>
kmalloc-1024       66854  66856   1024
kmalloc-512         1163   1168    512
kmalloc-256          456    624    256
kmalloc-128         2129   4256    128
kmalloc-64         23718  27392     64
kmalloc-192        70590  70812    192

This is what I have currently in my device that has been up for few ours:

# name            <active_objs> <num_objs> <objsize>
kmalloc-1024         136    144   1024
kmalloc-512          686    720    512
kmalloc-256         1936   1936    256
kmalloc-128         2113   3008    128
kmalloc-64         16915  17664     64
kmalloc-192         4494   4515    192

The number of active objects in the 1024 and 192 byte buckets is suspiciously
high for you.  

In a device that has had light usage for >3 days, 1024 bucket has 153 objects
and 192 byte bucket < 5000.  You have nearly seventy thousand more in both.


Could you use some script that at suitable intervals saves /proc/slabinfo and
ifconfig output while the device otherwise idles while you use it as a wireless
access point?  E.g. something like this (untested):
------
#!/bin/sh
mkdir -p data; cd data
for i in $(seq 100); do
  ifconfig > ifconfig-$i.txt
  # awk '{print $2,$1}' /proc/slabinfo|sort -n|tail -5 > slabinfo-$i.txt
  cat /proc/slabinfo > slabinfo-$i.txt
  sleep $((30*60))
done
------

If there's some specific slab item that always increases, while the device gets
slower, then we got at least an easier method for detecting (and therefore
isolating) the cause of slowdown.

("slabtop" from "procps" package can also show the slab usage at realtime.)
Comment 12 Marius Gedminas (reporter) 2009-10-28 15:35:45 UTC
Generic comments: I stopped using my ad-hoc wifi sharing for a few days, and
this problem hadn't recurred.  Also, another thing I noticed while I was still
using adhoc wifi was that sometimes, just after setting up an ad-hoc wifi and
ssh'ing in, my N900 would spontaneously reboot.

Bug in the wifi driver?

Now answers to comment #11: I have Bluetooth on (hidden), but no active BT
connections.

Thank you for the instructions for getting extra info, I'll try to find some
time to reproduce it --- if not this week, then definitely during next week,
when I'm back in Sweden with 3G as my only connectivity source.

(BTW procps is not installable on the n900: I get busybox-symlinks-procps
instead, and that one doesn't have slabtop.  IIRC I got my vmstat binary by
compiling procps in scratchbox, but I couldn't install the whole package
because it conflicted with busybox, and I didn't want to remove busybox and
brick my 770.  I've been copying the armel binary manually from the 770 to N800
to N810 and N900.  I should package it properly one day.)
Comment 13 Eero Tamminen nokia 2009-10-28 18:14:41 UTC
(In reply to comment #12)
> Bug in the wifi driver?
> 
> Now answers to comment #11: I have Bluetooth on (hidden), but no active
> BT connections.

Could be either ad-hoc mode or interaction between the different radio parts in
the kernel causing internal kernel memory leakage, maybe only in your specific
WLAN/phone environment...  Now need to find out what exactly causes this.  Just
having normal WLAN or phone connection doesn't seem to cause any changes to
kmalloc bucket counts.  Ad-hoc mode not yet tested.


> (BTW procps is not installable on the n900: I get busybox-symlinks-procps
> instead, and that one doesn't have slabtop.

You can install procps to scratchbox and copy the required binary from there to
the device.

Or you can just replace busybox-symlinks-procps with real procps, the symlink
package just adds symlinks that point these things to busybox:
  /bin/kill
  /bin/ps
  /usr/bin/uptime
  /usr/bin/pkill
  /usr/bin/watch
  /usr/bin/free
  /usr/bin/pgrep
  /usr/bin/top
And "procps" naturally provides (real versions of) these things among other
things like slabtop.
Comment 14 Eero Tamminen nokia 2009-10-30 12:42:29 UTC
Internal comment:
> I was testing this with just adhoc and basic ping's of different sizes for a
> couple of hours and into each direction.
>
> I was monitoring the /proc/slabinfo buckets, and saw no increase like the one
> described in the error.

-> more information needed on how to reproduce this issue.
Comment 15 Marius Gedminas (reporter) 2009-11-05 01:47:13 UTC
I, on the other hand, can still reproduce the issue, without even particularly
wanting to ;)

First attempt of collecting ifconfig and slabinfo failed since I forgot to
increment $i and ended up overwriting the information.  Then my ssh connection
died, which terminated the script.  I noticed a few minutes later, ssh'ed back
in and re-ran that script (after fixing it).  About an hour later ssh
connection dropped again.  vmstat 30 running over ssh on another terminal
showed iowait being ~80% for the last 4 minutes before both ssh connections
dropped.  After trying to reconnect for five minutes (and watching at the
frozen UI on screen -- the clock and battery indicators didn't update for 6
minutes) I held down the power button for a while.  Two or three more minutes
later the device shut down.

kmalloc-1024 numbers from the record:

bug5712/slabinfo-0.txt:kmalloc-1024        8671   8672   1024    8    2 :
tunables    0    0    0 : slabdata   1084   1084      0

  that was the script that overwrote its data in place, sigh

bug5712-run0/slabinfo-0.txt:kmalloc-1024       39007  39008   1024    8    2 :
tunables    0    0    0 : slabdata   4876   4876      0
bug5712-run0/slabinfo-1.txt:kmalloc-1024       44736  44736   1024    8    2 :
tunables    0    0    0 : slabdata   5592   5592      0
bug5712-run0/slabinfo-2.txt:kmalloc-1024       50464  50464   1024    8    2 :
tunables    0    0    0 : slabdata   6308   6308      0
bug5712-run0/slabinfo-3.txt:kmalloc-1024       56192  56192   1024    8    2 :
tunables    0    0    0 : slabdata   7024   7024      0
bug5712-run0/slabinfo-4.txt:kmalloc-1024       61911  61912   1024    8    2 :
tunables    0    0    0 : slabdata   7739   7739      0
bug5712-run0/slabinfo-5.txt:kmalloc-1024       67679  67680   1024    8    2 :
tunables    0    0    0 : slabdata   8460   8460      0

  constant growth until death

bug5712-run1/slabinfo-0.txt:kmalloc-1024        1534   1536   1024    8    2 :
tunables    0    0    0 : slabdata    192    192      0

  just after reboot

There was little change in traffic during the hour covered by the monitoring
script:

gprs0
-          RX bytes:39651379 (37.8 MiB)  TX bytes:2145926 (2.0 MiB)
+          RX bytes:43291104 (41.2 MiB)  TX bytes:2981780 (2.8 MiB)

wlan0
-          RX bytes:27058743 (25.8 MiB)  TX bytes:45860584 (43.7 MiB)
+          RX bytes:28396816 (27.0 MiB)  TX bytes:50342120 (48.0 MiB)

Overall I was sharing my 3g connection for maybe 3 hours.
Comment 16 Marius Gedminas (reporter) 2009-11-05 10:05:48 UTC
After another ssh disconnection I reran my monitor-ifconfig-and-slabinfo script
under screen.

kmalloc-1024 grew steadily from 6814 to 64247 during 1 hours and 20 min. 
Snapshots made every 10 min showed an increase of almost exactly 6000.  Then I
stopped my ssh sessions and suspended my laptop.  Overnight the kmalloc-1024
figure stayed more or less constant at 66687 (with small variations to 66689 or
66693 or 66698).  In the morning I reconnected with ssh -D 1080 (plus another
ssh session to run vmstat 60, plus sshfs) and the growth resumed (currently to
72351).

dmesg is completely filled with messages of the kind (one message every 30
seconds):

[112022.223846] wlan0: No active IBSS STAs - trying to scan for other IBSS
networks with same SSID (merge)
[112052.684814] wlan0: No active IBSS STAs - trying to scan for other IBSS
networks with same SSID (merge)

Since the timing doesn't match (once per 30s rather than 10 times per second)
and the messages continued during the quiet period, I guess it is not related
to the kmalloc growth.
Comment 17 Eero Tamminen nokia 2009-11-05 14:56:38 UTC
(In reply to comment #15)
> kmalloc-1024 numbers from the record:
> bug5712-run0/slabinfo-0.txt:kmalloc-1024       39007  39008   1024    8    2 :
> tunables    0    0    0 : slabdata   4876   4876      0
...
> bug5712-run0/slabinfo-5.txt:kmalloc-1024       67679  67680   1024    8    2 :
> tunables    0    0    0 : slabdata   8460   8460      0
> 
>   constant growth until death

Thanks!  Was the 192 sized bucket growth at the same rate too?


> There was little change in traffic during the hour covered by the monitoring
> script:
> 
> gprs0
> -          RX bytes:39651379 (37.8 MiB)  TX bytes:2145926 (2.0 MiB)
> +          RX bytes:43291104 (41.2 MiB)  TX bytes:2981780 (2.8 MiB)
> 
> wlan0
> -          RX bytes:27058743 (25.8 MiB)  TX bytes:45860584 (43.7 MiB)
> +          RX bytes:28396816 (27.0 MiB)  TX bytes:50342120 (48.0 MiB)
> 
> Overall I was sharing my 3g connection for maybe 3 hours.

The amount of traffic above doesn't seem suspicious.


(In reply to comment #16)
> After another ssh disconnection I reran my monitor-ifconfig-and-slabinfo
> script under screen.
>
> kmalloc-1024 grew steadily from 6814 to 64247 during 1 hours and 20 min. 
> Snapshots made every 10 min showed an increase of almost exactly 6000.

Thanks.  If it's growing that that rate, it seems easy to detect.

What kind of other network activity you had running at the same time?

Can you provide e.g. tcpdump data for both wlan & gprs interfaces?


> Then I stopped my ssh sessions and suspended my laptop.  Overnight
> the kmalloc-1024 figure stayed more or less constant at 66687 (with small
> variations to 66689 or 66693 or 66698).  In the morning I reconnected with
> ssh -D 1080 (plus another ssh session to run vmstat 60, plus sshfs) and
> the growth resumed (currently to 72351).

Here are couple of other things you could try to isolate the cause of the
issue:
- use just "ping" instead of ssh, both from and to the device, without
  any other network activity to/from/through the device.
- have laptop idling over the night, on power, WLAN on, but not doing network
  activity (i.e. is the issue triggered by whatever the laptop is doing)


> dmesg is completely filled with messages of the kind (one message every 30
> seconds):
> 
> [112022.223846] wlan0: No active IBSS STAs - trying to scan for other IBSS
> networks with same SSID (merge)
> [112052.684814] wlan0: No active IBSS STAs - trying to scan for other IBSS
> networks with same SSID (merge)
> 
> Since the timing doesn't match (once per 30s rather than 10 times per second)
> and the messages continued during the quiet period, I guess it is not related
> to the kmalloc growth.

Ok.
Comment 18 Marius Gedminas (reporter) 2009-11-05 18:18:38 UTC
(In reply to comment #17)
> Thanks!  Was the 192 sized bucket growth at the same rate too?

Yes:

~/MyDocs $ grep -- -192 bug5712-run0/slab*
bug5712-run0/slabinfo-0.txt:kmalloc-192        44094  44100    192   21    1 :
tunables    0    0    0 : slabdata   2100   2100      0
bug5712-run0/slabinfo-1.txt:kmalloc-192        49833  49833    192   21    1 :
tunables    0    0    0 : slabdata   2373   2373      0
bug5712-run0/slabinfo-2.txt:kmalloc-192        55576  55587    192   21    1 :
tunables    0    0    0 : slabdata   2647   2647      0
bug5712-run0/slabinfo-3.txt:kmalloc-192        61277  61299    192   21    1 :
tunables    0    0    0 : slabdata   2919   2919      0
bug5712-run0/slabinfo-4.txt:kmalloc-192        66938  66948    192   21    1 :
tunables    0    0    0 : slabdata   3188   3188      0
bug5712-run0/slabinfo-5.txt:kmalloc-192        72739  72765    192   21    1 :
tunables    0    0    0 : slabdata   3465   3465      0

> (In reply to comment #16)
> > After another ssh disconnection I reran my monitor-ifconfig-and-slabinfo
> > script under screen.
> >
> > kmalloc-1024 grew steadily from 6814 to 64247 during 1 hours and 20 min. 
> > Snapshots made every 10 min showed an increase of almost exactly 6000.
> 
> Thanks.  If it's growing that that rate, it seems easy to detect.
> 
> What kind of other network activity you had running at the same time?

Nothing special: IM accounts on the n900, IM accounts on the laptop.  Facebook
widget on the desktop.  I don't recall if I had email checking enabled.

> Can you provide e.g. tcpdump data for both wlan & gprs interfaces?

I'll see what I can do.

> > Then I stopped my ssh sessions and suspended my laptop.  Overnight
> > the kmalloc-1024 figure stayed more or less constant at 66687 (with small
> > variations to 66689 or 66693 or 66698).  In the morning I reconnected with
> > ssh -D 1080 (plus another ssh session to run vmstat 60, plus sshfs) and
> > the growth resumed (currently to 72351).

Continuation of the story: kmalloc-1024 grew to 88905, and soon after the
device entered happy swapland, which, suprisingly, wasn't immediately fatal.  I
went over to the xterm running the sharing script (i.e. dnsmasq), hit ^c.  A
few minutes later the device recovered.  It wasn't fully okay (still sluggish,
still swapping out apps, but maintaining ~40 free megs rather than ~90, which
meant the apps could be swapped in in a reasonable amount of time, with only
occasional multi-second ui pauses).  The sluggishness was unpleasant so I
rebooted.

I'm mostly mentioning this because previously the device usually died after
reaching about 67000 whatevers in the kmalloc-1024 column.

Oh, and I discovered messages from the OOM killer in dmesg (which I've saved to
a file) after the temporary reprieve.  It could be that it wasn't ^C but
OOM-killer killing osso-xterm that made the device unfreeze.  I think it killed
osso-xterm, I can't easily check now because of bug 5824 (zombie ssh sessions
ate all my PTYs; can't create new ones; can't open xterms; sshfs flaky due to
local network conditions).
Comment 19 Marius Gedminas (reporter) 2009-11-07 15:44:31 UTC
More data:
- connevted via gprs0, enabled adhoc wifi, used ssh for a while without memory
growth
- lost gprs connection, left ssh connection open over wifi
- disconnected ssh, growth persisted
- netstat showed no tcp connections, growth persisted
- disconnecting from wifi on my laptop stopped memory growth on the n900 after
a couple of seconds

Looks like wifi driver's fault to me.

Reconnecting to wifi from my laptop, pinging the n900, even reconnecting with
ssh didn't start memory growth immediatelly.  After a longer period with wifi
up and an active ssh connection with some periodic small traffic the growth
resumed.  gprs0 was down the whole time.  I have periodic snapshots of ifonfig
and slabinfo taken every 10 minutes.
Comment 20 Juuso Oikarinen nokia 2009-11-09 10:54:37 UTC
(In reply to comment #19)
> More data:
> - connevted via gprs0, enabled adhoc wifi, used ssh for a while without memory
> growth

By "enabling adhoc wifi", do you mean you connect with N900 to some adhoc AP,
or do you mean you configure the N900 to be an adhoc AP, and associate with
some other device to it?
Comment 21 Marius Gedminas (reporter) 2009-11-09 11:14:48 UTC
By "enabling adhoc wifi" I mean the setup mentioned in comment 5: ifconfig +
iwconfig behind icd's back, so I can browse from my laptop by using an ssh
socks proxy.

"AP" doesn't really apply to adhoc connections, although since I'm running
dnsmasq on the n900 to serve DHCP leases and do DNS resolution, you could call
it a sort-of AP.
Comment 22 Juuso Oikarinen nokia 2009-11-09 13:04:58 UTC
I have a setup, with the N900 configured from command line to adhoc mode (hence
bypassing the icd), with a static IP. With a linux laptop, I connect with the
N900, manually configuring a static IP there too. On the N900, I'm running the
dropbear ssh server.

On the N900, I have no UI applications running except one xterm. In addition, I
have configured the automatic connection stuff in the UI so that icd does not
intervene with my command line use of the wlan0 interface.

I have had varying pings ongoing between the devices, and a ssh session from
the laptop to the N900 varyingly running top, and varyingly inspecting
/proc/slabinfo.

The setup has been set-up several hours now, and the kmalloc bucket counters in
slabinfo have not incremented at all.

There is still something in the problematic setup that I'm not reproducing in
my environment.

May I query the OS of the laptop, the make and driver of the WLAN adapter in
use with it?
Comment 23 Marius Gedminas (reporter) 2009-11-09 16:48:29 UTC
Created an attachment (id=1558) [details]
script to bring up adhoc wifi and start dhcp server on n900

The laptop is a Lenovo ThinkPad T61, running Ubuntu 9.10, with Intel
PRO/Wireless 3945ABG driven by iwl3945.ko version 1.2.26ks.

I use the attached script on the N900 side, and regular Network Manager on the
laptop side.

I haven't fully figured out when exactly the slabinfo growth starts up.  So far
a full-proof method for me is to

  1. connect the N900 to the Internet via 3G
  2. run sharing script (optional argument specifies ad-hoc wifi channel; I've
been using 2 and 3)
  3. connect to the N900 ad-hoc wlan from the laptop
  4. ssh -D user@mg-n900.local from the laptop
  5. start browsing (Google Reader, some ssh connections, etc.)

I've noticed that sometimes the slabinfo growth doesn't start immediately (I've
browsed for maybe an hour once and found ~140 allocations); sometimes it starts
immediately (once I rebooted, did this exact sequence of steps and was forced
to reboot again within less than 3 hours because of memory pressure issues).

I've confirmed twice that disconnecting from the adhoc wlan on the laptop side
stops slabinfo growth on the N900 almost immediately (within a couple of
seconds: in an xterm I press up, enter, up, enter, up, enter to repeat my grep
-- -1024 /proc/slabinfo, and I see the numbers going up a bit and then stopping
and even decreasing by one or two).

I'll see if I can find a shorter way of reproducing the issue.
Comment 24 Eero Tamminen nokia 2009-11-09 17:27:36 UTC
(In reply to comment #23)
> I've noticed that sometimes the slabinfo growth doesn't start immediately
> (I've browsed for maybe an hour once and found ~140 allocations); sometimes
> it starts immediately

Hm.  So in the worst case the trigger can actually be some specific traffic
that happens otherwise through the WLAN...?
Comment 25 Juuso Oikarinen nokia 2009-11-10 07:48:31 UTC
It _could_ also be something that the "PRO/Wireless 3945ABG" does / sends. I'll
have to see if I can find one such card to try with.
Comment 26 Eero Tamminen nokia 2009-11-19 11:07:47 UTC
FYI:  Finally we were able to re-produce this.  It's not related to gprs
connection or specific WLAN access point, just ad-hoc WLAN with the device.  It
can be gotten also with two rovers connected to each other over ad-hoc network.

However, triggering of this reliably requires running the test at least for a
day.  In the test we run, it took something like 13 hours until the leakage
started, and it wasn't constant, at some point it stopped leaking for 5 hours.

I don't think the exact triggering condition is yet known.  Seems like it's
more  related to what kind of traffic there is (not sure at which level).
Comment 27 egoshin 2009-12-11 22:18:33 UTC
I experienced the similar behavior at least twice.

I did install some stuff which created me ad-hoc WiFi but I don't use this
connection - just WiFi.

The last time it happens after I downloaded a tar backup of whole system from
SDHC to host via WLAN and that was not immediate - I completed the download,
did something unsignificant (X-Term, ps, ssh etc) and drive home and tried to
make a phone call. Applications button freezes in "pressed" state, I was not
able to handle it a couple of minutes (traffic) and it unfreezes later and
phone application starts.

The first time is difficult to remember but it seems to me that it happens
right after backup of whole FS to SDHC via tar, again during driving and I just
rebooted.

So, before this ticket I counted it to slow SDHC (1.6MB/sec on WiFi G) and full
buffers.
Comment 28 Jamie Lokier 2010-01-20 02:25:18 UTC
(In reply to comment #26)
> just ad-hoc WLAN with the device.

Aha.  I have just had my N900 running for about 3-4 hours in ad-hoc wifi mode
(to my laptop which is forwarding net connections), while charging.  It was
switched off at the start of that.

Just now, I received an incoming phone call, and the UI was not responding to
touches to accept the call.  It turned out it was responding, but taking up to
10 seconds to process each touch.  Voice transmission was somewhat compromised,
so we couldn't actually talk.

After closing that call, because it wasn't usable, I uses the right-side slider
to blank the screen then unblank it (wondering why the "End" call button wasn't
responding), and it took a similar amount of time to respond to the slider. 
The top button for "End task" also took ages to respond.

I opened an xterm to look for rogue processes, and 'top' showed /usr/sbin/icd2
using about 4% CPU which it doesn't normally do (it doesn't normally show as a
CPU user in top at all), plus X using about 3% which is normal.

Until finding this bug report, it didn't occur to me to look at memory
statistics.

Now, it isn't even possible to unblank the screen: it's just not responding. 
Yet, it did ring a couple of times as the person on the other end still tried
to reach me, and eventually left a message, and the alert light is now
flashing.  But can't unblank the screen :-)

Ah, no, unblanking worked after about 40 seconds, or pressing the top button,
depending on which event it was that triggered it eventually.

There are no applications running (except X-terminal at this point), and the
usual things on the desktop don't normally cause this problem even with 48
hours use.  I have only the calendar widget, X-terminal shortcut, a contact
shortcut, some bookmarks, etc.  That is, the calendar widget is the only active
item on the desktop.  Other than that, a Skype connection was running, but
again I have not had anything like this problem with Skype previously.

The firmware is the PR1.1 release.

Very interesting that ad-hoc wifi mode shows up in this bug report.  I don't
use it often; this is only the second time.
Comment 29 Jamie Lokier 2010-01-20 02:33:55 UTC
(In reply to comment #28)
> Now, it isn't even possible to unblank the screen: it's just not responding. 
> Yet, it did ring a couple of times as the person on the other end still tried
> to reach me, and eventually left a message, and the alert light is now
> flashing.  But can't unblank the screen :-)
> 
> Ah, no, unblanking worked after about 40 seconds, or pressing the top button,
> depending on which event it was that triggered it eventually.

Well, it unblanked and showed the X-terminal where I was running top then
stopped it, but the X-terminal's cursor is unfocused and it's not responding to
key presses or touch to focus, or anything really.

I tried holding down the top button, expecting it to power off after 5 seconds
or so, like most phones will do.  I'd assumed there would be a hard power off
circuit for emergencies.

But that didn't have any effect, so I've had to remove the battery to switch it
off.

I can only imagine that it was getting slow at the point where I answered that
call (in the previous comment) and then opened the X-terminal, but then it
became catastrophically slow.

Most likely it won't be alerting about incoming calls at this point either.  In
my personal circumstances, I have a friend who may have a medical emergency, so
this behaviour is quite bad if it makes incoming call alerts stop working.

If the device ends up in a swap storm, it would be far better to trigger a
watchdog reset than getting stuck like this, as it's effectively frozen.

Simply reducing or removing swap isn't the solution, because it's needed for
idle application memory, but removing the ability to get stuck would be good.

Is the watchdog kept alive by the kernel running, or does it regularly poll
core applications for their response time?
Comment 30 Jamie Lokier 2010-01-20 02:50:55 UTC
(In reply to comment #29)
> But that didn't have any effect, so I've had to remove the battery to switch it off.

Got that bit wrong.  I was expecting to have to remove the battery because
holding down the top button for plenty of time didn't force it off.

But it did reboot itself, after a few minutes of not responding to anything. 
(I was distracted by an incoming call on another phone otherwise I wouldn't
have waited long enough to see that).

Unfortunately I wasn't able to get any memory statistics.  But it was noticable
that it went from very slow (up to 10 seconds to respond to touches, but
keyboard inside X-terminal working quite well), to completely non-responsive
within a few minutes, but had sat there charging for a couple of hours at
least.  The charge reached full less than an hour before the problems; I don't
know if that's relevant.

I don't use ad-hoc mode often but I will try it every few days now, to see if
there's a consistent pattern.

My laptop also uses the iwl3945.ko wireless driver module, with Ubuntu 9.10.

-- Jamie
Comment 31 Eero Tamminen nokia 2010-01-20 17:34:36 UTC
(In reply to comment #29)
> Is the watchdog kept alive by the kernel running, or does it regularly poll
> core applications for their response time?

There's a separate watchdog kicker thread in DSME for that, but DSME is locked
into RAM so swapping doesn't affect it much.  Kernel Wifi module eating all
device memory would eventually do that, but it would happen after user patience
ends I guess.

The SW watchdog polls critical applications, but if they don't respond it just
restarts them.  Only if they get restarted too many times in a row / within
certain time, then device is rebooted.  (With kernel leak causing swapping
restart might be so slow that this time limit doesn't get hit).


If you can find a use-case that doesn't require from several hours to one day
to trigger the start of the leakage, or differs from the one in comment ,
please report it[1].  If you trigger this, please provide also your
/proc/slabinfo data from the device so that it can be verified whether this is
the same issue as the original reported had.


(In reply to comment #27)
> The first time is difficult to remember but it seems to me that it happens
> right after backup of whole FS to SDHC via tar, again during driving and I
> just rebooted.

[1] Hm, the way it could be internally triggered (after 13 hours of running a
test, with 5 hour gap in leaks some time after that) was running ls in one
device from another device using ssh.  Maybe the triggering needs both ad-hoc
WLAN and file system access?

The issue isn't currently reproducible on our latest internal release, but
there were no specific fixes for this issue, so it's possible that the use-case
we had just doesn't happen to trigger the issue anymore (within day).
Comment 32 egoshin 2010-01-20 18:44:38 UTC
(In reply to comment #31)

> [1] Hm, the way it could be internally triggered (after 13 hours of running a
> test, with 5 hour gap in leaks some time after that) was running ls in one
> device from another device using ssh.  Maybe the triggering needs both ad-hoc
> WLAN and file system access?

See my comment #27.
Comment 33 Josh Triplett 2010-01-24 21:30:46 UTC
Retitling this bug to point at ad-hoc wifi and kernel memory usage as the
culprit.
Comment 34 Jamie Lokier 2010-02-02 22:37:48 UTC
Today I had to use ad-hoc mode again, with the same laptop as before.

This time, some new things happened, culminating in a reboot within about 5
minutes of starting ad-hoc mode:

    1. It appeared to establish a link (the ad-hoc wifi icon appeared and soon
stopped flashing), but then it was unable to reach the internet via the ad-hoc
peer, my laptop.  It's possible it could have been a laptop configuration
issue, but I did check various things and it seems unlikely.

    2. After I'd seen it could not reach the internet, and checked basic
configuration on the laptop, I used the Disconnect button on the N900, and then
searched for WiFi access points to reconnect.  (I was intending to look at
packets on the laptop using tcpdump, to see why it the N900 could not reach the
internet).

    3. For about a minute, the N900 did not show any WiFi access points, which
is unexpected as there are 4 to 5 within range which show up normally, had
shown up a few minutes earlier, and would show up in step 4.  This included not
showing the laptop, <1m away.  It did show the 3G connection as available.

    4. Then all the local WiFi access points appeared, and I selected the
laptop.

    5. A couple of minutes after that, I thought to myself "oh, I just
remembered that ad-hoc mode has problems".  And a few seconds later, the N900
spontaneously rebooted itself as I watched.

I didn't have much open or running on it when it rebooted.  I.e. it wasn't a
case of already using lots of memory.  Something blatantly triggered the reboot
with a few minutes of ad-hoc mode starting.

Not showing any WiFi access points for about a minute indicates a driver
problem, not just a memory leak.  I think it did not display it's usual spinner
indicating that it's searching, during that time, either.

Interesting: Not showing the ad-hoc laptop connection _might_ not be a driver
issue.  It normally shows that in the list of available connections _always_,
including when the laptop is switched off.  (Which is annoying, by the way.)

I.e. it's possible something went wrong with the wireless connection manager
app....  Or maybe that app only adds all the access points, including the ones
it's going to show regardless of signal availability, after it gets some kind
of response back from the kernel.

Is it possible the memory leak + connection problems in ad-hoc mode are caused
by the connection manager app, rather than the kernel driver?
Comment 35 Eero Tamminen nokia 2010-02-04 17:20:08 UTC
> And a few seconds later, the N900 spontaneously rebooted itself as I watched.

Please attach /dev/mtd2 contents or "sp-oops-extract /dev/mtd2" output (you
need to be root for these).
Comment 36 Roman Kapusta 2010-02-05 15:34:56 UTC
I have configured ad hoc wireless network with my laptop (Network controller:
Intel Corporation PRO/Wireless 3945ABG). I do not have problems described
above, but when I run top in xterm process icd2 is consuming all time above 6%
cpu. Most of the time icd2 is the first process in top list with highest cpu
usage. After half an hour the phone is considerably warm and battery indicator
is showing about 15% less charge.

Internet was bridged from my laptop lan to ad-hoc wifi. No special data
transfers was occuring between phone and laptop, there was just inactive
connections to googletalk and skype and inactive shh connection from laptop to
phone.

I'm often using similar configuration with infrastructure mode wifi (with AP),
but I never observed similar icd2 behaviour.
Comment 37 Jamie Lokier 2010-02-05 17:07:07 UTC
Created an attachment (id=2222) [details]
mtdoops of kernel panic after crash 5 minutes after starting ad-hoc mode

(In reply to comment #35)
> > And a few seconds later, the N900 spontaneously rebooted itself as I watched.
> 
> Please attach /dev/mtd2 contents or "sp-oops-extract /dev/mtd2" output (you
> need to be root for these).

sp-oops-extract wasn't installed nor did apt-get find it, so I read /dev/mtd2
and edited out the runs of 0xff.  (Ooh, I like mtdoops, think I'll use it on my
own devices in future.  But we can see it needs a bit more work, because the
panic log is mangled at block boundaries.)

It's triggered in __bug, called from rate_control_get_rate, called from
ieee80211_process_measurement_req.  Please enjoy!
Comment 38 Eero Tamminen nokia 2010-02-08 14:58:49 UTC
(In reply to comment #36)
> I do not have problems described above, but when I run top in xterm process
> icd2 is consuming all time above 6% cpu.

Please file a separate bug about that with "use-time" keyword and attach to it
output of "strace -f -p $(pidof icd2)" trace.

Strace is available from the SDK tools repo. Instructions for SDK tools
installation are at the end of this page:
  http://wiki.maemo.org/Documentation/devtools/maemo5


(In reply to comment #37)
>> Please attach /dev/mtd2 contents or "sp-oops-extract /dev/mtd2" output
>> (you need to be root for these).
> 
> sp-oops-extract wasn't installed nor did apt-get find it,

It's in the SDK tools repo (you get it also if you install e.g. Crash
reporter).  


> so I read /dev/mtd2 and edited out the runs of 0xff

Thanks!  I'd rather get the raw one if you think there was something wrong in
the log writing though...


> (Ooh, I like mtdoops, think I'll use it on my own devices in future.
> But we can see it needs a bit more work, because the panic log is mangled
> at block boundaries.)

...as it should handle oopslogs spanning multiple blocks fine.

Btw. latest upstream kernel version has some fixes to it (the stuff used by
oopslog may sometimes call other parts of the kernel which assume kernel
internal memory is fine, I understood latest upstream kernel isolates mtdoops
better than the version we originally contributed).

If you do find issues & do fixes to mtdoops (or backport them from later kernel
version to ours), please file a bug about that with a patch, thanks!


> It's triggered in __bug, called from rate_control_get_rate, called from
> ieee80211_process_measurement_req.  Please enjoy!

This seems internal bug NB#106780.  It may be caused by a station/AP not
announcing correct supported rates and hence triggers an incorrect rate index
from minstrel.  According to that bug comments, it's not triggered if a fixed
rate is set using e.g. "iwconfig wlan0 rate 1M fixed".  It should be fixed in
next public release (internal release version: 10.2010.03-12).

A related upstream kernel bug:
http://bugzilla.kernel.org/show_bug.cgi?id=12490


I didn't see any comments about this being related to the internal kernel
memory leakage in ad-hoc case (about which this bug is about) though.
Comment 39 Spencer Goh 2010-02-09 14:26:25 UTC
For the record I've experienced this too.
I run an ad-hoc network between PC and N900, and leave it on all day.

Firstly it completely drains my battery in no time.  Then also grinds the N900
to a messy halt... slow to unusable state.... easy to replicate.
Comment 40 Eero Tamminen nokia 2010-02-09 14:32:53 UTC
(In reply to comment #39)
> For the record I've experienced this too.
> I run an ad-hoc network between PC and N900, and leave it on all day.
> 
> Firstly it completely drains my battery in no time.  Then also grinds the N900
> to a messy halt... slow to unusable state.... easy to replicate.

Do you also see the growth in /proc/slabinfo kmalloc-* alloc buckets?

In the other bug you mentioned that the slowness went away when you
disconnected WLAN, if that helps with the slowness and the slabinfo counts drop
significantly, then the issue is not lost allocs, but e.g. some list used by
the driver growing when it shouldn't.
Comment 41 Eero Tamminen nokia 2010-03-12 15:53:25 UTC
FYI: Internal fixes to this are being tested (there were actually several small
leaks in the WiFi driver) and if they're good, they'll be in the next public
release.
Comment 42 Andre Klapper maemo.org 2010-03-15 14:42:25 UTC
Fixes have been included internally and two tests (26hours and 60hours) have
shown that this is fixed internally.

This has been fixed in package
kernel 2.6.28-20101001+0m5
which is part of the internal build version
10.2010.10-9
(Note: 2009/2010 is the year, and the number after is the week.)

A future public update released with the year/week later than this internal
build version will include the fix. (This is not always already the next public
update.)
Please verify that this new version fixes the bug by marking this bug report as
VERIFIED after the public update has been released and if you have some time.


To answer popular followup questions:
 * Nokia does not announce release dates of public updates in advance.
 * There is currently no access to these internal, non-public build versions.
   A Brainstorm proposal to change this exists at
http://maemo.org/community/brainstorm/view/undelayed_bugfix_releases_for_nokia_open_source_packages-002/
Comment 43 Uwe Kaminski 2010-03-16 23:02:44 UTC
*** Bug 8645 has been marked as a duplicate of this bug. ***
Comment 44 ext-petri.t.pitkanen nokia 2010-05-28 08:28:03 UTC
Fix for this is now delivered with Pr1.2. Marius could you please verify that
it has solved your problem
Comment 45 Marius Gedminas (reporter) 2010-06-17 20:54:00 UTC
RX-51_2009SE_10.2010.19-1_PR_MR0: ~20 minutes of internet sharing with the old
script, zero kmalloc-1024 growth so far.  Bug looks very fixed to me.
Comment 46 ext-petri.t.pitkanen nokia 2010-06-18 09:35:55 UTC
Thank you, I kinda assumed it was fixed :) As you (In reply to comment #45)
> RX-51_2009SE_10.2010.19-1_PR_MR0: ~20 minutes of internet sharing with the old
> script, zero kmalloc-1024 growth so far.  Bug looks very fixed to me.
> 

Thank you. I kinda assumed it is fixed :)

Since you are the reporter coudl you mark bbug as verified so that we close the
issue