Bug 9153 - Swapping connections locks wlancond and drains battery.
: Swapping connections locks wlancond and drains battery.
Status: NEW
Product: Connectivity
Networking
: 5.0/(3.2010.02-8)
: All Maemo
: Unspecified normal with 5 votes (vote)
: ---
Assigned To: unassigned
: networking-bugs
:
: moreinfo, use-time
:
:
  Show dependency tree
 
Reported: 2010-02-19 07:49 UTC by Bart
Modified: 2012-02-02 20:50 UTC (History)
11 users (show)

See Also:


Attachments
Results of strace -o PROG.OUT -ff PROG; PROG={wlancond, ifconfig} (12.47 KB, application/x-compressed-tar)
2010-02-19 15:55 UTC, Bart
Details
Contents of syslog after wifi stopped working, _without_ making wlancond go 100%. (9.00 KB, application/x-compressed-tar)
2010-02-19 15:56 UTC, Bart
Details
Script to turn switch between connections. (550 bytes, application/x-shellscript)
2010-02-22 15:19 UTC, Bart
Details
Syslog while wlancond locked up. (32.29 KB, application/x-gzip)
2010-02-22 15:23 UTC, Bart
Details
Contents of /dev/mtd2 (5.49 KB, application/x-gzip)
2010-02-22 17:35 UTC, Bart
Details
my mtd2 partition (289 bytes, application/gzip)
2010-02-26 00:55 UTC, Evan Driscoll
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description Bart (reporter) 2010-02-19 07:49:17 UTC
SOFTWARE VERSION:
(Settings > General > About product)
PR 1.1.1

EXACT STEPS LEADING TO PROBLEM: 
1. Change connection (Wifi/3G) either manually or automatically
entering/leaving wifi areas.

EXPECTED OUTCOME:
Phone stops using former connection and starts using the new one.

ACTUAL OUTCOME:
- Daemon wlancond consumes 100% CPU in sys mode (drains battery), ignores "kill
-9"
- Phone switches to 3G
- invoking ifconfig hangs
- conky shows empty black window
- connection selector widget doesn't show connections, thus preventing from
disabling 3G
- 3G connection actually works, web browsing and online radio work.

REPRODUCIBILITY:
Low, around 1/10. Usually can force the bug after a couple minutes switching
from 3G and wifi manually.

EXTRA SOFTWARE INSTALLED:
conky, sshd C&S, openvpn

OTHER COMMENTS:
- Updated to 1.1.1 with flasher 3.5. Happened before update, but it wouldn't
connect to 3G, just rendered the wifi unusable.
- Not using static IP, always DHCP.
- Maybe unrelated: A couple of times the UI completely froze and I had to pull
the battery to restart, although the media player kept on playing an internet
radio (tried to get a response for several minutes, but even sliding it open
didn't illuminate the keyboard).

User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2)
Gecko/20100209 Firefox/3.6
Comment 1 Lucas Maneos 2010-02-19 14:43:11 UTC
Thanks for the report.

> - Daemon wlancond consumes 100% CPU in sys mode (drains battery), ignores
> "kill -9"

That is very strange and usually happens when a process is blocked on I/O, but
it wouldn't consume CPU in that case.  Just checking, is the kill command
executed as root?

> Low, around 1/10. Usually can force the bug after a couple minutes switching
> from 3G and wifi manually.

Is the reproducible on more than one access points?  What is the
make/model/firmware version on those?

Can you provide strace output, a core dump and/or syslog output?  See
<http://wiki.maemo.org/Bugs:Stock_answers#Software_hangs_with_100.25_CPU> for
details on obtaining those.
Comment 2 Bart (reporter) 2010-02-19 15:20:39 UTC
(In reply to comment #1)
> Thanks for the report.
> 
> > - Daemon wlancond consumes 100% CPU in sys mode (drains battery), ignores
> > "kill -9"
> 
> That is very strange and usually happens when a process is blocked on I/O, but
> it wouldn't consume CPU in that case.  Just checking, is the kill command
> executed as root?
>
Yes, everything was executed as root. The kill command returned with no
message, the ifconfig just froze with no message either.


> > Low, around 1/10. Usually can force the bug after a couple minutes switching
> > from 3G and wifi manually.
> 
> Is the reproducible on more than one access points?  What is the
> make/model/firmware version on those?
>
So far it happened on two foneras (one running FONs original firmware and one
running OpenWRT), one Linksys WRT610N running DD-WRT and an unknown AP at a
friend's place, all of those running WPA. Also with a SMC VoIP enabled (don't
know the exact model) running WEP.
With some of them (original Fonera and unknown AP) it was 90% reproductible
(required several reboots in a row to get wifi), and with others (SMC) it
seemed to happen less often. Right now I only have access to a OpenWRT fonera
and the WRT610N to run tests, I might flash another fonera with the fon
firmware in case it could help.


> Can you provide strace output, a core dump and/or syslog output?  See
> <http://wiki.maemo.org/Bugs:Stock_answers#Software_hangs_with_100.25_CPU> for
> details on obtaining those.
> 
I'll do that ASAP.
Comment 3 Bart (reporter) 2010-02-19 15:55:00 UTC
Created an attachment (id=2318) [details]
Results of strace -o PROG.OUT -ff PROG; PROG={wlancond, ifconfig}

This time wlancond didn't use 100% cpu, but the wifi stopped working. It's the
1st time I see this behavior.
Comment 4 Bart (reporter) 2010-02-19 15:56:05 UTC
Created an attachment (id=2319) [details]
Contents of syslog after wifi stopped working, _without_ making wlancond go
100%.

This time wlancond didn't use 100% cpu, but the wifi stopped working. It's the
1st time I see this behavior.
Comment 5 Warren Baird 2010-02-19 21:18:44 UTC
I've had this happen to me quite a few times.   I'll notice that my n900 is
suddenly feeling warm to the touch, and the battery is draining quickly.   then
I check top, and see wlancond pulling huge amounts of cpu.

It happens to me about once every couple of days on average.   It does seem to
happen when I'm moving from cellular data to wifi or vice-versa.

I've only installed strace yesterday, and I haven't had the problem since, but
I can get a strace log when it happens again.

I've never managed to get to the point of trying to kill -9 it...  I don't
normally have a root terminal open, and every time this has happened, 'sudo
gainroot' has hung for me when I tried to become root to kill it.
Comment 6 Evan Driscoll 2010-02-22 08:59:33 UTC
I think I also get this bug. I haven't had my battery last more than 6 or 7
hours since I got the phone mid last week (apparently) because of this issue.

I don't even need to do anything to get the problem to arise; my phone can be
behaving normally and I can put it down for a while, come back, and wlancond
will be pegging my CPU.

Thus my "exact steps leading to problem" would be:
1. Don't turn off the phone

Reproducibility:
Within an hour or something, 100%.

I have the latest firmware as of when I was playing around with that stuff
Thursday evening last week. The only extra software I have in common with Bart
is openssh client (no server); I have a few other things (a couple timer
utilities, the flashlight utility, passwordsafe, and Python through apt-get).

I also had Warren's problem where even 'sudo gainroot' hung. I restarted the
phone, opened a root terminal, and now I'll leave it there and when wlancond
becomes intimate with the CPU again, see if I get the kill -9 and ifconfig
behaviors specified by Bart.

(Also, see
http://discussions.nokiausa.com/t5/Maemo-Devices/Is-my-N900-defective/td-p/636119
for a couple other problems I've seen, including a display glitch.)
Comment 7 Evan Driscoll 2010-02-22 10:08:39 UTC
Okay: 'ifconfig' hangs, and 'kill -9 (wlancond's PID)' exits with no output,
but without having killed wlancond.

I'll also post some strace stuff, but I won't be able to do that until tomorrow
some time.
Comment 8 Eero Tamminen nokia 2010-02-22 12:14:30 UTC
(In reply to comment #3)
> Created an attachment (id=2318) [details] [details]
> Results of strace -o PROG.OUT -ff PROG; PROG={wlancond, ifconfig}
> 
> This time wlancond didn't use 100% cpu, but the wifi stopped working. It's the
> 1st time I see this behavior.

This may be a separate issue.

It seems to be doing some D-BUS traffic:
poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN|POLLPRI}, {fd=5,
events=POLLIN}], 3, 0) = 0 (Timeout)
poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN|POLLPRI}, {fd=5,
events=POLLIN}], 3, -1) = 1 ([{fd=5, revents=POLLIN}])
read(5, "l\4\1\0011\0\0\0y\2\0\0f\0\0\0\1\1o\0\16\0\0\0/com/nok"..., 2048) =
169
read(5, 0x288e0, 2048)                  = -1 EAGAIN (Resource temporarily
unavailable)

And syslog has similar beacon loss messages as bug 6615:
----
Feb 19 14:34:59 viking kernel: [21833.193634] gprs0: attached
...
Feb 19 14:35:05 viking wlancond[3795]: Deauthenticating
Feb 19 14:35:05 viking kernel: [21839.053497] wlan0: deauthenticating by local
choice (reason=3)
...
Feb 19 14:35:05 viking kernel: [21839.322143] wl1251: down
Feb 19 14:35:07 viking telepathy-gabble[3999]: GLIB MESSAGE default -
lm-ssl-openssl.c: Issued for CN: gmail.com
Feb 19 14:35:21 viking wlancond[3795]: Could not set interface wlan0 flags 
Feb 19 14:35:21 viking kernel: [21854.372558] wl1251: ERROR timeout waiting for
the hardware to complete initialization
Feb 19 14:35:21 viking kernel: [21854.931335] wl1251: 151 tx blocks at 0x3b788,
35 rx blocks at 0x3a780
Feb 19 14:35:21 viking kernel: [21854.946960] wl1251: firmware booted (Rev
4.0.4.3.7)
...
Feb 19 14:35:23 viking kernel: [21856.847808] wlan0: authenticated
...
Feb 19 14:35:24 viking udhcpc[4101]: Lease of 172.24.10.195 obtained, lease
time 43200
Feb 19 14:35:25 viking telepathy-spirit[3996]: GLIB WARNING ** tp-glib -
tp_base_connection_change_status: attempted to re-emit the current status 0,
reason 1
Feb 19 14:35:31 viking kernel: [21864.743743] slide (GPIO 71) is now closed
Feb 19 14:35:34 viking kernel: [21868.024688] wlan0: driver reports beacon loss
from AP cf1bb52c - sending probe request
Feb 19 14:35:53 viking kernel: [21887.076507] wlan0: driver reports beacon loss
from AP cf1bb52c - sending probe request
------

Btw. Do you have bluetooth enabled (do you e.g. use BT headset)?  BT & WLAN
share the antenna and it's possible that the drivers may have some interaction
issue.


(In reply to comment #6)
> I think I also get this bug. I haven't had my battery last more than 6 or 7
> hours since I got the phone mid last week (apparently) because of this issue.

Evan, is this happening with PR1.1.1 for you also?


> I don't even need to do anything to get the problem to arise; my phone can be
> behaving normally and I can put it down for a while, come back, and wlancond
> will be pegging my CPU.
> 
> Thus my "exact steps leading to problem" would be:
> 1. Don't turn off the phone

Do you mean that in your case it's triggered also in some other condition than
switching between Wifi/3G?


> Reproducibility:
> Within an hour or something, 100%.

If you have in Settings -> Internet connections -> Connect automatically as
"Any connection", could you try changing it to "Always ask" and see whether you
can better pinpoint when the issue happens?

(This is also a good way to increase use-time even in normal situations.  AFAIK
radios take most power after the display and device services not enabling them
silently at the refresh intervals you've given + doing keepalives for certain
things helps with battery life.  The services will then do refreshes only when
you've specifically enabled networking.)


> I also had Warren's problem where even 'sudo gainroot' hung. I restarted the
> phone, opened a root terminal, and now I'll leave it there and when wlancond
> becomes intimate with the CPU again, see if I get the kill -9 and ifconfig
> behaviors specified by Bart.
> 
> (Also, see
> http://discussions.nokiausa.com/t5/Maemo-Devices/Is-my-N900-defective/td-p/636119
> for a couple other problems I've seen, including a display glitch.)

The display glitch is interesting.  It looks exactly like a non-reproducible 
SGX HW reset issue we have (it's not reproducible, but with large set of users
they sometimes, very rarely, encounter it).  Due to the issue being
non-reproducible, it hasn't been possible to fix the SGX 3D driver completely,
at least so far. So... If this is a reproducible use-case and also cause for
the wlancond issue, I would be very much interested.


(In reply to comment #7)
> Okay: 'ifconfig' hangs, and 'kill -9 (wlancond's PID)' exits with no output,
> but without having killed wlancond.
> 
> I'll also post some strace stuff, but I won't be able to do that until
> tomorrow some time.

I think most interesting information now would be output of "dmesg" when this
happens.  These sound like kernel issue and you can see kernel messages with
"dmesg" command without becoming root.

When you see the display glitches, I would be especially interested whether you
see any messages about SGX HW resets in the "dmesg" output.
Comment 9 Evan Driscoll 2010-02-22 13:26:46 UTC
> Evan, is this happening with PR1.1.1 for you also?

"About product" says 3.2010.02-8.002, so... I guess?


> Do you mean that in your case it's triggered also in some 
> other condition than switching between Wifi/3G?

I can't speak to whether it's trying to do something like that internally, but
the "internet connection" button in the status area says I'm not connected.
There are a couple Wi-Fi networks in range, so it's possible it is trying to
switch between them.

As for 3G, I have AT&T, so the phone's not even on the right frequency. (Though
maybe that is interesting in-and-of itself.) I do get EDGE reception though. My
primary use cases are disconnected PDA/media player, actual phone, and wi-fi
connection; internet over the cell network was never a reason I got the N900.

(Just FYI, one more thing that seems to hang when the phone is in this state is
the list of network connections that are available. And actually, now I wonder
if the Notes freezing problem has the same cause. Seems like it. Maybe we can
fix all my problems in one fell swoop. :-))


> If you have in Settings -> Internet connections -> Connect automatically
> as "Any connection", could you try changing it to "Always ask" and see 
> whether you can better pinpoint when the issue happens?

I had "connect automatically" set to "Wi-Fi"; I've switched it to "always ask",
and we'll see if and when the problem returns.


> Due to the issue being non-reproducible, it hasn't been possible to
> fix the SGX 3D driver completely, at least so far. So... If this is a 
> reproducible use-case and also cause for the wlancond issue, I would
> be very much interested.

"Unfortunately" the display glitch has been a one-time thing for me too. If it
recurs with any frequency I'll let you know. (Interesting that it sounds like a
software issue; my guess was a hardware problem, perhaps something
overheating.)


I'll also try to get you dmesg output too. I've saved the current output, but
transferring it will have to wait for a little bit (I did check for the string
"SGX", but it's not present; that issue was a while ago though and probably is
long gone), and I'll grab another dmesg output when the wlancond thing comes
back.
Comment 10 Evan Driscoll 2010-02-22 13:28:48 UTC
Also, bluetooth is off for me.
Comment 11 Bart (reporter) 2010-02-22 15:19:58 UTC
Created an attachment (id=2338) [details]
Script to turn switch between connections.

It switches between 3G and a wifi connection. It's quite easy to modifi to
switch among wifi APs or just turn the wifi on and off.

To find out the big-ugly-strings that define the wifi connections, use this:
gconftool -R /system/osso/connectivity/IAP | grep YOURWIFINAME -a B 10
You will see a section starting with
/system/osso/connectivity/IAP/XXX-XXX-XXX-XXXXX and with the wifi configuration
following, the string after "IAP/" is what you need to put in the script.
Comment 12 Bart (reporter) 2010-02-22 15:23:40 UTC
Created an attachment (id=2339) [details]
Syslog while wlancond locked up.

Including since boot until I realized the situation and rebooted.
Comment 13 Bart (reporter) 2010-02-22 15:31:06 UTC
> Btw. Do you have bluetooth enabled (do you e.g. use BT headset)?  BT & WLAN
> share the antenna and it's possible that the drivers may have some interaction
> issue.
> 
> 
This happened to me both with BT on and off.

I've created a script to change the configuration constantly, but within 4
hours I didn't trigger the bug. Later I had to take a flight and as soon as I
landed and turned the phone on, I hit the bug. I don't have the strace since I
had no time to restart wlancond with strace since I booted.

I've attached both files, script and syslog.
Comment 14 Eero Tamminen nokia 2010-02-22 16:39:16 UTC
(In reply to comment #9)
> > Evan, is this happening with PR1.1.1 for you also?
> 
> "About product" says 3.2010.02-8.002, so... I guess?

Yes.


(In reply to comment #13)
> I don't have the strace since I
> had no time to restart wlancond with strace since I booted.

dmesg output might also be interesting, to compare it against syslog (see
below).


(In reply to comment #12)
> Created an attachment (id=2339) [details] [details]
> Syslog while wlancond locked up.

There were some strange issues:

* Lots of messages like this:
---
Feb 20 15:21:36 viking udhcpc[1448]: bogus packet, option fields too long: Read 
past the packet length when getting option 0x2c (308 >= 308)
---

* This is most worrying:
---
Feb 20 15:32:47 viking wlancond[1791]: Unable to run: another instance running
(
PID 1144)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^
---

It seems that the syslog contents have gotten corrupted (there was also another
instance of it) unless the file somehow got corrupted before you tarred &
gzipped (if it was corrupted later on, untarring/ungzipping should complain).

I've seen that kind of thing once before, from last fall in which case the root
file system had gotten corrupted.

Can you provide /dev/mtd2 (small oopslog partition) contents?   I would like to
see whether kernel has logged any oopses for you.
Comment 15 Bart (reporter) 2010-02-22 17:35:12 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > I don't have the strace since I
> > had no time to restart wlancond with strace since I booted.
> 
> dmesg output might also be interesting, to compare it against syslog (see
> below).
> 
> 
Unfortunately, the syslog is from saturday and I've checked dmesg and there is
nothing that old, only association info from today.

> (In reply to comment #12)
> > Created an attachment (id=2339) [details] [details] [details]
> > Syslog while wlancond locked up.
> 
> There were some strange issues:
> 
> * Lots of messages like this:
> ---
> Feb 20 15:21:36 viking udhcpc[1448]: bogus packet, option fields too long: Read 
> past the packet length when getting option 0x2c (308 >= 308)
> ---
> 
Well, it was at the luggage claim at the airport, I have no idea what was the
wifi environment there, probably there was some open hotspot, but I'm not sure.

> * This is most worrying:
> ---
> Feb 20 15:32:47 viking wlancond[1791]: Unable to run: another instance running
> (
> PID 1144)
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> ^
> ---
> 
> It seems that the syslog contents have gotten corrupted (there was also another
> instance of it) unless the file somehow got corrupted before you tarred &
> gzipped (if it was corrupted later on, untarring/ungzipping should complain).
> 
> I've seen that kind of thing once before, from last fall in which case the root
> file system had gotten corrupted.
> 
> Can you provide /dev/mtd2 (small oopslog partition) contents?   I would like to
> see whether kernel has logged any oopses for you.
> 
I will upload it right away.
Comment 16 Bart (reporter) 2010-02-22 17:35:51 UTC
Created an attachment (id=2340) [details]
Contents of /dev/mtd2
Comment 17 Eero Tamminen nokia 2010-02-22 18:22:32 UTC
Hm.  There were some swap related oopses and couple of oopses from mtdoops
itself (which I haven't seen earlier).   I didn't see any oopses from PR1.1
release or later, but as you did SSU, it's possible that your file system may
have been slightly corrupted before SSU.

Evan, can you provide also your /dev/mtd2 contents and later on syslog from
time when this issue happens to you (if you get it again)?  I want to know
whether they show similar issues.


Note that when you don't need syslog, it's best to remove it and rm
/var/log/syslog*.  Syslog has some log file rotation, but with time it can fill
your rootfs completely.  When that's done by root process, it can come so full
that the device doesn't anymore boot (my device has had syslog running for 1.5
month, but it's lightly used and I don't have that much extra stuff installed
on it).
Comment 18 Evan Driscoll 2010-02-22 21:03:15 UTC
> Evan, can you provide also your /dev/mtd2 contents and later on syslog
> from time when this issue happens to you (if you get it again)?  I want
> to know whether they show similar issues.

How do I get /dev/mtd2 off the device? Can I just 'tar cvf mtd2.tar /dev/mtd2'
then copy that or something?

Also, the switching the internet setting to 'always ask' has solved my problem
at least; the phone's been running beautifully today, and is still reporting
full battery 4 hours after I unplugged it, even after a bit of use listening to
music.

At some point I'll switch it back to where it was and do the syslog thing,
assuming that the problem resurfaces.
Comment 19 Bart (reporter) 2010-02-22 23:42:20 UTC
(In reply to comment #18)
> > Evan, can you provide also your /dev/mtd2 contents and later on syslog
> > from time when this issue happens to you (if you get it again)?  I want
> > to know whether they show similar issues.
> 
> How do I get /dev/mtd2 off the device? Can I just 'tar cvf mtd2.tar /dev/mtd2'
> then copy that or something?
> 
I used "dd if=/dev/mtd2 of=/home/user/MyDocs/FILE_NAME" and then tarred the
file.
Comment 20 Eero Tamminen nokia 2010-02-23 13:01:05 UTC
(In reply to comment #18)
> > How do I get /dev/mtd2 off the device?

You need root.  Then just:
# cd /home/user/MyDocs
# gzip -c /dev/mtd2 > mtd2.gz
Comment 21 Evan Driscoll 2010-02-26 00:55:26 UTC
Created an attachment (id=2362) [details]
my mtd2 partition

sorry for takine a bit to to this.
Comment 22 Evan Driscoll 2010-02-26 01:31:49 UTC
Also, I had this problem yesterday a couple times when manually switching
networks. (I caught it right away both times, fortunately.)

Also, it seems weird that my mtd2 partition is an order of magnitude smaller
than Bart's... I just noticed that now. I used the 'gzip -c /dev/mtd2 >
mtd2.gz' method.
Comment 23 Eero Tamminen nokia 2010-02-26 11:07:46 UTC
(In reply to comment #22)
> Also, I had this problem yesterday a couple times when manually switching
> networks. (I caught it right away both times, fortunately.)

Could you also try stracing wlancond when it starts taking all CPU?
You can get it from the SDK tools repository:
  http://wiki.maemo.org/Documentation/devtools/maemo5#Installation


> Also, it seems weird that my mtd2 partition is an order of magnitude smaller
> than Bart's... I just noticed that now. I used the 'gzip -c /dev/mtd2 >
> mtd2.gz' method.

Your mtd2 partition was empty, i.e. your device apparently hasn't gotten any
kernel oopses.  While in Bart's case I was wondering whether his root file
system could have corrected (before the PR1.1 update), you don't have any
indications of that, so probably it was a false clue.
Comment 24 Eero Tamminen nokia 2010-02-26 11:37:59 UTC
(In reply to comment #11)
> Created an attachment (id=2338) [details] [details]
> Script to turn switch between connections.
> 
> It switches between 3G and a wifi connection. It's quite easy to modifi to
> switch among wifi APs or just turn the wifi on and off.
> 
> To find out the big-ugly-strings that define the wifi connections, use this:
> gconftool -R /system/osso/connectivity/IAP | grep YOURWIFINAME -a B 10
> You will see a section starting with
> /system/osso/connectivity/IAP/XXX-XXX-XXX-XXXXX and with the wifi
> configuration following, the string after "IAP/" is what you need
> to put in the script.

Thanks!  Current Maemo Busybox ping doesn't support "-f" (or even "-i" or "-w",
so I changed the ping line to:
  ping $TESTADDR -c 4 -s 1000

(I used "maemo.org" as TESTADDR and set packet count to 4 (=4 secs) to
reasonable timeouts for ping.)

So far (~15 mins) it hasn't been able to trigger the issue.  How relevant the
ECHO flood pinging is for triggering this?
Comment 25 Bart (reporter) 2010-02-26 19:56:45 UTC
(In reply to comment #24)

> Thanks!  Current Maemo Busybox ping doesn't support "-f" (or even "-i" or "-w",
> so I changed the ping line to:
>   ping $TESTADDR -c 4 -s 1000
> 
> (I used "maemo.org" as TESTADDR and set packet count to 4 (=4 secs) to
> reasonable timeouts for ping.)
> 
I installed ping with apt-get, that one supports all usual options :)

> So far (~15 mins) it hasn't been able to trigger the issue.  How relevant the
> ECHO flood pinging is for triggering this?
> 
Unfortunately I don't know since I haven't been able to trigger the bug with
one 4 hour session with this script. I just wanted to make sure to "stress" the
connection, just in case it helps... I haven't discarded it since I just tired
it once, I'll try to tun it more this weekend.
Comment 26 Masood Mehmood 2010-04-01 03:47:50 UTC
@Evan Driscoll: is this bug resolved for you?

I'm facing the same problem. strace to wlancond gives you nothing. Looks to me
it's a hardware issue. As I'm not seeing any one else having this problem. And
it really easy to reproduce.

Wheever I have this issue I get these in logs:

[44555.085845] wl1251: initialized
[44555.163269] cfg80211: Regulatory domain changed to country: US
[44555.163299]  (start_freq - end_freq @ bandwidth), (max_antenna_gain,
max_eirp)
[44555.163299]  (2402000 KHz - 2472000 KHz @ 40000 KHz), (600 mBi, 2000 mBm)
[44555.163330]  (5170000 KHz - 5190000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[44555.163360]  (5190000 KHz - 5210000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[44555.163360]  (5210000 KHz - 5230000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[44555.163391]  (5230000 KHz - 5330000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[44555.163391]  (5735000 KHz - 5835000 KHz @ 40000 KHz), (600 mBi, 3000 mBm)
[44555.243591] wlan0: deauthenticating by local choice (reason=3)
[44555.571777] wl12xx spi4.0: firmware: requesting wl1251-fw.bin
[44556.431121] wl1251: 151 tx blocks at 0x3b788, 35 rx blocks at 0x3a780
[44556.431457] wl1251: firmware booted (Rev 4.0.4.3.7)
[44556.609954] cfg80211: Regulatory domain changed to country: US
[44556.610015]  (start_freq - end_freq @ bandwidth), (max_antenna_gain,
max_eirp)
[44556.610046]  (2402000 KHz - 2472000 KHz @ 40000 KHz), (600 mBi, 2000 mBm)
[44556.610076]  (5170000 KHz - 5190000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[44556.610137]  (5190000 KHz - 5210000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[44556.610168]  (5210000 KHz - 5230000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[44556.610198]  (5230000 KHz - 5330000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[44556.610229]  (5735000 KHz - 5835000 KHz @ 40000 KHz), (600 mBi, 3000 mBm)
[44557.939056] wl1251: down
[44569.939208] wl1251: ERROR timeout waiting for the hardware to complete
initialization
[44580.462432] wl1251: 151 tx blocks at 0x3b788, 35 rx blocks at 0x3a780
[44580.478057] wl1251: firmware booted (Rev 4.0.4.3.7)
[44581.798339] wl1251: down
[44597.423706] wl1251: ERROR timeout waiting for the hardware to complete
initialization


- And After that wlancond takes 100% CPU.
- sudo gainroot hangs.
- ifcondig hangs.
- rmmod wl12xx hangs.

- kill -9 wlancond has no impact on wlancond.
- Reflash with the latest image and the issue is still there. 
- Tested open and secure wireless routers of different brand and same issue
every where.


any help?
Comment 27 Oskar Arvidsson 2010-05-02 14:50:55 UTC
*** This bug has been confirmed by popular vote. ***
Comment 28 Andre Klapper maemo.org 2010-05-02 22:17:57 UTC
CC'ing Jason.

(In reply to comment #26)
> [44557.939056] wl1251: down
> [44569.939208] wl1251: ERROR timeout waiting for the hardware to complete
> initialization

Jason, any idea how to further track this down?
Comment 29 Oskar Arvidsson 2010-05-31 11:25:30 UTC
Problem still exists in PR 1.2. This happened twice for me yesterday, device
auto connect set and connecting to wireless ap at home. The second time this
happened, the device was charging, leading to even more heat.

I'm concerned this will eventually influence device life time. As this seems to
be a hardware issue, is there a possibility to have the unit replaced?
Comment 30 Eero Tamminen nokia 2010-05-31 14:14:24 UTC
(In reply to comment #29)
> As this seems to be a hardware issue, is there a possibility to have the unit replaced?

You need to ask Nokia Care.  Maemo community bugzilla tracks just SW issues.
Comment 31 Andre Klapper maemo.org 2010-08-31 15:02:28 UTC
Oskar: Did you try to get a new unit?

Anybody else still having this problem in 10.2010.19-1?
Comment 32 Bart (reporter) 2010-08-31 15:42:01 UTC
(In reply to comment #31)
> Oskar: Did you try to get a new unit?
> 
> Anybody else still having this problem in 10.2010.19-1?
> 

Yes, although it happens with MUCH lower frequency than with the original
firmware. In the last couple of months it happened to me only three times (two
of them in a row), while before it happened at least in 50% of attempts to
connect to an AP.
Maybe related: when it happened, all the three times it was because a program
(imhere-0.3, extras-devel) was trying to connect.
Maybe related: lately I have the auto-connect setting to "Wifi Only" since I no
longer pay a 3G flatrate.
Comment 33 Oskar Arvidsson 2010-09-01 21:33:34 UTC
(In reply to comment #32)
> (In reply to comment #31)
> > Oskar: Did you try to get a new unit?
> > 
> > Anybody else still having this problem in 10.2010.19-1?
> > 
> 
> Yes, although it happens with MUCH lower frequency than with the original
> firmware. In the last couple of months it happened to me only three times (two
> of them in a row), while before it happened at least in 50% of attempts to
> connect to an AP.
> Maybe related: when it happened, all the three times it was because a program
> (imhere-0.3, extras-devel) was trying to connect.
> Maybe related: lately I have the auto-connect setting to "Wifi Only" since I no
> longer pay a 3G flatrate.
> 

I did contact Nokia Care, but as I did not recieve a return call from them and
the problem almost ceased to occur, I have not tried again.

The frequency of this problem is now approximately once a month. I am using
both 3G and wifi.

Maybe of interest: I am using kernel power from extras.
Comment 34 Oskar Arvidsson 2010-11-21 21:09:05 UTC
I thought I'd update you on the status. I handed it in to Nokia for repair and
they replaced the wifi card. After this, the issue is gone (at least it's been
a few days with no problems). So I think we could close this bug report. It
seems to be a hardware issue after all.

Message from Nokia: "Replaced NOK/WLAN Size4.0 Module ENW49701N LGA80"
Comment 35 Bart (reporter) 2010-11-21 21:31:43 UTC
(In reply to comment #34)
> I thought I'd update you on the status. I handed it in to Nokia for repair and
> they replaced the wifi card. After this, the issue is gone (at least it's been
> a few days with no problems). So I think we could close this bug report. It
> seems to be a hardware issue after all.
> 
> Message from Nokia: "Replaced NOK/WLAN Size4.0 Module ENW49701N LGA80"
> 

In my case (as the original reporter) the issue went slowly away with time,
without any hardware modifications. I had only a copule issues in all the
firmware 1.2 period and none since 1.3 went public.
Comment 36 Fabrice Gabolde 2011-01-25 23:49:44 UTC
Hi,

I'm having this happen to me as well; since there are more symptoms it's
probable my setup has several issues.  Some details in chronological order:

Immediately after the phone is bought and unpacked (around October), everything
works fine: connection to wifi, 3G, battery life is shortish but OK.

Updated over the air around December.

Shortly (immediately?) after this, phone sometimes doesn't find wireless
connection, 3G connection; nothing changed in home wifi setup.  Battery life
unchanged.  dmesg shows wl1251 spewing error messages when I pull up the
connections dialog and nothing shows up.  Rebooting sometimes fixes the issue.

Last Friday I decide to use "restore original settings" (not "clear device") in
the hopes that some package I've installed is messing up with the wireless.

Now I suffer from these issues:

* Phone almost never can find wireless; 3G works barely better

* When it fails to find wireless or 3G, wlancond shows up in top as using up
all the CPU; sudo gainroot hangs; kill -9 started from a previously opened root
shell hangs as well; of course since a process is running at 100% the phone is
warm and the battery lasts only a few hours

* When it does find the wireless it can lose it again randomly, but wlancond
doesn't show up in top, and I don't know yet if the battery is drained

  * then ifconfig still shows an IP but I can't ping anything

  * dmesg shows wl1251 spamming the logs with "ERROR elp wakeup timeout"

  * ifconfig wlan0 down && ifconfig wlan0 up logs this:

    "mac80211-phy0: failed to remove key (0, ff:ff:ff:ff:ff:ff) from hardware
(-110)"

    before the "down" message, then the usual "firmware booted" message, but
still no connection

* Probably unrelated, but I couldn't even install strace and others since the
connection dropped when I was trying to reload the cache after adding the tools
repository.

If it is of any value I'll try to post any output you require (made more
difficult by the fact that I have no SSH to or from the phone, obviously, and I
don't know where I've put the USB cable).
Comment 37 Eero Tamminen nokia 2011-01-31 12:01:36 UTC
> * Probably unrelated, but I couldn't even install strace and others since the
> connection dropped when I was trying to reload the cache after adding the tools
> repository.

Just copy the packages to a memory card on PC and install them on device with
"dpkg -i" as root.  Things like strace don't have extra dependencies so they
can be easily installed like that.
Comment 38 Samuli Seppänen 2011-12-15 11:04:09 UTC
I just sent my N900 to Nokia Care due to this issue - I'll let you know if it
got fixed.

Anyways, this bug was triggered by a WLAN scan, about 10-20% of the time. When
using the connection applet, the WLAN networks popped up unusually late (in 3-5
seconds). Attempting to connect to an WLAN AP always failed. At this point
wlancond was _not_ yet hogging CPU. However, an another WLAN scan caused
wlancond to start eating 100% CPU.

This problem was also triggered by the automatic background WLAN scan. I was
able to reproduce this bug in at least 4 different locations with many
different WLAN APs.

Any attempt to kill wlancond as root from a xterminal failed. Removing the wlan
chipset's kernel module also failed. That said, one time I tried the latter and
did not reboot the phone myself: after a few minutes it rebooted spontenously.

I worked around this with a script that used dbus signals to connect to my home
network. This worked ok, but after several connections (10-15) it would not
reconnect anymore, and starting a WLAN scan using the connection applet
triggered the bug.

When I got the phone, it apparently had PR1.2 installed (did not check), as I
got one major update (PR1.3) right away. After that (a month ago?) I installed
a small security update. Before sending the phone to Nokia Care, I tried
flashing the rootfs (fiasco image) to PR1.3 and installed the latest security
update on top of that. The issue was still there, even though no extra software
was installed.

Also, my wife's N900 does not have this issue and it's used in the exact same
WLAN environment. There is/was definitely something funky in my N900's
hardware.
Comment 39 Samuli Seppänen 2012-02-02 20:50:49 UTC
It's probably not worth sending a N900 that has this issue to Nokia Care. The
guy in my local Nokia Care shop said that the factory will most likely replace
the phone with a Nokia E7 or similar, and people reading this bug report would
probably be scared of that idea :). Also, replacing the WLAN card yourself will
be difficult because it's soldered to the motherboard. 

Now to the good news... at least for me updating to these (backported) WLAN
drivers fixed the issue:

<http://david.gnedt.eu/blog/wl1251>

Install them according to the standard instructions. If you want to autoload
them, add the following to /etc/event.d/wlancond:

  pre-start script
  cd /etc/compat-wireless
  sh load.sh
  end script

You need to copy the new kernel modules, as well as the load.sh and unload.sh
scripts to /etc/compat-wireless. Do not install the customized osso-wlan
package unless you know you need it; it will cause problems when installing
other software.

I have not noticed much of a difference in battery life using this new driver
compared to the stock driver. I still get about 4 days in my normal use.