Bug 2006 - Memory corruption during WLAN use
: Memory corruption during WLAN use
Status: RESOLVED FIXED
Product: Connectivity
WiFi
: 2.2
: 770 Linux
: High critical with 1 vote (vote)
: ---
Assigned To: unassigned
: wifi-bugs
:
:
:
:
  Show dependency tree
 
Reported: 2007-09-13 01:16 UTC by Tilman Vogel
Modified: 2007-12-11 22:48 UTC (History)
6 users (show)

See Also:


Attachments
cx3110x-nodma.ko (66.06 KB, application/octet-stream)
2007-10-28 16:36 UTC, Siarhei Siamashka
Details
n770_wlan_memory_corruption_fix.diff (1.35 KB, patch)
2007-10-29 22:06 UTC, Siarhei Siamashka
Details
declaring recv_buffer on stack instead of global data (1.21 KB, patch)
2007-10-30 23:17 UTC, Frantisek Dufka
Details
declaring recv_buffer on stack instead of global data - binary module (61.57 KB, application/octet-stream)
2007-10-30 23:18 UTC, Frantisek Dufka
Details
n770_wlan_align_fix.diff (2.70 KB, patch)
2007-10-30 23:27 UTC, Siarhei Siamashka
Details
adds some stats into cx3110x_spi_dma_read/write (3.74 KB, patch)
2007-11-11 15:28 UTC, Frantisek Dufka
Details
declaring recv_buffer on stack instead of global data - final (2.06 KB, patch)
2007-11-15 17:23 UTC, Frantisek Dufka
Details
declaring recv_buffer on stack instead of global data - final binary module (61.61 KB, application/octet-stream)
2007-11-15 17:27 UTC, Frantisek Dufka
Details
installable deb package for OS2006 or older OS2007HE (28.74 KB, application/x-debian-package)
2007-11-16 12:03 UTC, Frantisek Dufka
Details
source for deb package (3.40 KB, application/x-gzip)
2007-11-16 12:08 UTC, Frantisek Dufka
Details
installable deb package for OS2006 or older OS2007HE 0.2 (28.97 KB, application/x-debian-package)
2007-11-20 15:12 UTC, Frantisek Dufka
Details
source for deb package (3.58 KB, application/x-gzip)
2007-11-20 15:15 UTC, Frantisek Dufka
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description Tilman Vogel (reporter) 2007-09-13 01:16:15 UTC
EXPECTED OUTCOME: 
n/a

ACTUAL OUTCOME:
Depending on the memory location to which the modules umac.ko and cx3110x.ko
get loaded, exactly two consecutive bytes at fixed physical locations in memory
get overwritten by zeroes everytime there is WLAN activity: while scanning for
networks as well as when data is actually transmitted.

On a vanilla NOKIA770_2006SE_3.2006.49-2_PR_MR0, the modules get loaded at (cat
/proc/modules):
cx3110x 51420 0 - Live 0xbf03f000
umac 253316 1 cx3110x, Live 0xbf000000

In this case, the two bytes are at physical location 0x1304b8b4 and 0x1304b8b5
(these addresses include an offset of 0x10000000 - see /proc/iomem).

When booting the same OS from an ext2 formatted MMC partition, then the modules
are at:
cx3110x 51420 0 - Live 0xbf04e000
umac 253316 1 cx3110x, Live 0xbf00f000
ext2 43524 1 - Live 0xbf003000
mbcache 7716 0 - Live 0xbf000000

I.e. due to the two extra modules, umac.ko and cx3110x.ko are shifted by
0xf000. Accordingly, the corrupted bytes get shifted by 0xf000 to 0x1305a8b4
and 0x1305a8b5.


STEPS TO REPRODUCE THE PROBLEM:

Needed tools:

memtester: http://pyropus.ca/software/memtester/
binaries: http://freenet-homepage.de/tvogel/memtester-4.0.7-bin.tar.bz2

urumt: http://www.normalesup.org/~george/comp/rumt/
with patch: http://freenet-homepage.de/tvogel/rumt-n770.patch
binaries: http://freenet-homepage.de/tvogel/rumt-bin.tar.bz2

How to reproduce the problem:

- open two root shells on your 770 (in xterms or via ssh)
- start WLAN on the 770, flood ping "ping -f" your 770 in order to create
network traffic
- in the first shell, start memtester starting with a size that shows the
corruption (the first argument is the size in MB, e.g. ./memtester 24)
- successively reduce the size until you don't see corruption: This makes it
likely, that the next alloc of 1 MB will get the block with the bad bytes
- let memtester run and now in the second shell, start "urumt -p 256": This
will allocate 1 MB of memory, locate its physical addresses in /dev/mem and
start testing.

You'll get bit-precise location information on which bits get corrupted into
which direction: + (1->0) or - (0->1).

WORKAROUND:
My idea was to write a programm that tries to allocate the bad memory block,
lock it and then just sleep forever. This would save other processes from
stepping into the trap.

source code: http://freenet-homepage.de/tvogel/blockbad.c

binary: http://freenet-homepage.de/tvogel/blockbad

The programm takes as argument the memory page to block. If urumt reported
123ef:8bc, strip off the leading 1 and the last 3 digits, i.e. use 0x23ef in
this example. The programm will always allocate 32MB RAM in order to search for
the block. This is currently hardcoded. After the block is found, the other
blocks are freed up again. Of course, you should stop memtester and urumt
before that.

If this works for you, you might consider starting blockbad 0x23ef at the end
of /etc/init.d/minircS .

Then you can check with "ps" if blockbad is running. If so, it found and
allocated the suspicious memory block. If not, it was out of luck and didn't
get that block assigned by the kernel.


OTHER COMMENTS:
OS version: NOKIA770_2006SE_3.2006.49-2_PR_MR0
WLAN settings: WEP, problem also present without WEP, WPA not tested
N800: not yet tested
Comment 1 Kalle Valo nokia 2007-09-13 08:34:45 UTC
I'm testing N800 software version RX-34_2007SE_4.2007.26-8 with memtester and
haven't seen any corruption after five minutes. I'm using WEP and ping flooding
the device constantly.

My commandline:

./memtester 50

I'll leave the test running for few hours, let's see what happens.
Comment 2 Frantisek Dufka maemo.org 2007-09-13 10:25:08 UTC
This was reproduced by more people including myself, see my mail here
http://maemo.org/pipermail/maemo-developers/2006-September/005633.html
See also related/duplicate bug #677 (comment 4)
Comment 3 Tilman Vogel (reporter) 2007-09-13 11:35:54 UTC
(comment 1) I don't think you should have to wait for the error to appear. For
me, either the errors turn up immediately or never. It just depends on which
physical memory blocks are given to memtester by the kernel. So for the N800 I
see different possibilities:
1) the error is fixed
2) the block with the damaged bytes is already allocated by something else: one
could try to start memtester in an earlier stage of booting when the block is
still free and redirect its output. I also had this problem sometimes on the
770 when trying to reproduce the problem after I had it running for a while.
Especially now, as I start blockbad in minircS, when I kill it later in order
to check the bug again, the tiny hole in memory allocation that blockbad
leaves, gets quickly filled by other small things, like e.g. the keyboard
applet. When I start blockbad again, it has bad chances to catch the block
again.
3) memtester just doesn't catch the block: then increasing the size further
could help, also opening and closing applications could generate some
rearrangement of memory making it more likely that the questionable block gets
free again by chance (paging). I don't know a good tool for that, but you can
check if a given physical block is available by running blockbad and checking
if it can allocate the block you want.

For N800, one should probably increase the search range of blockbad which is
currently hardcoded to 32M. Also, memory mapping is definitely not the same as
on 770. /proc/iomem shows that system RAM is mapped to 0x10000000-0x14000000 on
the 770. This mapping is currently hardcoded in both blockbad and in the
rumt-n770.patch.
Comment 4 Kalle Valo nokia 2007-09-13 18:21:44 UTC
I couldn't reproduce this with N800 with simple memtester tests.

Moving to Connectivity/WiFi.
Comment 5 Tilman Vogel (reporter) 2007-09-14 02:11:48 UTC
Ok, there was a bug in my blockbad.c: After finding the block, I called
daemon(0,0) in order to detach from the terminal, but this lifts the mlock()!
(In fact, running the buggy version twice, sometimes found the block again :-/)

Now I fork first and mlock() the block in the child which lives forever. Please
get the new version at the same URLs.

Sorry!
Comment 6 Siarhei Siamashka 2007-10-09 22:11:44 UTC
(In reply to comment #1)
> I'm testing N800 software version RX-34_2007SE_4.2007.26-8 with memtester and
> haven't seen any corruption after five minutes. I'm using WEP and ping flooding
> the device constantly.
> 
> My commandline:
> 
> ./memtester 50
> 
> I'll leave the test running for few hours, let's see what happens.

Just in case if you have never received it, here is the text from my e-mail
sent to you on 2007-01-28 (shortly after the last Nokia 770 firmware update):

"Hello Kalle,

I just flashed the latest IT OS2006 update 3.2006.49-2 (in the hope 
that it could probably fix this problem) and run some tests. The memory
corruption problem is still there (see memtester report at the end of my
message).

For your convenience here is the binary of the test program that I used 
(patch for upstream memtester 4.0.5 included):
http://ufo2000.xcomufo.com/files/memtester-n770.tar.gz
Memtester home page is here: http://pyropus.ca/software/memtester/
Here is a direct link for downloading memtester 4.0.5:
http://pyropus.ca/software/memtester/old-versions/memtester-4.0.5.tar.gz

The detailed steps the following (some may be redundant, but again, that's
exactly what I did):
* flashed new IT OS update (battery full)
* installed xterm, ssh and midnight commander
* added support for ext3 (using the steps similar to 
https://maemo.org/maemowiki/HowTo_mmc_ext2 but more modules for insmod were 
required)
* started memtester from mmc card to run in a loop from ssh session (and using
slightly more than 20MB for testing)
* after some time of running tests also started background transfer of some 
large file via scp
* as none of the problems showed up up to this moment, stopped memtester with
ctrl-c and decided to increase memory size to test (up to 30MB), in the
process of running this test the device suddenly rebooted
* immediately after reboot it complained about low battery charge
* I put the device on charger, started wifi connection and tried to run
memtester again, it detected some problems 

All the process starting from flashing the latest firmware till getting error
messages in memtester took a few hours.

If you need ANY additional information, I would be glad to help.


I'm more worried about the following. This is not the first time this problem
got reported (the last time in a private mail to Kimmo Hämäläinen). But the
problem was never confirmed and could not be reproduced on your side. On the 
other hand it was reproduced on a few other devices. And there are still lots
of complaints about spontaneous reboots and crashes in forums. This memtester
is 
check seems to be the most easy and "scientific" way to verify wifi related 
system stability problems at the moment. So it is a big step forward in problem
reproducibility.

...
"

Well, regarding this issue, as Tilman Vogel said in a few comments above, it
takes some time and patience to reproduce the problem. But what is much more
important is just a bit of trust in our bugreports, we are really not bad guys
and have no evil intentions :) A quick five minute test is definitely not
enough to say that the problem is not reproducible, especially for a critical
one such as this one.

Do you have any advice about what could be done to investigate this problem
further and try to fix it? Is N800 wifi driver much different from the one used
in Nokia 770? I hope it will not take much of your time and efforts to provide
some minor guidance and a few advices about how to build a debugging version of
wifi driver for Nokia 770. Such help would be very much appreciated, especially
considering that you are primarily focused on N800 support now.
Comment 7 Frantisek Dufka maemo.org 2007-10-09 22:55:54 UTC
BTW as Tilman mentioned the address depends on address where wi-fi modules are
loaded. This definitely looks like some SW bug in the driver. I'm not sure when
I'll get to it but is is relatively easy to either modify load_wlan_module in
/mnt/initfs/linuxrc or use newest bootmenu with usb recovery mode and insert
umac.ko and cx3110x.ko by hand intermixed with some other module to pinpoint
which of those two contains the bug. Anyone?

In case of cx3110x.ko latest n770 sources of cx3110x driver would be enough
either to fix it or implement workaround with allocating the corrupted page
directly in cx3110x.ko. In case of umac only workaround is possible to do by us
(perhaps via other kernel module taking umac module address as a parameter).

Kalle, are cx3110x sources for latest 770 firmware available? The cx3110x
garage project and Bora repository seems to contain only N800 code. I couldn't
find similar source in 2.2/gregale repository . I think n770 sources were
available at some time in the past.
Comment 8 Frantisek Dufka maemo.org 2007-10-10 12:01:35 UTC
Just googled around and it looks there is a reserve_bootmem(addr,size) call in
linux kernel that should allocate the page so the workaround may not be hard to
do in kernel. 

See also
http://lxr.linux.no/ident?i=reserve_bootmem
http://rick.vanrein.org/linux/badram/index.html
Comment 9 Kalle Valo nokia 2007-10-15 16:51:09 UTC
(In reply to comment #6)
> Just in case if you have never received it, here is the text from my e-mail
> sent to you on 2007-01-28 (shortly after the last Nokia 770 firmware update):

I remember your email now. It got burried somewhere inside my inbox, I'm sorry
about that.

> Well, regarding this issue, as Tilman Vogel said in a few comments above, it
> takes some time and patience to reproduce the problem. But what is much more
> important is just a bit of trust in our bugreports, we are really not bad guys
> and have no evil intentions :)

It's not about trust, don't worry. It's more like there's only 24 hours in a
day and I have to sleep also :)

> A quick five minute test is definitely not
> enough to say that the problem is not reproducible, especially for a critical
> one such as this one.

I run it longer than that and also one of our testers tested this on N800, but
we couldn't reproduce it. Either this is much more difficult to reproduce on
N800 or this only a bug in 770.

> Do you have any advice about what could be done to investigate this problem
> further and try to fix it? Is N800 wifi driver much different from the one used
> in Nokia 770? 

The biggest difference is that 770 uses McBSP and N800 uses McSPI. I would
start checking that first.

I would appreciate if people reported if they can or cannot reproduce this on
N800.
Comment 10 Kalle Valo nokia 2007-10-15 16:54:35 UTC
(In reply to comment #7)
> Kalle, are cx3110x sources for latest 770 firmware available? The cx3110x
> garage project and Bora repository seems to contain only N800 code. I couldn't
> find similar source in 2.2/gregale repository . I think n770 sources were
> available at some time in the past.

They should be, but obviously something has gone wrong. I will sort this out,
but it might take few days. I'll let you know when they are again available.
Comment 11 Siarhei Siamashka 2007-10-15 22:52:37 UTC
(In reply to comment #9)
> I run it longer than that and also one of our testers tested this on N800, but
> we couldn't reproduce it. Either this is much more difficult to reproduce on
> N800 or this only a bug in 770.

Maybe. On the other hand, we always take seriously the bugs reported against
the older version of software at my job even if they are not reproducible with
the latest version (unless there have been some changes to fix it). Unfixed
bugs tend to hide and bite you again :) But as you said, the problem could be
in the code that is different for 770 and N800.

> The biggest difference is that 770 uses McBSP and N800 uses McSPI. I would
> start checking that first.
> 
> I would appreciate if people reported if they can or cannot reproduce this on
> N800.

OK, I'll try to do some tests when I have a bit more spare time.

> They should be, but obviously something has gone wrong. I will sort this out,
> but it might take few days. I'll let you know when they are again available.

That's great. Having the last Nokia 770 wifi driver sources will definitely
increase our chances to find some solution to this problem.


I had a look at this issue. As far as I remember, in my tests with memtester
4.0.5, the offset of faulty address aligned at 1K boundary reported by
memtester was 0x1a5 on the older version of OS2006 firmware and 0x22d with the
latest one. I don't know if memtester provides byte precise results (and the
initial Tilman's post lists different address). This gives a consistent 0x88
bytes difference. I unpacked FIASCO images from OS2006 3.206.49-2 and
1.2006.26-8 and took a look at cx3110x.ko
This module is really a bit larger in the latest version of firmware. If we
suppose that the problematic address has a constant address relative to
cx3110x.ko, then finding a symbol in disassembly listing that got shifted by
this amount (0x88?) would make it very suspicious. I did a quick comparison,
but it did not reveal any interesting symbols at 0x88 difference. Maybe I'll
try to do some more tests again to get problematic addresses with both these
wifi driver modules to calculate precise difference.

I also took a quick look at the N800 driver sources (without trying too hard to
understand what it does). My wild guess is that 'spi_dma.recv_buffer' looks the
most suspicious to me. Could it possibly be that cx3110x.ko gets relocated in
physical memory somehow and DMA(?) keeps writing data to now wrong memory
location that was occupied by 'spi_dma.recv_buffer' earlier? I'm not a kernel
hacker and that was just an attempt (maybe completely stupid) for
brainstorming.
Comment 12 Siarhei Siamashka 2007-10-28 01:22:11 UTC
Well, I really can't wait for being able to compile the latest stable Nokia 770
version of cx3110x.ko with the dummy variable 'spi_dma.recv_buffer' moved to
some kmalloc allocated block of memory (see
linux/Documentation/DMA-mapping.txt, "This rule also means that you may use
neither kernel image addresses (items in data/text/bss segments), nor module
image addresses, nor stack addresses for DMA." part). I strongly suspect that
this minor fix is all that is needed to solve the problem.

Also, it is not quite related, but probably wifi performance can be improved a
bit (waiting for DMA completion in Nokia 770 versions of 'cx3110x_spi_dma_read'
and 'cx3110x_spi_dma_write' does not look too efficient). I downloaded
STLC4370.pdf and it mentions that SPI serial host interface should support
speeds up to 48Mbps. So the wlan chip itself should be capable of providing
much better speed than we see now. That's particularly interesting as it might
probably surpass (very slow) USB performance and become the fastest way to
transfer data to or from the device. How fast is McBSP actually, is it really
the bottleneck? Also I wonder why is it needed to perform dummy DMA read in
'cx3110x_spi_dma_write' and dummy DMA write is 'cx3110x_spi_dma_read'? Is it a
workaround for some problem or McBSP always requires bidirectional transfers?
If not, doesn't this just waste precious bandwidth?
Comment 13 Siarhei Siamashka 2007-10-28 16:36:51 UTC
Created an attachment (id=585) [details]
cx3110x-nodma.ko

I have attached a patched cx3110x.ko binary file. It was originally taken from
the last OS2006 firmware (3.2006.49) and I have inserted a single jump
instruction passing control to 'cx3110x_spi_write' into the very beginning of
'cx3110x_spi_dma_write' function (which is apparently causing this memory
corruption problem). So we effectively fallback to 'cx3110x_spi_write' instead
of 'cx3110x_spi_dma_write' with this patched binary :)

Steps to test a patched driver:
1. copy 'cx3110x-nodma.ko' somewhere to your device
2. login as root
3. run the following sequence of commands from the directory where you have
'cx3110x-nodma.ko':
> rmmod cx3110x
> insmod cx3110x-nodma.ko
> chroot /mnt/initfs /usr/bin/wlan-cal
4. connect to wireless network
5. do some tests trying to catch memory corruption error with memtester and
flood ping

I could not reproduce any memory corruption problems with it (while I did
confirm them with the official wifi driver again today) and now have flashed
patched cx3110x driver to initfs on my device (replacing 'cx3110x.ko' with it
and also deleting 'cx3110x_mt.ko' just in case as it seems useless there).
Wireless network connection seems to work properly, and it seems stable so far.
Most likely it will be slower on sending data, but I prefer a slow driver
instead of a broken one :)

I'm still keeping an eye on it while heavily using wlan. I will report if I
notice anything unusual. Reports confirming that this fix is working (or
proving that the bug is still here) are also welcome.

Having the driver sources will allow us to get a proper fix for this problem
instead of a binary hack. We'll also have to think about the way of providing
some convenient driver upgrade solution for all the Nokia 770 users.
Comment 14 Kalle Valo nokia 2007-10-29 16:06:18 UTC
I have now published the cx3110x sources for Nokia 770:

https://garage.maemo.org/frs/shownotes.php?release_id=1012

I had some problems with the tarball upload, but I hope it's fixed now. Here's
the md5sum just in case:

85c115a81fa4429bee2cd16bfe961d44  cx3110x-0.8.1.tar.gz
Comment 15 Tilman Vogel (reporter) 2007-10-29 16:27:50 UTC
(In reply to comment #14)
> 85c115a81fa4429bee2cd16bfe961d44  cx3110x-0.8.1.tar.gz

It seems, the upload is missing some bits, I got 68594 bytes with an md5sum of

4874784b6948d056cf17b8b7aa0ca9bb  cx3110x-0.8.1.tar.gz
Comment 16 Frantisek Dufka maemo.org 2007-10-29 17:19:32 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > 85c115a81fa4429bee2cd16bfe961d44  cx3110x-0.8.1.tar.gz
> 
> It seems, the upload is missing some bits, I got 68594 bytes with an md5sum of
> 
> 4874784b6948d056cf17b8b7aa0ca9bb  cx3110x-0.8.1.tar.gz
> 

Works for me

[sbox-SDK22_ARMEL: ~] > ls -la cx3110x-0.8.1.tar.gz
-rw-rw-r--  1 maemo maemo 68594 Oct 29 15:49 cx3110x-0.8.1.tar.gz
[sbox-SDK22_ARMEL: ~] > md5sum cx3110x-0.8.1.tar.gz
85c115a81fa4429bee2cd16bfe961d44  cx3110x-0.8.1.tar.gz
Comment 17 Kalle Valo nokia 2007-10-29 17:33:02 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > (In reply to comment #14)
> > > 85c115a81fa4429bee2cd16bfe961d44  cx3110x-0.8.1.tar.gz
> > 
> > It seems, the upload is missing some bits, I got 68594 bytes with an md5sum of
> > 
> > 4874784b6948d056cf17b8b7aa0ca9bb  cx3110x-0.8.1.tar.gz
> > 
> 
> Works for me

Really strange, I tried with another laptop and the file was corrupted for me
also.
Comment 18 Tilman Vogel (reporter) 2007-10-29 17:39:18 UTC
(In reply to comment #17)
> Really strange, I tried with another laptop and the file was corrupted for me
> also.

Yes, indeed! When I download with firefox, it's corrupted, when I use wget, it
works. The files have the same size, but the content is shifted strangely:

Correct:
00000000  1f 8b 08 00 37 56 13 47  00 03 ec 5b df 73 db 46  |....7V.G...[.s.F|
00000010  92 ce eb ce 5f 31 e7 97  48 55 34 23 c9 92 1c c7  |...._1..HU4#....|
00000020  75 0f 94 44 db bc 95 48  85 a4 e2 f8 29 05 12 43  |u..D...H....)..C|
...

Wrong:
00000000  1f 8b 08 00 00 00 00 00  00 03 00 2d 80 d2 7f 1f  |...........-....|
00000010  8b 08 00 37 56 13 47 00  03 ec 5b df 73 db 46 92  |...7V.G...[.s.F.|
00000020  ce eb ce 5f 31 e7 97 48  55 34 23 c9 92 1c c7 75  |..._1..HU4#....u|
...

Fifteen bytes spliced in after the first four. That's strange and (!) it
currently also happens to me with other garage downloads! E.g. from my mogg
project... Is this some bug in the garage server or is my firefox going nuts?

Sorry for being OT.
Comment 19 Tilman Vogel (reporter) 2007-10-29 17:44:09 UTC
OT: No, downloading tar.gz from other sites works fine with my firefox. I'll
exclude firefox being nuts.
Comment 20 Kalle Valo nokia 2007-10-29 21:16:23 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > Really strange, I tried with another laptop and the file was corrupted for me
> > also.
> 
> Yes, indeed! When I download with firefox, it's corrupted, when I use wget, it
> works.

Ah, this is firefox (iceweasel) specific! That explains it why IE and WinZip
didn't complaint about any errors when I tried to download earlier. Now I tried
with my PC at home (has debian unstable and is on different network than the
office network) and got similar results:

$ /bin/ls -l
total 144
-rw-r--r-- 1 kvalo kvalo 68594 2007-10-29 21:08 cx3110x-0.8.1.tar.gz-iceweasel
-rw-r--r-- 1 kvalo kvalo 68594 2007-10-29 21:08 cx3110x-0.8.1.tar.gz-wget
$ md5sum *
4874784b6948d056cf17b8b7aa0ca9bb  cx3110x-0.8.1.tar.gz-iceweasel
85c115a81fa4429bee2cd16bfe961d44  cx3110x-0.8.1.tar.gz-wget
$ 

And I spent over an hour uploading with different PCs and from different
networks, and this was a problem with downloading. Bah.

Thanks a lot for investigating this!
Comment 21 Siarhei Siamashka 2007-10-29 22:06:36 UTC
Created an attachment (id=592) [details]
n770_wlan_memory_corruption_fix.diff

Patch that fixes this memory corruption issue is attached, it seems to be
working fine in my preliminary tests. In addition, using some printk debugging
output, I confirmed that DMA write in 'cx3110x_spi_dma_write' really missed and
did not touch 'spi_dma.recv_buffer' variable (so it hit some other unrelated
physical memory address) in old code, and now it properly hits a new kmalloc
allocated buffer.

So we finally solved it, congratulations to everyone :)
Comment 22 Frantisek Dufka maemo.org 2007-10-29 23:28:47 UTC
(In reply to comment #21)
> Created an attachment (id=592) [details] [details]
> n770_wlan_memory_corruption_fix.diff
> 
> Patch that fixes this memory corruption issue is attached, it seems to be
> working fine in my preliminary tests.

Excellent ! Congratulations :-)

> In addition, using some printk debugging
> output, I confirmed that DMA write in 'cx3110x_spi_dma_write' really missed and
> did not touch 'spi_dma.recv_buffer' variable (so it hit some other unrelated
> physical memory address) in old code, and now it properly hits a new kmalloc
> allocated buffer.

Question is - why? Why virt_to_phys is not safe to use with pointer pointing to
writable data section of kernel module? Some bug/feature of dynamic kernel
linker (i.e. some bad pointer relocation when & operator is used) or some bug
in gcc linker script when linking cx3110x.ko? Or virt_t_phys cannot handle such
kernel memory? Or is the value itself fine but it is not DMA-ble and/or few
bits of such pointer are masked away?

Anyway, thanks a lot for pinpointing this :-)

Also now, how to distribute this, new official firmware? That would be best
solution of course.

If this is not possible we could make custom initfs reflasher similar to the
one used for bootmenu which just replaces this module. Perhaps I could even
bite the bullet this time and (unlike with bootmenu) try to package it into
nice .deb with postinstall script that reflashes initfs in place :-)
Comment 23 Tilman Vogel (reporter) 2007-10-30 00:22:37 UTC
(In reply to comment #21) Great, thanks a lot!
Comment 24 Frantisek Dufka maemo.org 2007-10-30 10:26:23 UTC
(In reply to comment #22)
> Why virt_to_phys is not safe to use with pointer pointing to
> writable data section of kernel module?

Just answering myself - STFW :-)
http://mirror.linux.org.au/linux.conf.au/2005/cdrom-beta-1/linux-mandocs-2.6.12.6/virt_to_phys.html
"The returned physical address is the physical (CPU) mapping for the memory
address given. It is only valid to use this function on addresses directly
mapped or allocated via kmalloc." 

BTW, same fix should be done also in n800 source with sections #ifdef-ed for
N770.
Comment 25 Frantisek Dufka maemo.org 2007-10-30 10:52:56 UTC
Also from the declaration here
http://www.gelato.unsw.edu.au/lxr/source/include/asm-arm/memory.h#L65
it is clear that this must be wrong when called with pointer to module space,
see definition of MODULE_END below, modules live below PAGE_OFFSET not above as
kmalloc-ed space. So in theory this could be done without using kmalloc for
those 2 extra bytes but something like (x) - MODULE_START + PHYS_OFFSET is
needed (not sure if this is valid for kernel code or data). Quite hackish and
not worth the effort. I did not find equivalent definition for virt_to_phys for
kernel modules. See also
http://uwsg.indiana.edu/hypermail/linux/kernel/0009.2/0898.html

It would be interesting to search kernel source for using "virt_to_phys(&" to
find some other occurences.
Comment 26 Siarhei Siamashka 2007-10-30 11:16:54 UTC
(In reply to comment #22)
> Question is - why? Why virt_to_phys is not safe to use with pointer pointing to
> writable data section of kernel module? 

As can be seen in disassembled code, virt_to_phys macro just adds some constant
(-0xB0000000) to the virtual address value to get physical address. It is quite
possible, that it expects input virtual address to reside in some specific
address range in memory for which this assumption is valid (and it might be not
true for kernel module data section).

> Some bug/feature of dynamic kernel
> linker (i.e. some bad pointer relocation when & operator is used) or some bug
> in gcc linker script when linking cx3110x.ko? Or virt_t_phys cannot handle such
> kernel memory? Or is the value itself fine but it is not DMA-ble and/or few
> bits of such pointer are masked away?

I referenced to 'Documentation/DMA-mapping.txt' in comment #12, it explicitly
forbids using kernel module data section for DMA. They must have some reason
for this restriction, so it's better to just follow the rules to stay out of
the trouble :)

> Anyway, thanks a lot for pinpointing this :-)

Thanks to Tilman Vogel for reviving hunt of this bug (after everyone lost
motivation and hope long ago, having neither driver sources nor even official
confirmation of the problem at that time). Also providing precise description
of memory corruption pattern was a great help and allowed to spot a suspicious
part of code quite fast by just simply looking at the source.

Kalle Valo provided invaluable help with very useful comments and ensured
availability of the sources for Nokia 770 wlan driver (and N800 driver sources
were available much earlier, but as it turned out, N800 never suffered from
wlan memory corruption problem). It is good to have a person available for
contact who is knowledgeable and can provide support for some system component.
A good contrast with abstract 'Maemo QA' :)

> Also now, how to distribute this, new official firmware? That would be best
> solution of course.

Sure, it would be best solution (and probably official YUV420 fix in the kernel
would be very nice too ;)).

About the driver, there is still a part I don't like very much:

    /* UMAC may send us odd number of bytes long frames */
    if (length % 2) {
        u16 last_word;

        last_word = *(uint16_t *)(buffer + length - 1);
        omap_mcbsp_spi_master_xmit_word_poll(OMAP_MCBSP2, last_word);
    }

It has problems with alignment and can read data behind buffer. Just replacing
'uint16_t' -> 'uint8_t' may be enough (and a similar code is used for N800). I
wonder if 'odd length writes' ever happen in reality? If there were many of
them, this code could possibly cause a lot of problems.

> If this is not possible we could make custom initfs reflasher similar to the
> one used for bootmenu which just replaces this module. Perhaps I could even
> bite the bullet this time and (unlike with bootmenu) try to package it into
> nice .deb with postinstall script that reflashes initfs in place :-)

I thought about a whole package which might do system packages upgrade in
OS2006 without reflashing. Something like a gui which checks existing system
components (by verifying MD5 of the relevant files), provides the user with a
list of unpatched components that need an upgrade and asks for a confirmation
to replace them. If the user has everything up to date, it would just show some
status information about the packages (version numbers and patch levels).
Comment 27 Frantisek Dufka maemo.org 2007-10-30 11:25:00 UTC
(In reply to comment #25)
Sorry to bother you again with my kernel newbie findings but to avoid extra
kmalloc (which probably wastes more than just 2 bytes), recv_buffer could be
declared on stack (which is allowed if I understood Russell King corrctly in
the mail linked above). Especially if it is there just for satisfying dummy
read in spi_dma_write or write in spi_dma_read. If that value is really used
for something it can be just copied back to struct after transfer is done (or
cached from it before reading).
Comment 28 Frantisek Dufka maemo.org 2007-10-30 23:17:03 UTC
Created an attachment (id=593) [details]
declaring recv_buffer on stack instead of global data
Comment 29 Frantisek Dufka maemo.org 2007-10-30 23:18:09 UTC
Created an attachment (id=594) [details]
declaring recv_buffer on stack instead of global data - binary module
Comment 30 Siarhei Siamashka 2007-10-30 23:27:25 UTC
Created an attachment (id=595) [details]
n770_wlan_align_fix.diff

Some additional changes in the sources for the stuff I did not like
(inconsistent waiting for dma completion, potential unaligned memory accesses
and reading/writing outside buffer). This patch can be applied after memory
corruption fix.

Also I have added informational message showing current McBSP2 clock frequency.
McBSP2 base frequency is 63MHz and some integer divisor is used to actually get
the frequency we want. Original sources have a frequency limit at 14MHz in
'max_mhz' variable (so divisor 5 is used and real frequency is only 12.6MHz).
It is possible to extend frequency limit and I have set it to 40 (40MHz is
mentioned as a valid frequency in STLC4370.pdf), so I use divisor 2 and clock
frequency 31.5MHz. This all boosts WLAN performance from ~550KB/s to ~1MB/s in
my benchmarks downloading a large file with wget from a local webserver. If it
proves to be reliable at such clock frequency, that is a very nice speed
improvement and already seems to be faster than USB :)
Comment 31 Frantisek Dufka maemo.org 2007-10-30 23:31:55 UTC
Tried the idea with declaring on stack and it seems to work fine too. Patch and
binary module attached. Tried few (5?) times to remove and reinsert new and
original module and run memtester like this:

disconnect WLAN, ssh in over BT PAN
Nokia-770-36:~# rmmod cx3110x
Nokia-770-36:~# insmod ./cx3110x-onstack.ko
Nokia-770-36:~# chroot /mnt/initfs wlan-cal
connect to AP, ssh in over WLAN
Nokia-770-36:~# ./memtester-4.0.7/memtester 20

and got always error with original and no error when declaring on stack (both
when logged in via ssh over wi-fi). Also it seem like the variable is indeed a
dummy one, networking works fine and recv_buffer value on stack is not saved or
initialized.
Comment 32 Siarhei Siamashka 2007-10-31 00:01:55 UTC
(In reply to comment #31)
> got always error with original and no error when declaring on stack (both
> when logged in via ssh over wi-fi). Also it seem like the variable is indeed a
> dummy one, networking works fine and recv_buffer value on stack is not saved or
> initialized.

I use a better test - initialize 'recv_buffer' with some value (ex. 0xBEEF) and
check it after DMA is done, if it is reset to 0, DMA hit the right spot and we
don't even need memtester to verify it. Stack allocated variable works indeed.
Though considering the rest of code, it might be not safe as the current code
waits for completion of primary DMA transfer and not the dummy one - we might
exit from this function while dummy DMA transfer is still not complete ;)
That's one of the reasons why I attached a second patch which tries to change
things in the code I did not like. I started 'improving' code after I observed
a strange thing in my tests - sometimes wlan speed drops down to 30-50KB/s and
remains such until reconnected. Now I think about what it could be and how it
can be hunted down? Cleaning up the code and inserting more debugging messages
seems like the way to go.

BTW, I tried to remove this dummy DMA transfer completely, but wlan driver just
hangs. I checked spru762c.pdf (McBSP for OMAP5912) and it looks like McBSP has
independent pins for reading and writing, so at least bandwidth is not wasted
:)
Comment 33 Siarhei Siamashka 2007-10-31 00:31:51 UTC
Frantisek, you have been historically one of the most trusted 770/n800
'kernelware' suppliers. So when it comes to building a binary module with the
final fix for this problem and deploying it to end users, I hope you don't mind
if we all relay this responsibility to you? :)

Regarding possible official bugfix from Nokia, it would be very much welcome
(if 770 stuff is not completely disbanded yet).
Comment 34 Frantisek Dufka maemo.org 2007-10-31 10:12:18 UTC
(In reply to comment #33)
> So when it comes to building a binary module with the
> final fix for this problem and deploying it to end users, I hope you don't mind
> if we all relay this responsibility to you? :)

OK, I can make it part of bootmenu flasher (which reflashes initfs where the
module resides) and also make separate one just with this fix for those who do
not want bootmenu. I'll also try to make installable deb out of it.

The only thing that bothers me is that I need to bear misdirected thanks and
undeserved fame. I will definitely give proper credit to you guys but still ...

> Regarding possible official bugfix from Nokia, it would be very much welcome
> (if 770 stuff is not completely disbanded yet).

It should be at least pushed to initfs in next hacker edition firmware or even
re-release the current (possibly the last) 2007 based one.

Kalle, can you comment patches and suggest best fix and/or explain what the
code does and what exactly is the recv_buffer for? Also can you comment how
Nokia would handle this or at least push it to appropriate people (Quim?) who
can provide the answer? Thanks.
Comment 35 Frantisek Dufka maemo.org 2007-10-31 16:27:02 UTC
<Commenting n770_wlan_align_fix.diff>

first chunk (cx3110x_spi_dma_write)

- waiting for spi_dma.dma_rx_done is perhaps more correct, yes, especially if
recv_buffer is on stack, otherwise we are not interested much when it is
finished so that's why it is not there in original code, and it should finish
at the same time as writing anyway
- "last_word = *(uint16_t *)(buffer + length - 1);" looks fishy to me too,
buffer is pointer to void so I'm not sure what even compiler thinks this should
do, seems like unaligned word read to me, fix looks sane

second and third chunk (cx3110x_spi_read)
- original code seem OK to me except that it does not read last byte if length
is odd, patched code seems unneeded to me it just does the same thing in two
writes
last part (cx3110x_spi_write)
- seems not needed to me, as in previous part omap_mcbsp_spi_master_*_word_poll
takes long pointer to buffer which is the only reason why there is explicit
cast, I see nothing wrong with this in original code.
</Commenting n770_wlan_align_fix.diff>

As for the dummy part of transfer, maybe this is the way to tell the other side
our clock so it is synchronized with us, transferred data does not matter but
the clock does and other side send/receive is driven by our clock. I was also a
bit worried why one 16 bit buffer is ok for dummy transfer even if length is >1
only later I found that the DMA channel is set up as OMAP_DMA_AMODE_CONSTANT so
it always reads/writes same word 'length>>1' times.
Comment 36 Siarhei Siamashka 2007-11-01 09:28:07 UTC
(In reply to comment #35)
> <Commenting n770_wlan_align_fix.diff>
> 
> first chunk (cx3110x_spi_dma_write)
> 
> - waiting for spi_dma.dma_rx_done is perhaps more correct, yes, especially if
> recv_buffer is on stack, otherwise we are not interested much when it is
> finished so that's why it is not there in original code, and it should finish
> at the same time as writing anyway

It is safer to wait for both to be sure. We definitely would not want any
possibility of having DMA still hitting stack (already used by some other part
of program) after we exit from this function. All in all, kmalloc seems a bit
more safe. Do you have a reference to the piece of documentation proving that
virt_to_phys can be officially used for stack (and it is not just a
coincidence)? I relied on the information from
http://www.mjmwired.net/kernel/Documentation/DMA-mapping.txt which states that
neither module image addresses, nor stack can be used for DMA.

> - "last_word = *(uint16_t *)(buffer + length - 1);" looks fishy to me too,
> buffer is pointer to void so I'm not sure what even compiler thinks this should
> do, seems like unaligned word read to me, fix looks sane
> 
> second and third chunk (cx3110x_spi_read)
> - original code seem OK to me except that it does not read last byte if length
> is odd, patched code seems unneeded to me it just does the same thing in two
> writes

Actually cx3110x_spi_read has a dangerous bug which I tried to fix. The point
is that 'omap_mcbsp_spi_master_recv_word_poll' function writes 32-bit value to
output buffer. This leads to write outside the input buffer on last iteration
(and possible alignment issues when storing 32-bit values at 16-bit aligned
positions in the buffer). A good example is the use of this function from
'sm_drv_spi_rx' to read 16-bit 'length' variable. A simple test, reading data
to some intermediate buffer (to move this data to 'length' variable later)
shows that buffer overrun really exists and more than 2 bytes are modified.
Fortunately, gcc pads this variable to 4 bytes on stack, so neighbour variables
are not overwritten in this particular case. Nevertheless 'cx3110x_spi_read' is
very dangerous and can be potentially the source of unexpected and hard to
track problems.

> last part (cx3110x_spi_write)
> - seems not needed to me, as in previous part omap_mcbsp_spi_master_*_word_poll
> takes long pointer to buffer which is the only reason why there is explicit
> cast, I see nothing wrong with this in original code.

The wrong thing with the code is that it casts (unsigned char *) to (u16 *) and
is reading 16-bit values at once. If provided with arbitrary input buffer,
having arbitrary alignment, that would cause alignment problems. Surely, it
most likely always gets properly aligned buffers in this context (the driver
works somehow and that's a proof), but there is no comment describing argument
constraints.

Potential odd length in spi_read is one more undocumented possible constraint,
so we don't know for sure if it is a bug or part of protocol of communication
with firmware. Maybe Kalle can have a look at it and tell us if it is ok or
should be fixed. We are mostly blind and don't see the whole picture as the
open driver part is only a glue between firmware and closed source umac code.
We can spot suspicious looking code though.


Regarding increasing McBSP clock frequency. I did some more experiments and
looks like the real divisor is equal to 'div + 1'. I benchmarked DMA transfer
speed in spi_dma_read/spi_dma_write, and it looks close to ~2.6MB/s at div ==
2. So if div == 2, real frequency is most likely 21MHz (21MHz / 8bits =
2.625MB/s). And with the original div == 5, we have only 10.5MHz. Regarding
stability at 21MHz, it seems to be ok when downloading data to the device, but
deadlocks and reboots on transferring data from the device. Let's hope that's
one of the numerous software bugs and can be fixed :) I will try to disable
watchdog to see what happens there.
Comment 37 Frantisek Dufka maemo.org 2007-11-01 10:59:12 UTC
(In reply to comment #36)
>  Do you have a reference to the piece of documentation proving that
> virt_to_phys can be officially used for stack (and it is not just a
> coincidence)? 

Here (linked also in comment 25)
http://uwsg.indiana.edu/hypermail/linux/kernel/0009.2/0898.html
Someone making same error as in our driver and is corrected

> I relied on the information from
> http://www.mjmwired.net/kernel/Documentation/DMA-mapping.txt which states that
> neither module image addresses, nor stack can be used for DMA.
> 
> > - "last_word = *(uint16_t *)(buffer + length - 1);" looks fishy to me too,
> > buffer is pointer to void so I'm not sure what even compiler thinks this should
> > do, seems like unaligned word read to me, fix looks sane
> > 
> > second and third chunk (cx3110x_spi_read)
> > - original code seem OK to me except that it does not read last byte if length
> > is odd, patched code seems unneeded to me it just does the same thing in two
> > writes
> 
> Actually cx3110x_spi_read has a dangerous bug which I tried to fix. The point
> is that 'omap_mcbsp_spi_master_recv_word_poll' function writes 32-bit value to
> output buffer.

Ah, I see, missed that. Seems like bug in omap_mcbsp_spi_master_recv_word_poll
to me. If word size is 16 it shouldn't write whole long but just 16 bit word.

> Regarding
> stability at 21MHz, it seems to be ok when downloading data to the device, but
> deadlocks and reboots on transferring data from the device.

Tried this too yesterday. First I tried to read from the device and it locked
up (after reading ~11MB) and rebooted in few seconds.

Next thing I don't understand is why the driver busy waits (udelay(5)) for DMA
completion and does not use interrupts and wait_for_completion mechanism like
omap_mcbsp_xmit_buffer in arch/arm/plat-omap/mcbsp.c But considering that
Samuel Ortiz is author of that code too there is perhaps some reason.
Comment 38 Siarhei Siamashka 2007-11-01 12:04:12 UTC
(In reply to comment #37)
> (In reply to comment #36)
> >  Do you have a reference to the piece of documentation proving that
> > virt_to_phys can be officially used for stack (and it is not just a
> > coincidence)? 
> 
> Here (linked also in comment 25)
> http://uwsg.indiana.edu/hypermail/linux/kernel/0009.2/0898.html
> Someone making same error as in our driver and is corrected

Yes, I have read it before. I was just wondering if it is stated somewhere in
the trusted documentation that it is expected behavior and we can rely on it in
our code. Btw, the next followup post in that discussion states that we
shouldn't rely on buffers in stack and kmalloc is the only reliable option:
http://uwsg.indiana.edu/hypermail/linux/kernel/0009.2/0999.html

> > Regarding
> > stability at 21MHz, it seems to be ok when downloading data to the device, but
> > deadlocks and reboots on transferring data from the device.
> 
> Tried this too yesterday. First I tried to read from the device and it locked
> up (after reading ~11MB) and rebooted in few seconds.

I see, looks like my hardware is capable of working at a bit higher
frequencies, to 'div = 2' is probably too much and is really a hardware issue.

> Next thing I don't understand is why the driver busy waits (udelay(5)) for DMA
> completion and does not use interrupts and wait_for_completion mechanism like
> omap_mcbsp_xmit_buffer in arch/arm/plat-omap/mcbsp.c But considering that
> Samuel Ortiz is author of that code too there is perhaps some reason.

Yes, I wondered about this too in one of the early comments. Anyway, the
typical time is hundreds of useconds, so overhead of using calls to udelay(5)
should not be too high. We can also estimate the transfer time depending on bus
speed and call the first udelay() with that value.

As for the performance, it needs to be investigated properly, but most likely a
lot of time is spent in umac.ko (probably doing AES encryption/decryption).
OMAP should support AES in hardware, but only for OMAP2420 and not OMAP1710.
Comment 39 Siarhei Siamashka 2007-11-01 13:40:04 UTC
(In reply to comment #38)
> > > Regarding
> > > stability at 21MHz, it seems to be ok when downloading data to the device, but
> > > deadlocks and reboots on transferring data from the device.
> > 
> > Tried this too yesterday. First I tried to read from the device and it locked
> > up (after reading ~11MB) and rebooted in few seconds.
> 
> I see, looks like my hardware is capable of working at a bit higher
> frequencies, so 'div = 2' is probably too much and is really a hardware issue.

I'm sorry, I did not read your post carefully enough last time. If you only
tested reading from the device, this looks exactly like my experience. Writing
to the device was very stable in my tests and I transfered multiple gigabytes
without a single fault. This still can be a software issue, as reading and
writing uses a bit different code, for example the following chunk of code is
used after writing but nothing like this is used after reading:

        /* We wait for the DMA Ready interrupt */
        host_ints = 0;
        sm_spi_read(dev, SPI_ADRS_HOST_INTERRUPTS,
                (unsigned char *) &host_ints, sizeof(host_ints));

        timeout = jiffies + 2 * HZ;
        while(!(host_ints & SPI_HOST_INT_WR_READY)) {
            if (time_after(jiffies, timeout)) {
                printk(KERN_WARNING "cx3110x: Haven't got a "
                       "WR_READY for DMA write."
                       "(firmware crashed?)\n");
                result = -1;
                goto exit;
            }
            sm_spi_read(dev, SPI_ADRS_HOST_INTERRUPTS,
                    (unsigned char *) &host_ints,
                    sizeof(host_ints));
        }

        host_ints_ack = SPI_HOST_INT_WR_READY;
        sm_spi_write(dev, SPI_ADRS_HOST_INT_ACK,
                 (unsigned char *)&host_ints_ack,
                 sizeof(host_ints_ack));

Anyway, the first priority is a reliable fix for memory corruption issue.
Discussing possible performance improvements should be probably moved somewhere
else (to the maemo-developers mailing list or to the mailing list of cx3110x
garage project).
Comment 40 Kalle Valo nokia 2007-11-02 22:51:38 UTC
First of all, great work everyone! This is a good example how powerful Open
Source community really is. You guys found the bug and fixed it in no time.
Biggest bottleneck was me with releasing sources, I'm sorry about that.

I haven't had time to look at the fixes yet, but I'll try answer all questions
directed to me. Please point out if I missed something.

(In reply to comment #24)
> BTW, same fix should be done also in n800 source with sections #ifdef-ed for
> N770.

There's no need to do that. In OS2008 WLAN driver there won't support for 770
anymore.

(In reply to comment #34)
> Kalle, can you comment patches and suggest best fix and/or explain what the
> code does and what exactly is the recv_buffer for?

Sorry, I haven't had the time to look at the fixes yet. Also I'm not that
familiar with 770's McBSP, so I can't comment before I study it more.

> Also can you comment how
> Nokia would handle this or at least push it to appropriate people (Quim?) who
> can provide the answer?

I'm not working with 770 anymore, so from my point of view the case is clear:
there won't be any official updates or fixes coming to 770. I can try to help
you with this occasionally, whenever I find some free time. But nowadays that
doesn't happen that often.

NB: This is not my call, so there's no point of trying to argue this with me.
Please have that discussion somewhere else, for example in a mailing list.
Let's try to keep this bug report short and clean.
Comment 41 Neil MacLeod maemo.org 2007-11-04 02:19:28 UTC
Surely the patches submitted thus far can be rolled into a future release of OS
2007 or OS 2008 HE (probably the latter) in order that this critical defect is
corrected for every 770 user willing to upgrade to OS 200x HE?

"Back porting" to OS 2006 may be best left to the "community"...
Comment 42 Neil MacLeod maemo.org 2007-11-04 02:25:22 UTC
(In reply to comment #41)
> Surely the patches submitted thus far can be rolled into a future release of OS
> 2007 or OS 2008 HE (probably the latter) in order that this critical defect is
> corrected for every 770 user willing to upgrade to OS 200x HE?
> 
> "Back porting" to OS 2006 may be best left to the "community"...
> 

The first para should have said "rolled into a future release of OS 2007 HE or
OS 2008 HE (probably the latter) ..."
Comment 43 Kalle Valo nokia 2007-11-04 16:05:47 UTC
(In reply to comment #41)
> Surely the patches submitted thus far can be rolled into a future release of OS
> 2007 or OS 2008 HE (probably the latter) in order that this critical defect is
> corrected for every 770 user willing to upgrade to OS 200x HE?

The 770 Hacker Edition only updates rootfs. Kernel and initfs are not updated
AFAIK. And WLAN driver is located in initfs so I don't see how we could get the
fix into the Hacker Edition.
Comment 44 Neil MacLeod maemo.org 2007-11-04 16:14:04 UTC
Hacker Edition is shipped as a full FIASCO image, so in theory Nokia could
respin initfs in a future HE update, be it OS 2007 or OS 2008.

If what you say is true, I find it really bizarre that Nokia would not fix such
a critical bug in a future firmware update.
Comment 45 Frantisek Dufka maemo.org 2007-11-11 01:10:30 UTC
Hi, I was thinking about two ways to install fixed module. The easiest and
safest one (to install) is to add early startup script to rootfs which unloads
buggy module and loads fixed one. The second (perhaps more correct) one is
reflashing initfs on the fly and replacing module there which is a bit risky
when it goes wrong in the middle of flashing. I guess it can be relatively safe
when doing same thing that backup utility does when restoring settings -
shutdown all UI programs, do it and then reboot. Still the first one would be
preferred but I think there is a catch. The buggy spi_dma_write is already used
to send firmware to the chip so once driver is loaded and wlan-cal is called
the firmware is sent to the device corrupting few random bytes already. See
src/sm_drv_spi.c functions  sm_drv_spi_up, sm_drv_spi_upload_firmware. Is this
correct? Or maybe the firmware is just prefetched to RAM on driver load or
wlan-cal time but is sent to the chip when interface is really brought up (by
ifconfig) later? So maybe we still have clean, not corrupted memory early in
rootfs?
Comment 46 Siarhei Siamashka 2007-11-11 15:01:41 UTC
I have added some simple code which shows printk message on first
'cx3110x_spi_dma_write' call in wlan driver. This test shows that freshly
booted Nokia 770 loads cx3110x driver, but does not call
'cx3110x_spi_dma_write' function until somebody actually tries to use wifi
(initiates scan for wlan networks for example).

Looks like just modifying startup script should be safe.
Comment 47 Frantisek Dufka maemo.org 2007-11-11 15:26:13 UTC
(In reply to comment #45)
> So maybe we still have clean, not corrupted memory early in
> rootfs?

Yes, firmware is sent only after interface is really up, can be verified by
dmesg output (McBSP2: frequency info and 'Loading 3825.arm firmware"). I have
also added simple statistics into both dma functions and result is interesting
(after few minutes of surfing and internet radio streaming).

[100565.627899] cx3110x_spi_dma_read stats: called 13588 minwait 0 maxwait 305
oddlen 0 unalign 0
[100565.628021] cx3110x_spi_dma_write stats: called 10137 minwait 1 maxwait
3819 oddlen 130 unalign 0

So we really have frames with odd length and also busywaiting for finished
transfer can be sometimes quite long (it counts while loops with udelay), patch
attached (for convenience, it is quite simple).
Comment 48 Frantisek Dufka maemo.org 2007-11-11 15:28:24 UTC
Created an attachment (id=608) [details]
adds some stats into cx3110x_spi_dma_read/write
Comment 49 Siarhei Siamashka 2007-11-11 15:31:44 UTC
Nevertheless I think that an updated full FIASCO image is still needed to solve
this problem. We can be never sure that the system is not already unrecoverably
broken by the moment when the user decided to install wlan driver fix. If the
user connected wlan at least once (to download .deb package with wlan driver
fix for example), the system consistency can't be guaranteed.

I initially expected that we can get only one official firmware update with
wlan bugfix at most (and even that is uncertain), so I tried to find as many
problems as possible so that the last bugfix could include fixes for them too.
But now I see that the only really critical bugfix for inclusion into firmware
update is memory corruption fix. The driver may be slow, have some minor bugs,
consume extra cpu resources, break wlan connection sometimes, but it should
never compromise stability of the rest of the system. All the other problems
can be fixed later and installed using startup script solution without the
strong need to update FIASCO image again.
Comment 50 Eero Tamminen nokia 2007-11-12 17:06:52 UTC
(In reply to comment #44)
> Hacker Edition is shipped as a full FIASCO image, so in theory Nokia could
> respin initfs in a future HE update, be it OS 2007 or OS 2008.

AFAIK they are looking into getting this fix to the next HE release.

Thanks for the hard work!
Comment 51 Attila Domokos nokia 2007-11-14 10:07:39 UTC
Bugfix released in the OS2007 Hacker Edition update. Resolving bug.
See:
http://maemo.org/news/announcements/view/new_os2007_he_for_the_nokia_770_with_mozilla_engine_and_memory_corruption_bugfix.html
Comment 52 Frantisek Dufka maemo.org 2007-11-15 17:23:44 UTC
Created an attachment (id=611) [details]
declaring recv_buffer on stack instead of global data - final

bumped up version string so it can be easily identified in kernel log and also
added waiting for both dma transfers in cx3110x_spi_dma_write so we are sure
stack variable is not writen after function exit
Comment 53 Frantisek Dufka maemo.org 2007-11-15 17:27:40 UTC
Created an attachment (id=612) [details]
declaring recv_buffer on stack instead of global data - final binary module

More correct version of binary module as explained in patch description. It
identifies itself as version 0.8.1-fix#2006 in kernel log.
Comment 54 Frantisek Dufka maemo.org 2007-11-16 12:03:54 UTC
Created an attachment (id=613) [details]
installable deb package for OS2006 or older OS2007HE

Loads fixed driver in startup script when device boots.
If you boot multiple systems from MMC, this should be installed in each system
separately.

If possible, please test and report problems, thanks.
Comment 55 Frantisek Dufka maemo.org 2007-11-16 12:08:13 UTC
Created an attachment (id=614) [details]
source for deb package

Read included README, patches are inside. I have added also fix for writing
past end of the buffer inspired by n770_wlan_align_fix.diff

Alignment issues remain unfixed but it seems to not cause any problems as the
buffer seems to be aligned.
Comment 56 Kalle Valo nokia 2007-11-16 15:13:45 UTC
(In reply to comment #39)

> I'm sorry, I did not read your post carefully enough last time. If you only
> tested reading from the device, this looks exactly like my experience. Writing
> to the device was very stable in my tests and I transfered multiple gigabytes
> without a single fault. This still can be a software issue, as reading and
> writing uses a bit different code, for example the following chunk of code is
> used after writing but nothing like this is used after reading:

That's because the firmware interface does not have similar as WriteReady
interrupt for reading. So it is implemented as specified.

> Discussing possible performance improvements should be probably moved somewhere
> else (to the maemo-developers mailing list or to the mailing list of cx3110x
> garage project).

Is there a need for cx3110x specific mailing list? I can set it up if needed.
If any of you thinks that there should be one, please send me email privately. 

(Let's try to not to pollute this bug.)
Comment 57 Frantisek Dufka maemo.org 2007-11-20 15:12:40 UTC
Created an attachment (id=618) [details]
installable deb package for OS2006 or older OS2007HE 0.2

hopefully the last one, fixed spelling of Tilman's surname, reboot is now not
needed if wi-fi is off at install time
Comment 58 Frantisek Dufka maemo.org 2007-11-20 15:15:00 UTC
Created an attachment (id=619) [details]
source for deb package

source replaced
Comment 59 Jim Carter 2007-11-30 07:28:52 UTC
I'm afraid we have a problem: the hotfix interacts with the clock.  Badly.
How to reproduce: (some of the steps may be irrelevant but that's what I did)
    Start the clock program and set an alarm.
    Leaving the clock program open, close the cover (suspend to RAM).
    Plug in the charger.
    Wait for the alarm time.  The alarm sound does not play.
    The O.S. has frozen solid -- black screen, no key response.
    Power cycle by removing and replacing the battery, to recover.
Without the hotfix the alarm feature was 99.9% reliable.  Coincident with
installing the hotfix, it started failing like this, 3 crashes out of 3
attempts.  Both cx3110x-hotfix-bug2006_0.2_armel.deb and version 0.1 fail
equally.  I *did* reboot the machine immediately after both installations.
Comment 60 Frantisek Dufka maemo.org 2007-11-30 10:36:21 UTC
(In reply to comment #59)
> Without the hotfix the alarm feature was 99.9% reliable.  Coincident with
> installing the hotfix, it started failing like this, 3 crashes out of 3
> attempts.  Both cx3110x-hotfix-bug2006_0.2_armel.deb and version 0.1 fail
> equally.  I *did* reboot the machine immediately after both installations.

So, umm, when you just uninstall the hotfix without touching anything else
clock starts to work again?

It sounds impossible that this is caused by the change in cx3110x driver
itself. I can only imagine that the startup script that unloads old module and
loads fixed one might change something (even only timing) of boot sequence. But
you say you do not shutdown your tablet so the system should not boot at all.

It looks like coincidence to me, several people in mailing list mentioned the
alarm clock is not 100% reliable and sometimes device just hangs, searching
bugzilla for word 'alarm' gives few bugs too.

Could you try to uninstall just the hotfix without doing any other change, test
alarms for few days and then reinstall back? You can also just rename the
module in /lib/modules/2.6.16.27-omap1/ to something else to disable loading it
at boot time.

Or you can rename the startup script to disable it at boot time temporarily
too.
sudo gainroot
cd /etc/rc2.d/
mv S10cx3110x-hotfix-bug2006 whateverS10cx3110x-hotfix-bug2006
Comment 61 Jim Carter 2007-12-11 22:48:03 UTC
Good news: the alarm feature is once again operational.  Here's what I did:
Following Frantisek's suggestion I disabled the hotfix by renaming
    /lib/modules/2.6.16.27-omap1/cx3110x-hotfix-bug2006.ko and
    rebooting (telinit 6). From here to the last step the WLAN was never 
    activated.  I wonder if telinit 6 was insufficient, i.e. if it was
    important to power off, then back on.
There was a "hanging" alarm from when the alarm program crashed the
    machine, i.e. in the clock app the bell icon did not have a red
    slash.  I removed this alarm, and set another one, then suspended to
    RAM.
At the alarm time the machine froze solid.  Removed battery and rebooted.
I removed the hanging alarm and set another one, but did not suspend to
    RAM. Behavior was successful, i.e. the sound played and the alarm
    was removed. I suspect that this was the key step in bringing it
    back to life.
I set another alarm, suspended to RAM, but impatiently checked if the
    machine had died.  It was suspended when the alarm went off: success.
I did the same procedure but left it suspended for the whole time: 
    success.
Now I reactivated the hotfix by reverting the rename.  Power off, then on.
I set another alarm and suspended to RAM: Success!
I activated the WLAN, transferred files, did web stuff, etc.  Set another
    alarm and suspended to RAM: Success!
So I conclude that there may still have been a bad interaction between
    modules (post hoc ergo propter hoc), but whatever internal state trash
    may have been involved was cleaned up by one of the maneuvers that
    I did.  I've always attributed occasional unreliability of the alarm
    to mysterious WLAN-related corruption (which was proven to be real by
    Tilman Vogel).  Everyone cross fingers.