maemo.org Bugzilla – Bug 2006
Memory corruption during WLAN use
Last modified: 2007-12-11 22:48:03 UTC
You need to log in before you can comment on or make changes to this bug.
EXPECTED OUTCOME: n/a ACTUAL OUTCOME: Depending on the memory location to which the modules umac.ko and cx3110x.ko get loaded, exactly two consecutive bytes at fixed physical locations in memory get overwritten by zeroes everytime there is WLAN activity: while scanning for networks as well as when data is actually transmitted. On a vanilla NOKIA770_2006SE_3.2006.49-2_PR_MR0, the modules get loaded at (cat /proc/modules): cx3110x 51420 0 - Live 0xbf03f000 umac 253316 1 cx3110x, Live 0xbf000000 In this case, the two bytes are at physical location 0x1304b8b4 and 0x1304b8b5 (these addresses include an offset of 0x10000000 - see /proc/iomem). When booting the same OS from an ext2 formatted MMC partition, then the modules are at: cx3110x 51420 0 - Live 0xbf04e000 umac 253316 1 cx3110x, Live 0xbf00f000 ext2 43524 1 - Live 0xbf003000 mbcache 7716 0 - Live 0xbf000000 I.e. due to the two extra modules, umac.ko and cx3110x.ko are shifted by 0xf000. Accordingly, the corrupted bytes get shifted by 0xf000 to 0x1305a8b4 and 0x1305a8b5. STEPS TO REPRODUCE THE PROBLEM: Needed tools: memtester: http://pyropus.ca/software/memtester/ binaries: http://freenet-homepage.de/tvogel/memtester-4.0.7-bin.tar.bz2 urumt: http://www.normalesup.org/~george/comp/rumt/ with patch: http://freenet-homepage.de/tvogel/rumt-n770.patch binaries: http://freenet-homepage.de/tvogel/rumt-bin.tar.bz2 How to reproduce the problem: - open two root shells on your 770 (in xterms or via ssh) - start WLAN on the 770, flood ping "ping -f" your 770 in order to create network traffic - in the first shell, start memtester starting with a size that shows the corruption (the first argument is the size in MB, e.g. ./memtester 24) - successively reduce the size until you don't see corruption: This makes it likely, that the next alloc of 1 MB will get the block with the bad bytes - let memtester run and now in the second shell, start "urumt -p 256": This will allocate 1 MB of memory, locate its physical addresses in /dev/mem and start testing. You'll get bit-precise location information on which bits get corrupted into which direction: + (1->0) or - (0->1). WORKAROUND: My idea was to write a programm that tries to allocate the bad memory block, lock it and then just sleep forever. This would save other processes from stepping into the trap. source code: http://freenet-homepage.de/tvogel/blockbad.c binary: http://freenet-homepage.de/tvogel/blockbad The programm takes as argument the memory page to block. If urumt reported 123ef:8bc, strip off the leading 1 and the last 3 digits, i.e. use 0x23ef in this example. The programm will always allocate 32MB RAM in order to search for the block. This is currently hardcoded. After the block is found, the other blocks are freed up again. Of course, you should stop memtester and urumt before that. If this works for you, you might consider starting blockbad 0x23ef at the end of /etc/init.d/minircS . Then you can check with "ps" if blockbad is running. If so, it found and allocated the suspicious memory block. If not, it was out of luck and didn't get that block assigned by the kernel. OTHER COMMENTS: OS version: NOKIA770_2006SE_3.2006.49-2_PR_MR0 WLAN settings: WEP, problem also present without WEP, WPA not tested N800: not yet tested
I'm testing N800 software version RX-34_2007SE_4.2007.26-8 with memtester and haven't seen any corruption after five minutes. I'm using WEP and ping flooding the device constantly. My commandline: ./memtester 50 I'll leave the test running for few hours, let's see what happens.
This was reproduced by more people including myself, see my mail here http://maemo.org/pipermail/maemo-developers/2006-September/005633.html See also related/duplicate bug #677 (comment 4)
(comment 1) I don't think you should have to wait for the error to appear. For me, either the errors turn up immediately or never. It just depends on which physical memory blocks are given to memtester by the kernel. So for the N800 I see different possibilities: 1) the error is fixed 2) the block with the damaged bytes is already allocated by something else: one could try to start memtester in an earlier stage of booting when the block is still free and redirect its output. I also had this problem sometimes on the 770 when trying to reproduce the problem after I had it running for a while. Especially now, as I start blockbad in minircS, when I kill it later in order to check the bug again, the tiny hole in memory allocation that blockbad leaves, gets quickly filled by other small things, like e.g. the keyboard applet. When I start blockbad again, it has bad chances to catch the block again. 3) memtester just doesn't catch the block: then increasing the size further could help, also opening and closing applications could generate some rearrangement of memory making it more likely that the questionable block gets free again by chance (paging). I don't know a good tool for that, but you can check if a given physical block is available by running blockbad and checking if it can allocate the block you want. For N800, one should probably increase the search range of blockbad which is currently hardcoded to 32M. Also, memory mapping is definitely not the same as on 770. /proc/iomem shows that system RAM is mapped to 0x10000000-0x14000000 on the 770. This mapping is currently hardcoded in both blockbad and in the rumt-n770.patch.
I couldn't reproduce this with N800 with simple memtester tests. Moving to Connectivity/WiFi.
Ok, there was a bug in my blockbad.c: After finding the block, I called daemon(0,0) in order to detach from the terminal, but this lifts the mlock()! (In fact, running the buggy version twice, sometimes found the block again :-/) Now I fork first and mlock() the block in the child which lives forever. Please get the new version at the same URLs. Sorry!
(In reply to comment #1) > I'm testing N800 software version RX-34_2007SE_4.2007.26-8 with memtester and > haven't seen any corruption after five minutes. I'm using WEP and ping flooding > the device constantly. > > My commandline: > > ./memtester 50 > > I'll leave the test running for few hours, let's see what happens. Just in case if you have never received it, here is the text from my e-mail sent to you on 2007-01-28 (shortly after the last Nokia 770 firmware update): "Hello Kalle, I just flashed the latest IT OS2006 update 3.2006.49-2 (in the hope that it could probably fix this problem) and run some tests. The memory corruption problem is still there (see memtester report at the end of my message). For your convenience here is the binary of the test program that I used (patch for upstream memtester 4.0.5 included): http://ufo2000.xcomufo.com/files/memtester-n770.tar.gz Memtester home page is here: http://pyropus.ca/software/memtester/ Here is a direct link for downloading memtester 4.0.5: http://pyropus.ca/software/memtester/old-versions/memtester-4.0.5.tar.gz The detailed steps the following (some may be redundant, but again, that's exactly what I did): * flashed new IT OS update (battery full) * installed xterm, ssh and midnight commander * added support for ext3 (using the steps similar to https://maemo.org/maemowiki/HowTo_mmc_ext2 but more modules for insmod were required) * started memtester from mmc card to run in a loop from ssh session (and using slightly more than 20MB for testing) * after some time of running tests also started background transfer of some large file via scp * as none of the problems showed up up to this moment, stopped memtester with ctrl-c and decided to increase memory size to test (up to 30MB), in the process of running this test the device suddenly rebooted * immediately after reboot it complained about low battery charge * I put the device on charger, started wifi connection and tried to run memtester again, it detected some problems All the process starting from flashing the latest firmware till getting error messages in memtester took a few hours. If you need ANY additional information, I would be glad to help. I'm more worried about the following. This is not the first time this problem got reported (the last time in a private mail to Kimmo Hämäläinen). But the problem was never confirmed and could not be reproduced on your side. On the other hand it was reproduced on a few other devices. And there are still lots of complaints about spontaneous reboots and crashes in forums. This memtester is check seems to be the most easy and "scientific" way to verify wifi related system stability problems at the moment. So it is a big step forward in problem reproducibility. ... " Well, regarding this issue, as Tilman Vogel said in a few comments above, it takes some time and patience to reproduce the problem. But what is much more important is just a bit of trust in our bugreports, we are really not bad guys and have no evil intentions :) A quick five minute test is definitely not enough to say that the problem is not reproducible, especially for a critical one such as this one. Do you have any advice about what could be done to investigate this problem further and try to fix it? Is N800 wifi driver much different from the one used in Nokia 770? I hope it will not take much of your time and efforts to provide some minor guidance and a few advices about how to build a debugging version of wifi driver for Nokia 770. Such help would be very much appreciated, especially considering that you are primarily focused on N800 support now.
BTW as Tilman mentioned the address depends on address where wi-fi modules are loaded. This definitely looks like some SW bug in the driver. I'm not sure when I'll get to it but is is relatively easy to either modify load_wlan_module in /mnt/initfs/linuxrc or use newest bootmenu with usb recovery mode and insert umac.ko and cx3110x.ko by hand intermixed with some other module to pinpoint which of those two contains the bug. Anyone? In case of cx3110x.ko latest n770 sources of cx3110x driver would be enough either to fix it or implement workaround with allocating the corrupted page directly in cx3110x.ko. In case of umac only workaround is possible to do by us (perhaps via other kernel module taking umac module address as a parameter). Kalle, are cx3110x sources for latest 770 firmware available? The cx3110x garage project and Bora repository seems to contain only N800 code. I couldn't find similar source in 2.2/gregale repository . I think n770 sources were available at some time in the past.
Just googled around and it looks there is a reserve_bootmem(addr,size) call in linux kernel that should allocate the page so the workaround may not be hard to do in kernel. See also http://lxr.linux.no/ident?i=reserve_bootmem http://rick.vanrein.org/linux/badram/index.html
(In reply to comment #6) > Just in case if you have never received it, here is the text from my e-mail > sent to you on 2007-01-28 (shortly after the last Nokia 770 firmware update): I remember your email now. It got burried somewhere inside my inbox, I'm sorry about that. > Well, regarding this issue, as Tilman Vogel said in a few comments above, it > takes some time and patience to reproduce the problem. But what is much more > important is just a bit of trust in our bugreports, we are really not bad guys > and have no evil intentions :) It's not about trust, don't worry. It's more like there's only 24 hours in a day and I have to sleep also :) > A quick five minute test is definitely not > enough to say that the problem is not reproducible, especially for a critical > one such as this one. I run it longer than that and also one of our testers tested this on N800, but we couldn't reproduce it. Either this is much more difficult to reproduce on N800 or this only a bug in 770. > Do you have any advice about what could be done to investigate this problem > further and try to fix it? Is N800 wifi driver much different from the one used > in Nokia 770? The biggest difference is that 770 uses McBSP and N800 uses McSPI. I would start checking that first. I would appreciate if people reported if they can or cannot reproduce this on N800.
(In reply to comment #7) > Kalle, are cx3110x sources for latest 770 firmware available? The cx3110x > garage project and Bora repository seems to contain only N800 code. I couldn't > find similar source in 2.2/gregale repository . I think n770 sources were > available at some time in the past. They should be, but obviously something has gone wrong. I will sort this out, but it might take few days. I'll let you know when they are again available.
(In reply to comment #9) > I run it longer than that and also one of our testers tested this on N800, but > we couldn't reproduce it. Either this is much more difficult to reproduce on > N800 or this only a bug in 770. Maybe. On the other hand, we always take seriously the bugs reported against the older version of software at my job even if they are not reproducible with the latest version (unless there have been some changes to fix it). Unfixed bugs tend to hide and bite you again :) But as you said, the problem could be in the code that is different for 770 and N800. > The biggest difference is that 770 uses McBSP and N800 uses McSPI. I would > start checking that first. > > I would appreciate if people reported if they can or cannot reproduce this on > N800. OK, I'll try to do some tests when I have a bit more spare time. > They should be, but obviously something has gone wrong. I will sort this out, > but it might take few days. I'll let you know when they are again available. That's great. Having the last Nokia 770 wifi driver sources will definitely increase our chances to find some solution to this problem. I had a look at this issue. As far as I remember, in my tests with memtester 4.0.5, the offset of faulty address aligned at 1K boundary reported by memtester was 0x1a5 on the older version of OS2006 firmware and 0x22d with the latest one. I don't know if memtester provides byte precise results (and the initial Tilman's post lists different address). This gives a consistent 0x88 bytes difference. I unpacked FIASCO images from OS2006 3.206.49-2 and 1.2006.26-8 and took a look at cx3110x.ko This module is really a bit larger in the latest version of firmware. If we suppose that the problematic address has a constant address relative to cx3110x.ko, then finding a symbol in disassembly listing that got shifted by this amount (0x88?) would make it very suspicious. I did a quick comparison, but it did not reveal any interesting symbols at 0x88 difference. Maybe I'll try to do some more tests again to get problematic addresses with both these wifi driver modules to calculate precise difference. I also took a quick look at the N800 driver sources (without trying too hard to understand what it does). My wild guess is that 'spi_dma.recv_buffer' looks the most suspicious to me. Could it possibly be that cx3110x.ko gets relocated in physical memory somehow and DMA(?) keeps writing data to now wrong memory location that was occupied by 'spi_dma.recv_buffer' earlier? I'm not a kernel hacker and that was just an attempt (maybe completely stupid) for brainstorming.
Well, I really can't wait for being able to compile the latest stable Nokia 770 version of cx3110x.ko with the dummy variable 'spi_dma.recv_buffer' moved to some kmalloc allocated block of memory (see linux/Documentation/DMA-mapping.txt, "This rule also means that you may use neither kernel image addresses (items in data/text/bss segments), nor module image addresses, nor stack addresses for DMA." part). I strongly suspect that this minor fix is all that is needed to solve the problem. Also, it is not quite related, but probably wifi performance can be improved a bit (waiting for DMA completion in Nokia 770 versions of 'cx3110x_spi_dma_read' and 'cx3110x_spi_dma_write' does not look too efficient). I downloaded STLC4370.pdf and it mentions that SPI serial host interface should support speeds up to 48Mbps. So the wlan chip itself should be capable of providing much better speed than we see now. That's particularly interesting as it might probably surpass (very slow) USB performance and become the fastest way to transfer data to or from the device. How fast is McBSP actually, is it really the bottleneck? Also I wonder why is it needed to perform dummy DMA read in 'cx3110x_spi_dma_write' and dummy DMA write is 'cx3110x_spi_dma_read'? Is it a workaround for some problem or McBSP always requires bidirectional transfers? If not, doesn't this just waste precious bandwidth?
Created an attachment (id=585) [details] cx3110x-nodma.ko I have attached a patched cx3110x.ko binary file. It was originally taken from the last OS2006 firmware (3.2006.49) and I have inserted a single jump instruction passing control to 'cx3110x_spi_write' into the very beginning of 'cx3110x_spi_dma_write' function (which is apparently causing this memory corruption problem). So we effectively fallback to 'cx3110x_spi_write' instead of 'cx3110x_spi_dma_write' with this patched binary :) Steps to test a patched driver: 1. copy 'cx3110x-nodma.ko' somewhere to your device 2. login as root 3. run the following sequence of commands from the directory where you have 'cx3110x-nodma.ko': > rmmod cx3110x > insmod cx3110x-nodma.ko > chroot /mnt/initfs /usr/bin/wlan-cal 4. connect to wireless network 5. do some tests trying to catch memory corruption error with memtester and flood ping I could not reproduce any memory corruption problems with it (while I did confirm them with the official wifi driver again today) and now have flashed patched cx3110x driver to initfs on my device (replacing 'cx3110x.ko' with it and also deleting 'cx3110x_mt.ko' just in case as it seems useless there). Wireless network connection seems to work properly, and it seems stable so far. Most likely it will be slower on sending data, but I prefer a slow driver instead of a broken one :) I'm still keeping an eye on it while heavily using wlan. I will report if I notice anything unusual. Reports confirming that this fix is working (or proving that the bug is still here) are also welcome. Having the driver sources will allow us to get a proper fix for this problem instead of a binary hack. We'll also have to think about the way of providing some convenient driver upgrade solution for all the Nokia 770 users.
I have now published the cx3110x sources for Nokia 770: https://garage.maemo.org/frs/shownotes.php?release_id=1012 I had some problems with the tarball upload, but I hope it's fixed now. Here's the md5sum just in case: 85c115a81fa4429bee2cd16bfe961d44 cx3110x-0.8.1.tar.gz
(In reply to comment #14) > 85c115a81fa4429bee2cd16bfe961d44 cx3110x-0.8.1.tar.gz It seems, the upload is missing some bits, I got 68594 bytes with an md5sum of 4874784b6948d056cf17b8b7aa0ca9bb cx3110x-0.8.1.tar.gz
(In reply to comment #15) > (In reply to comment #14) > > 85c115a81fa4429bee2cd16bfe961d44 cx3110x-0.8.1.tar.gz > > It seems, the upload is missing some bits, I got 68594 bytes with an md5sum of > > 4874784b6948d056cf17b8b7aa0ca9bb cx3110x-0.8.1.tar.gz > Works for me [sbox-SDK22_ARMEL: ~] > ls -la cx3110x-0.8.1.tar.gz -rw-rw-r-- 1 maemo maemo 68594 Oct 29 15:49 cx3110x-0.8.1.tar.gz [sbox-SDK22_ARMEL: ~] > md5sum cx3110x-0.8.1.tar.gz 85c115a81fa4429bee2cd16bfe961d44 cx3110x-0.8.1.tar.gz
(In reply to comment #16) > (In reply to comment #15) > > (In reply to comment #14) > > > 85c115a81fa4429bee2cd16bfe961d44 cx3110x-0.8.1.tar.gz > > > > It seems, the upload is missing some bits, I got 68594 bytes with an md5sum of > > > > 4874784b6948d056cf17b8b7aa0ca9bb cx3110x-0.8.1.tar.gz > > > > Works for me Really strange, I tried with another laptop and the file was corrupted for me also.
(In reply to comment #17) > Really strange, I tried with another laptop and the file was corrupted for me > also. Yes, indeed! When I download with firefox, it's corrupted, when I use wget, it works. The files have the same size, but the content is shifted strangely: Correct: 00000000 1f 8b 08 00 37 56 13 47 00 03 ec 5b df 73 db 46 |....7V.G...[.s.F| 00000010 92 ce eb ce 5f 31 e7 97 48 55 34 23 c9 92 1c c7 |...._1..HU4#....| 00000020 75 0f 94 44 db bc 95 48 85 a4 e2 f8 29 05 12 43 |u..D...H....)..C| ... Wrong: 00000000 1f 8b 08 00 00 00 00 00 00 03 00 2d 80 d2 7f 1f |...........-....| 00000010 8b 08 00 37 56 13 47 00 03 ec 5b df 73 db 46 92 |...7V.G...[.s.F.| 00000020 ce eb ce 5f 31 e7 97 48 55 34 23 c9 92 1c c7 75 |..._1..HU4#....u| ... Fifteen bytes spliced in after the first four. That's strange and (!) it currently also happens to me with other garage downloads! E.g. from my mogg project... Is this some bug in the garage server or is my firefox going nuts? Sorry for being OT.
OT: No, downloading tar.gz from other sites works fine with my firefox. I'll exclude firefox being nuts.
(In reply to comment #18) > (In reply to comment #17) > > Really strange, I tried with another laptop and the file was corrupted for me > > also. > > Yes, indeed! When I download with firefox, it's corrupted, when I use wget, it > works. Ah, this is firefox (iceweasel) specific! That explains it why IE and WinZip didn't complaint about any errors when I tried to download earlier. Now I tried with my PC at home (has debian unstable and is on different network than the office network) and got similar results: $ /bin/ls -l total 144 -rw-r--r-- 1 kvalo kvalo 68594 2007-10-29 21:08 cx3110x-0.8.1.tar.gz-iceweasel -rw-r--r-- 1 kvalo kvalo 68594 2007-10-29 21:08 cx3110x-0.8.1.tar.gz-wget $ md5sum * 4874784b6948d056cf17b8b7aa0ca9bb cx3110x-0.8.1.tar.gz-iceweasel 85c115a81fa4429bee2cd16bfe961d44 cx3110x-0.8.1.tar.gz-wget $ And I spent over an hour uploading with different PCs and from different networks, and this was a problem with downloading. Bah. Thanks a lot for investigating this!
Created an attachment (id=592) [details] n770_wlan_memory_corruption_fix.diff Patch that fixes this memory corruption issue is attached, it seems to be working fine in my preliminary tests. In addition, using some printk debugging output, I confirmed that DMA write in 'cx3110x_spi_dma_write' really missed and did not touch 'spi_dma.recv_buffer' variable (so it hit some other unrelated physical memory address) in old code, and now it properly hits a new kmalloc allocated buffer. So we finally solved it, congratulations to everyone :)
(In reply to comment #21) > Created an attachment (id=592) [details] [details] > n770_wlan_memory_corruption_fix.diff > > Patch that fixes this memory corruption issue is attached, it seems to be > working fine in my preliminary tests. Excellent ! Congratulations :-) > In addition, using some printk debugging > output, I confirmed that DMA write in 'cx3110x_spi_dma_write' really missed and > did not touch 'spi_dma.recv_buffer' variable (so it hit some other unrelated > physical memory address) in old code, and now it properly hits a new kmalloc > allocated buffer. Question is - why? Why virt_to_phys is not safe to use with pointer pointing to writable data section of kernel module? Some bug/feature of dynamic kernel linker (i.e. some bad pointer relocation when & operator is used) or some bug in gcc linker script when linking cx3110x.ko? Or virt_t_phys cannot handle such kernel memory? Or is the value itself fine but it is not DMA-ble and/or few bits of such pointer are masked away? Anyway, thanks a lot for pinpointing this :-) Also now, how to distribute this, new official firmware? That would be best solution of course. If this is not possible we could make custom initfs reflasher similar to the one used for bootmenu which just replaces this module. Perhaps I could even bite the bullet this time and (unlike with bootmenu) try to package it into nice .deb with postinstall script that reflashes initfs in place :-)
(In reply to comment #21) Great, thanks a lot!
(In reply to comment #22) > Why virt_to_phys is not safe to use with pointer pointing to > writable data section of kernel module? Just answering myself - STFW :-) http://mirror.linux.org.au/linux.conf.au/2005/cdrom-beta-1/linux-mandocs-2.6.12.6/virt_to_phys.html "The returned physical address is the physical (CPU) mapping for the memory address given. It is only valid to use this function on addresses directly mapped or allocated via kmalloc." BTW, same fix should be done also in n800 source with sections #ifdef-ed for N770.
Also from the declaration here http://www.gelato.unsw.edu.au/lxr/source/include/asm-arm/memory.h#L65 it is clear that this must be wrong when called with pointer to module space, see definition of MODULE_END below, modules live below PAGE_OFFSET not above as kmalloc-ed space. So in theory this could be done without using kmalloc for those 2 extra bytes but something like (x) - MODULE_START + PHYS_OFFSET is needed (not sure if this is valid for kernel code or data). Quite hackish and not worth the effort. I did not find equivalent definition for virt_to_phys for kernel modules. See also http://uwsg.indiana.edu/hypermail/linux/kernel/0009.2/0898.html It would be interesting to search kernel source for using "virt_to_phys(&" to find some other occurences.
(In reply to comment #22) > Question is - why? Why virt_to_phys is not safe to use with pointer pointing to > writable data section of kernel module? As can be seen in disassembled code, virt_to_phys macro just adds some constant (-0xB0000000) to the virtual address value to get physical address. It is quite possible, that it expects input virtual address to reside in some specific address range in memory for which this assumption is valid (and it might be not true for kernel module data section). > Some bug/feature of dynamic kernel > linker (i.e. some bad pointer relocation when & operator is used) or some bug > in gcc linker script when linking cx3110x.ko? Or virt_t_phys cannot handle such > kernel memory? Or is the value itself fine but it is not DMA-ble and/or few > bits of such pointer are masked away? I referenced to 'Documentation/DMA-mapping.txt' in comment #12, it explicitly forbids using kernel module data section for DMA. They must have some reason for this restriction, so it's better to just follow the rules to stay out of the trouble :) > Anyway, thanks a lot for pinpointing this :-) Thanks to Tilman Vogel for reviving hunt of this bug (after everyone lost motivation and hope long ago, having neither driver sources nor even official confirmation of the problem at that time). Also providing precise description of memory corruption pattern was a great help and allowed to spot a suspicious part of code quite fast by just simply looking at the source. Kalle Valo provided invaluable help with very useful comments and ensured availability of the sources for Nokia 770 wlan driver (and N800 driver sources were available much earlier, but as it turned out, N800 never suffered from wlan memory corruption problem). It is good to have a person available for contact who is knowledgeable and can provide support for some system component. A good contrast with abstract 'Maemo QA' :) > Also now, how to distribute this, new official firmware? That would be best > solution of course. Sure, it would be best solution (and probably official YUV420 fix in the kernel would be very nice too ;)). About the driver, there is still a part I don't like very much: /* UMAC may send us odd number of bytes long frames */ if (length % 2) { u16 last_word; last_word = *(uint16_t *)(buffer + length - 1); omap_mcbsp_spi_master_xmit_word_poll(OMAP_MCBSP2, last_word); } It has problems with alignment and can read data behind buffer. Just replacing 'uint16_t' -> 'uint8_t' may be enough (and a similar code is used for N800). I wonder if 'odd length writes' ever happen in reality? If there were many of them, this code could possibly cause a lot of problems. > If this is not possible we could make custom initfs reflasher similar to the > one used for bootmenu which just replaces this module. Perhaps I could even > bite the bullet this time and (unlike with bootmenu) try to package it into > nice .deb with postinstall script that reflashes initfs in place :-) I thought about a whole package which might do system packages upgrade in OS2006 without reflashing. Something like a gui which checks existing system components (by verifying MD5 of the relevant files), provides the user with a list of unpatched components that need an upgrade and asks for a confirmation to replace them. If the user has everything up to date, it would just show some status information about the packages (version numbers and patch levels).
(In reply to comment #25) Sorry to bother you again with my kernel newbie findings but to avoid extra kmalloc (which probably wastes more than just 2 bytes), recv_buffer could be declared on stack (which is allowed if I understood Russell King corrctly in the mail linked above). Especially if it is there just for satisfying dummy read in spi_dma_write or write in spi_dma_read. If that value is really used for something it can be just copied back to struct after transfer is done (or cached from it before reading).
Created an attachment (id=593) [details] declaring recv_buffer on stack instead of global data
Created an attachment (id=594) [details] declaring recv_buffer on stack instead of global data - binary module
Created an attachment (id=595) [details] n770_wlan_align_fix.diff Some additional changes in the sources for the stuff I did not like (inconsistent waiting for dma completion, potential unaligned memory accesses and reading/writing outside buffer). This patch can be applied after memory corruption fix. Also I have added informational message showing current McBSP2 clock frequency. McBSP2 base frequency is 63MHz and some integer divisor is used to actually get the frequency we want. Original sources have a frequency limit at 14MHz in 'max_mhz' variable (so divisor 5 is used and real frequency is only 12.6MHz). It is possible to extend frequency limit and I have set it to 40 (40MHz is mentioned as a valid frequency in STLC4370.pdf), so I use divisor 2 and clock frequency 31.5MHz. This all boosts WLAN performance from ~550KB/s to ~1MB/s in my benchmarks downloading a large file with wget from a local webserver. If it proves to be reliable at such clock frequency, that is a very nice speed improvement and already seems to be faster than USB :)
Tried the idea with declaring on stack and it seems to work fine too. Patch and binary module attached. Tried few (5?) times to remove and reinsert new and original module and run memtester like this: disconnect WLAN, ssh in over BT PAN Nokia-770-36:~# rmmod cx3110x Nokia-770-36:~# insmod ./cx3110x-onstack.ko Nokia-770-36:~# chroot /mnt/initfs wlan-cal connect to AP, ssh in over WLAN Nokia-770-36:~# ./memtester-4.0.7/memtester 20 and got always error with original and no error when declaring on stack (both when logged in via ssh over wi-fi). Also it seem like the variable is indeed a dummy one, networking works fine and recv_buffer value on stack is not saved or initialized.
(In reply to comment #31) > got always error with original and no error when declaring on stack (both > when logged in via ssh over wi-fi). Also it seem like the variable is indeed a > dummy one, networking works fine and recv_buffer value on stack is not saved or > initialized. I use a better test - initialize 'recv_buffer' with some value (ex. 0xBEEF) and check it after DMA is done, if it is reset to 0, DMA hit the right spot and we don't even need memtester to verify it. Stack allocated variable works indeed. Though considering the rest of code, it might be not safe as the current code waits for completion of primary DMA transfer and not the dummy one - we might exit from this function while dummy DMA transfer is still not complete ;) That's one of the reasons why I attached a second patch which tries to change things in the code I did not like. I started 'improving' code after I observed a strange thing in my tests - sometimes wlan speed drops down to 30-50KB/s and remains such until reconnected. Now I think about what it could be and how it can be hunted down? Cleaning up the code and inserting more debugging messages seems like the way to go. BTW, I tried to remove this dummy DMA transfer completely, but wlan driver just hangs. I checked spru762c.pdf (McBSP for OMAP5912) and it looks like McBSP has independent pins for reading and writing, so at least bandwidth is not wasted :)
Frantisek, you have been historically one of the most trusted 770/n800 'kernelware' suppliers. So when it comes to building a binary module with the final fix for this problem and deploying it to end users, I hope you don't mind if we all relay this responsibility to you? :) Regarding possible official bugfix from Nokia, it would be very much welcome (if 770 stuff is not completely disbanded yet).
(In reply to comment #33) > So when it comes to building a binary module with the > final fix for this problem and deploying it to end users, I hope you don't mind > if we all relay this responsibility to you? :) OK, I can make it part of bootmenu flasher (which reflashes initfs where the module resides) and also make separate one just with this fix for those who do not want bootmenu. I'll also try to make installable deb out of it. The only thing that bothers me is that I need to bear misdirected thanks and undeserved fame. I will definitely give proper credit to you guys but still ... > Regarding possible official bugfix from Nokia, it would be very much welcome > (if 770 stuff is not completely disbanded yet). It should be at least pushed to initfs in next hacker edition firmware or even re-release the current (possibly the last) 2007 based one. Kalle, can you comment patches and suggest best fix and/or explain what the code does and what exactly is the recv_buffer for? Also can you comment how Nokia would handle this or at least push it to appropriate people (Quim?) who can provide the answer? Thanks.
<Commenting n770_wlan_align_fix.diff> first chunk (cx3110x_spi_dma_write) - waiting for spi_dma.dma_rx_done is perhaps more correct, yes, especially if recv_buffer is on stack, otherwise we are not interested much when it is finished so that's why it is not there in original code, and it should finish at the same time as writing anyway - "last_word = *(uint16_t *)(buffer + length - 1);" looks fishy to me too, buffer is pointer to void so I'm not sure what even compiler thinks this should do, seems like unaligned word read to me, fix looks sane second and third chunk (cx3110x_spi_read) - original code seem OK to me except that it does not read last byte if length is odd, patched code seems unneeded to me it just does the same thing in two writes last part (cx3110x_spi_write) - seems not needed to me, as in previous part omap_mcbsp_spi_master_*_word_poll takes long pointer to buffer which is the only reason why there is explicit cast, I see nothing wrong with this in original code. </Commenting n770_wlan_align_fix.diff> As for the dummy part of transfer, maybe this is the way to tell the other side our clock so it is synchronized with us, transferred data does not matter but the clock does and other side send/receive is driven by our clock. I was also a bit worried why one 16 bit buffer is ok for dummy transfer even if length is >1 only later I found that the DMA channel is set up as OMAP_DMA_AMODE_CONSTANT so it always reads/writes same word 'length>>1' times.
(In reply to comment #35) > <Commenting n770_wlan_align_fix.diff> > > first chunk (cx3110x_spi_dma_write) > > - waiting for spi_dma.dma_rx_done is perhaps more correct, yes, especially if > recv_buffer is on stack, otherwise we are not interested much when it is > finished so that's why it is not there in original code, and it should finish > at the same time as writing anyway It is safer to wait for both to be sure. We definitely would not want any possibility of having DMA still hitting stack (already used by some other part of program) after we exit from this function. All in all, kmalloc seems a bit more safe. Do you have a reference to the piece of documentation proving that virt_to_phys can be officially used for stack (and it is not just a coincidence)? I relied on the information from http://www.mjmwired.net/kernel/Documentation/DMA-mapping.txt which states that neither module image addresses, nor stack can be used for DMA. > - "last_word = *(uint16_t *)(buffer + length - 1);" looks fishy to me too, > buffer is pointer to void so I'm not sure what even compiler thinks this should > do, seems like unaligned word read to me, fix looks sane > > second and third chunk (cx3110x_spi_read) > - original code seem OK to me except that it does not read last byte if length > is odd, patched code seems unneeded to me it just does the same thing in two > writes Actually cx3110x_spi_read has a dangerous bug which I tried to fix. The point is that 'omap_mcbsp_spi_master_recv_word_poll' function writes 32-bit value to output buffer. This leads to write outside the input buffer on last iteration (and possible alignment issues when storing 32-bit values at 16-bit aligned positions in the buffer). A good example is the use of this function from 'sm_drv_spi_rx' to read 16-bit 'length' variable. A simple test, reading data to some intermediate buffer (to move this data to 'length' variable later) shows that buffer overrun really exists and more than 2 bytes are modified. Fortunately, gcc pads this variable to 4 bytes on stack, so neighbour variables are not overwritten in this particular case. Nevertheless 'cx3110x_spi_read' is very dangerous and can be potentially the source of unexpected and hard to track problems. > last part (cx3110x_spi_write) > - seems not needed to me, as in previous part omap_mcbsp_spi_master_*_word_poll > takes long pointer to buffer which is the only reason why there is explicit > cast, I see nothing wrong with this in original code. The wrong thing with the code is that it casts (unsigned char *) to (u16 *) and is reading 16-bit values at once. If provided with arbitrary input buffer, having arbitrary alignment, that would cause alignment problems. Surely, it most likely always gets properly aligned buffers in this context (the driver works somehow and that's a proof), but there is no comment describing argument constraints. Potential odd length in spi_read is one more undocumented possible constraint, so we don't know for sure if it is a bug or part of protocol of communication with firmware. Maybe Kalle can have a look at it and tell us if it is ok or should be fixed. We are mostly blind and don't see the whole picture as the open driver part is only a glue between firmware and closed source umac code. We can spot suspicious looking code though. Regarding increasing McBSP clock frequency. I did some more experiments and looks like the real divisor is equal to 'div + 1'. I benchmarked DMA transfer speed in spi_dma_read/spi_dma_write, and it looks close to ~2.6MB/s at div == 2. So if div == 2, real frequency is most likely 21MHz (21MHz / 8bits = 2.625MB/s). And with the original div == 5, we have only 10.5MHz. Regarding stability at 21MHz, it seems to be ok when downloading data to the device, but deadlocks and reboots on transferring data from the device. Let's hope that's one of the numerous software bugs and can be fixed :) I will try to disable watchdog to see what happens there.
(In reply to comment #36) > Do you have a reference to the piece of documentation proving that > virt_to_phys can be officially used for stack (and it is not just a > coincidence)? Here (linked also in comment 25) http://uwsg.indiana.edu/hypermail/linux/kernel/0009.2/0898.html Someone making same error as in our driver and is corrected > I relied on the information from > http://www.mjmwired.net/kernel/Documentation/DMA-mapping.txt which states that > neither module image addresses, nor stack can be used for DMA. > > > - "last_word = *(uint16_t *)(buffer + length - 1);" looks fishy to me too, > > buffer is pointer to void so I'm not sure what even compiler thinks this should > > do, seems like unaligned word read to me, fix looks sane > > > > second and third chunk (cx3110x_spi_read) > > - original code seem OK to me except that it does not read last byte if length > > is odd, patched code seems unneeded to me it just does the same thing in two > > writes > > Actually cx3110x_spi_read has a dangerous bug which I tried to fix. The point > is that 'omap_mcbsp_spi_master_recv_word_poll' function writes 32-bit value to > output buffer. Ah, I see, missed that. Seems like bug in omap_mcbsp_spi_master_recv_word_poll to me. If word size is 16 it shouldn't write whole long but just 16 bit word. > Regarding > stability at 21MHz, it seems to be ok when downloading data to the device, but > deadlocks and reboots on transferring data from the device. Tried this too yesterday. First I tried to read from the device and it locked up (after reading ~11MB) and rebooted in few seconds. Next thing I don't understand is why the driver busy waits (udelay(5)) for DMA completion and does not use interrupts and wait_for_completion mechanism like omap_mcbsp_xmit_buffer in arch/arm/plat-omap/mcbsp.c But considering that Samuel Ortiz is author of that code too there is perhaps some reason.
(In reply to comment #37) > (In reply to comment #36) > > Do you have a reference to the piece of documentation proving that > > virt_to_phys can be officially used for stack (and it is not just a > > coincidence)? > > Here (linked also in comment 25) > http://uwsg.indiana.edu/hypermail/linux/kernel/0009.2/0898.html > Someone making same error as in our driver and is corrected Yes, I have read it before. I was just wondering if it is stated somewhere in the trusted documentation that it is expected behavior and we can rely on it in our code. Btw, the next followup post in that discussion states that we shouldn't rely on buffers in stack and kmalloc is the only reliable option: http://uwsg.indiana.edu/hypermail/linux/kernel/0009.2/0999.html > > Regarding > > stability at 21MHz, it seems to be ok when downloading data to the device, but > > deadlocks and reboots on transferring data from the device. > > Tried this too yesterday. First I tried to read from the device and it locked > up (after reading ~11MB) and rebooted in few seconds. I see, looks like my hardware is capable of working at a bit higher frequencies, to 'div = 2' is probably too much and is really a hardware issue. > Next thing I don't understand is why the driver busy waits (udelay(5)) for DMA > completion and does not use interrupts and wait_for_completion mechanism like > omap_mcbsp_xmit_buffer in arch/arm/plat-omap/mcbsp.c But considering that > Samuel Ortiz is author of that code too there is perhaps some reason. Yes, I wondered about this too in one of the early comments. Anyway, the typical time is hundreds of useconds, so overhead of using calls to udelay(5) should not be too high. We can also estimate the transfer time depending on bus speed and call the first udelay() with that value. As for the performance, it needs to be investigated properly, but most likely a lot of time is spent in umac.ko (probably doing AES encryption/decryption). OMAP should support AES in hardware, but only for OMAP2420 and not OMAP1710.
(In reply to comment #38) > > > Regarding > > > stability at 21MHz, it seems to be ok when downloading data to the device, but > > > deadlocks and reboots on transferring data from the device. > > > > Tried this too yesterday. First I tried to read from the device and it locked > > up (after reading ~11MB) and rebooted in few seconds. > > I see, looks like my hardware is capable of working at a bit higher > frequencies, so 'div = 2' is probably too much and is really a hardware issue. I'm sorry, I did not read your post carefully enough last time. If you only tested reading from the device, this looks exactly like my experience. Writing to the device was very stable in my tests and I transfered multiple gigabytes without a single fault. This still can be a software issue, as reading and writing uses a bit different code, for example the following chunk of code is used after writing but nothing like this is used after reading: /* We wait for the DMA Ready interrupt */ host_ints = 0; sm_spi_read(dev, SPI_ADRS_HOST_INTERRUPTS, (unsigned char *) &host_ints, sizeof(host_ints)); timeout = jiffies + 2 * HZ; while(!(host_ints & SPI_HOST_INT_WR_READY)) { if (time_after(jiffies, timeout)) { printk(KERN_WARNING "cx3110x: Haven't got a " "WR_READY for DMA write." "(firmware crashed?)\n"); result = -1; goto exit; } sm_spi_read(dev, SPI_ADRS_HOST_INTERRUPTS, (unsigned char *) &host_ints, sizeof(host_ints)); } host_ints_ack = SPI_HOST_INT_WR_READY; sm_spi_write(dev, SPI_ADRS_HOST_INT_ACK, (unsigned char *)&host_ints_ack, sizeof(host_ints_ack)); Anyway, the first priority is a reliable fix for memory corruption issue. Discussing possible performance improvements should be probably moved somewhere else (to the maemo-developers mailing list or to the mailing list of cx3110x garage project).
First of all, great work everyone! This is a good example how powerful Open Source community really is. You guys found the bug and fixed it in no time. Biggest bottleneck was me with releasing sources, I'm sorry about that. I haven't had time to look at the fixes yet, but I'll try answer all questions directed to me. Please point out if I missed something. (In reply to comment #24) > BTW, same fix should be done also in n800 source with sections #ifdef-ed for > N770. There's no need to do that. In OS2008 WLAN driver there won't support for 770 anymore. (In reply to comment #34) > Kalle, can you comment patches and suggest best fix and/or explain what the > code does and what exactly is the recv_buffer for? Sorry, I haven't had the time to look at the fixes yet. Also I'm not that familiar with 770's McBSP, so I can't comment before I study it more. > Also can you comment how > Nokia would handle this or at least push it to appropriate people (Quim?) who > can provide the answer? I'm not working with 770 anymore, so from my point of view the case is clear: there won't be any official updates or fixes coming to 770. I can try to help you with this occasionally, whenever I find some free time. But nowadays that doesn't happen that often. NB: This is not my call, so there's no point of trying to argue this with me. Please have that discussion somewhere else, for example in a mailing list. Let's try to keep this bug report short and clean.
Surely the patches submitted thus far can be rolled into a future release of OS 2007 or OS 2008 HE (probably the latter) in order that this critical defect is corrected for every 770 user willing to upgrade to OS 200x HE? "Back porting" to OS 2006 may be best left to the "community"...
(In reply to comment #41) > Surely the patches submitted thus far can be rolled into a future release of OS > 2007 or OS 2008 HE (probably the latter) in order that this critical defect is > corrected for every 770 user willing to upgrade to OS 200x HE? > > "Back porting" to OS 2006 may be best left to the "community"... > The first para should have said "rolled into a future release of OS 2007 HE or OS 2008 HE (probably the latter) ..."
(In reply to comment #41) > Surely the patches submitted thus far can be rolled into a future release of OS > 2007 or OS 2008 HE (probably the latter) in order that this critical defect is > corrected for every 770 user willing to upgrade to OS 200x HE? The 770 Hacker Edition only updates rootfs. Kernel and initfs are not updated AFAIK. And WLAN driver is located in initfs so I don't see how we could get the fix into the Hacker Edition.
Hacker Edition is shipped as a full FIASCO image, so in theory Nokia could respin initfs in a future HE update, be it OS 2007 or OS 2008. If what you say is true, I find it really bizarre that Nokia would not fix such a critical bug in a future firmware update.
Hi, I was thinking about two ways to install fixed module. The easiest and safest one (to install) is to add early startup script to rootfs which unloads buggy module and loads fixed one. The second (perhaps more correct) one is reflashing initfs on the fly and replacing module there which is a bit risky when it goes wrong in the middle of flashing. I guess it can be relatively safe when doing same thing that backup utility does when restoring settings - shutdown all UI programs, do it and then reboot. Still the first one would be preferred but I think there is a catch. The buggy spi_dma_write is already used to send firmware to the chip so once driver is loaded and wlan-cal is called the firmware is sent to the device corrupting few random bytes already. See src/sm_drv_spi.c functions sm_drv_spi_up, sm_drv_spi_upload_firmware. Is this correct? Or maybe the firmware is just prefetched to RAM on driver load or wlan-cal time but is sent to the chip when interface is really brought up (by ifconfig) later? So maybe we still have clean, not corrupted memory early in rootfs?
I have added some simple code which shows printk message on first 'cx3110x_spi_dma_write' call in wlan driver. This test shows that freshly booted Nokia 770 loads cx3110x driver, but does not call 'cx3110x_spi_dma_write' function until somebody actually tries to use wifi (initiates scan for wlan networks for example). Looks like just modifying startup script should be safe.
(In reply to comment #45) > So maybe we still have clean, not corrupted memory early in > rootfs? Yes, firmware is sent only after interface is really up, can be verified by dmesg output (McBSP2: frequency info and 'Loading 3825.arm firmware"). I have also added simple statistics into both dma functions and result is interesting (after few minutes of surfing and internet radio streaming). [100565.627899] cx3110x_spi_dma_read stats: called 13588 minwait 0 maxwait 305 oddlen 0 unalign 0 [100565.628021] cx3110x_spi_dma_write stats: called 10137 minwait 1 maxwait 3819 oddlen 130 unalign 0 So we really have frames with odd length and also busywaiting for finished transfer can be sometimes quite long (it counts while loops with udelay), patch attached (for convenience, it is quite simple).
Created an attachment (id=608) [details] adds some stats into cx3110x_spi_dma_read/write
Nevertheless I think that an updated full FIASCO image is still needed to solve this problem. We can be never sure that the system is not already unrecoverably broken by the moment when the user decided to install wlan driver fix. If the user connected wlan at least once (to download .deb package with wlan driver fix for example), the system consistency can't be guaranteed. I initially expected that we can get only one official firmware update with wlan bugfix at most (and even that is uncertain), so I tried to find as many problems as possible so that the last bugfix could include fixes for them too. But now I see that the only really critical bugfix for inclusion into firmware update is memory corruption fix. The driver may be slow, have some minor bugs, consume extra cpu resources, break wlan connection sometimes, but it should never compromise stability of the rest of the system. All the other problems can be fixed later and installed using startup script solution without the strong need to update FIASCO image again.
(In reply to comment #44) > Hacker Edition is shipped as a full FIASCO image, so in theory Nokia could > respin initfs in a future HE update, be it OS 2007 or OS 2008. AFAIK they are looking into getting this fix to the next HE release. Thanks for the hard work!
Bugfix released in the OS2007 Hacker Edition update. Resolving bug. See: http://maemo.org/news/announcements/view/new_os2007_he_for_the_nokia_770_with_mozilla_engine_and_memory_corruption_bugfix.html
Created an attachment (id=611) [details] declaring recv_buffer on stack instead of global data - final bumped up version string so it can be easily identified in kernel log and also added waiting for both dma transfers in cx3110x_spi_dma_write so we are sure stack variable is not writen after function exit
Created an attachment (id=612) [details] declaring recv_buffer on stack instead of global data - final binary module More correct version of binary module as explained in patch description. It identifies itself as version 0.8.1-fix#2006 in kernel log.
Created an attachment (id=613) [details] installable deb package for OS2006 or older OS2007HE Loads fixed driver in startup script when device boots. If you boot multiple systems from MMC, this should be installed in each system separately. If possible, please test and report problems, thanks.
Created an attachment (id=614) [details] source for deb package Read included README, patches are inside. I have added also fix for writing past end of the buffer inspired by n770_wlan_align_fix.diff Alignment issues remain unfixed but it seems to not cause any problems as the buffer seems to be aligned.
(In reply to comment #39) > I'm sorry, I did not read your post carefully enough last time. If you only > tested reading from the device, this looks exactly like my experience. Writing > to the device was very stable in my tests and I transfered multiple gigabytes > without a single fault. This still can be a software issue, as reading and > writing uses a bit different code, for example the following chunk of code is > used after writing but nothing like this is used after reading: That's because the firmware interface does not have similar as WriteReady interrupt for reading. So it is implemented as specified. > Discussing possible performance improvements should be probably moved somewhere > else (to the maemo-developers mailing list or to the mailing list of cx3110x > garage project). Is there a need for cx3110x specific mailing list? I can set it up if needed. If any of you thinks that there should be one, please send me email privately. (Let's try to not to pollute this bug.)
Created an attachment (id=618) [details] installable deb package for OS2006 or older OS2007HE 0.2 hopefully the last one, fixed spelling of Tilman's surname, reboot is now not needed if wi-fi is off at install time
Created an attachment (id=619) [details] source for deb package source replaced
I'm afraid we have a problem: the hotfix interacts with the clock. Badly. How to reproduce: (some of the steps may be irrelevant but that's what I did) Start the clock program and set an alarm. Leaving the clock program open, close the cover (suspend to RAM). Plug in the charger. Wait for the alarm time. The alarm sound does not play. The O.S. has frozen solid -- black screen, no key response. Power cycle by removing and replacing the battery, to recover. Without the hotfix the alarm feature was 99.9% reliable. Coincident with installing the hotfix, it started failing like this, 3 crashes out of 3 attempts. Both cx3110x-hotfix-bug2006_0.2_armel.deb and version 0.1 fail equally. I *did* reboot the machine immediately after both installations.
(In reply to comment #59) > Without the hotfix the alarm feature was 99.9% reliable. Coincident with > installing the hotfix, it started failing like this, 3 crashes out of 3 > attempts. Both cx3110x-hotfix-bug2006_0.2_armel.deb and version 0.1 fail > equally. I *did* reboot the machine immediately after both installations. So, umm, when you just uninstall the hotfix without touching anything else clock starts to work again? It sounds impossible that this is caused by the change in cx3110x driver itself. I can only imagine that the startup script that unloads old module and loads fixed one might change something (even only timing) of boot sequence. But you say you do not shutdown your tablet so the system should not boot at all. It looks like coincidence to me, several people in mailing list mentioned the alarm clock is not 100% reliable and sometimes device just hangs, searching bugzilla for word 'alarm' gives few bugs too. Could you try to uninstall just the hotfix without doing any other change, test alarms for few days and then reinstall back? You can also just rename the module in /lib/modules/2.6.16.27-omap1/ to something else to disable loading it at boot time. Or you can rename the startup script to disable it at boot time temporarily too. sudo gainroot cd /etc/rc2.d/ mv S10cx3110x-hotfix-bug2006 whateverS10cx3110x-hotfix-bug2006
Good news: the alarm feature is once again operational. Here's what I did: Following Frantisek's suggestion I disabled the hotfix by renaming /lib/modules/2.6.16.27-omap1/cx3110x-hotfix-bug2006.ko and rebooting (telinit 6). From here to the last step the WLAN was never activated. I wonder if telinit 6 was insufficient, i.e. if it was important to power off, then back on. There was a "hanging" alarm from when the alarm program crashed the machine, i.e. in the clock app the bell icon did not have a red slash. I removed this alarm, and set another one, then suspended to RAM. At the alarm time the machine froze solid. Removed battery and rebooted. I removed the hanging alarm and set another one, but did not suspend to RAM. Behavior was successful, i.e. the sound played and the alarm was removed. I suspect that this was the key step in bringing it back to life. I set another alarm, suspended to RAM, but impatiently checked if the machine had died. It was suspended when the alarm went off: success. I did the same procedure but left it suspended for the whole time: success. Now I reactivated the hotfix by reverting the rename. Power off, then on. I set another alarm and suspended to RAM: Success! I activated the WLAN, transferred files, did web stuff, etc. Set another alarm and suspended to RAM: Success! So I conclude that there may still have been a bad interaction between modules (post hoc ergo propter hoc), but whatever internal state trash may have been involved was cleaned up by one of the maneuvers that I did. I've always attributed occasional unreliability of the alarm to mysterious WLAN-related corruption (which was proven to be real by Tilman Vogel). Everyone cross fingers.