maemo.org Bugzilla – Bug 9150
Device doesn't respond via UI. syslog reports HWRecoveryResetSGX: SGX Hardware Recovery triggered, sgx_misr eating all CPU
Last modified: 2013-12-06 03:12:14 UTC
You need to log in before you can comment on or make changes to this bug.
SOFTWARE VERSION: NOKIA Maemo 5 Version: 3.2010.02-8 WLAN MAC address: EC:9B:5B:FD:xx:xx Bluetooth address: EC:9B:5B:FD:xx:xx IMEI: 35693803153xxxx EXACT STEPS LEADING TO PROBLEM: 1. System UI will lockup randomly, appears not to be related to activity or use of the system, has mostly occurred when idle. When this happens, opening the slide keyboard, or activating the keyboard lock switch, or opening the camera shutter or pressing the power button has no effect, system has been observed to not indicate incoming phone calls, and battery drains very quickly with the device becoming very warm. 2. [sgx_misr] eats all the CPU 3. syslog reports kernel: [100992.835906] HWRecoveryResetSGX: SGX Hardware Recovery triggered 4. syslog reports mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_open: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 5. syslog reports mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. EXPECTED OUTCOME: System shouldn't lock up. ACTUAL OUTCOME: System is unusable. The only way to recover from this without remote access (ie ssh) appears to be to remove the battery, which may cause data loss. REPRODUCIBILITY: random - has occured while idle on charge, or idle in my pocket, incidence about once every day or two. EXTRA SOFTWARE INSTALLED: openssh, syslogd, maep OTHER COMMENTS: Initially sshing into the device and running top gave the following information: PID PPID USER STAT RSS %MEM %CPU COMMAND 800 2 root RW 0 0.0 97.2 [sgx_misr] I believe [sgx_misr] is the kernel process for the PowerVR SGX GPU Masked Interrupt Status Register Since then, the phone has been reflashed with vanilla nokia firmware (I think it was with 1.2009.42-11 before): http://tablets-dev.nokia.com/nokia_N900.php?f=RX-51_2009SE_3.2010.02-8_PR_COMBINED_MR0_ARM.bin http://tablets-dev.nokia.com/nokia_N900.php?f=RX-51_2009SE_1.2009.41-1.V ANILLA_PR_EMMC_MR0_ARM.bin This hasn't fixed the problem. Syslogd was installed to try to gather more information, the problem is first shown with the following in /var/log/syslog, [period of inactivity, nothing out of the unusual] Feb 17 19:45:35 Nokia-N900-02-8 kernel: [100992.835906] HWRecoveryResetSGX: SGX Hardware Recovery triggered Feb 17 19:47:28 Nokia-N900-02-8 ke_recv[1322]: prop_modified:1889: udi /org/freedesktop/Hal/devices/platform_slide modified button.state.value Feb 17 19:47:28 Nokia-N900-02-8 kernel: [101105.542236] slide (GPIO 71) is now open Feb 17 19:47:28 Nokia-N900-02-8 systemui-tklock[988]: Method call received from: :1.6, iface: com.nokia.system_ui.request, method: tklock_close Feb 17 19:47:29 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible causes include: the remote application did not send a reply, th e message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.Feb 17 19:47:31 Nokia-N900-02-8 ke_recv[1322]: prop_modified:1889: udi /org/freedesktop/Hal/devices/platform_slide modified button.state.value Feb 17 19:47:31 Nokia-N900-02-8 kernel: [101108.628143] slide (GPIO 71) is now closed Feb 17 19:47:32 Nokia-N900-02-8 kernel: [101109.518768] kb_lock (GPIO 113) is now closed Feb 17 19:47:32 Nokia-N900-02-8 kernel: [101109.557830] kb_lock (GPIO 113) is now openFeb 17 19:47:32 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_open: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Feb 17 19:47:33 Nokia-N900-02-8 kernel: [101109.995330] kb_lock (GPIO 113) is now closed Feb 17 19:47:33 Nokia-N900-02-8 kernel: [101110.315643] kb_lock (GPIO 113) is now open Feb 17 19:47:34 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible causes include: the remote application did not send a reply, th e message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.Feb 17 19:47:34 Nokia-N900-02-8 kernel: [101111.034393] kb_lock (GPIO 113) is now closed Feb 17 19:47:34 Nokia-N900-02-8 kernel: [101111.315643] kb_lock (GPIO 113) is now open Feb 17 19:47:34 Nokia-N900-02-8 kernel: [101111.432830] kb_lock (GPIO 113) is now closed Feb 17 19:47:34 Nokia-N900-02-8 kernel: [101111.604736] kb_lock (GPIO 113) is now openFeb 17 19:47:35 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible causes include: the remote application did not send a reply, th e message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Feb 17 19:47:36 Nokia-N900-02-8 ke_recv[1322]: prop_modified:1889: udi /org/freedesktop/Hal/devices/platform_slide modified button.state.value Feb 17 19:47:36 Nokia-N900-02-8 kernel: [101112.925018] slide (GPIO 71) is now openFeb 17 19:47:37 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible causes include: the remote application did not send a reply, th e message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Feb 17 19:47:37 Nokia-N900-02-8 ke_recv[1322]: prop_modified:1889: udi /org/freedesktop/Hal/devices/platform_slide modified button.state.value Feb 17 19:47:37 Nokia-N900-02-8 kernel: [101113.979736] slide (GPIO 71) is now closed Feb 17 19:47:38 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_open: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Feb 17 19:47:41 Nokia-N900-02-8 kernel: [101118.065643] cam_shutter (GPIO 110) is now open Feb 17 19:47:41 Nokia-N900-02-8 camera-ui[2292]: GLIB DEBUG liblocation - Loading initial values from com.nokia.Location::las Feb 17 19:47:41 Nokia-N900-02-8 location-daemon[3174]: GLIB DEBUG default - :1.820 now having 1 connections Feb 17 19:47:41 Nokia-N900-02-8 camera-ui[2292]: GLIB DEBUG liblocation - Object path: /com/nokia/location/las Feb 17 19:47:41 Nokia-N900-02-8 location-daemon[3174]: GLIB DEBUG default - :1.820 now having 0 connections Feb 17 19:47:42 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible causes include: the remote application did not send a reply, th e message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Feb 17 19:47:42 Nokia-N900-02-8 location-daemon[3174]: GLIB DEBUG default - :1.411 now having 1 connections Feb 17 19:47:42 Nokia-N900-02-8 location-daemon[3174]: GLIB DEBUG default - New client. Not modifying LAS session Feb 17 19:47:45 Nokia-N900-02-8 kernel: [101122.182830] cam_shutter (GPIO 110) is now closed Feb 17 19:48:09 Nokia-N900-02-8 kernel: [101146.620544] wl1251: 151 tx blocks at 0x3b788, 35 rx blocks at 0x3a780 Feb 17 19:48:09 Nokia-N900-02-8 kernel: [101146.620941] wl1251: firmware booted (Rev 4.0.4.3.7) Feb 17 19:48:09 Nokia-N900-02-8 wlancond[1128]: Scan issued Feb 17 19:48:10 Nokia-N900-02-8 wlancond[1128]: Scan results ready -- scan active Feb 17 19:48:10 Nokia-N900-02-8 wlancond[1128]: Scan results (10 APs) to :1.81 Feb 17 19:48:10 Nokia-N900-02-8 kernel: [101147.667541] wl1251: down A little later I asked someone to call me, to see if the device would indicate the incoming call - it didn't, but the syslog has a few entries from it. Feb 17 20:17:40 Nokia-N900-02-8 telepathy-ring[1061]: GLIB MESSAGE Modem-Call - incoming call from "07871xxxxxx" Feb 17 20:17:41 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Feb 17 20:17:42 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible causes include: the remote application did not send a reply, th e message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Feb 17 20:17:50 Nokia-N900-02-8 kernel: [102926.987762] proximity (GPIO 89) is now closedFeb 17 20:17:51 Nokia-N900-02-8 kernel: [102927.972015] proximity (GPIO 89) is now open Feb 17 20:17:52 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible causes include: the remote application did not send a reply, th e message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Feb 17 20:17:55 Nokia-N900-02-8 kernel: [102932.620361] proximity (GPIO 89) is now closed Feb 17 20:17:56 Nokia-N900-02-8 kernel: [102933.190673] proximity (GPIO 89) is now open Feb 17 20:17:57 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible causes include: the remote application did not send a reply, th e message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Feb 17 20:18:02 Nokia-N900-02-8 telepathy-ring[1061]: GLIB MESSAGE Modem-Call - mt-released incoming call from '07871xxxxxx' com.nokia.csd.Call.Error.Network.NormalCallClearing: Normal Call Clearing Feb 17 20:18:02 Nokia-N900-02-8 telepathy-ring[1061]: GLIB MESSAGE Modem-Call - terminated incoming call from '07871xxxxxx' com.nokia.csd.Call.Error.Network.NormalCallClearing: Normal Call Clearing Feb 17 20:18:03 Nokia-N900-02-8 mce[749]: Error sending with reply to com.nokia.system_ui.request.tklock_open: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Possibly related / useful URLs: Bug 7017 - SGX memory reset seems failed during reboot https://bugs.maemo.org/show_bug.cgi?id=7017 The SGX driver http://www.daimi.au.dk/~cvm/repo/add_nokia_sgx_driver.patch Possible hardware fault followed by failure to recover with the "HWRecoveryResetSGX: SGX Hardware Recovery triggered" routine? Thanks James
Hi James, thanks for reporting this!
I believe I just experienced same bug. Device was in charge for some hours with a steady green LED confirming full charge. Unplug chargers, LED switches to standard blinking as expected but no response to any physical actions. Keyboard even remains unlit when slid open. Device also warmer that expected. ssh'd to device without problems and top confirmed same process: Mem: 228024K used, 17516K free, 0K shrd, 4252K buff, 72024K cached CPU: 0.3% usr 99.6% sys 0.0% nice 0.0% idle 0.0% io 0.0% irq 0.0% softirq Load average: 2.44 3.01 2.49 PID PPID USER STAT RSS %MEM %CPU COMMAND 796 2 root RW 0 0.0 98.0 [sgx_misr] I attempted to kill this device several times with "kill -9 796" but without success. Finally used the "reboot" command which seemed to perform a normal reboot without any unusual delays. Firmware: 2.2009.51-1.203.2 (Latest version not rolled to UK users yet) ?:~# cat /proc/version Linux version 2.6.28-omap1 (bifh6@maemo-bifh-19) (gcc version 4.2.1) #1 PREEMPT Thu Dec 17 09:40:52 EET 2009 Recent activity that possibly may have contributed or triggered: Updated OMWeather to version 0.25.6 from the dev repository a few hours earlier but had yet to reboot device as instructed during update. Installed BatteryGraph 0.2.2-1 about a week prior, which I understand keeps a background process running to poll and record battery status every 15 minutes.
Also possibly related: On several occasions in the past week the device has not commenced charging upon connecting the supplied Nokia charger. Resolved each time, on first attempt, by disconnecting then reconnecting the charger to device. Uptime was 4 days prior to rebooting. Can't say for sure, but the occasional charging issue described above may only have occurred during this time-frame.
Just to add, the display wasn't turning itself off when running of battery after experiencing this bug. Toggling various options in the Settings, Display applet resolved it. N.B. I have the option to keep display lit when charging enabled.
Exactly the same symptoms as Faz, including UK Firmware 2.2009.51-1.203.2. Except that I don't have OMWeather or BatteryGraph installed. My problems seem to relate to using GPS (leaving Maep running while locking the screen), rather than charging. See bug #8689. [75990.246429] kb_lock (GPIO 113) is now closed [75990.379119] kb_lock (GPIO 113) is now open [76364.634124] HWRecoveryResetSGX: SGX Hardware Recovery triggered [76386.777008] HWRecoveryResetSGX: SGX Hardware Recovery triggered [76674.741302] HWRecoveryResetSGX: SGX Hardware Recovery triggered [76676.639343] HWRecoveryResetSGX: SGX Hardware Recovery triggered [76879.832122] kb_lock (GPIO 113) is now closed [76880.082122] kb_lock (GPIO 113) is now open [76882.089965] kb_lock (GPIO 113) is now closed [76882.347778] kb_lock (GPIO 113) is now open Load average: 2.10 1.97 1.57 PID PPID USER STAT RSS %MEM %CPU COMMAND 784 2 root RW 0 0.0 81.6 [sgx_misr] 4546 4524 root R 588 0.2 18.1 top 3950 1153 user S 23636 9.5 0.0 /usr/sbin/browserd -s 3950 -n RTComMessagingServer 4459 950 user S 18296 7.4 0.0 /usr/bin/modest
Can you test if this still happens after uninstalling maep, having bug 8689 in mind?
Hi Andre, I was getting to the end of my 28 day 'on the spot replacement for a new one' period, so I took advantage of that, and haven't had the device go unresponsive since. I've just installed syslog to see if there's anything about maep and SGX still happening, and indeed occasionally it does seem to happen that I get a Mar 1 21:30:17 Nokia-N900-02-8 kernel: [37680.657348] HWRecoveryResetSGX: SGX Hardware Recovery triggered Mar 1 22:05:20 Nokia-N900-02-8 kernel: [39783.652770] HWRecoveryResetSGX: SGX Hardware Recovery triggered However the device remains responsive, and the sgx_misr kernel process isn't eating all the CPU. Any idea as to the origins of the HWRecoveryResetSGX routine? Is it a kernel patch around a hardware bug which detects and resets the SGX when bad stuff happens? (but doesn't handle all cases of bad stuff happening?)
(In reply to comment #2) > PID PPID USER STAT RSS %MEM %CPU COMMAND > 796 2 root RW 0 0.0 98.0 [sgx_misr] > > I attempted to kill this device several times with "kill -9 796" > but without success. Things which take zero memory and which names are shown in [], are kernel threads, i.e. there's nothing to kill from the user-space perspective. (In reply to comment #7) > Any idea as to the origins of the HWRecoveryResetSGX routine? > Is it a kernel patch around a hardware bug which detects and resets > the SGX when bad stuff happens? AFAIK: SGX HW (microkernel?) resets its state when it encounters a state it cannot handle and Linux kernel on the ARM side notices this and outputs the message. The issue could be in Imagination's SGX driver. Next release has some driver changes which should at least decrease the likelyhood of the resets happening, but as the issue is non-reproducible, it's nearly impossible to reliably verify this. This doesn't happen for everybody, some people never encounter this issue (for example my device has syslog for over a month and there's nothing about SGX resets, but my usage is quite light, I might not stress the device enough).
(In reply to comment #8) > This doesn't happen for everybody, some people never encounter this > issue (for example my device has syslog for over a month and there's > nothing about SGX resets, but my usage is quite light, I might not > stress the device enough). Have you tried leaving maep running for extended periods of time? (note I'm not blaming maep for doing anything wrong, but bug 8689 indicates something maep does triggers the SGX resets).. There does seem to be two different outcomes from SGX resets while running maep, either a) as per my old N900 - UI dies, sgx_misr eats CPU, SGX Hardware recovery triggered. b) as per my new N900 - SGX Hardware Recovery triggered. Device continues running just fine. This is with the same firmware flashed to the devices, so I'm curious as to whether the difference is intentional - i.e. different revisions of SGX chip, or whether there's a few faulty SGX chips (or something related to the resetting of the SGX chips) which cause symptoms as per a) when HW recovery is initiated. In any case there seems to be several people experiencing a) and several people experiencing b). Cheers James
(In reply to comment #9) > Have you tried leaving maep running for extended periods of time? No, I haven't used that, but it seems interesting (and starts faster than PR1.1 Maps), so I'll give it a try.
Tried it on older device, which started getting SGX resets after installing & testing Maep... I'll try it next on the device with 20 days uptime and no SGX resets for far.
Seems a related bug: https://bugs.maemo.org/show_bug.cgi?id=8689
(In reply to comment #11) > Tried it on older device, which started getting SGX resets after installing & > testing Maep... > > I'll try it next on the device with 20 days uptime and no SGX resets for far. That device doesn't get SGX resets even with Maep. Hm... I don't have same SIM in them. I wonder whether network could have some impact on triggering this (wild guess that hard to trigger issue might come from interaction between different drivers)?
(In reply to comment #8) >The issue could be in Imagination's SGX driver. Next release has some driver >changes which should at least decrease the likelyhood of the resets happening, >but as the issue is non-reproducible, it's nearly impossible to reliably verify >this. If you send it to me I can verify it. My N900 has a high probability of "SGX Reset" problem after reboot (it is a reason why I use OFF-ON instead of reboot) - see bug 7017
(In reply to comment #14) >> The issue could be in Imagination's SGX driver. Next release has some >> driver changes which should at least decrease the likelyhood of the resets >> happening, but as the issue is non-reproducible, it's nearly impossible to >> reliably verify this. > > If you send it to me I can verify it. My N900 has a high probability of "SGX > Reset" problem after reboot (it is a reason why I use OFF-ON instead of reboot) > - see bug 7017 There was limited community pre-testing for the PR1.1 release (to find out whether there are issues that don't appear in our own testing environments). Quim can comment on whether something similar is possible also for PR1.2 release.
I'm afraid I've seen this bug as well. Not with Maep, but with maemo-mapper.
*** This bug has been confirmed by popular vote. ***
*** Bug 8689 has been marked as a duplicate of this bug. ***
Just experienced this issue for the first time today. Tried to use phone and found it completely unresponsive. Pulled the battery and then, about an hour later, found it the same way. This time I was in a position to SSH into the phone, and saw the same process eating 100% cpu. No data from syslog yet, as I don't have it setup and configured, but I'll make that my next step.
I've seen this behaviour more than once. Today I was hiking and it occured four times while * the device was in offline mode (to avoid roaming cost of 7¢ per KB) * the only app running was Maep * the device had a GPS fix It usually happens when the device is in locked mode. Grab it, try to unlock it, no response. As I have no other device available to ssh in, I need to remove the battery and to do a reboot. After that, the N900 needs up to 20 minutes to get a GPS fix (due to offline mode) and will crash again some time it is in locked mode again. This issue is still unassigned but rather annoying (no offense BTW). Hiking, biking and editing osm data was the main reason to buy the N900. Are there hopes to get the issue fixed or should I better use the N810? Refs: http://www.christeck.de/wp/2010/04/02/mapping-initiation-of-the-n900/ http://www.christeck.de/wp/2010/04/25/loosing-the-tracks-of-a-hiking-trip-9869/
(In reply to comment #20) > This issue is still unassigned but rather annoying (no offense BTW). It's being looked at (alias field indicates that it's forwarded to internal bug tracker). > Hiking, biking and editing osm data was the main reason to buy the N900. > Are there hopes to get the issue fixed or should I better use the N810? It's potentially fixed with PR1.2 which has quite a lot of SGX driver fixes, but as the issue here isn't really re-producible, it's impossible to say for sure that this is the case. There are still two known remaining issues with SGX: * One that was possible to trigger (sometimes) with maemo-mapper (which uses clutter i.e. GLES). The kernel oops is a page fault in v7_dma_inv_range(). It seems that maemo-mapper called glReadPixels() on a pixmap but the backing memory had already been freed. If SGX tried to access the pixmap it'd probably trigger a page fault too, which would cause a lock-up. It may be possible that the same issue is also triggerable by Maep. This has a potential fix, but it's not yet properly tested or integrated (NB#144156). * SGX driver has problems when the FPS keeps below 0.5. I would think this unlikely to get triggered in anything else than a synthetic test-program, who would want to do or use an application that's so badly written that it doesn't get more speed out of SGX?
(In reply to comment #21) > It's potentially fixed with PR1.2 which has quite a lot of SGX driver fixes, > but as the issue here isn't really re-producible I can consistently reproduce it on my device (it seems like a few other people can too). If there's a chance that further testing might reveal a bug *and* suggest a fix for 1.2, please suggest things for my to try or do. Otherwise I'll report back after 1.2 is released.
(In reply to comment #22) > If there's a chance that further testing might reveal a bug *and* > suggest a fix for 1.2, please suggest things for my to try or do. > Otherwise I'll report back after 1.2 is released. PR1.2 should be coming soon, I think it's best to do testing on that. (Note: releases are based on internal testing results -> we don't have dates for them.)
(In reply to comment #22) > If there's a chance that further testing might reveal a bug *and* > suggest a fix for 1.2, please suggest things for my to try or do. Something came up which would be good to check. When this happens, log to the device with SSH and check whether any processes are in D state. For any such process, check whether what /proc/PID/wchan reports.
(In reply to comment #24) > Something came up which would be good to check. When this happens, log to the > device with SSH and check whether any processes are in D state. For any such > process, check whether what /proc/PID/wchan reports. Right, sorry for the delay. When the device is in this state: Nookie:~# ps | grep ' D ' | grep -v grep 786 root 26132 D < /usr/bin/Xorg -logfile /tmp/Xorg.0.log -logverbose 1 ohmd intermittently shows up blocked in this state, but also during the device's normal state too. Xorg is the key here I guess. /proc/786/wchan contains "PVRSRVPowerLock". This is on PR 1.1.1.
I got my phone a week ago, only yesterday it started to have this bug. I was using maep and logging my moving position, but I think sometimes it also happends when I had maep closed while charging the phone in idle. Is there a workaround or are there certain apps we should not use/have installed to avoid it? Is there a way I can provide usefull informations? I would hope it gets solved in firmware 1.2, but I dont expect it to come soon :(
(In reply to comment #26) > I got my phone a week ago, only yesterday it started to have this bug. > I was using maep and logging my moving position, but I think sometimes it also > happends when I had maep closed while charging the phone in idle. What programs & applets you had running when the phone was "idle"? > Is there a workaround or are there certain apps we should not use/have > installed to avoid it? Is there a way I can provide usefull informations? > > I would hope it gets solved in firmware 1.2, but I dont expect it to come soon > :( It should really come soon now (within weeks, not month(s)), it's been post-poned too much already.
I've experienced the same issues, with PR1.1 Global, associated with use of maep or OVI Maps. Always using 3G for network connectivity (3 UK), if that's of any consideration. I've never attempted to debug the cause, though.
(In reply to comment #28) > I've experienced the same issues, with PR1.1 Global, associated with use of > maep or OVI Maps. This is the first time I hear that it has happened with the (pre-installed) OVI maps. Has anybody else gotten this issue with OVI Maps?
(In reply to comment #29) > (In reply to comment #28) > > I've experienced the same issues, with PR1.1 Global, associated with use of > > maep or OVI Maps. > > This is the first time I hear that it has happened with the (pre-installed) OVI > maps. Has anybody else gotten this issue with OVI Maps? Not me. I have left OVI Maps in the background with the screen locked on many occasions, too.
Bug confirmed on PR1.2 :-( (someone please update the version field) UK variant firmware, installed via command-line flasher (including emmc flash). Xorg still blocked and its wchan reporting 'PVRSRVPowerLock'. (Exactly as per my comment #25). I think it might have taken a bit longer to go belly-up this time (over 1hr, on PR1.1.1 it rarely took more than 20 mins) but the timing was never precise. This is with maep v1.3.2. I've delayed upgrading to 1.3.5 or greater (which the maep maintainer says might avoid triggering this bug) in the hope that I might be of some help nailing the underlying problem. See https://garage.maemo.org/tracker/index.php?func=detail&aid=5293&group_id=1155&atid=4332
Saw this on BBC Iplayer - halfway through Top Gear. Flash video froze, though continued audio playing. Screen blanked after the blank timeout, would not resume with the lock swich, touching the screen unblanked (exposing the original content). There was only one line in dmesg - [89946.780700] HWRecoveryResetSGX: SGX Hardware Recovery triggered All the rest looked normal.
Has just happened to me on PR1.2. I was not using any maps program at the time, but did run Maep earlier today for less than 1 minute. I had MicroB open on a plain HTML page (no flash/javascript stuff) and it was downloading a large file (350M) in the background, and pidgin running in background, then I locked the screen. When I came back to check it 30 minutes later, the screen wouldn't turn on. I am able to ssh into it, though. I didn't have syslog running at the time but dmesg shows: HWRecoveryResetSGX: SGX Hardware Recovery triggered After which the screen won't turn on and sgx_misr uses near 100% CPU, same as the others have reported.
"Me too" PR 1.2, this has occurred twice for me, both times since I found a way to keep GPS active (AGTL app; Maps & others tend to let GPS shutdown when idle). Why is this keyworded 'performance'?
Another "me too" here on PR1.2: Tasks: 173 total, 2 running, 170 sleeping, 0 stopped, 1 zombie Cpu(s): 9.6%us, 20.1%sy, 0.7%ni, 65.8%id, 3.7%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 245348k total, 231696k used, 13652k free, 1212k buffers Swap: 786424k total, 88164k used, 698260k free, 75096k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 866 root 20 0 0 0 0 R 91.3 0.0 31:33.25 sgx_misr 3908 user 20 0 2344 1016 760 R 7.0 0.4 0:00.10 top 1 root 20 0 1832 608 484 S 0.0 0.2 0:00.75 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:02.37 ksoftirqd/0 Linux Nokia-N900 2.6.28.10power37 #1 PREEMPT Wed May 26 00:24:03 EEST 2010 armv7l unknown
(In reply to comment #34) > Why is this keyworded 'performance'? Mostly the the effects of SGX resets (the SGX reseting itself to recover from something) are seen as (extreme) slowdown and sgx_misr kernel thread taking all CPU of course also affects performance. Unlocking not working could be a side-effect or unrelated bug. An internal (w25) release has additional SGX fixes which will be in the next public release. As the issues aren't reproducible, it's hard to say whether it will fix the issues people have still encountered after the SGX fixes in PR1.2. If some program from Extras is especially "good" at causing the SGX resets (now that Maep doesn't generate them, at least re-producibly) with PR1.2, please comment in this bug. Typically such programs are also otherwise bad; do updates on the background, leak a lot of memory etc.
(Some fixes internally had to be reverted because of side effects.)
I've got a non-scientific observation to this bug: it happened way more frequently (2 times a week instead of once a month) during my holidays trip. Possible causes: - a TIM (Telecom Italia) sim card instead of an SFR (French telecom operator) one. - a different usage pattern (more mail usage, more GPS usage via OVI maps, but it once locked with only mail app running) As said elsewhere, it happens when idle, screen locked, either in my pocket or while charging.
Ok, I've "seen" it happen just after I slide-to-unlock (you know, after you press the power button when it's locked). In fact, the window manager was dead (mail app fullscreen without decoration), re-pressing the power button made the menu appear, but without borders and all buttons stacked in 1 column, instead of 2 as it should in horizontal mode. And then it deadlocked, screen off. All this happened quite fast (1-2 seconds), but I'm sure the WM was dead, and that may be interesting to whoever is in charge of this bug.
Don't know if this helps, but here's output from sysrq-w (shoW-blocked-tasks) in the hung state: [17890.795166] SysRq : Show Blocked State [17890.795227] task PC stack pid father [17890.795288] Xorg D c0283c14 0 831 687 [17890.795288] [<c028396c>] (schedule+0x0/0x328) from [<c0284994>] (__mutex_lock_slowpath+0xb0/0x124) [17890.795349] [<c02848e4>] (__mutex_lock_slowpath+0x0/0x124) from [<c02846c8>] (mutex_lock+0x24/0x28) [17890.795379] r6:00000000 r5:ffffffff r4:d0d9a004 [17890.795410] [<c02846a4>] (mutex_lock+0x0/0x28) from [<bf2004f4>] (PVRSRVPowerLock+0x5c/0xf0 [pvrsrvkm]) [17890.795532] [<bf200498>] (PVRSRVPowerLock+0x0/0xf0 [pvrsrvkm]) from [<bf2006c8>] (PVRSRVSetDevicePowerStateKM+0x3c/0xa0 [pvrsrvkm]) [17890.795623] r8:00000001 r7:00000000 r6:00000000 r5:ffffffff r4:d0d9a004 [17890.795623] [<bf20068c>] (PVRSRVSetDevicePowerStateKM+0x0/0xa0 [pvrsrvkm]) from [<bf2068e8>] (SGXScheduleCCBCommandKM+0x3c/0x1f4 [pvrsrvkm]) [17890.795715] r9:ce6dfe18 r8:d0d7e000 r7:ffffffff r6:00000001 r5:cd741180 [17890.795745] r4:d0d9a004 [17890.795776] [<bf2068ac>] (SGXScheduleCCBCommandKM+0x0/0x1f4 [pvrsrvkm]) from [<bf2073ac>] (SGXSubmitTransferKM+0x2dc/0x2ec [pvrsrvkm]) [17890.795867] [<bf2070d0>] (SGXSubmitTransferKM+0x0/0x2ec [pvrsrvkm]) from [<bf2020dc>] (SGXSubmitTransferBW+0x17c/0x190 [pvrsrvkm]) [17890.795928] r7:d0d9b000 r6:00000001 r5:d0d9a000 r4:d0d9a004 [17890.795959] [<bf201f60>] (SGXSubmitTransferBW+0x0/0x190 [pvrsrvkm]) from [<bf201714>] (BridgedDispatchKM+0x110/0x140 [pvrsrvkm]) [17890.796051] [<bf201604>] (BridgedDispatchKM+0x0/0x140 [pvrsrvkm]) from [<bf1f6524>] (PVRSRV_BridgeDispatchKM+0xc4/0xec [pvrsrvkm]) [17890.796142] r9:ce6de000 r8:ce6ab900 r7:ce6dfeb0 r6:0000033f r5:c01c6755 [17890.796173] r4:bedb6b04 [17890.796173] [<bf1f6460>] (PVRSRV_BridgeDispatchKM+0x0/0xec [pvrsrvkm]) from [<c00c6a50>] (vfs_ioctl+0x34/0x94) [17890.796234] r7:00000006 r6:c01c6755 r5:bedb6b04 r4:ce6ab900 [17890.796264] [<c00c6a1c>] (vfs_ioctl+0x0/0x94) from [<c00c7044>] (do_vfs_ioctl+0x498/0x4d8) [17890.796295] r7:00000006 r6:bedb6b04 r5:00000006 r4:ce6ab900 [17890.796325] [<c00c6bac>] (do_vfs_ioctl+0x0/0x4d8) from [<c00c70dc>] (sys_ioctl+0x58/0x7c) [17890.796356] r9:ce6de000 r8:ce6ab900 r6:c01c6755 r5:bedb6b04 r4:00000000 [17890.796386] [<c00c7084>] (sys_ioctl+0x0/0x7c) from [<c002c920>] (ret_fast_syscall+0x0/0x2c) [17890.796417] r8:c002caa4 r7:00000036 r6:bedb6d4c r5:bedb7064 r4:0019cb94
Me too. Confirm the presence of this problem when running modrana. Firmware: India version.
I can also see this issue. Running PR1.2
(In reply to comment #42) > I can also see this issue. Running PR1.2 > As per https://bugs.maemo.org/show_bug.cgi?id=7017 the issue started after entering reboot in xterm and was fixed after powering off and back on again.
I want to add some more observations: 1. Till now I had seen the issue triggering only when the display was in off state. I installed "Simple Brightness Applet" to retain the display ON. For the first time now I could see the issue triggering even when display was ON. Nothing on the display was responding and I had to as usual take out the battery to restart. 2. The triggering of issue seems to have something to do with the data connection. When I was "roaming" the issue triggered more number of times (almost every 5/10 minutes) than when I was on my home network (where it triggered once in a few hours). (I'm suggesting this to be attributed to the quality of data network rather than roaming or non roaming.) 3. I switched off internet connection completely. Disabled auto-connection options., set the map images to None in modrana. (Retained only the tracklog view.) The bug did not trigger in several hours of continuous usage after that.
(In reply to comment #44) > 2. The triggering of issue seems to have something to do with the data > connection. When I was "roaming" the issue triggered more number of times > (almost every 5/10 minutes) than when I was on my home network (where it > triggered once in a few hours). (I'm suggesting this to be attributed to the > quality of data network rather than roaming or non roaming.) It could be just that programs in the device do more (at the same time) and consume/dirty more memory when you have network connection. dbus-monitor will tell what kind of dbus messages there are flying in the device, "xresponse -a '*' -w 0" will tell what kind kind of window updates the programs do and e.g. powertop will tell what extra processes had wakeups during that time.
> It could be just that programs in the device do more (at the same time) and > consume/dirty more memory when you have network connection. > > dbus-monitor will tell what kind of dbus messages there are flying in the > device, "xresponse -a '*' -w 0" will tell what kind kind of window updates the > programs do and e.g. powertop will tell what extra processes had wakeups during > that time. > You know that a bug-triggering state was reached only after the device hangs. So I don't know WHEN to check above things. I had kept the syslogs enabled. Using time stamp and treating "syslog restart" as a separator, I have segregated the log into segments between a restart till occurrence of issue. I also have the log of the session where the same app (modrana) was running though sans internet and no issue got triggered. Thus I have 4 "bad" session logs and 1 "good" session log. I am trying to isolate messages that occur only in "bad" session and not in "good" session to uniquely tell what happened before the triggering of issue. No luck so far. It could be that the issue causing circumstances do not reflect in the log.
(In reply to comment #46) > You know that a bug-triggering state was reached only after the device hangs. > So I don't know WHEN to check above things. If xresponse (which you use through SSH connection) reports apps updating their windows when they're not visible (some other app is on top or screen is blanked), that's a clear bug in the application and has at least earlier made this issue to trigger much more easily. -> bug should be filed against the app Applications using huge amounts of memory can also make this issue easier to trigger. If application leaks memory when it's just repeating the same thing without its data (e.g. number of mails, contacts etc) growing, that's a clear bug also (-> should be reported against app). mem-cpu-monitor (from sp-memusage package) and xrestop can be used to track that. Among other things PR1.3 fixes last know (smallish) memory leaks in the pre-installed applications. > No luck so far. It could be that the issue causing circumstances > do not reflect in the log. I think it's better to first wait until PR1.3 is released and see whether this is still reproducible. It has several fixes to the SGX drivers.
> If xresponse (which you use through SSH connection) reports apps updating their > windows when they're not visible (some other app is on top or screen is > blanked), that's a clear bug in the application and has at least earlier made > this issue to trigger much more easily. -> bug should be filed against the app I've seen this bug with SEVERAL navigation apps till now. I have also seen it trigger when the display was on and the application in question was visible in the foreground. > > Applications using huge amounts of memory can also make this issue easier to > trigger. If application leaks memory when it's just repeating the same thing > without its data (e.g. number of mails, contacts etc) growing, that's a clear > bug also (-> should be reported against app). mem-cpu-monitor (from > sp-memusage package) and xrestop can be used to track that. I had run top and monitored memory taken once (by modrana though I already said this happens with several Navig apps). No reason to believe that it was huge or I noticed memory usage trend indicative of leaks. STILL, to pre-empt above possibility I wrapped modrana in a bash script to be restarted every 1 minute. The bug still got triggered within that minute. Rather than display and memory leaks etc. when internet connectivity was either very smooth or completely disabled, the bug did not get triggered for several hours of continuous usage no matter whether display was on or off, no matter whether other applications popped up messages in the foreground or whatever. It got triggered much more frequently with poor internet connectivity. I think with the observations I have gathered till now, I might be able to provide a way to reproduce this bug reliably, perhaps using home router to create poor network conditions when the app is trying to load map images etc. Will post if I manage to do so. > > Among other things PR1.3 fixes last know (smallish) memory leaks in the > pre-installed applications. Doesn't mean much for the present problem unless the pre-installed navig application has more functionality and interfaces to load gpx or other file formats etc. > > > > No luck so far. It could be that the issue causing circumstances > > do not reflect in the log. > > I think it's better to first wait until PR1.3 is released and see whether this > is still reproducible. It has several fixes to the SGX drivers. That's certainly interesting. Keeping fingers crossed since there is no word whatsoever on WHEN that's going to be released. Mayuresh
(In reply to comment #45) > dbus-monitor will tell what kind of dbus messages there are flying in the > device, "xresponse -a '*' -w 0" will tell what kind kind of window updates the > programs do and e.g. powertop will tell what extra processes had wakeups during > that time. I have captured top, powertop, xresponse, syslog and dbus monitor logs for a session that ended in triggering the issue. The issue was triggered as usual while on the move for 2/3 hours with internet connection ON and a navig application - modrana - on. syslog: Seen garbled (with binary characters) when the crash happens. Previous few messages suggest network registration activity. Some also show "...or the network connection was broken" error message, though the process that logs this message varies. But such message comes in normal run also. This time I did not see SGX recovery triggers, though I have seen them in logs in previous such crashes. top: This time I could NOT see sgx_misr taking all the cpu though I have noticed this before. (It could be that the top logging got choked and this did not get logged.) There was nothing abnormal in top log. Memory taken by modrana was 38%. xresponse log showed the following 3 messages repeating several towards the end of the log: 4898834ms : 2ms : Unmapped window 0x1a000fc (hildon-home) 4898836ms : 2ms : Unmapped window 0x1a000fc (hildon-home) 4898837ms : 1ms : Destroyed window 0x1a000fc (hildon-home) Throughout the log it shows "Got damage event" from SCREEN, hildon-home, hildon-status-menu and modrana. Seems nothing unusual as these messages are seen throughout the run. powertop and dbus-monitor logs did not show anything different at the point of crash. Note: I am still on PR1.2. Will upgrade soon, though above investigation will help re-assess the issue in PR1.3. Will appreciate pointers to analyze this information further. Regards, Mayuresh.
(In reply to comment #49) > syslog: > > Seen garbled (with binary characters) when the crash happens. Clearly as just part of some specific process' syslog message content, or like syslog content itself got garbled (which could be either syslog process or file system corruption issue...)? > Previous few messages suggest network registration activity. Some also show > "...or the network connection was broken" error message, though the process > that logs this message varies. But such message comes in normal run also. It would be pretty bad if network stack would corrupt kernel memory (or it could be SGX corrupting it and it showing up with network). Is there anything suspicious in your oopslog (/dev/mtd2) for PR1.2 release? > This time I did not see SGX recovery triggers, though I have seen them in logs > in previous such crashes. > > top: > > This time I could NOT see sgx_misr taking all the cpu though I have noticed > this before. (It could be that the top logging got choked and this did not get > logged.) Maybe this was a different issue. Because syslog access is synchronous, it's possible for there to be priority inversion situation between processes. (which makes debugging things awkward...) > Note: I am still on PR1.2. Will upgrade soon, though above investigation will > help re-assess the issue in PR1.3. Can you reproduce this bug with PR1.3.
FWIW I've seen thoses hangs with PR1.3, but I have nothing handy to debug.
(In reply to comment #51) > FWIW I've seen thoses hangs with PR1.3, but I have nothing handy to debug. > I have just written the following script to log the activity continuously. Besides these contents of /var/log/syslog should be captured for the time window of interest. Will appreciate comments regarding usage of the log tools used, what else may be worth logging etc. #!/bin/sh # NOTE: This will create various monitoring logs till the script is killed # Be aware that it may fill up the media card space. Be sure to close the script # after the watch window is over. Also periodically delete the logs created # or move them to a computer with sufficient storage space. trap "kill 0" EXIT LOGDIR="/media/mmc1/logs" NEWLOGDIR=$LOGDIR/`date +%Y%m%d%H%M%S` mkdir -p $NEWLOGDIR cd $NEWLOGDIR top > top.log & xresponse -a '*' -w 0 > xresp.log & dbus-monitor > dbus.log & while [ true ] do sp-oops-extract /dev/mtd2 sleep 5 done > oops.log & while [ true ] do date echo "=====" powertop done > powertop.log & while [ true ] do sleep 86400 done
(In reply to comment #52) > while [ true ] > do > sp-oops-extract /dev/mtd2 > sleep 5 > done > oops.log & This isn't needed. Like syslog, oopslog will be there after reboot (SDK syslog is AFAIK rotated on bootup based on log size and oops partition is used as oops record ring buffer).
(In reply to comment #53) Thanks. Will drop that. Just now the issue got triggered once again. The device was at home with modrana running and connected with a laptop vis ssh over wifi. Observations: 1. We are probably talking about two (may be several) different issues with same end-user perceived symptom - hanging/having to take out battery etc. 2. This time I had switched off the display and this time sgx_misr cpu usage triggering was seen. (Last few times when display was left on with appropriate setting, sgx_misr wasn't taking the cpu though the device was hung.) 3. Thanks to top log I can tell that the point at which sgx_misr CPU usage shot up, Xorg was taking 85% memory. I think that's prominent enough observation from all the logs. 4. I have got all the logs produced by above script and can share any information anyone wants. Regards, Mayuresh.
> 3. Thanks to top log I can tell that the point at which sgx_misr CPU usage shot > up, Xorg was taking 85% memory. Sorry, that was wrong observation. (In the log, the column titles and contents do not come properly aligned.) 85% was CPU usage, not memory.
Most interesting thing is whether this is reproducible with PR1.3...
I am running PR1.3, and am seeing this often now. I was NOT seeing it before the PR1.3 update, but have had this happen about 4 times now, verified via SSH login: Mem: 207244K used, 38056K free, 0K shrd, 1740K buff, 66536K cached CPU: 0.0% usr 100% sys 0.0% nice 0.0% idle 0.0% io 0.0% irq 0.0% softirq Load average: 1.10 1.02 0.93 PID PPID USER STAT RSS %MEM %CPU COMMAND 868 2 root RW 0 0.0 81.6 [sgx_misr] 4704 4684 root R 588 0.2 9.0 top 10 2 root SW 0 0.0 9.0 [omap2_mcspi] I have found it happens most often when monRana is running tile download caching (memory, GPS, CPU & FS intensive). I didn't have this issue before the PR1.3 update, with modRana and Mappero being regularly used items. (I did OTA update for PR1.3.) I've also found that text messages and alarms don't go off when this is happening. (But are time-stamped properly after rebooting.) I do have power-kernel installed, so if this is a kernel-level fix, that may be part of the issue. Since it's semi-reproducible, would it be of help if I installed the stock PR1.3 kernel to see if it helps?
I know it sounds wierd, but by now I'm sure it happens near certainly when in my trousers' pocket. That may mean there's a harware problem when it's somewhat pressed. At first I thought it may be linked to the acceleration sensor, but it doesn't happen when it's in my jacket.
FYI: I've repeated this with stock kernel in PR1.3 on a couple of occasions now. It seems to happen mainly when I have either a nav app running (like modRana) or a flash game running in the browser. Both are graphic, CPU, and memory intensive. I've also had instances where after shutting down such apps, minutes after, when changing between Wifi and GPRS this triggers. Lets be clear, this is a system-level error, not an app error as comment 47 seems to imply. Sure, an app shouldn't eat all system memory. But frankly, a user process eating too much memory should trigger a process kill, not a system lockup. A kernel driver going into a spin and requiring a reboot to fix isn't an acceptable result of low memory. The kernel should be able to do a process kill if it's dead out of memory and needs it, most Linux kernels have this built in. This driver clearly has issues, and needs to be fixed. For those stuck with this bug, I posted a script(link below)to monitor the driver and reboot if it starts spinning, as to have a clean shutdown and not burn through your entire battery. If it would be helpful to add some collection code to that before the reboot, let me know. I'd be happy to add it. http://talk.maemo.org/showpost.php?p=890903&postcount=1
(In reply to comment #59) > Lets be clear, this is a system-level error, not an app error as comment 47 > seems to imply. If the issue is e.g. app needing larger GL command buffer than what's available by default, it can add a /etc/powervr.d/ file[1] where it sets a suitable command buffer size for itself (I have no idea whether there's documentation on this, maybe Imagination www-site has something). [1] "strace -e open -f <application>" will tell which files SGX libs try to open. > Sure, an app shouldn't eat all system memory. Problem isn't related to things eating all virtual memory (as that would be very hard because swap size is huge), just a lot of it & using it actively so that device swaps a lot. > But frankly, a user process eating too much memory should trigger > a process kill, not a system lockup. A kernel driver going into a spin SGX resets are about what the SGX HW microkernel does e.g. when rendering is too slow. CPU side kernel driver is just reporting that, it doesn't have control over it or know which of the processes pushing stuff to SGX pipeline might be causing it. > and requiring a reboot to fix isn't an acceptable result of low memory. > The kernel should be able to do a process kill if it's dead out of memory If this is a question of SGX itself running out of memory, there's no API in SGX driver to get information about this nor which process is using most of it (no idea whether the current driver is even tracking that). > and needs it, most Linux kernels have this built in. The kernel has OOM-killing enabled, but it works only if device runs out of normal (virtual) memory, it cannot help to SGX issues. > This driver clearly has issues, and needs to be fixed. While the driver has issues which AFAIK are going to be handled only in Harmattan version of it, what the applications do has very large effect on triggering them. I've never heard of anybody being able to trigger this with pre-installed device software, it happens only when running 3rd party SW.
Pretty sure I had this problem with a clean PR1.2 flash. I /don't/ see it anymore, still on PR1.2. Maybe coincidence, but I think it stopped when I stopped leaving GPS active 24/7.
Some of the apps that I've found which trigger this are QT BASED APPS, meaning they're not using the SGX driver directly. The whole point of QT is to not have to do machine-level tweaking like this. Further, there doesn't appear to be any documentation about how set the "Command buffer size" for the PowerVR driver. Are the options for this ini file documented anywhere? Even Google power search isn't turning up anything useful (in this forum or outside). In any case, this is a *closed source* kernel driver, which is causing the entire device to die randomly. I think people would be ok with it if it threw an error, or skipped a drawing request, or even crashed the calling program. *Anything*, even rebooting, would be a better result than silently spinning in a tight kernel loop, eating 100% CPU, completely draining the battery, all the while disabling the only UI on the device. The fact that this major bug has existed since PR 1.1, and now appears to be headed toward a "WONTFIX" state is really sad. This bug was marked as a "critical show stopper" by Nokia in Feb, almost a year ago. Yes, shipped software doesn't trigger it. Shipped software also doesn't provide turn-by-turn navigation or lots of features. If this device were limited to Nokia-only software it would be quite boring, and not purchased by most people who now own one. I'll tell you this much: If Nokia goes "WONTFIX" on something as critical as a display driver on a 4th gen device a year and 3 patches later, what confidence is anyone going to have with MeeGo? The fact that it's still labeled "New" and "unassigned" is telling all by itself.
Does somebody have reliably *reproducible* test-case for this with PR1.3? Then it at least can be checked what happens in the device. (In reply to comment #62) > Some of the apps that I've found which trigger this are QT BASED APPS, > meaning they're not using the SGX driver directly. > The whole point of QT is to not have to do machine-level tweaking like this. If that really is the issue with those Qt apps, whether something is using HW indirectly or directly is really immaterial in regards to what the program does with the HW. There being extra SW layers in-between rarely helps things if app is trying to use more resources than are available (typically it makes things worse as developer is less aware of those issues). > Further, there doesn't appear to be > any documentation about how set the "Command buffer size" for the PowerVR > driver. Are the options for this ini file documented anywhere? Maybe in the “OMAP35x Graphics SDK Getting Started Guide” or other documents referred by it? > Even Google power search isn't turning up anything useful > (in this forum or outside). You may need to register to get SGX SDK & its documentation from Imagination. > In any case, this is a *closed source* kernel driver, which is causing the > entire device to die randomly. The kernel driver is AFAIK open source. The user space libraries are closed source. > I think people would be ok with it if it threw > an error, or skipped a drawing request, I don't think it's that simple. SW might be executing the commands asynchronously for performance reasons and not waiting for return values. The HW throws an error (resets itself) and as result stuff actually *isn't* being drawn. For you that (UI frames not being drawn) seems like UI freeze. > or even crashed the calling program. I don't think kernel driver knows what process pushing things to the graphics pipeline eventually caused the HW stall/reset and whether HW will eventually recover from it. Crashing e.g. X server because some other app caused GFX pipeline stall from which system might recover, would cause device reboot. Not very desirable. AFAIK one of the triggers for GFX HW state reset is also SW trying to do things that are too slow on SGX. There are many things (not just GL releated) that user-space processes (even ones not running as root) can do which can completely ruin the device usability. Applications need to behave nicely, not DOS the device. > *Anything*, even rebooting, would be a better result than silently spinning > in a tight kernel loop, eating 100% CPU, completely draining the battery, > all the while disabling the only UI on the device. > > The fact that this major bug has existed since PR 1.1, and now appears to It existed already earlier and there have been additional fixes to it in every release (some also in PR1.3). > be headed toward a "WONTFIX" state is really sad. This bug was marked as a > "critical show stopper" by Nokia in Feb, almost a year ago. > > Yes, shipped software doesn't trigger it. Shipped software also doesn't > provide turn-by-turn navigation or lots of features. Why turn-by-turn navigation needs GL? Embedded GL HW is typically suited for smallish textures and a good amount of geometry, that's why they're better match for (suitably designed) games than applications which typically have lots of large textures but very little geometry. Even worse is if such application tries to use 32-bit textures, they more than halve the speed compared to 16-bit ones. > If this device were limited to > Nokia-only software it would be quite boring, and not purchased > by most people who now own one. > > I'll tell you this much: If Nokia goes "WONTFIX" on something as critical > as a display driver on a 4th gen device a year and 3 patches later, what > confidence is anyone going to have with MeeGo? In regards to graphics drivers (kernel & user-space); there are a large changes in them, X server, Qt etc. in MeeGo. If you check meegotouch library & applications that are publicly available (e.g. on ARM MeeGo release), they're all using GLES for drawing and should be working fine (in regards to issue discussed in this bug). Doing such huge changes to a working release (Fremantle) which is in a bugfixes-only mode is out of question. Disclaimer: I'm not related to the development of the drivers or haven't even used them, information here is at best second hand. > The fact that it's still labeled "New" and > "unassigned" is telling all by itself. For Nokia SW issues, typically the state after those is fixed (or wontfix/invalid). The intermediate bug states are mainly used by non-Nokia SW here in bugs.maemo.org (the community bugzilla).
(In reply to comment #63) > Does somebody have reliably *reproducible* test-case for this with PR1.3? > Then it at least can be checked what happens in the device. I had a 100% reliable test-case for PR1.2 on MY device and got tired of asking for ways to help you diagnose it further. If Nokia ever manages to release a FIASCO image for the UK variant of PR1.3 I'm sure I'll be able to reproduce it in PR1.3 too.
(In reply to comment #64) > (In reply to comment #63) > > Does somebody have reliably *reproducible* test-case for this with PR1.3? > > Then it at least can be checked what happens in the device. > > I had a 100% reliable test-case for PR1.2 on MY device If you refer to comment 31, you said that you had older version of Maep which behaved badly. It leaked memory like sieve and did frequent window updates on background. I had reported a bug against it as such behavior is obvious performance & use-time issue for the whole device and that got fixed. After it was fixed, the issue didn't happen again. > and got tired of asking for ways to help you diagnose it further. There are many easy ways to find bad behavior[1] in apps. - memory leakage: sp-endurance - background activity & window updates: strace & xresponse See: http://wiki.maemo.org/Documentation/devtools/maemo5 (For diagnosing SGX reset issues, none that I know.)
> I had a 100% reliable test-case for PR1.2 on MY device and got tired of asking You mean, you have exact steps to reproduce the issue? Can you share them? I have been able to trigger the issue on many occasions till now though don't have a reliable way to reproduce it. Mayuresh.
(In reply to comment #63) > Maybe in the “OMAP35x Graphics SDK Getting Started Guide” or other documents > referred by it? Finally, something useful. At least that I can look up and see if there's a bandaid one can put on this mess. Sadly, it looks like they only are offering the 4.0 SDK, which is not what we're based on. But it may have something related at least. (And yes, I have a TI developer account, as I work with TI chipsets routinely in my "real job".) (In reply to comment #63) > The kernel driver is AFAIK open source. The user space libraries are closed > source. Really? Would you happen to know it's source file name? From everything I've found this is a closed source module. I can't imagine that the kernel driver is open source, since it's the part directly touching the hardware. If it IS open, I (or the community) could at look at it and see if we can find where it's spinning and patch the community based kernel. (In reply to comment #63) > Crashing e.g. X server because some other app caused GFX > pipeline stall from which system might recover, would cause > device reboot. Not very desirable. Actually, YES! Yes it IS more desirable to have it reboot. What's NOT desirable is to have it spin and drain the battery. Tell me, which is more desirable to you: A device that sometimes reboots randomly but usually works fine, or one that you can't rely on because it randomly and *silently* drains it's entire battery? My vote is to reboot. (In reply to comment #63) > Does somebody have reliably *reproducible* test-case for this with PR1.3? > Then it at least can be checked what happens in the device. I've had it happen while using MicroB and/or modRana, and on occasion while just sitting with nothing running user side. But then you *have* a reliable way to reproduce this: an old version of Maep. (In reply to comment #65) > you said that you had older version of Maep which behaved badly. No matter how poorly a program is written, a USER SPACE program should never be able to cause a kernel to spin/lock. THIS IS A KERNEL ISSUE. Even older systems have measures in place to prevent user programs like "while(1) fork();" from bringing down a system. This is no different. I'm really tired of the game you're playing here in blaming apps. If you had a car that sometimes shutdown randomly, but did so reliably if you drove it over 70kmh, your mechanic saying "Well, don't go over 60kmh!" wouldn't fly. That's essentially what you're saying here. A key kernel driver is spin-locking under load, and your reply is "lighten the load by filing bugs against other programs". This issue CAN be reproduce reliably with a known (admittedly buggy) version of software. You have the tools you need to fix it, but are asking that the tools be fixed to accommodate the bug instead! If the issue is that UNDER LOAD the hardware resets or faults, saying "don't stress the device" is not a solution. The solution is fixing the kernel driver to handle the fault state, or limit the load, or in a worst case make it eventually give up and trigger a reboot. SPINNING FOREVER IS NOT AN ACCEPTABLE SOLUTION FOR A KERNEL DRIVER, EVER. And that's exactly what's happening here! (In reply to comment #63) > Why turn-by-turn navigation needs GL? Way to be snarky... Look at the apps triggering this issue. Most often they have been nav apps. Meap, mappero, modRana, are ALL apps that trigger this, and ALL apps that revolve around navigation. Most of them exist only because the OVI maps app lacks (among many other things) turn-by-turn. None of the above mentioned apps may actually NEED GL. Most of them NEED to do screen drawing though, and use existing libraries (like QT) that use GL in doing drawing on their behalf. Even if this is caused by X updating the screen on behalf of an app, the issue still persists. (In reply to comment #63) > Doing such huge changes to a working release (Fremantle) which is in a > bugfixes-only mode is out of question. Define working. I'm not sure I'd classify a mobile device that randomly drains it's battery as "working". This is a critical bug. If you're not going to fix a critical level bug while in "bugfixes-only mode", then why have such a mode? Nobody is asking for back-port of the then entire MeeGo display driver! We're asking for support in stopping this *broken* behavior. Even a solution that triggers the watchdog and reboots the device when it spins in this state would be preferable to what it does now.
About crash-and-reboot - Hey, hey - it is not the only option, it is still possible to configure upstart in a way which would cause a restart of X11 and GUI apps like hildon etc after X11 failure-and-crash, but keep running server applications. I am not sure, but it seems that MOST of GUI apps are able to safely restart via DSME tools. At least some thread in talk.maemo.org has a script to periodically restart some amount of GUI and servers to avoid fragmentation in swap and memory.
+1 for reboot being way preferable to the current behavior. (In reply to comment #67) > I've had it happen while using MicroB and/or modRana, and on occasion while > just sitting with nothing running user side. But then you *have* a reliable way > to reproduce this: an old version of Maep. I haven't been able to notice a difference between old and new (including the latest) versions of Maep; both trigger this bug routinely.
> I haven't been able to notice a difference between old and new (including the > latest) versions of Maep; both trigger this bug routinely. Bump. STOP CLAIMING MAEP FIXED THE PROBLEM ON THIS THREAD. It triggers with maep as well. It has been reported with microb as well and at least once with ovi maps as well somewhere on above thread. I also have top logs where the issue got triggered with modrana taking just 13% memory. Not a stressful situation by any standards. All the times the issue got triggered, there was some activity on the GPRS/wifi connection such as reconnecting, poor signal quality etc. High resource usage could be playing some part, but it's not only that. I agree with comment 67. Either provide at least half decent navig app or stop passing on the buck to apps that are doing a good job of filling the void. It has been reasoned a number of times on this thread how the kernel has to be responsible for this.
> I also have top logs where the issue got triggered with modrana taking just 13% > memory. Not a stressful situation by any standards. All the times the issue got > triggered, there was some activity on the GPRS/wifi connection such as > reconnecting, poor signal quality etc. High resource usage could be playing > some part, but it's not only that. Want to add one more point. If I simply keep internet connection switched off, no matter how much stress (memory/cpu) I cause on the device I have never managed to trigger the issue. I can achieve this by panning fast in modrana for a long time till it takes a lot of memory and finally appears hung. Though in such instances the device is NOT hung. You can recover from this situation. In fact you get yellow message box saying that new application can't be allocated memory etc. That's quite acceptable. Now just keep the internet connection on. No matter what the stress level is, the issue may trigger. There have been occasions of triggering the issue wth as low memory usage by modrana as 13% and within a matter of minute of launching it.
I also would really prefer the device to reboot. The current behavior means: - if I don't see it right away, the battery becomes empty fast - if I see it quickly enough, I have to pull off the battery (only mean to reboot the device), and put it back, which means setting again the date and time, because the N900 doesn't do it itself (I can't remember which bug# it is). Afwully annoying, I have lots of photos dated from 1/1/2009 because of this.
(In reply to comment #67) >> The kernel driver is AFAIK open source. The user space libraries >> are closed source. > > Really? Would you happen to know it's source file name? > From everything I've found this is a closed source module. > I can't imagine that the kernel driver is open source, since > it's the part directly touching the hardware. In that case it's better to rely on facts than on hearsay or imagination. :) 2 minutes of googling gives the used upstream kernel and changes on top of it: http://repository.maemo.org/pool/fremantle/free/k/kernel/kernel_2.6.28-20103103+0m5.diff.gz kernel-2.6.28/drivers/gpu/pvr/module.c seems like a good starting point. > (In reply to comment #63) > > Crashing e.g. X server because some other app caused GFX > > pipeline stall from which system might recover, would cause > > device reboot. Not very desirable. > > Actually, YES! Yes it IS more desirable to have it reboot. What's NOT > desirable is to have it spin and drain the battery. > > Tell me, which is more desirable to you: A device that sometimes reboots > randomly but usually works fine, or one that you can't rely on because it > randomly and *silently* drains it's entire battery? My vote is to reboot. Have you checked how many times in your case the HW would recover within reasonable time or how many times it has happened even without you noticing it from the UI (I think you should see this from syslog)? If device is just rebooted, you lose all unsaved data in your apps. Bug 7017 would also indicate that it might be possible that reboot doesn't always fix the issue (in which case device with your suggested solution would be in reboot loop and anyway empty the battery). Based on comments in that bug, it would seem to be fixed for PR1.3 (when not meddling with device voltages) Regarding bug 7017, has anybody encountering this issue modified device CPU voltage or Mhz values? > (In reply to comment #63) >> Does somebody have reliably *reproducible* test-case for this with PR1.3? >> Then it at least can be checked what happens in the device. > >I've had it happen while using MicroB and/or modRana, and on occasion while >just sitting with nothing running user side. But then you *have* a reliable >way to reproduce this: an old version of Maep. Not in normal usage. It had to be attached to charger so that I could run it over a day. (that was in spring, before PR1.2. If I remember correctly, it didn't happen with PR1.2.) > (In reply to comment #65) > I'm really tired of the game you're playing here in blaming apps. > If you had a car that sometimes shutdown randomly, but did so > reliably if you drove it over 70kmh, your mechanic saying > "Well, don't go over 60kmh!" wouldn't fly. That's essentially > what you're saying here. And if the reason is that you kept brake down all the time and didn't use clutch while changing the gears, guess what the mechanic would say? ;-) Buggy user processes can DOS a normally configured Linux system very easily[1], the reason why you don't see such applications on Desktop is that typically they don't get into distro repos if they behave too badly. [1] I can easily think of ways how process can DOS Linux with either: - D-BUS (lots of messages which recipient doesn't process), - X server (server or input grab or just lots of request), - disk (writing huge amounts of data without madvise when you have slow disk), - memory (constantly dirtying more memory than system has RAM), or - GL usage (well, this breaks things even with apps accepted to distros). - use higher priority, but not work appropriately for that I've myself encountered all of these, but I'm sure there are many other ways bad program can DOS our modern multi-purpose operating systems. X & D-BUS don't offer mechanisms to control above and because there can be valid reasons for some programs to (rarely) temporarily do rest of the things, they are allowed to be done. If you have programs that only together cause this kind of issues, they also need to be fixed, but finding out about the issue is much harder and you need to judge what is and isn't justified in the program. But for example drawing to windows when they aren't visible isn't justified under any circumstances. > A key kernel driver is spin-locking under load, and your reply is > "lighten the load by filing bugs against other programs". As I stated, AFAIK: - One of the conditions where HW/driver tries to fix issues by SGX reseting is when operation is too slow - This can be caused by application asking HW to do more than is reasonable (operation it asks is too slow) - Operations are pipelined to the HW, so that current SGX user isn't necessarily the cause for the stall - Therefore kernel has no idea which process is causing the issue (killing processes randomly isn't reasonable either) > This issue CAN be reproduce reliably with a known (admittedly buggy) > version of software. I'm not working for Fremantle anymore[1], but when I was, I couldn't reproduce these in "normal" conditions. I've also understood that Imagination wasn't able to reproduce the issue with older Maep for newer driver releases (like one in PR1.2 I think). If these issues were easily reproducible (say within day) by the developers, of course they wouldn't have been fixed earlier. I think there are also other conditions that need to be present for these issues to happen. > (In reply to comment #63) > > Why turn-by-turn navigation needs GL? > > Way to be snarky... Look at the apps triggering this issue. > Most often they have been nav apps. Meap, Maep works fine for me. I've never seen SGX reset in normal usage. And Maep's so nice that I don't use the others[1]. [1] Take my comments as from another, potentially more knowledgeable, N900 user. :-) > mappero, Mappero I know to use Clutter i.e. GL. > modRana, are ALL apps that trigger this, If it's this: http://maemo.org/packages/package_instance/view/fremantle_extras-devel_free_armel/modrana/0.20-3/ It actually would seem to be using python & cairo, not GL. > and ALL apps that revolve around navigation. [...] > None of the above mentioned apps may actually NEED GL. Most of them > NEED to do screen drawing though, and use existing libraries (like QT) > that use GL in doing drawing on their behalf. Even if this is caused > by X updating the screen on behalf of an app, the issue still persists. If any of them do window updates on background, they're just broken and should be fixed. Regardless of whether that triggers SGX resets. > (In reply to comment #63) > Nobody is asking for back-port of the then entire MeeGo display driver! > We're asking for support in stopping this *broken* behavior. I've understood that the case I referred above (too slow drawing by app) AFAIK isn't something that will be fixed in the SGX drivers. Expectation of things working faster than certain (very low/unusable) minimum speed is AFAIK builtin to the (user-space) driver and cannot be "fixed". I would assume such buggy apps should be pretty obvious to users though, they always draw things too slowly & cause the issue.
(In reply to comment #71) >> I also have top logs where the issue got triggered with modrana taking >> just 13% memory. Not a stressful situation by any standards. All the times >> the issue got triggered, there was some activity on the GPRS/wifi connection >> such as reconnecting, poor signal quality etc. High resource usage could be >> playing some part, but it's not only that. You've verified this is the whole device UI freeze caused by SGX issues e.g. by sshing to the device and seeing sgx kernel thread highest in top or SGX resets in dmesg? You haven't modified device CPU MHz or voltages? > Want to add one more point. If I simply keep internet connection switched off, > no matter how much stress (memory/cpu) I cause on the device I have never > managed to trigger the issue. ... > Now just keep the internet connection on. No matter what the stress level is, > the issue may trigger. There have been occasions of triggering the issue wth as > low memory usage by modrana as 13% and within a matter of minute of launching > it. Very interesting. Does this happen with both GPRS and Wifi or only with the other one? Or only when the device is moving and switching between base stations? (I'm wondering whether there could be any relation to bug 9116 / could this issue actually be relate to radios instead of SGX.) Does it happen also when battery is full? (I'm wondering whether battery consumption could affect it.)
(In reply to comment #68) > Hey, hey - it is not the only option, it is still possible to configure upstart > in a way which would cause a restart of X11 and GUI apps like hildon etc after > X11 failure-and-crash, but keep running server applications. UI related startup takes most of the bootup time, if one is re-starting the whole UI session and loosing all user's unsaved data anyway, it would be much safer just to do full reboot in such a stressed situation.
(In reply to comment #74) > (In reply to comment #71) > >> I also have top logs where the issue got triggered with modrana taking > >> just 13% memory. Not a stressful situation by any standards. All the times > >> the issue got triggered, there was some activity on the GPRS/wifi connection > >> such as reconnecting, poor signal quality etc. High resource usage could be > >> playing some part, but it's not only that. > > You've verified this is the whole device UI freeze caused by SGX issues e.g. by > sshing to the device and seeing sgx kernel thread highest in top or SGX resets > in dmesg? In above instance I mentioned it wasn't sgx, it was X server that was consuming a lot of CPU. (I have already provided the details in one of the posts above.) Nevertheless, as a consumer it means the same to me. > > You haven't modified device CPU MHz or voltages? No. And I don't know how to. > > Want to add one more point. If I simply keep internet connection switched off, > > no matter how much stress (memory/cpu) I cause on the device I have never > > managed to trigger the issue. > ... > > Now just keep the internet connection on. No matter what the stress level is, > > the issue may trigger. There have been occasions of triggering the issue wth as > > low memory usage by modrana as 13% and within a matter of minute of launching > > it. > > Very interesting. Does this happen with both GPRS and Wifi or only with the > other one? Or only when the device is moving and switching between base > stations? Noticed more often with GPRS and noticed when roaming i.e. possibly related to the quality of network. (I have already logged these details on this thread.) > (I'm wondering whether there could be any relation to bug 9116 / could this > issue actually be relate to radios instead of SGX.) > > Does it happen also when battery is full? > > (I'm wondering whether battery consumption could affect it.) Yes.
(In reply to comment #76) >> You've verified this is the whole device UI freeze caused by SGX issues >> e.g. by sshing to the device and seeing sgx kernel thread highest in top >> or SGX resets in dmesg? > > In above instance I mentioned it wasn't sgx, it was X server that was > consuming a lot of CPU. > > (I have already provided the details in one of the posts above.) You had above commented about both the SGX and the other issues, so I wasn't sure which one this is. So, to verify again, the SGX issue isn't reproducible for you, but you have some other (potentially X usage related) issue which is reproducible? > Nevertheless, as a consumer it means the same to me. If you don't see SGX resets or sgx_misr kernel thread using all CPU, it's a separate issue which should be discussed in separate bug. This bug is only about the (currently non-reproducible) SGX issue.
(In reply to comment #77) > So, to verify again, the SGX issue isn't reproducible for you, but you have > some other (potentially X usage related) issue which is reproducible? > > If you don't see SGX resets or sgx_misr kernel thread using all CPU, it's a > separate issue which should be discussed in separate bug. This bug is only > about the (currently non-reproducible) SGX issue. When this happens to me with recent versions of Maep (which is all the time if I let the screen go dark while navigating), I see a SGX reset message in dmesg and top showing X taking all CPU in D state. I don't think that necessarily means it's not sgx_misr kernel thread using all CPU. (See also my comment #40)
(In reply to comment #78) > > If you don't see SGX resets or sgx_misr kernel thread using all CPU, it's a > > separate issue which should be discussed in separate bug. This bug is only > > about the (currently non-reproducible) SGX issue. > > When this happens to me with recent versions of Maep (which is all the time if > I let the screen go dark Hm. Is Maep still/again doing window updates when its window isn't visible? Could you check that from e.g. ssh console with "xresponse -w 0 -a '*'"[1] and if Maep indeed is doing window updates while screen is blanked or it's on background, file a use-time & performance [2] bug against Maep? [1] http://wiki.maemo.org/Documentation/devtools/maemo5/xresponse [2] Because window update means that large amount of memory is actively (but unnesessarily) used, this is also device memory usage issue. Due to the SGX issue discussed here, it's also (triggering) reliability issue. -> Non-visible window updates are really evil. > while navigating), So you need to be moving to get this issue? Do you need to have network also enabled? > I see a SGX reset message in dmesg and top showing X taking all CPU > in D state. In that case X isn't taking CPU (unlike for Mayuresh) as it's in D state, and you have SGX reset. So sounds like an issue that belongs to this bug. > I don't think that necessarily means it's not sgx_misr kernel thread > using all CPU. There are two SGX issues. One where you see sgx_misr kernel thread taking all CPU (which I've never seen myself, just mentioned here in bugzilla) and SGX resets (which I was able to reproduce in spring). I think both can be handled in this bug.
(In reply to comment #77) > So, to verify again, the SGX issue isn't reproducible for you, but you have > some other (potentially X usage related) issue which is reproducible? Hang on. I didn't say sgx issue is not reproducible. In fact in comment 54 I already said that there are 2 (and possibly multiple) different issues with same end-user perceived symptom. I just said that in one of the logs when the device hung, sgx was not taking 100% CPU. But in some other log it was. My point is, modrana WASN'T taking excessive memory in EITHER of the scenarios. E.g. following are 3 consecutive snapshopts of top log produced at 5s interval in that log where sgx GOT triggered: Snapshot 1: sgx is yet to trigger. NOTE: modrana memory 20% NOT HIGH 884 713 root R < 14036 5.7 85.5 /usr/bin/Xorg -logfile /tmp/Xorg.0.log 3621 3620 user S 50476 20.4 11.1 python2.5 modrana.py n900 767 1 messageb S < 2100 0.8 0.6 /usr/bin/dbus-daemon --system --nofork 2231 2228 root R 740 0.3 0.6 top 1153 1028 user S 9624 3.9 0.4 /usr/bin/hildon-desktop 878 2 root SW 0 0.0 0.4 [sgx_misr] Snapshot 2: It's just triggering, X and sgx are seen competing for CPU. NOTE: modrana memory 20% NOT HIGH 878 2 root RW 0 0.0 47.2 [sgx_misr] 884 713 root D < 13488 5.4 44.2 /usr/bin/Xorg -logfile /tmp/Xorg.0.log 3621 3620 user S 50476 20.4 5.7 python2.5 modrana.py n900 Snapshot 3: sgx took over and it's all over now ... NOTE: modrana memory 20% NOT HIGH 878 2 root RW 0 0.0 97.6 [sgx_misr] 767 1 messageb S < 2100 0.8 0.8 /usr/bin/dbus-daemon --system --nofork 2231 2228 root R 740 0.3 0.4 top 1142 1028 user S 10004 4.0 0.2 /usr/bin/hildon-status-menu 1799 1028 user S 8008 3.2 0.2 /usr/bin/osso-xterm 757 1 root S 1072 0.4 0.2 /usr/sbin/bme_RX-51 5155 2236 root S 528 0.2 0.2 powertop 3621 3620 user S 50476 20.4 0.0 python2.5 modrana.py n900 > > Nevertheless, as a consumer it means the same to me. > > If you don't see SGX resets or sgx_misr kernel thread using all CPU, it's a > separate issue which should be discussed in separate bug. This bug is only > about the (currently non-reproducible) SGX issue. Fair enough though I think title of the bug doesn't say sgx. It's about an end user perceived behavior. Till the issue is completely analyzed one won't know whether to treat them different. Nevertheless, I'll confine myself for now to those instances where sgx triggers. I also am not sure why it has suddenly been referred as "non-reproducible" while so much discussion is still going on about it?
(In reply to comment #79) > Hm. Is Maep still/again doing window updates when its window isn't visible? > > Could you check that from e.g. ssh console with "xresponse -w 0 -a '*'"[1] and > if Maep indeed is doing window updates while screen is blanked or it's on > background, file a use-time & performance [2] bug against Maep? Ok, I'll try that. At least with Maep 1.3.6 I'm still seeing these freezes frequently. > So you need to be moving to get this issue? Do you need to have network also > enabled? I think I've only ever seen this when moving. I've actually been thinking of trying to code a test case which only simulates the GPS receiver moving, I hypothesize that could be used to trigger this. So far I've had too much other things to do to investigate this, though. If by network you mean either of wifi or 3G, I think at least most of the time I've had 3G connectivity when I've seen this happen. I don't think I've never tried (or seen) it without 3G connection. Network usage is a possibility; Maep downloads the tiles it draws if they are not found in the cache. > > I see a SGX reset message in dmesg and top showing X taking all CPU > > in D state. > > In that case X isn't taking CPU (unlike for Mayuresh) as it's in D state, and > you have SGX reset. So sounds like an issue that belongs to this bug. Yet I seem to remember top showing X taking 100% CPU, but in D state (I believe that can happen also when the cpu time is spent in kernel as result of a syscall by X). I'll recheck this too the next time I have the opportunity to. > There are two SGX issues. One where you see sgx_misr kernel thread taking all > CPU (which I've never seen myself, just mentioned here in bugzilla) and SGX > resets (which I was able to reproduce in spring). I think both can be handled > in this bug. Ah, that clarifies. Thank you for still looking into this. While the tone in some comments here is slowly becoming aggressive-ish, and while I'm myself disappointed that the response has seemed too often to be mostly "please wait for the next PR which contains some fixes to something and retest", it's still more than I've come to expect of many proprietary software vendors. While we open source people are obviously frustrated when we can't just fix broken things ourselves, I'm still optimistic about this bug because Nokia seems to be still engaged in resolving this.
(In reply to comment #81) > (In reply to comment #79) > > Hm. Is Maep still/again doing window updates when its window isn't visible? > > > > Could you check that from e.g. ssh console with "xresponse -w 0 -a '*'"[1] and > > if Maep indeed is doing window updates while screen is blanked or it's on > > background, file a use-time & performance [2] bug against Maep? > > Ok, I'll try that. At least with Maep 1.3.6 I'm still seeing these freezes > frequently. Yeah, I do see damage events from maep while the screen is blanked: 977414476ms : 38988ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977415515ms : 1039ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977416491ms : 976ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977421480ms : 4989ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977422489ms : 1009ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977423473ms : 984ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977424483ms : 1010ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977425487ms : 1004ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977433467ms : 7980ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977437478ms : 4011ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977447466ms : 9988ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977448475ms : 1009ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977449507ms : 1032ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977450481ms : 974ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977470482ms : 20001ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977475470ms : 4988ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977479481ms : 4011ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977480470ms : 989ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977488475ms : 8005ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977489492ms : 1017ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977490496ms : 1004ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977492484ms : 1988ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977493479ms : 995ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977499470ms : 5991ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977506483ms : 7013ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977507487ms : 1004ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977508490ms : 1003ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977509481ms : 991ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977510489ms : 1008ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977511496ms : 1007ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977513484ms : 1988ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977514482ms : 998ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977517476ms : 2994ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977518475ms : 999ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977521481ms : 3006ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977525482ms : 4001ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977529467ms : 3985ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977531482ms : 2015ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977532479ms : 997ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977533489ms : 1010ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977535489ms : 2000ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977538470ms : 2981ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977542472ms : 4002ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977552473ms : 10001ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977554490ms : 2017ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977556491ms : 2001ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977557519ms : 1028ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977558506ms : 987ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977559517ms : 1011ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977562479ms : 2962ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977563489ms : 1010ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977564547ms : 1058ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977568463ms : 3916ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977569485ms : 1022ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977574490ms : 5005ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977575495ms : 1005ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977578487ms : 2992ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977579471ms : 984ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977582481ms : 3010ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977584521ms : 2040ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977585519ms : 998ms : Got damage event 800x424+0+56 from 0x3a0001a (maep) 977586520ms : 1001ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
(In reply to comment #82) > Yeah, I do see damage events from maep while the screen is blanked: This seems to be related to maep's drawing of the (red by default) track. If I disable track capturing, I no longer see damage events. I'll see if I can come up with a patch to maep; presumably identifying the exact operation in Maep that causes this would also help you to fix this bug?
(In reply to comment #75) > (In reply to comment #68) > > Hey, hey - it is not the only option, it is still possible to configure upstart > > in a way which would cause a restart of X11 and GUI apps like hildon etc after > > X11 failure-and-crash, but keep running server applications. > > UI related startup takes most of the bootup time, if one is re-starting the > whole UI session and loosing all user's unsaved data anyway, it would be much > safer just to do full reboot in such a stressed situation. No - 1) there is a possibility of bug 7017. 2) some SERVER applications also have state and it could be lost as well. 3) I use a special ADJUSTMENT scripts after N900 boot, depending from situation. A typical example of this - "stop klogd;stop sysklogd" and I use it for day-to-day run. If I have a problem - I don't run it and record all syslog and lose a battery/performance. Another exam - I use a non-regular boot (M32GB) and it choses a kernel+system depending from KBD. And I know many users uses something like bootmenu etc. So - please find another solution before enforce a reboot. I suggested one - restart X11 and all GUI apps, it is not difficult to modify upstart scripts.
(In reply to comment #80) Fair enough though I think title of the bug doesn't say sgx. It does? > It's about an end user perceived behavior. Which is/shows up as SGX issue. Please read the orignal bug report and title. > Till the issue is completely analyzed one won't know whether > to treat them different. If you don't see the SGX issues, it's a different one. It's easy for a buggy / bad program to DOS a Linux system just by misusing e.g. X as X doesn't have any mechanisms to prevent that from clients that user allows to access it. If that's the case, that app needs to be fixed not to DOS the device X server. > I also am not sure why it has suddenly been referred as "non-reproducible" > while so much discussion is still going on about it? People encounter this issue now and then, but so far there haven't been any steps with which one could reliably reproduce it with PR1.2 or later, e.g. on a device that is just lying on the desk where it could be debugged with HW debuggers. For example I haven't seen these issues, except last spring before PR1.2, with old Maep, in a way that isn't normal device usage (device on charge over a day with Maep running) and even then it wasn't reliably reproducible. -> people are omitting some crucial information on how to reproduce the issues, if it really is reproducible for them. (In reply to comment #83) > (In reply to comment #82) >> Yeah, I do see damage events from maep while the screen is blanked: Thanks for testing! Does it do that also when some other application is on top & active? Doing window updates on bg takes a lot more CPU away from the top app than just keeping track of the position. It dirties memory (e.g. window composite buffer) and therefore causes unnecessary memory pressure (and potentially extra swapping because of that). (Nokia Maps has its own faults, but it doesn't have this kind of crappy behavior for a mobile application. Neither does any other of the pre-installed apps.) You might also want to check with strace is it waking up even more often that this 1 sec interval, because that would make it even more of a use-time issue. > This seems to be related to maep's drawing of the (red by default) track. > If I disable track capturing, I no longer see damage events. I guess you encounter the UI freeze only when you have tracking enabled...? Btw. X does some things differently when the screen is blanked in regards to SGX. Have you had the SGX related UI freezed only when screen is blanked or does it get triggered also when screen is kept ON? > I'll see if I can come up with a patch to maep; presumably identifying the > exact operation in Maep that causes this would also help you to fix this bug? Not really. Only thing to help in that would be exact steps to trigger the issue that are 100% reproducible for anybody, not just "this happens randomly to me when using app X". I'm not anymore related to Fremantle development but my educated guess is that this issue won't get resolved on the SGX side for Fremantle, that kind of steps would have been needed right after PR1.2 release for things to proceed. However, the apps can and should be fixed because their bad behavior has also many other downsides besides the UI freeze. If Maep stops triggering the issue, that's a really good thing in itself and can be used as an example of how to fix the other apps to behave like mobile apps should. I mean, who wouldn't want his navigation software to consume slightly less battery[1] so that you could use it longer? [1] Battery consumption effect goes typically in this order: display backlight, radios (3g, wlan, bt, gps etc), CPU wakeups/usage (e.g. on previous devices having one process do 1Hz wakeups for couple of syscalls on an otherwise idle & offline device reduced the device use-time from a week to 1 day).
(In reply to comment #85) > Which is/shows up as SGX issue. Please read the orignal bug report and title. Point taken. I'll keep it to instances when sgx takes 100% cpu. The point is prior to sgx triggering firstly X server went to take all the cpu and then in one instance sgx triggered and in the other it didn't. If you insist one should see these differently, let's talk about only those where sgx triggered. My point still remains. modrana WASN'T taking high memory at that time as was speculated in some of the posts on this thread. I also have powertop, xresponse and dbus-monitor logs of all these instances. I can post information that may be useful to analyze this issue. Mayuresh.
(In reply to comment #86) > The point is prior to sgx triggering firstly X server went to take all the cpu > and then in one instance sgx triggered and in the other it didn't. Any idea what app was causing this and whether it was on foreground or on background? Was screen lit or blanked? > My point still remains. modrana WASN'T taking high memory at that time as was > speculated in some of the posts on this thread. It doesn't necessarily need to be the program itself, but the device as whole actively using a lot of memory / swapping while some other things[1] happen seemed as one variable affecting this issue (at least in spring when I was looking a bit into this issue). I would guess that high memory usage isn't the trigger but it makes the issue more likely to happen. [1] like non-visible window updates.
(In reply to comment #87) > (In reply to comment #86) > > The point is prior to sgx triggering firstly X server went to take all the cpu > > and then in one instance sgx triggered and in the other it didn't. > > Any idea what app was causing this and whether it was on foreground or on > background? Was screen lit or blanked? I logged this in comment 48. I managed to keep the display ON with an applet that has this option. (Newer versions of modrana have this option in the app itself.) The application was in the foreground and when the issue triggered fully lit display that showed only the app and nothing else was frozen. If at all I find a correlation of this issue with other factors - it is not with background updates, it is not with display being switched off or lit, it is not with high memory usage (of navig app at least), it is seen quite repeatedly that it is with some events on the network - poor connectivity situations in particular, it's relatively rare in areas with good connectivity. I can regularly see this when traveling outside city areas when the likelihood of this issue triggering increases. Has anyone on this list ever seen this issue with network connection switched off? If not why don't we look at that aspect? No matter whether I leave display on or off and keep a navig app running for hours, no matter if I manage to take modrana memory usage up to 65% or so by continuous panning, nothing ever locks the device if network is switched off. This whole episode reminds of the story of six blind men and an elephant as many of us notice different aspect of the issue. May be there is a background screen update issue, but may be the navig app doesn't do it, something pops up while tries to pop up when network disconnects and reconnects with a foreground app making heavy use of screen in the foreground. Something like that. Just speculating. Pending this issue I have started using modrana regularly only in offline mode. I cache the data on home wifi network and when going out just switch off the network access from modrana fully. Then I never face any problem.
(In reply to comment #73) > 2 minutes of googling gives the used upstream kernel and changes on top of it: > http://repository.maemo.org/pool/fremantle/free/k/kernel/kernel_2.6.28-20103103+0m5.diff.gz > > kernel-2.6.28/drivers/gpu/pvr/module.c seems like a good starting point. Thank you for the pointer, I'll start looking there. (In reply to comment #73) > Have you checked how many times in your case the HW would recover within > reasonable time or how many times it has happened even without you noticing it > from the UI (I think you should see this from syslog)? If device is just > rebooted, you lose all unsaved data in your apps. Yes, I have looked at the logs. And yes, there are instances when it resets and everything is fine. The issue is when a reset *doesnt* fix it, it just keeps resetting. I'm just asking for a simple counter. If the device has tried 20 times to reset in a short window (say a minute or two), it's time to call for a system reset. That's not that difficult to do. A static counter, with a setup in the init code, a ++ in the reset code, and a clear in the draw exit code. In fact, I've made a script that does just this by waking up every 30 seconds and collecting the stats from /proc/stat and /proc/<pid>/stat for the sgx_misr process. It warns (verbally) via espeak that it's having issues, then reboots after a short period if not corrected. I'd just prefer this to be built in... As for Bug 7017, that looks like it's been fixed. Even if it hasn't when the device reboots it makes noise, buzzes, and does other things that make it clear that it's rebooting. If the device is in a reboot loop in my pocket, it will get my attention and I can shut it down before it drains battery. If it's sitting idle somewhere, then the worst case result is the same: drained battery. The more common case I'd bet is it reboots and recovers because of the hardware reset. (In reply to comment #73) > Buggy user processes can DOS a normally configured Linux system very easily In each of the methods you listed, you're talking about extreme, purposeful behaviors. And it each case, it's a denial of service, where the system is still running, and a watchdog process can be setup to see/kill such user processes causing this behavior. In this case because the culprit is NOT a user process but a kernel level process, that's not an option. There's no way to "kill" a misbehaving kernel process. Let's be clear here. It's not Maep, or modRana, or MicroB that's locking up the system. What's locking the system is a kernel process that's spinning. Sure, a kernel process (like kswap) eating all resources because of requests from a user app is a bad app. But a kernel process spinning after a request failure is a bad kernel driver, regardless of the app triggering it. (In reply to comment #73) > If it's this: > http://maemo.org/packages/package_instance/view/fremantle_extras-devel_free_armel/modrana/0.20-3/ > > It actually would seem to be using python & cairo, not GL. It is, and it is using python/cario. Yet it still *triggers* this behavior, especially when downloading maps and actively navigating. I've had less frequent occurrences as I've been using it, in part I think because it's no longer fetching tiles from the air, and is instead pulling them from the local cache. (I do think this is tied to GPRS/GPS combo usage as well.) (In reply to comment #74) > You've verified this is the whole device UI freeze caused by SGX issues e.g. by > sshing to the device and seeing sgx kernel thread highest in top or SGX resets > in dmesg? I can verify when it happens because I have a script running that watches the sgx user-space process. When it starts consuming >80% cpu, it uses espeak to announce it has an issue and after 3 such announcements (over 15 seconds) it reboots. It's triggered three times so far, twice when using modRana, and once when sitting idle (at about 5am) on my night stand. (In reply to comment #74) > Very interesting. Does this happen with both GPRS and Wifi or only with the > other one? Or only when the device is moving and switching between base > stations? This is interesting. I know I've seen it primarily when on gprs, but have seen it on Wifi too. But this made me think... I checked my reboots, system logs and script logs. One thing I found was a very high (100%) correlation to an inbound SMS. Each time this happened there was an in-bound text message that was time-stamped just before the script triggered and/or shortly before the reboot. I also recall often getting a text just *after* the reboot, including that 5am instance. (Did I mention I hate my bank, posting mortgage payment notices via SMS at 5am...) Could it be that heavy cell usage is involved? (I use only 2.5 EDGE, not 3G.) I will attempt to make a call and/or have people text me in the background while navigating tonight to see if I can trigger it. Also, of a later question: I've had it trigger both when the screen is off, and when on. In the former cases, the screen backlight comes on when I try to access the device, but the screen stays black. In the later cases, the UI is unresponsive, but the backlight continues to alter state based on hardware input.
Comment 89: > As for Bug 7017, that looks like it's been fixed. Even if it hasn't when the device reboots it makes noise, buzzes, and does other things that make it clear that it's rebooting. Unfortunately, it is not for anybody - sit in pocket and make noises. And unexpected reboot is still unexpected reboot, it kills ANYTHING. But as I understand we need to kill-and-restart only GUI apps. I think a complete reboot is not a good solution.
(In reply to comment #88) > I can regularly see this when traveling outside city areas when > the likelihood of this issue triggering increases. > > Has anyone on this list ever seen this issue with network connection > switched off? If not why don't we look at that aspect? > > No matter whether I leave display on or off and keep a navig app running for > hours, no matter if I manage to take modrana memory usage up to 65% or so by > continuous panning, nothing ever locks the device if network is switched off. ... > Pending this issue I have started using modrana regularly only in offline > mode. I cache the data on home wifi network and when going out just switch > off the network access from modrana fully. Then I never face any problem. Hm. That might explain why I never see it. That's how I typically use my personal device (for use-time reasons). And at work devices of course typically are / have to be stationary. > This whole episode reminds of the story of six blind men and an elephant > as many of us notice different aspect of the issue. May be there is > a background screen update issue, but may be the navig app doesn't do it, > something pops up while tries to pop up when network disconnects and > reconnects with a foreground app making heavy use of screen in > the foreground. Something like that. Just speculating. Yes, there can be multiple reasons/triggers. Highlights the importance of getting exact, reproducible steps and exact environment setup for reproducing the issue. If it's related to network, it could even be something that can be reproduced in a certain network environment (wouldn't be first such bug). (In reply to comment #89) > (In reply to comment #73) > > Buggy user processes can DOS a normally configured Linux system very easily > > In each of the methods you listed, you're talking about extreme, purposeful > behaviors. Several of them can happen just by having a buggy program. I've seen such. They can e.g. have some untested corner-case code that in some (e.g. network error) situations gets triggered constantly. > In this case because the culprit is NOT a user process but a kernel level > process, that's not an option. There's no way to "kill" a misbehaving kernel > process. Let's be clear here. It's not Maep, or modRana, or MicroB that's > locking up the system. Did you actually check that e.g. by killing all the UI processes one by one to see whether the issue goes away? If that can cure the issue, it would be good to know what was on screen, e.g. as output of "DISPLAY=:0 xwininfo -root -tree" command (in x11-utils package) taken before starting killing. > I can verify when it happens because I have a script running that watches the > sgx user-space process. When it starts consuming >80% cpu, it uses espeak to > announce it has an issue and after 3 such announcements (over 15 seconds) it > reboots. It's triggered three times so far, twice when using modRana, and > once when sitting idle (at about 5am) on my night stand. Device being stationary, screen blanked? Which apps were running in last case? You might consider starting xresponse logging when your script first notices the issue to see what apps are doing screen updates. + take above xwininfo data too. > I checked my reboots, system logs and script logs. One thing I found was > a very high (100%) correlation to an inbound SMS. Each time this happened > there was an in-bound text message that was time-stamped just before the > script triggered and/or shortly before the reboot. > I also recall often getting a text just *after* the reboot, including > that 5am instance. Very interesting finding! When an SMS comes in, cgroups is used to freeze the extra processes (maybe for ~1 sec) to get the notification banner up fast. Notification coming up on top of a fullscreen application means that it is switched from direct to composited mode. Maybe there's some race condition in X regarding drawing, SGX operations and compositing switching? (There's some SGX specific driver code in X side too, but like kernel driver, I think that's also Open Source, unlike the OpenGL/SGX libraries.) > Could it be that heavy cell usage is involved? (I use only 2.5 EDGE, > not 3G.) The device has several HW with their own "OSes", cellmo side has one, SGX has one etc. Maybe they interact badly, I think all of them have their own MMUs and can therefore read & write everywhere in RAM... > Also, of a later question: I've had it trigger both when the screen is off, > and when on. In the former cases, the screen backlight comes on when I try to > access the device, but the screen stays black. In the later cases, the UI is > unresponsive, but the backlight continues to alter state based on hardware > input. If screen was blanked before and UI isn't drawing anything on screen after unblanking, what you see is backlighted black color.
(In reply to comment #90) > And unexpected reboot is still unexpected reboot, it kills ANYTHING. But as I > understand we need to kill-and-restart only GUI apps. No... Your understanding is incorrect. The ONLY solution so far found has been to reboot the system. Killing and restarting the GUI apps has NO effect. The problem here is that a KERNEL DRIVER gets into a spinning state. There is no way to kill or unload the driver, and killing apps seems to have NO effect on the driver. I've tried killing apps and even X itself. The only thing that brings back the display and stops the driver from taking 100% cpu is a reboot. Again, many would prefer to have the device reboot than to have it SILENTLY drain all the battery and die. An unexpected battery drain "kills anything" just as much as a reboot does. At least with the reboot there's a controlled shutdown, and a chance of having battery left after an occurrence. Let me state again, quite clearly: There is NO way we currently know to recover from this state other than a reboot. If you know of another way, please do share it. I would love to alter my script to simply reset the display system if that's an option. To my knowledge, there is no known way of fixing this other than a reboot/powerdown. (In reply to comment #91) > Hm. That might explain why I never see it. That's how I typically use my > personal device (for use-time reasons). And at work devices of course > typically are / have to be stationary. I'm not sure being mobile is a requirement. modRana has the ability to download tiles to cache them locally. One could easily emulate the device strain of navigation by setting the connection to GPRS, setting it to download lots of tiles, leaving the GPS on, and dragging the display around a bit from time to time to force redraws. All of that can be done on the bench, and emulates the device usage in a mobile situation. (Sending a few SMS during that would also seem to be a good indicator?) (In reply to comment #91) > Did you actually check that e.g. by killing all the UI processes one by one to > see whether the issue goes away? I did once when it happened at work (I was downloading cache tiles via wifi). I was able to ssh into the device and look around. I tried killing modRana, but decided since a reboot was going to happen anyway, I should try killing other things too. Most of the apps respawned, since the system likes some of the system apps to always be loaded. I even tried killing X with various levels to see if it would recover. (First -1, then -15, then -9.) Nothing caused it to recover (though it did reboot after I killed something it didn't like me killing.) (In reply to comment #91) > If that can cure the issue, it would be good to know what was on screen, e.g. > as output of "DISPLAY=:0 xwininfo -root -tree" command (in x11-utils package) > taken before starting killing. I can add that to my script, to record that.. and maybe even try killing the app(s) involved. I'm up for adding any debug collection if it will help. Right now I'm not having it happen as much, but I've been unable to run a few tests as I'm busy wrapping things up before the holiday. I'm hopeful that after Thursday I can sit and test some things out (including the IM while navigating test I wanted to do yesterday). (In reply to comment #91) > Device being stationary, screen blanked? Which apps were running in last case? None... I generally don't have apps running in the background, especially at night. I do have IM setup, but at night even that goes off. When the 5am instance triggered nothing but the desktop widgets were running, and the screen was off at the time, so they should have been idle (facebook, calendar & weather widgets is all I have there). The only thing I know was going on was I got a SMS around the same time the device had the issue. (In reply to comment #91) > You might consider starting xresponse logging when your script first notices > the issue to see what apps are doing screen updates. + take above xwininfo > data too. I will add both. The current script is listed here btw: http://talk.maemo.org/showthread.php?t=66660 (In reply to comment #91) > Very interesting finding! When an SMS comes in, cgroups is used to freeze the > extra processes (maybe for ~1 sec) to get the notification banner up fast. > Notification coming up on top of a fullscreen application means that it is > switched from direct to composited mode. > > Maybe there's some race condition in X regarding drawing, SGX operations and > compositing switching? Possible... I don't generally use full-screen mode though, even for nav apps, as I like to have that top status bar with the time and such visible. But even then, it would still use cgroups and cause X to do redraws. That could be part of it, since a GL "burst" is whats suspected for triggering this, yes? (In reply to comment #91) > The device has several HW with their own "OSes", cellmo side has one, SGX has > one etc. Maybe they interact badly, I think all of them have their own MMUs > and can therefore read & write everywhere in RAM... This was my initial thought when I realized most of the triggering apps were GPS related. I was wondering if it was the GPS module conflicting... But it could be a mix of the SGX and any number of them. But again, the SGX module should be fixed to not get stuck in a loop. Even if the display just goes away until a reboot, the killer is not the UI loss, it's the battery drain. Sucking up 100% cpu and blowing through all battery power is a bad thing. If I picked up my device and couldn't get the UI to refresh without a reboot, that's annoying. If I pick it up and it's off, and has no battery left to restart with, that really sucks. (In reply to comment #91) > If screen was blanked before and UI isn't drawing anything on screen after > unblanking, what you see is backlighted black color. My point was more that other things were running still (backlight adjusting, sensors for unlock etc). So the device isn't "locked up", it's just not drawing. But we kind of already knew that. Is there a key sequence one can use to force a reboot safely? I'm not talking about the hard power-down by holding the power button, but something where the OS gets to close things up, like issuing a reboot or shutdown command.
(In reply to comment #92) > (In reply to comment #90) > > And unexpected reboot is still unexpected reboot, it kills ANYTHING. But as I > > understand we need to kill-and-restart only GUI apps. > > No... Your understanding is incorrect. The ONLY solution so far found has been > to reboot the system. Killing and restarting the GUI apps has NO effect. > > The problem here is that a KERNEL DRIVER gets into a spinning state. There is > no way to kill or unload the driver, and killing apps seems to have NO effect > on the driver. I've tried killing apps and even X itself. The only thing that > brings back the display and stops the driver from taking 100% cpu is a reboot. Excuse me, but it is POSSIBLE to restart driver. Just some work needs to be done. > Again, many would prefer to have the device reboot than to have it SILENTLY > drain all the battery and die. An unexpected battery drain "kills anything" > just as much as a reboot does. You suggest a choice between bad and awful. But there is a man who has a similar symptom and suffers exactly from reboots, see - http://talk.maemo.org/showthread.php?t=67292 It seems that that man situation is an exactly your dream but he is not happy. > At least with the reboot there's a controlled > shutdown, and a chance of having battery left after an occurrence. > > Let me state again, quite clearly: There is NO way we currently know to recover > from this state other than a reboot. You are not informed. If device (SGX) can be reset then there IS a way w/out reboot. > If you know of another way, please do > share it. I would love to alter my script to simply reset the display system > if that's an option. To my knowledge, there is no known way of fixing this > other than a reboot/powerdown. Without programming - yes, you are right. But SGX reset with killing X11 (or sending it some message) is possible.
My case is a little different. The phone will entry reboot on its own when "HWRecoveryResetSGX: SGX Hardware Recovery triggered" happens. This usually happens after I finish a phone call (receiving calls seem to be ok.) The firmware running on my system is PR 1.3. Dec 22 10:14:41 Nokia-N900-42-11 -- MARK -- Dec 22 10:15:28 Nokia-N900-42-11 kernel: [80711.257019] kb_lock (GPIO 113) is now closed Dec 22 10:15:28 Nokia-N900-42-11 kernel: [80711.506866] kb_lock (GPIO 113) is now open Dec 22 10:15:44 Nokia-N900-42-11 dorian[3714]: GLIB CRITICAL ** GLib-GObject - g_object_get: assertion `G_IS_OBJECT (object)' failed Dec 22 10:15:44 Nokia-N900-42-11 dorian[3714]: GLIB CRITICAL ** Gtk - gtk_widget_set_sensitive: assertion `GTK_IS_WIDGET (widget)' failed Dec 22 10:25:01 Nokia-N900-42-11 kernel: [81284.624237] kb_lock (GPIO 113) is now closed Dec 22 10:25:02 Nokia-N900-42-11 kernel: [81284.890228] kb_lock (GPIO 113) is now open Dec 22 10:31:00 Nokia-N900-42-11 kernel: [81643.178833] HWRecoveryResetSGX: SGX Hardware Recovery triggered Dec 22 10:32:29 Nokia-N900-42-11 kernel: [81732.428314] HWRecoveryResetSGX: SGX Hardware Recovery triggered Dec 22 10:32:49 Nokia-N900-42-11 syslogd 1.5.0#5maemo7+0m5: restart.
(In reply to comment #93) > Excuse me, but it is POSSIBLE to restart driver. Just some work needs to be > done. > > You are not informed. If device (SGX) can be reset then there IS a way w/out > reboot. > > Without programming - yes, you are right. But SGX reset with killing X11 (or > sending it some message) is possible. Since you say I'm "not informed", please DO INFORM ME on how to do thi. I have yet to see ANYONE show a way to reset the driver when in this state outside of a reboot. You can say it can be done, and is possible. How? Have you done it? Did it work? If not, how can you say with certainty that it's possible? As I said, I would love to be able to just reset the driver and have it work again. That would be a perfectly acceptable work around, and most here would happily use such a patch. But currently, there is NO WAY to do this. PLEASE instead of just calling the multiple people here asking for a reboot fix here "uninformed", why not do something useful and give an alternate solution. (In reply to comment #93) > You suggest a choice between bad and awful. But there is a man who has a > similar symptom and suffers exactly from reboots, see - The issue he is seeing is NOT the same as this bug. It's a very different issue. His device is being reset by the watchdog after calls. Some of the output in his logs indicated an SGX reset, among other things, before the watchdog reboots his device. But this is not a bug for all things where the SGX resets. His is a separate bug, totally unrelated to what's happening here. Given a choice, I think most would prefer that Nokia fix this so the kernel driver worked and just reset the hardware. But given the choice between a device that randomly DIES and is USELESS for the rest of the day (how it works now), or one that randomly REBOOTS (the only SHOWN way of fixing this bug), I and many others would choose the later. You may choose differently, but *multiple people* have shown a desire for a reboot vs a dead battery. Again, if you know of another working option, PLEASE DO SHARE by showing HOW to do that. Ultimately, I would like this to not happen at all, by having Nokia fix this device killing bug. Until that's a reality, I posted a script to implement a watcher that collects a log and reboots. If you think it's possible to reset the driver and X instead, please post something on how to do so.
(In reply to comment #92) > The problem here is that a KERNEL DRIVER gets into a spinning state. It isn't necessarily the kernel driver, it may just be reacting to what SGX does in a loop (I mean, you be seeing effect, not cause). Would be nice if somebody checks that from the driver... > There is no way to kill or unload the driver, To do rmmod on the driver (pvrsrvkm?), you need to kill its users first. > and killing apps seems to have NO effect on the driver. ... >> Did you actually check that e.g. by killing all the UI processes one by >> one to see whether the issue goes away? > > I did once when it happened at work (I was downloading cache tiles via wifi). Stationary device, no GPS? > I was able to ssh into the device and look around. I tried killing modRana, > but decided since a reboot was going to happen anyway, I should try killing > other things too. Most of the apps respawned, since the system likes some of > the system apps to always be loaded. You need to use dsmetool to kill stuff that uses DSME SW watchdog. > I even tried killing X with various levels to see if it would recover. > (First -1, then -15, then -9.) Were you able to kill it? (If it's in D state, even SIGKILL might not work.) > (In reply to comment #91) > > Hm. That might explain why I never see it. That's how I typically use my > > personal device (for use-time reasons). And at work devices of course > > typically are / have to be stationary. > > I'm not sure being mobile is a requirement. modRana has the ability to > download tiles to cache them locally. One could easily emulate the device > strain of navigation by setting the connection to GPRS, setting it to > download lots of tiles, leaving the GPS on, and dragging the display around > a bit from time to time to force redraws. All of that can be done on the > bench, and emulates the device usage in a mobile situation. Will that trigger the issue? (By mobile I meant things happening in other kernel drivers, not just SGX one. Things like base station change etc.) > (In reply to comment #91) > > Device being stationary, screen blanked? Which apps were running > > in last case? > > None... I generally don't have apps running in the background, especially at > night. I do have IM setup, but at night even that goes off. When the 5am > instance triggered nothing but the desktop widgets were running, and > the screen was off at the time, so they should have been idle (facebook, > calendar & weather widgets is all I have there). Did you have network up? Facebook or weather widget might be doing screen update although screen is blanked (as update e.g. once a day isn't a use-time problem). > The only thing I know was going on was I > got a SMS around the same time the device had the issue. > (In reply to comment #91) > > You might consider starting xresponse logging when your script first notices > > the issue to see what apps are doing screen updates. + take above xwininfo > > data too. > > I will add both. The current script is listed here btw: > http://talk.maemo.org/showthread.php?t=66660 Hm. They don't work if X is stuck at D state. Best to first get: cat /proc/$(pidof Xorg)/wchan output to see where it's stuck (if it indeed happens to be in D state). > (In reply to comment #91) >> Maybe there's some race condition in X regarding drawing, SGX operations >> and compositing switching? > > Possible... I don't generally use full-screen mode though, even for nav apps, > as I like to have that top status bar with the time and such visible. But > even then, it would still use cgroups and cause X to do redraws. Non-fullscreen apps are run in composited mode, so that shouldn't trigger composition switch. Incoming SMS triggers temporary cgroups freeze for non-related processes though. > That could be part of it, since a GL "burst" is whats suspected for > triggering this, yes? If by "burst" you mean multiple processes waking up (after freeze) to do window updates at the same (while system is otherwise stressed), yes something like that could be one of the triggering conditions. > Is there a key sequence one can use to force a reboot safely? I'm not talking > about the hard power-down by holding the power button, but something where the > OS gets to close things up, like issuing a reboot or shutdown command. Holding power button down should be doing "clean" shutdown (unmount etc).
(In reply to comment #96) > > Is there a key sequence one can use to force a reboot safely? I'm not talking > > about the hard power-down by holding the power button, but something where the > > OS gets to close things up, like issuing a reboot or shutdown command. > > Holding power button down should be doing "clean" shutdown (unmount etc). That doesn't work. Hmm. I always assumed that's because X is in an uninterruptible state, but then I'm not sure if that makes clean shutdown impossible. In the PC world inability to cleanly shutdown can be seen when something that the shutdown depends on is in uninterruptible state (in my experience most often when umount blocks waiting on a D-state task with a file open in the target filesystem).
It seems that I found a way to reproduce it easily... Installed Marble 1.0.0 from extras/testing selected OpenStreetMap theme, selected on the screen a map region about 300x500 km. Selected "Download Region" from the pull down menu, then "Visible region", set Zoom to Tile level range 10-13 (estimated download size: 387 MB) and pressed the button "Done". The download started, put the phone on the table and after about 15-20 it is frozen - had to pull out the battery...
(In reply to comment #98) > It seems that I found a way to reproduce it easily... > Installed Marble 1.0.0 from extras/testing selected OpenStreetMap theme, > selected on the screen a map region about 300x500 km. Selected "Download > Region" from the pull down menu, then "Visible region", set Zoom to Tile level > range 10-13 (estimated download size: 387 MB) and pressed the button "Done". > The download started, put the phone on the table and after about 15-20 it is Seconds? Minutes? > frozen - had to pull out the battery... Could you check with xresponse (from ssh console) whether this application does screen updates after screen has blanked, and how often? Also, please install sp-memusage and give marble PID to mem-cpu-monitor to monitor its (and whole system) memory usage changes.
Created an attachment (id=3362) [details] thread apply all bt for marble (In reply to comment #99) > after about 15-20 it is > > Seconds? Minutes? Rather minutes. But it hangs for 90% of downloads, IIRC it has completed only one large download for me. marble 1.1.0. I find it easier to make it reproducible for yourself than the remote debugging like this one. http://userbase.kde.org/Tutorials#Marble http://userbase.kde.org/Marble/Maemo/OfflineRouting > Could you check with xresponse I never got any output from it using: xresponse -w0 -i xresponse -w0 -k A xresponse -w0 -c 100x100 xresponse -w0 -t foo > sp-memusage and give marble PID to mem-cpu-monitor to monitor its It has never changed the output: # mem-cpu-monitor 2710 System total memory: 245380 kB RAM, 786424 kB swap _______________ ____________ ________ __ / system memory \/ system CPU \ time: \/BL\/ used: change: %: MHz: 21:43:24 -- 171136 +0 0.00 0 Also tried: Nokia-N900:~# ls -l /proc/2710/exe lrwxrwxrwx 1 user users 0 May 3 22:00 /proc/2710/exe -> /opt/marble/bin/marble Nokia-N900:~# strace -p 2710 Process 2710 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>^C <unfinished ...> Process 2710 detached --- strace prints no output GDB backtraces attached. Thanks for investigation. CPU: 0.1% usr 99.8% sys 0.0% nice 0.0% idle 0.0% io 0.0% irq 0.0% softirq Mem: 194720K used, 50660K free, 0K shrd, 364K buff, 85616K cached CPU: 0.0% usr 100% sys 0.0% nice 0.0% idle 0.0% io 0.0% irq 0.0% softirq Load average: 1.00 1.10 1.13 PID PPID USER STAT RSS %MEM %CPU COMMAND 876 2 root RW 0 0.0 98.4 [sgx_misr] 3103 1566 root S 1348 0.5 0.3 sshd: root@pts/0 3412 3105 root R 736 0.3 0.3 top 10 2 root SW 0 0.0 0.3 [omap2_mcspi] 544 2 root SW 0 0.0 0.2 [wl12xx] 884 729 root D < 16428 6.6 0.0 /usr/bin/Xorg -logfile /tmp/Xorg.0.log -logverbose 1 -nolisten tcp -noreset -s 0 -core 2710 1 user S 11932 4.8 0.0 /opt/marble/bin/marble [...]