Bug 9150 - (int-157801) Device doesn't respond via UI. syslog reports HWRecoveryResetSGX: SGX Hardware Recovery triggered, sgx_misr eating all CPU
(int-157801)
: Device doesn't respond via UI. syslog reports HWRecoveryResetSGX: SGX Hardwar...
Status: NEW
Product: Core
Kernel
: 5.0:(20.2010.36-2)
: N900 Maemo
: Unspecified critical with 29 votes (vote)
: 5.0+
Assigned To: unassigned
: linux-kernel-bugs
:
: performance
:
:
  Show dependency tree
 
Reported: 2010-02-19 03:28 UTC by James A. T. Rice
Modified: 2013-12-06 03:12 UTC (History)
30 users (show)

See Also:


Attachments
thread apply all bt for marble (8.86 KB, text/plain)
2011-05-03 23:06 UTC, Jan Kratochvil
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description James A. T. Rice (reporter) 2010-02-19 03:28:25 UTC
SOFTWARE VERSION:
NOKIA
Maemo 5
Version: 3.2010.02-8
WLAN MAC address: EC:9B:5B:FD:xx:xx
Bluetooth address: EC:9B:5B:FD:xx:xx
IMEI: 35693803153xxxx


EXACT STEPS LEADING TO PROBLEM: 
1. System UI will lockup randomly, appears not to
be related to activity or use of the system,
has mostly occurred when idle. When this happens,
opening the slide keyboard, or activating the
keyboard lock switch, or opening the camera shutter
or pressing the power button has no effect, system
has been observed to not indicate incoming phone
calls, and battery drains very quickly with the
device becoming very warm.
2. [sgx_misr] eats all the CPU
3. syslog reports kernel: [100992.835906] HWRecoveryResetSGX: SGX Hardware
Recovery triggered
4. syslog reports mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_open: Did not receive a reply. Possible
causes include: the remote application did not send a reply, the message bus
security policy blocked the reply, the reply timeout expired, or the network
connection was broken.
5. syslog reports mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible
causes include: the remote application did not send a reply, the message bus
security policy blocked the reply, the reply timeout expired, or the network
connection was broken.

EXPECTED OUTCOME:
System shouldn't lock up.

ACTUAL OUTCOME:
System is unusable. The only way to recover from this
without remote access (ie ssh) appears to be to remove
the battery, which may cause data loss.

REPRODUCIBILITY:
random - has occured while idle on charge, or idle
in my pocket, incidence about once every day or two.

EXTRA SOFTWARE INSTALLED:
openssh, syslogd, maep

OTHER COMMENTS:
Initially sshing into the device and running
top gave the following information:

PID  PPID USER     STAT   RSS %MEM %CPU COMMAND
800     2 root     RW       0  0.0 97.2 [sgx_misr]

I believe [sgx_misr] is the kernel process for the
PowerVR SGX GPU Masked Interrupt Status Register

Since then, the phone has been reflashed with vanilla
nokia firmware (I think it was with 1.2009.42-11 before):
http://tablets-dev.nokia.com/nokia_N900.php?f=RX-51_2009SE_3.2010.02-8_PR_COMBINED_MR0_ARM.bin
http://tablets-dev.nokia.com/nokia_N900.php?f=RX-51_2009SE_1.2009.41-1.V
ANILLA_PR_EMMC_MR0_ARM.bin

This hasn't fixed the problem.

Syslogd was installed to try to gather more information,
the problem is first shown with the following in /var/log/syslog,

[period of inactivity, nothing out of the unusual]
Feb 17 19:45:35 Nokia-N900-02-8 kernel: [100992.835906] HWRecoveryResetSGX: SGX
Hardware Recovery triggered
Feb 17 19:47:28 Nokia-N900-02-8 ke_recv[1322]: prop_modified:1889: udi
/org/freedesktop/Hal/devices/platform_slide modified button.state.value
Feb 17 19:47:28 Nokia-N900-02-8 kernel: [101105.542236] slide (GPIO 71) is now
open
Feb 17 19:47:28 Nokia-N900-02-8 systemui-tklock[988]: Method call received
from: :1.6, iface: com.nokia.system_ui.request, method: tklock_close
Feb 17 19:47:29 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible
causes include: the remote application did not send a reply, th
e message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.Feb 17 19:47:31 Nokia-N900-02-8
ke_recv[1322]: prop_modified:1889: udi
/org/freedesktop/Hal/devices/platform_slide modified button.state.value
Feb 17 19:47:31 Nokia-N900-02-8 kernel: [101108.628143] slide (GPIO 71) is now
closed
Feb 17 19:47:32 Nokia-N900-02-8 kernel: [101109.518768] kb_lock (GPIO 113) is
now closed
Feb 17 19:47:32 Nokia-N900-02-8 kernel: [101109.557830] kb_lock (GPIO 113) is
now openFeb 17 19:47:32 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_open: Did not receive a reply. Possible
causes include: the remote application did not send a reply, the
 message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.
Feb 17 19:47:33 Nokia-N900-02-8 kernel: [101109.995330] kb_lock (GPIO 113) is
now closed
Feb 17 19:47:33 Nokia-N900-02-8 kernel: [101110.315643] kb_lock (GPIO 113) is
now open
Feb 17 19:47:34 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible
causes include: the remote application did not send a reply, th
e message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.Feb 17 19:47:34 Nokia-N900-02-8 kernel:
[101111.034393] kb_lock (GPIO 113) is now closed
Feb 17 19:47:34 Nokia-N900-02-8 kernel: [101111.315643] kb_lock (GPIO 113) is
now open
Feb 17 19:47:34 Nokia-N900-02-8 kernel: [101111.432830] kb_lock (GPIO 113) is
now closed
Feb 17 19:47:34 Nokia-N900-02-8 kernel: [101111.604736] kb_lock (GPIO 113) is
now openFeb 17 19:47:35 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible
causes include: the remote application did not send a reply, th
e message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.
Feb 17 19:47:36 Nokia-N900-02-8 ke_recv[1322]: prop_modified:1889: udi
/org/freedesktop/Hal/devices/platform_slide modified button.state.value
Feb 17 19:47:36 Nokia-N900-02-8 kernel: [101112.925018] slide (GPIO 71) is now
openFeb 17 19:47:37 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible
causes include: the remote application did not send a reply, th
e message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.
Feb 17 19:47:37 Nokia-N900-02-8 ke_recv[1322]: prop_modified:1889: udi
/org/freedesktop/Hal/devices/platform_slide modified button.state.value
Feb 17 19:47:37 Nokia-N900-02-8 kernel: [101113.979736] slide (GPIO 71) is now
closed
Feb 17 19:47:38 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_open: Did not receive a reply. Possible
causes include: the remote application did not send a reply, the
 message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.
Feb 17 19:47:41 Nokia-N900-02-8 kernel: [101118.065643] cam_shutter (GPIO 110)
is now open
Feb 17 19:47:41 Nokia-N900-02-8 camera-ui[2292]: GLIB DEBUG liblocation -
Loading initial values from com.nokia.Location::las
Feb 17 19:47:41 Nokia-N900-02-8 location-daemon[3174]: GLIB DEBUG default -
:1.820 now having 1 connections
Feb 17 19:47:41 Nokia-N900-02-8 camera-ui[2292]: GLIB DEBUG liblocation -
Object path: /com/nokia/location/las
Feb 17 19:47:41 Nokia-N900-02-8 location-daemon[3174]: GLIB DEBUG default -
:1.820 now having 0 connections
Feb 17 19:47:42 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible
causes include: the remote application did not send a reply, th
e message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.
Feb 17 19:47:42 Nokia-N900-02-8 location-daemon[3174]: GLIB DEBUG default -
:1.411 now having 1 connections
Feb 17 19:47:42 Nokia-N900-02-8 location-daemon[3174]: GLIB DEBUG default - New
client. Not modifying LAS session
Feb 17 19:47:45 Nokia-N900-02-8 kernel: [101122.182830] cam_shutter (GPIO 110)
is now closed
Feb 17 19:48:09 Nokia-N900-02-8 kernel: [101146.620544] wl1251: 151 tx blocks
at 0x3b788, 35 rx blocks at 0x3a780
Feb 17 19:48:09 Nokia-N900-02-8 kernel: [101146.620941] wl1251: firmware booted
(Rev 4.0.4.3.7)
Feb 17 19:48:09 Nokia-N900-02-8 wlancond[1128]: Scan issued
Feb 17 19:48:10 Nokia-N900-02-8 wlancond[1128]: Scan results ready -- scan
active
Feb 17 19:48:10 Nokia-N900-02-8 wlancond[1128]: Scan results (10 APs) to :1.81
Feb 17 19:48:10 Nokia-N900-02-8 kernel: [101147.667541] wl1251: down

A little later I asked someone to call me, to see
if the device would indicate the incoming call -
it didn't, but the syslog has a few entries from it.

Feb 17 20:17:40 Nokia-N900-02-8 telepathy-ring[1061]: GLIB MESSAGE Modem-Call -
incoming call from "07871xxxxxx"
Feb 17 20:17:41 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible
causes include: the remote application did not send a reply, the message bus
security policy blocked the reply, the reply timeout expired, or the network
connection was broken.
Feb 17 20:17:42 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible
causes include: the remote application did not send a reply, th
e message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.
Feb 17 20:17:50 Nokia-N900-02-8 kernel: [102926.987762] proximity (GPIO 89) is
now closedFeb 17 20:17:51 Nokia-N900-02-8 kernel: [102927.972015] proximity
(GPIO 89) is now open
Feb 17 20:17:52 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible
causes include: the remote application did not send a reply, th
e message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.
Feb 17 20:17:55 Nokia-N900-02-8 kernel: [102932.620361] proximity (GPIO 89) is
now closed
Feb 17 20:17:56 Nokia-N900-02-8 kernel: [102933.190673] proximity (GPIO 89) is
now open
Feb 17 20:17:57 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_close: Did not receive a reply. Possible
causes include: the remote application did not send a reply, th
e message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.
Feb 17 20:18:02 Nokia-N900-02-8 telepathy-ring[1061]: GLIB MESSAGE Modem-Call -
mt-released incoming call from '07871xxxxxx'
com.nokia.csd.Call.Error.Network.NormalCallClearing: Normal Call Clearing
Feb 17 20:18:02 Nokia-N900-02-8 telepathy-ring[1061]: GLIB MESSAGE Modem-Call -
terminated incoming call from '07871xxxxxx'
com.nokia.csd.Call.Error.Network.NormalCallClearing: Normal Call Clearing
Feb 17 20:18:03 Nokia-N900-02-8 mce[749]: Error sending with reply to
com.nokia.system_ui.request.tklock_open: Did not receive a reply. Possible
causes include: the remote application did not send a reply, the
 message bus security policy blocked the reply, the reply timeout expired, or
the network connection was broken.


Possibly related / useful URLs:

Bug 7017 - SGX memory reset seems failed during reboot
https://bugs.maemo.org/show_bug.cgi?id=7017

The SGX driver
http://www.daimi.au.dk/~cvm/repo/add_nokia_sgx_driver.patch

Possible hardware fault followed by failure to recover
with the "HWRecoveryResetSGX: SGX Hardware Recovery triggered"
routine?

Thanks
James
Comment 1 Andre Klapper maemo.org 2010-02-22 14:06:12 UTC
Hi James, thanks for reporting this!
Comment 2 Faz 2010-02-27 21:48:26 UTC
I believe I just experienced same bug.

Device was in charge for some hours with a steady green LED confirming full
charge.  Unplug chargers, LED switches to standard blinking as expected but no
response to any physical actions.  Keyboard even remains unlit when slid open. 
Device also warmer that expected.

ssh'd to device without problems and top confirmed same process:


Mem: 228024K used, 17516K free, 0K shrd, 4252K buff, 72024K cached
CPU:  0.3% usr 99.6% sys  0.0% nice  0.0% idle  0.0% io  0.0% irq  0.0% softirq
Load average: 2.44 3.01 2.49
  PID  PPID USER     STAT   RSS %MEM %CPU COMMAND
  796     2 root     RW       0  0.0 98.0 [sgx_misr]



I attempted to kill this device several times with "kill -9 796" but without
success.

Finally used the "reboot" command which seemed to perform a normal reboot
without any unusual delays.

Firmware: 2.2009.51-1.203.2
(Latest version not rolled to UK users yet)


?:~# cat /proc/version 
Linux version 2.6.28-omap1 (bifh6@maemo-bifh-19) (gcc version 4.2.1) #1 PREEMPT
Thu Dec 17 09:40:52 EET 2009



Recent activity that possibly may have contributed or triggered:

Updated OMWeather to version 0.25.6 from the dev repository a few hours earlier
but had yet to reboot device as instructed during update.

Installed BatteryGraph 0.2.2-1 about a week prior, which I understand keeps a
background process running to poll and record battery status every 15 minutes.
Comment 3 Faz 2010-02-27 22:12:26 UTC
Also possibly related:

On several occasions in the past week the device has not commenced charging
upon connecting the supplied Nokia charger.  Resolved each time, on first
attempt, by disconnecting then reconnecting the charger to device.

Uptime was 4 days prior to rebooting.
Can't say for sure, but the occasional charging issue described above may only
have occurred during this time-frame.
Comment 4 Faz 2010-02-28 01:20:45 UTC
Just to add, the display wasn't turning itself off when running of battery
after experiencing this bug.
Toggling various options in the Settings, Display applet resolved it.
N.B. I have the option to keep display lit when charging enabled.
Comment 5 PeteC 2010-03-01 15:04:25 UTC
Exactly the same symptoms as Faz, including UK Firmware 2.2009.51-1.203.2.
Except that I don't have OMWeather or BatteryGraph installed.

My problems seem to relate to using GPS (leaving Maep running while locking the
screen), rather than charging. See bug #8689.

[75990.246429] kb_lock (GPIO 113) is now closed
[75990.379119] kb_lock (GPIO 113) is now open
[76364.634124] HWRecoveryResetSGX: SGX Hardware Recovery triggered
[76386.777008] HWRecoveryResetSGX: SGX Hardware Recovery triggered
[76674.741302] HWRecoveryResetSGX: SGX Hardware Recovery triggered
[76676.639343] HWRecoveryResetSGX: SGX Hardware Recovery triggered
[76879.832122] kb_lock (GPIO 113) is now closed
[76880.082122] kb_lock (GPIO 113) is now open
[76882.089965] kb_lock (GPIO 113) is now closed
[76882.347778] kb_lock (GPIO 113) is now open

Load average: 2.10 1.97 1.57
  PID  PPID USER     STAT   RSS %MEM %CPU COMMAND
  784     2 root     RW       0  0.0 81.6 [sgx_misr]
 4546  4524 root     R      588  0.2 18.1 top
 3950  1153 user     S    23636  9.5  0.0 /usr/sbin/browserd -s 3950 -n
RTComMessagingServer
 4459   950 user     S    18296  7.4  0.0 /usr/bin/modest
Comment 6 Andre Klapper maemo.org 2010-03-01 22:29:43 UTC
Can you test if this still happens after uninstalling maep, having bug 8689 in
mind?
Comment 7 James A. T. Rice (reporter) 2010-03-02 01:03:57 UTC
Hi Andre,

I was getting to the end of my 28 day 'on the spot
replacement for a new one' period, so I took advantage
of that, and haven't had the device go unresponsive
since.

I've just installed syslog to see if there's anything
about maep and SGX still happening, and indeed
occasionally it does seem to happen that I get a

Mar  1 21:30:17 Nokia-N900-02-8 kernel: [37680.657348] HWRecoveryResetSGX: SGX
Hardware Recovery triggered

Mar  1 22:05:20 Nokia-N900-02-8 kernel: [39783.652770] HWRecoveryResetSGX: SGX
Hardware Recovery triggered

However the device remains responsive, and the
sgx_misr kernel process isn't eating all the CPU.


Any idea as to the origins of the HWRecoveryResetSGX
routine? Is it a kernel patch around a hardware bug
which detects and resets the SGX when bad stuff
happens? (but doesn't handle all cases of bad stuff
happening?)
Comment 8 Eero Tamminen nokia 2010-03-02 16:14:15 UTC
(In reply to comment #2)
>   PID  PPID USER     STAT   RSS %MEM %CPU COMMAND
>   796     2 root     RW       0  0.0 98.0 [sgx_misr]
> 
> I attempted to kill this device several times with "kill -9 796"
> but without success.

Things which take zero memory and which names are shown in [], are kernel
threads, i.e. there's nothing to kill from the user-space perspective.


(In reply to comment #7)
> Any idea as to the origins of the HWRecoveryResetSGX routine?
> Is it a kernel patch around a hardware bug which detects and resets
> the SGX when bad stuff happens?

AFAIK: SGX HW (microkernel?) resets its state when it encounters a state it
cannot handle and Linux kernel on the ARM side notices this and outputs the
message.

The issue could be in Imagination's SGX driver.  Next release has some driver
changes which should at least decrease the likelyhood of the resets happening, 
but as the issue is non-reproducible, it's nearly impossible to reliably verify
this.

This doesn't happen for everybody, some people never encounter this issue (for
example my device has syslog for over a month and there's nothing about SGX
resets, but my usage is quite light, I might not stress the device enough).
Comment 9 James A. T. Rice (reporter) 2010-03-02 16:29:50 UTC
(In reply to comment #8)

> This doesn't happen for everybody, some people never encounter this
> issue (for example my device has syslog for over a month and there's
> nothing about SGX resets, but my usage is quite light, I might not
> stress the device enough).

Have you tried leaving maep running for extended periods of time?
(note I'm not blaming maep for doing anything wrong, but bug 8689
indicates something maep does triggers the SGX resets)..

There does seem to be two different outcomes from SGX resets while
running maep, either
a) as per my old N900 - UI dies, sgx_misr eats CPU, SGX Hardware
recovery triggered.
b) as per my new N900 - SGX Hardware Recovery triggered. Device
continues running just fine.

This is with the same firmware flashed to the devices, so
I'm curious as to whether the difference is intentional - i.e.
different revisions of SGX chip, or whether there's a few
faulty SGX chips (or something related to the resetting of
the SGX chips) which cause symptoms as per a) when HW 
recovery is initiated. In any case there seems to be several
people experiencing a) and several people experiencing b).

Cheers
James
Comment 10 Eero Tamminen nokia 2010-03-02 17:20:12 UTC
(In reply to comment #9)
> Have you tried leaving maep running for extended periods of time?

No, I haven't used that, but it seems interesting (and starts faster than PR1.1
Maps), so I'll give it a try.
Comment 11 Eero Tamminen nokia 2010-03-02 19:14:30 UTC
Tried it on older device, which started getting SGX resets after installing &
testing Maep...

I'll try it next on the device with 20 days uptime and no SGX resets for far.
Comment 12 Eero Tamminen nokia 2010-03-02 19:30:42 UTC
Seems a related bug:
https://bugs.maemo.org/show_bug.cgi?id=8689
Comment 13 Eero Tamminen nokia 2010-03-03 19:03:39 UTC
(In reply to comment #11)
> Tried it on older device, which started getting SGX resets after installing &
> testing Maep...
> 
> I'll try it next on the device with 20 days uptime and no SGX resets for far.

That device doesn't get SGX resets even with Maep.  Hm...

I don't have same SIM in them.  I wonder whether network could have some impact
on triggering this (wild guess that hard to trigger issue might come from
interaction between different drivers)?
Comment 14 egoshin 2010-03-10 00:23:04 UTC
(In reply to comment #8)
>The issue could be in Imagination's SGX driver.  Next release has some driver
>changes which should at least decrease the likelyhood of the resets happening,
>but as the issue is non-reproducible, it's nearly impossible to reliably verify
>this.

If you send it to me I can verify it. My N900 has a high probability of "SGX
Reset" problem after reboot (it is a reason why I use OFF-ON instead of reboot)
- see bug 7017
Comment 15 Eero Tamminen nokia 2010-03-10 11:05:00 UTC
(In reply to comment #14)
>> The issue could be in Imagination's SGX driver.  Next release has some
>> driver changes which should at least decrease the likelyhood of the resets
>> happening, but as the issue is non-reproducible, it's nearly impossible to
>> reliably verify this.
> 
> If you send it to me I can verify it. My N900 has a high probability of "SGX
> Reset" problem after reboot (it is a reason why I use OFF-ON instead of reboot)
> - see bug 7017

There was limited community pre-testing for the PR1.1 release (to find out
whether there are issues that don't appear in our own testing environments). 
Quim can comment on whether something similar is possible also for PR1.2
release.
Comment 16 Alberto Mardegan 2010-03-11 09:45:51 UTC
I'm afraid I've seen this bug as well. Not with Maep, but with maemo-mapper.
Comment 17 Alberto Mardegan 2010-03-11 09:46:12 UTC
*** This bug has been confirmed by popular vote. ***
Comment 18 Andre Klapper maemo.org 2010-04-21 19:54:57 UTC
*** Bug 8689 has been marked as a duplicate of this bug. ***
Comment 19 ryang 2010-04-23 20:44:07 UTC
Just experienced this issue for the first time today. Tried to use phone and
found it completely unresponsive. Pulled the battery and then, about an hour
later, found it the same way. This time I was in a position to SSH into the
phone, and saw the same process eating 100% cpu. No data from syslog yet, as I
don't have it setup and configured, but I'll make that my next step.
Comment 20 Ch. Eckert 2010-04-25 22:31:13 UTC
I've seen this behaviour more than once. Today I was hiking and it occured four
times while
* the device was in offline mode (to avoid roaming cost of 7¢ per KB)
* the only app running was Maep
* the device had a GPS fix

It usually happens when the device is in locked mode. Grab it, try to unlock
it, no response. As I have no other device available to ssh in, I need to
remove the battery and to do a reboot. After that, the N900 needs up to 20
minutes to get a GPS fix (due to offline mode) and will crash again some time
it is in locked mode again.

This issue is still unassigned but rather annoying (no offense BTW). Hiking,
biking and editing osm data was the main reason to buy the N900. Are there
hopes to get the issue fixed or should I better use the N810?

Refs:
http://www.christeck.de/wp/2010/04/02/mapping-initiation-of-the-n900/
http://www.christeck.de/wp/2010/04/25/loosing-the-tracks-of-a-hiking-trip-9869/
Comment 21 Eero Tamminen nokia 2010-04-26 10:58:22 UTC
(In reply to comment #20)
> This issue is still unassigned but rather annoying (no offense BTW).

It's being looked at (alias field indicates that it's forwarded to internal bug
tracker).


> Hiking, biking and editing osm data was the main reason to buy the N900.
> Are there hopes to get the issue fixed or should I better use the N810?

It's potentially fixed with PR1.2 which has quite a lot of SGX driver fixes,
but as the issue here isn't really re-producible, it's impossible to say for
sure that this is the case.


There are still two known remaining issues with SGX:

* One that was possible to trigger (sometimes) with maemo-mapper (which uses
clutter i.e. GLES). The kernel oops is a page fault in v7_dma_inv_range(). It
seems that maemo-mapper called glReadPixels() on a pixmap but the backing
memory had already been freed. If SGX tried to access the pixmap it'd probably
trigger a page fault too, which would cause a lock-up.  It may be possible that
the same issue is also triggerable by Maep. This has a potential fix, but it's
not yet properly tested or integrated (NB#144156).

* SGX driver has problems when the FPS keeps below 0.5.  I would think this
unlikely to get triggered in anything else than a synthetic test-program, who
would want to do or use an application that's so badly written that it doesn't
get more speed out of SGX?
Comment 22 PeteC 2010-04-28 14:19:09 UTC
(In reply to comment #21)
> It's potentially fixed with PR1.2 which has quite a lot of SGX driver fixes,
> but as the issue here isn't really re-producible

I can consistently reproduce it on my device (it seems like a few other people
can too). If there's a chance that further testing might reveal a bug *and*
suggest a fix for 1.2, please suggest things for my to try or do. Otherwise
I'll report back after 1.2 is released.
Comment 23 Eero Tamminen nokia 2010-04-29 18:40:57 UTC
(In reply to comment #22)
> If there's a chance that further testing might reveal a bug *and*
> suggest a fix for 1.2, please suggest things for my to try or do.
> Otherwise I'll report back after 1.2 is released.

PR1.2 should be coming soon, I think it's best to do testing on that.

(Note: releases are based on internal testing results -> we don't have dates
for them.)
Comment 24 Eero Tamminen nokia 2010-04-30 09:58:20 UTC
(In reply to comment #22)
> If there's a chance that further testing might reveal a bug *and*
> suggest a fix for 1.2, please suggest things for my to try or do.

Something came up which would be good to check.  When this happens, log to the
device with SSH and check whether any processes are in D state.  For any such
process, check whether what /proc/PID/wchan reports.
Comment 25 PeteC 2010-05-03 17:11:01 UTC
(In reply to comment #24)
> Something came up which would be good to check.  When this happens, log to the
> device with SSH and check whether any processes are in D state.  For any such
> process, check whether what /proc/PID/wchan reports.

Right, sorry for the delay. When the device is in this state:

Nookie:~# ps | grep ' D ' | grep -v grep
  786 root     26132 D <  /usr/bin/Xorg -logfile /tmp/Xorg.0.log -logverbose 1

ohmd intermittently shows up blocked in this state, but also during the
device's normal state too. Xorg is the key here I guess.

/proc/786/wchan contains "PVRSRVPowerLock".

This is on PR 1.1.1.
Comment 26 caco3 2010-05-19 06:09:06 UTC
I got my phone a week ago, only yesterday it started to have this bug.
I was using maep and logging my moving position, but I think sometimes it also
happends when I had maep closed while charging the phone in idle.

Is there a workaround or are there certain apps we should not use/have
installed to avoid it?
Is there a way I can provide usefull informations?

I would hope it gets solved in firmware 1.2, but I dont expect it to come soon
:(
Comment 27 Eero Tamminen nokia 2010-05-19 11:00:16 UTC
(In reply to comment #26)
> I got my phone a week ago, only yesterday it started to have this bug.
> I was using maep and logging my moving position, but I think sometimes it also
> happends when I had maep closed while charging the phone in idle.

What programs & applets you had running when the phone was "idle"?


> Is there a workaround or are there certain apps we should not use/have
> installed to avoid it? Is there a way I can provide usefull informations?
> 
> I would hope it gets solved in firmware 1.2, but I dont expect it to come soon
> :(

It should really come soon now (within weeks, not month(s)), it's been
post-poned too much already.
Comment 28 Darren Long 2010-05-20 22:30:31 UTC
I've experienced the same issues, with PR1.1 Global, associated with use of
maep or OVI Maps.  Always using 3G for network connectivity (3 UK), if that's
of any consideration.  I've never attempted to debug the cause, though.
Comment 29 Eero Tamminen nokia 2010-05-21 10:31:32 UTC
(In reply to comment #28)
> I've experienced the same issues, with PR1.1 Global, associated with use of
> maep or OVI Maps.

This is the first time I hear that it has happened with the (pre-installed) OVI
maps.  Has anybody else gotten this issue with OVI Maps?
Comment 30 PeteC 2010-05-27 16:11:21 UTC
(In reply to comment #29)
> (In reply to comment #28)
> > I've experienced the same issues, with PR1.1 Global, associated with use of
> > maep or OVI Maps.
> 
> This is the first time I hear that it has happened with the (pre-installed) OVI
> maps.  Has anybody else gotten this issue with OVI Maps?

Not me. I have left OVI Maps in the background with the screen locked on many
occasions, too.
Comment 31 PeteC 2010-05-27 16:19:44 UTC
Bug confirmed on PR1.2 :-( 
(someone please update the version field)

UK variant firmware, installed via command-line flasher (including emmc flash).

Xorg still blocked and its wchan reporting 'PVRSRVPowerLock'. (Exactly as per
my comment #25).

I think it might have taken a bit longer to go belly-up this time (over 1hr, on
PR1.1.1 it rarely took more than 20 mins) but the timing was never precise.

This is with maep v1.3.2.

I've delayed upgrading to 1.3.5 or greater (which the maep maintainer says
might avoid triggering this bug) in the hope that I might be of some help
nailing the underlying problem.

See
https://garage.maemo.org/tracker/index.php?func=detail&aid=5293&group_id=1155&atid=4332
Comment 32 Ian Stirling 2010-06-03 18:03:12 UTC
Saw this on BBC Iplayer - halfway through Top Gear.
Flash video froze, though continued audio playing.
Screen blanked after the blank timeout, would not resume with the lock swich,
touching the screen unblanked (exposing the original content).
There was only one line in dmesg - 
[89946.780700] HWRecoveryResetSGX: SGX Hardware Recovery triggered

All the rest looked normal.
Comment 33 Paul Hartman 2010-06-29 06:54:35 UTC
Has just happened to me on PR1.2. I was not using any maps program at the time,
but did run Maep earlier today for less than 1 minute. I had MicroB open on a
plain HTML page (no flash/javascript stuff) and it was downloading a large file
(350M) in the background, and pidgin running in background, then I locked the
screen. When I came back to check it 30 minutes later, the screen wouldn't turn
on. I am able to ssh into it, though.

I didn't have syslog running at the time but dmesg shows:

HWRecoveryResetSGX: SGX Hardware Recovery triggered

After which the screen won't turn on and sgx_misr uses near 100% CPU, same as
the others have reported.
Comment 34 Luke Dashjr 2010-06-29 09:05:54 UTC
"Me too"

PR 1.2, this has occurred twice for me, both times since I found a way to keep
GPS active (AGTL app; Maps & others tend to let GPS shutdown when idle).

Why is this keyworded 'performance'?
Comment 35 Paul Hedderly 2010-06-29 17:55:43 UTC
Another "me too" here on PR1.2:

Tasks: 173 total,   2 running, 170 sleeping,   0 stopped,   1 zombie
Cpu(s):  9.6%us, 20.1%sy,  0.7%ni, 65.8%id,  3.7%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:    245348k total,   231696k used,    13652k free,     1212k buffers
Swap:   786424k total,    88164k used,   698260k free,    75096k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  866 root      20   0     0    0    0 R 91.3  0.0  31:33.25 sgx_misr
 3908 user      20   0  2344 1016  760 R  7.0  0.4   0:00.10 top
    1 root      20   0  1832  608  484 S  0.0  0.2   0:00.75 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
    3 root      20   0     0    0    0 S  0.0  0.0   0:02.37 ksoftirqd/0


Linux Nokia-N900 2.6.28.10power37 #1 PREEMPT Wed May 26 00:24:03 EEST 2010
armv7l unknown
Comment 36 Eero Tamminen nokia 2010-07-26 17:26:07 UTC
(In reply to comment #34)
> Why is this keyworded 'performance'?

Mostly the the effects of SGX resets (the SGX reseting itself to recover from
something) are seen as (extreme) slowdown and sgx_misr kernel thread taking all
CPU of course also affects performance.  Unlocking not working could be a
side-effect or unrelated bug.


An internal (w25) release has additional SGX fixes which will be in the next
public release.  As the issues aren't reproducible, it's hard to say whether it
will fix the issues people have still encountered after the SGX fixes in PR1.2.

If some program from Extras is especially "good" at causing the SGX resets (now
that Maep doesn't generate them, at least re-producibly) with PR1.2, please
comment in this bug.  Typically such programs are also otherwise bad; do
updates on the background, leak a lot of memory etc.
Comment 37 Andre Klapper maemo.org 2010-08-09 13:14:48 UTC
(Some fixes internally had to be reverted because of side effects.)
Comment 38 Xavier Bestel 2010-08-24 19:19:56 UTC
I've got a non-scientific observation to this bug: it happened way more
frequently (2 times a week instead of once a month) during my holidays trip.
Possible causes:

- a TIM (Telecom Italia) sim card instead of an SFR (French telecom operator)
one.
- a different usage pattern (more mail usage, more GPS usage via OVI maps, but
it once locked with only mail app running)

As said elsewhere, it happens when idle, screen locked, either in my pocket or
while charging.
Comment 39 Xavier Bestel 2010-09-02 17:53:18 UTC
Ok, I've "seen" it happen just after I slide-to-unlock (you know, after you
press the power button when it's locked).

In fact, the window manager was dead (mail app fullscreen without decoration),
re-pressing the power button made the menu appear, but without borders and all
buttons stacked in 1 column, instead of 2 as it should in horizontal mode.
And then it deadlocked, screen off.

All this happened quite fast (1-2 seconds), but I'm sure the WM was dead, and
that may be interesting to whoever is in charge of this bug.
Comment 40 Sami Liedes 2010-09-02 19:23:18 UTC
Don't know if this helps, but here's output from sysrq-w (shoW-blocked-tasks)
in the hung state:

[17890.795166] SysRq : Show Blocked State
[17890.795227]   task                PC stack   pid father
[17890.795288] Xorg          D c0283c14     0   831    687
[17890.795288] [<c028396c>] (schedule+0x0/0x328) from [<c0284994>]
(__mutex_lock_slowpath+0xb0/0x124)
[17890.795349] [<c02848e4>] (__mutex_lock_slowpath+0x0/0x124) from [<c02846c8>]
(mutex_lock+0x24/0x28)
[17890.795379]  r6:00000000 r5:ffffffff r4:d0d9a004
[17890.795410] [<c02846a4>] (mutex_lock+0x0/0x28) from [<bf2004f4>]
(PVRSRVPowerLock+0x5c/0xf0 [pvrsrvkm])
[17890.795532] [<bf200498>] (PVRSRVPowerLock+0x0/0xf0 [pvrsrvkm]) from
[<bf2006c8>] (PVRSRVSetDevicePowerStateKM+0x3c/0xa0 [pvrsrvkm])
[17890.795623]  r8:00000001 r7:00000000 r6:00000000 r5:ffffffff r4:d0d9a004
[17890.795623] [<bf20068c>] (PVRSRVSetDevicePowerStateKM+0x0/0xa0 [pvrsrvkm])
from [<bf2068e8>] (SGXScheduleCCBCommandKM+0x3c/0x1f4 [pvrsrvkm])
[17890.795715]  r9:ce6dfe18 r8:d0d7e000 r7:ffffffff r6:00000001 r5:cd741180
[17890.795745] r4:d0d9a004
[17890.795776] [<bf2068ac>] (SGXScheduleCCBCommandKM+0x0/0x1f4 [pvrsrvkm]) from
[<bf2073ac>] (SGXSubmitTransferKM+0x2dc/0x2ec [pvrsrvkm])
[17890.795867] [<bf2070d0>] (SGXSubmitTransferKM+0x0/0x2ec [pvrsrvkm]) from
[<bf2020dc>] (SGXSubmitTransferBW+0x17c/0x190 [pvrsrvkm])
[17890.795928]  r7:d0d9b000 r6:00000001 r5:d0d9a000 r4:d0d9a004
[17890.795959] [<bf201f60>] (SGXSubmitTransferBW+0x0/0x190 [pvrsrvkm]) from
[<bf201714>] (BridgedDispatchKM+0x110/0x140 [pvrsrvkm])
[17890.796051] [<bf201604>] (BridgedDispatchKM+0x0/0x140 [pvrsrvkm]) from
[<bf1f6524>] (PVRSRV_BridgeDispatchKM+0xc4/0xec [pvrsrvkm])
[17890.796142]  r9:ce6de000 r8:ce6ab900 r7:ce6dfeb0 r6:0000033f r5:c01c6755
[17890.796173] r4:bedb6b04
[17890.796173] [<bf1f6460>] (PVRSRV_BridgeDispatchKM+0x0/0xec [pvrsrvkm]) from
[<c00c6a50>] (vfs_ioctl+0x34/0x94)
[17890.796234]  r7:00000006 r6:c01c6755 r5:bedb6b04 r4:ce6ab900
[17890.796264] [<c00c6a1c>] (vfs_ioctl+0x0/0x94) from [<c00c7044>]
(do_vfs_ioctl+0x498/0x4d8)
[17890.796295]  r7:00000006 r6:bedb6b04 r5:00000006 r4:ce6ab900
[17890.796325] [<c00c6bac>] (do_vfs_ioctl+0x0/0x4d8) from [<c00c70dc>]
(sys_ioctl+0x58/0x7c)
[17890.796356]  r9:ce6de000 r8:ce6ab900 r6:c01c6755 r5:bedb6b04 r4:00000000
[17890.796386] [<c00c7084>] (sys_ioctl+0x0/0x7c) from [<c002c920>]
(ret_fast_syscall+0x0/0x2c)
[17890.796417]  r8:c002caa4 r7:00000036 r6:bedb6d4c r5:bedb7064 r4:0019cb94
Comment 41 Mayuresh 2010-09-14 17:10:36 UTC
Me too.

Confirm the presence of this problem when running modrana.

Firmware: India version.
Comment 42 Karl 2010-09-23 08:55:23 UTC
I can also see this issue. Running PR1.2
Comment 43 Karl 2010-09-23 10:41:05 UTC
(In reply to comment #42)
> I can also see this issue. Running PR1.2
> 

As per https://bugs.maemo.org/show_bug.cgi?id=7017 the issue started after
entering reboot in xterm and was fixed after powering off and back on again.
Comment 44 Mayuresh 2010-10-18 07:18:55 UTC
I want to add some more observations:

1. Till now I had seen the issue triggering only when the display was in off
state. I installed "Simple Brightness Applet" to retain the display ON. For the
first time now I could see the issue triggering even when display was ON.
Nothing on the display was responding and I had to as usual take out the
battery to restart.


2. The triggering of issue seems to have something to do with the data
connection. When I was "roaming" the issue triggered more number of times
(almost every 5/10 minutes) than when I was on my home network (where it
triggered once in a few hours). (I'm suggesting this to be attributed to the
quality of data network rather than roaming or non roaming.)


3. I switched off internet connection completely. Disabled auto-connection
options., set the map images to None in modrana. (Retained only the tracklog
view.) The bug did not trigger in several hours of continuous usage after that.
Comment 45 Eero Tamminen nokia 2010-10-18 10:50:28 UTC
(In reply to comment #44)
> 2. The triggering of issue seems to have something to do with the data
> connection. When I was "roaming" the issue triggered more number of times
> (almost every 5/10 minutes) than when I was on my home network (where it
> triggered once in a few hours). (I'm suggesting this to be attributed to the
> quality of data network rather than roaming or non roaming.)

It could be just that programs in the device do more (at the same time) and
consume/dirty more memory when you have network connection.

dbus-monitor will tell what kind of dbus messages there are flying in the
device, "xresponse -a '*' -w 0" will tell what kind kind of window updates the
programs do and e.g. powertop will tell what extra processes had wakeups during
that time.
Comment 46 Mayuresh 2010-10-18 11:46:47 UTC
> It could be just that programs in the device do more (at the same time) and
> consume/dirty more memory when you have network connection.
> 
> dbus-monitor will tell what kind of dbus messages there are flying in the
> device, "xresponse -a '*' -w 0" will tell what kind kind of window updates the
> programs do and e.g. powertop will tell what extra processes had wakeups during
> that time.
> 

You know that a bug-triggering state was reached only after the device hangs.
So I don't know WHEN to check above things.

I had kept the syslogs enabled. Using time stamp and treating "syslog restart"
as a separator, I have segregated the log into segments between a restart till
occurrence of issue. I also have the log of the session where the same app
(modrana) was running though sans internet and no issue got triggered. Thus I
have 4 "bad" session logs and 1 "good" session log.

I am trying to isolate messages that occur only in "bad" session and not in
"good" session to uniquely tell what happened before the triggering of issue.

No luck so far. It could be that the issue causing circumstances do not reflect
in the log.
Comment 47 Eero Tamminen nokia 2010-10-18 15:29:26 UTC
(In reply to comment #46)
> You know that a bug-triggering state was reached only after the device hangs.
> So I don't know WHEN to check above things.

If xresponse (which you use through SSH connection) reports apps updating their
windows when they're not visible (some other app is on top or screen is
blanked), that's a clear bug in the application and has at least earlier made
this issue to trigger much more easily. -> bug should be filed against the app

Applications using huge amounts of memory can also make this issue easier to
trigger.  If application leaks memory when it's just repeating the same thing
without its data (e.g. number of mails, contacts etc) growing, that's a clear
bug also (-> should be reported against app).  mem-cpu-monitor (from
sp-memusage package) and xrestop can be used to track that.

Among other things PR1.3 fixes last know (smallish) memory leaks in the
pre-installed applications.


> No luck so far. It could be that the issue causing circumstances
> do not reflect in the log.

I think it's better to first wait until PR1.3 is released and see whether this
is still reproducible.  It has several fixes to the SGX drivers.
Comment 48 Mayuresh 2010-10-18 16:12:22 UTC
> If xresponse (which you use through SSH connection) reports apps updating their
> windows when they're not visible (some other app is on top or screen is
> blanked), that's a clear bug in the application and has at least earlier made
> this issue to trigger much more easily. -> bug should be filed against the app

I've seen this bug with SEVERAL navigation apps till now. I have also seen it
trigger when the display was on and the application in question was visible in
the foreground.

> 
> Applications using huge amounts of memory can also make this issue easier to
> trigger.  If application leaks memory when it's just repeating the same thing
> without its data (e.g. number of mails, contacts etc) growing, that's a clear
> bug also (-> should be reported against app).  mem-cpu-monitor (from
> sp-memusage package) and xrestop can be used to track that.


I had run top and monitored memory taken once (by modrana though I already said
this happens with several Navig apps). No reason to believe that it was huge or
I noticed memory usage trend indicative of leaks.

STILL, to pre-empt above possibility I wrapped modrana in a bash script to be
restarted every 1 minute. The bug still got triggered within that minute.

Rather than display and memory leaks etc. when internet connectivity was either
very smooth or completely disabled, the bug did not get triggered for several
hours of continuous usage no matter whether display was on or off, no matter
whether other applications popped up messages in the foreground or whatever. It
got triggered much more frequently with poor internet connectivity.

I think with the observations I have gathered till now, I might be able to
provide a way to reproduce this bug reliably, perhaps using home router to
create poor network conditions when the app is trying to load map images etc.
Will post if I manage to do so.


> 
> Among other things PR1.3 fixes last know (smallish) memory leaks in the
> pre-installed applications.

Doesn't mean much for the present problem unless the pre-installed navig
application has more functionality and interfaces to load gpx or other file
formats etc.

> 
> 
> > No luck so far. It could be that the issue causing circumstances
> > do not reflect in the log.
> 
> I think it's better to first wait until PR1.3 is released and see whether this
> is still reproducible.  It has several fixes to the SGX drivers.

That's certainly interesting. Keeping fingers crossed since there is no word
whatsoever on WHEN that's going to be released.



Mayuresh
Comment 49 Mayuresh 2010-10-31 18:51:48 UTC
(In reply to comment #45)

> dbus-monitor will tell what kind of dbus messages there are flying in the
> device, "xresponse -a '*' -w 0" will tell what kind kind of window updates the
> programs do and e.g. powertop will tell what extra processes had wakeups during
> that time.

I have captured top, powertop, xresponse, syslog and dbus monitor logs for a
session that ended in triggering the issue.

The issue was triggered as usual while on the move for 2/3 hours with internet
connection ON and a navig application - modrana - on.

syslog:

Seen garbled (with binary characters) when the crash happens.
Previous few messages suggest network registration activity. Some also show
"...or the network connection was broken" error message, though the process
that logs this message varies. But such message comes in normal run also.

This time I did not see SGX recovery triggers, though I have seen them in logs
in previous such crashes.

top:

This time I could NOT see sgx_misr taking all the cpu though I have noticed
this before. (It could be that the top logging got choked and this did not get
logged.)

There was nothing abnormal in top log. Memory taken by modrana was 38%.

xresponse log showed the following 3 messages repeating several towards the end
of the log:

   4898834ms :     2ms : Unmapped window 0x1a000fc (hildon-home)
   4898836ms :     2ms : Unmapped window 0x1a000fc (hildon-home)
   4898837ms :     1ms : Destroyed window 0x1a000fc (hildon-home)

Throughout the log it shows "Got damage event" from SCREEN, hildon-home,
hildon-status-menu and modrana. Seems nothing unusual as these messages are
seen throughout the run.

powertop and dbus-monitor logs did not show anything different at the point of
crash.

Note: I am still on PR1.2. Will upgrade soon, though above investigation will
help re-assess the issue in PR1.3.

Will appreciate pointers to analyze this information further.

Regards,
Mayuresh.
Comment 50 Eero Tamminen nokia 2010-11-01 14:58:35 UTC
(In reply to comment #49)
> syslog:
> 
> Seen garbled (with binary characters) when the crash happens.

Clearly as just part of some specific process' syslog message content, or like
syslog content itself got garbled (which could be either syslog process or file
system corruption issue...)?


> Previous few messages suggest network registration activity. Some also show
> "...or the network connection was broken" error message, though the process
> that logs this message varies. But such message comes in normal run also.

It would be pretty bad if network stack would corrupt kernel memory (or it
could be SGX corrupting it and it showing up with network).

Is there anything suspicious in your oopslog (/dev/mtd2) for PR1.2 release?


> This time I did not see SGX recovery triggers, though I have seen them in logs
> in previous such crashes.
> 
> top:
> 
> This time I could NOT see sgx_misr taking all the cpu though I have noticed
> this before. (It could be that the top logging got choked and this did not get
> logged.)

Maybe this was a different issue.  Because syslog access is synchronous, it's
possible for there to be priority inversion situation between processes.

(which makes debugging things awkward...)


> Note: I am still on PR1.2. Will upgrade soon, though above investigation will
> help re-assess the issue in PR1.3.

Can you reproduce this bug with PR1.3.
Comment 51 Xavier Bestel 2010-11-01 15:36:57 UTC
FWIW I've seen thoses hangs with PR1.3, but I have nothing handy to debug.
Comment 52 Mayuresh 2010-11-01 17:08:47 UTC
(In reply to comment #51)
> FWIW I've seen thoses hangs with PR1.3, but I have nothing handy to debug.
> 

I have just written the following script to log the activity continuously.

Besides these contents of /var/log/syslog should be captured for the time
window of interest.

Will appreciate comments regarding usage of the log tools used, what else may
be worth logging etc.

#!/bin/sh

# NOTE: This will create various monitoring logs till the script is killed
# Be aware that it may fill up the media card space. Be sure to close the
script
# after the watch window is over. Also periodically delete the logs created
# or move them to a computer with sufficient storage space.

trap "kill 0" EXIT

LOGDIR="/media/mmc1/logs"
NEWLOGDIR=$LOGDIR/`date +%Y%m%d%H%M%S`

mkdir -p $NEWLOGDIR
cd $NEWLOGDIR

top > top.log &

xresponse -a '*' -w 0 > xresp.log &

dbus-monitor > dbus.log &

while [ true ]
do
    sp-oops-extract /dev/mtd2
    sleep 5
done > oops.log &

while [ true ]
do
    date
    echo "====="
    powertop
done > powertop.log &



while [ true ]
do
        sleep 86400
done
Comment 53 Eero Tamminen nokia 2010-11-01 17:34:17 UTC
(In reply to comment #52)
> while [ true ]
> do
>     sp-oops-extract /dev/mtd2
>     sleep 5
> done > oops.log &

This isn't needed.  Like syslog, oopslog will be there after reboot (SDK syslog
is AFAIK rotated on bootup based on log size and oops partition is used as oops
record ring buffer).
Comment 54 Mayuresh 2010-11-01 18:49:39 UTC
(In reply to comment #53)
Thanks. Will drop that.

Just now the issue got triggered once again. The device was at home with
modrana running and connected with a laptop vis ssh over wifi.

Observations:

1. We are probably talking about two (may be several) different issues with
same end-user perceived symptom - hanging/having to take out battery etc.

2. This time I had switched off the display and this time sgx_misr cpu usage
triggering was seen. (Last few times when display was left on with appropriate
setting, sgx_misr wasn't taking the cpu though the device was hung.)

3. Thanks to top log I can tell that the point at which sgx_misr CPU usage shot
up, Xorg was taking 85% memory. I think that's prominent enough observation
from all the logs.

4. I have got all the logs produced by above script and can share any
information anyone wants.

Regards,
Mayuresh.
Comment 55 Mayuresh 2010-11-01 19:48:23 UTC
> 3. Thanks to top log I can tell that the point at which sgx_misr CPU usage shot
> up, Xorg was taking 85% memory.

Sorry, that was wrong observation. (In the log, the column titles and contents
do not come properly aligned.) 85% was CPU usage, not memory.
Comment 56 Eero Tamminen nokia 2010-11-02 11:32:57 UTC
Most interesting thing is whether this is reproducible with PR1.3...
Comment 57 Craig Woodward 2010-11-04 22:38:22 UTC
I am running PR1.3, and am seeing this often now.  I was NOT seeing it before
the PR1.3 update, but have had this happen about 4 times now, verified via SSH
login:

Mem: 207244K used, 38056K free, 0K shrd, 1740K buff, 66536K cached
CPU:  0.0% usr  100% sys  0.0% nice  0.0% idle  0.0% io  0.0% irq  0.0% softirq
Load average: 1.10 1.02 0.93
  PID  PPID USER     STAT   RSS %MEM %CPU COMMAND
  868     2 root     RW       0  0.0 81.6 [sgx_misr]
 4704  4684 root     R      588  0.2  9.0 top
   10     2 root     SW       0  0.0  9.0 [omap2_mcspi]

I have found it happens most often when monRana is running tile download
caching (memory, GPS, CPU & FS intensive).  I didn't have this issue before the
PR1.3 update, with modRana and Mappero being regularly used items.  (I did OTA
update for PR1.3.)  I've also found that text messages and alarms don't go off
when this is happening. (But are time-stamped properly after rebooting.)

I do have power-kernel installed, so if this is a kernel-level fix, that may be
part of the issue.  Since it's semi-reproducible, would it be of help if I
installed the stock PR1.3 kernel to see if it helps?
Comment 58 Xavier Bestel 2010-11-04 23:20:21 UTC
I know it sounds wierd, but by now I'm sure it happens near certainly when in
my trousers' pocket. That may mean there's a harware problem when it's somewhat
pressed. At first I thought it may be linked to the acceleration sensor, but it
doesn't happen when it's in my jacket.
Comment 59 Craig Woodward 2010-12-06 22:47:51 UTC
FYI: I've repeated this with stock kernel in PR1.3 on a couple of occasions
now.  It seems to happen mainly when I have either a nav app running (like
modRana) or a flash game running in the browser.  Both are graphic, CPU, and
memory intensive.
I've also had instances where after shutting down such apps, minutes after,
when changing between Wifi and GPRS this triggers.

Lets be clear, this is a system-level error, not an app error as comment 47
seems to imply. Sure, an app shouldn't eat all system memory.  But frankly, a
user process eating too much memory should trigger a process kill, not a system
lockup.  A kernel driver going into a spin and requiring a reboot to fix isn't
an acceptable result of low memory.  The kernel should be able to do a process
kill if it's dead out of memory and needs it, most Linux kernels have this
built in.  This driver clearly has issues, and needs to be fixed.

For those stuck with this bug, I posted a script(link below)to monitor the
driver and reboot if it starts spinning, as to have a clean shutdown and not
burn through your entire battery.  If it would be helpful to add some
collection code to that before the reboot, let me know.  I'd be happy to add
it.

http://talk.maemo.org/showpost.php?p=890903&postcount=1
Comment 60 Eero Tamminen nokia 2010-12-09 12:44:15 UTC
(In reply to comment #59)
> Lets be clear, this is a system-level error, not an app error as comment 47
> seems to imply.

If the issue is e.g. app needing larger GL command buffer than what's available
by default, it can add a /etc/powervr.d/ file[1] where it sets a suitable
command buffer size for itself (I have no idea whether there's documentation on
this, maybe Imagination www-site has something).

[1] "strace -e open -f <application>" will tell which files SGX libs try to
open.


> Sure, an app shouldn't eat all system memory.

Problem isn't related to things eating all virtual memory (as that would be
very hard because swap size is huge), just a lot of it & using it actively so
that device swaps a lot.


>  But frankly, a user process eating too much memory should trigger
> a process kill, not a system lockup.  A kernel driver going into a spin

SGX resets are about what the SGX HW microkernel does e.g. when rendering is
too slow.  CPU side kernel driver is just reporting that, it doesn't have
control over it or know which of the processes pushing stuff to SGX pipeline
might be causing it.


> and requiring a reboot to fix isn't an acceptable result of low memory.
> The kernel should be able to do a process kill if it's dead out of memory

If this is a question of SGX itself running out of memory, there's no API in
SGX driver to get information about this nor which process is using most of it
(no idea whether the current driver is even tracking that).


> and needs it, most Linux kernels have this built in.

The kernel has OOM-killing enabled, but it works only if device runs out of
normal (virtual) memory, it cannot help to SGX issues.


> This driver clearly has issues, and needs to be fixed.

While the driver has issues which AFAIK are going to be handled only in
Harmattan version of it, what the applications do has very large effect on
triggering them. I've never heard of anybody being able to trigger this with
pre-installed device software, it happens only when running 3rd party SW.
Comment 61 Luke Dashjr 2010-12-09 17:07:46 UTC
Pretty sure I had this problem with a clean PR1.2 flash. I /don't/ see it
anymore, still on PR1.2. Maybe coincidence, but I think it stopped when I
stopped leaving GPS active 24/7.
Comment 62 Craig Woodward 2010-12-13 22:14:11 UTC
Some of the apps that I've found which trigger this are QT BASED APPS, meaning
they're not using the SGX driver directly. The whole point of QT is to not have
to do machine-level tweaking like this.  Further, there doesn't appear to be
any documentation about how set the "Command buffer size" for the PowerVR
driver. Are the options for this ini file documented anywhere?  Even Google
power search isn't turning up anything useful (in this forum or outside).

In any case, this is a *closed source* kernel driver, which is causing the
entire device to die randomly.  I think people would be ok with it if it threw
an error, or skipped a drawing request, or even crashed the calling program. 
*Anything*, even rebooting, would be a better result than silently spinning in
a tight kernel loop, eating 100% CPU, completely draining the battery, all the
while disabling the only UI on the device.  

The fact that this major bug has existed since PR 1.1, and now appears to be
headed toward a "WONTFIX" state is really sad.  This bug was marked as a
"critical show stopper" by Nokia in Feb, almost a year ago.  Yes, shipped
software doesn't trigger it.  Shipped software also doesn't provide
turn-by-turn navigation or lots of features.  If this device were limited to
Nokia-only software it would be quite boring, and not purchased by most people
who now own one.

I'll tell you this much: If Nokia goes "WONTFIX" on something as critical as a
display driver on a 4th gen device a year and 3 patches later, what confidence
is anyone going to have with MeeGo?  The fact that it's still labeled "New" and
"unassigned" is telling all by itself.
Comment 63 Eero Tamminen nokia 2010-12-14 17:34:15 UTC
Does somebody have reliably *reproducible* test-case for this with PR1.3?
Then it at least can be checked what happens in the device.


(In reply to comment #62)
> Some of the apps that I've found which trigger this are QT BASED APPS,
> meaning they're not using the SGX driver directly.
> The whole point of QT is to not have to do machine-level tweaking like this.

If that really is the issue with those Qt apps, whether something is using HW
indirectly or directly is really immaterial in regards to what the program does
with the HW.  There being extra SW layers in-between rarely helps things if app
is trying to use more resources than are available (typically it makes things
worse as developer is less aware of those issues).


> Further, there doesn't appear to be
> any documentation about how set the "Command buffer size" for the PowerVR
> driver. Are the options for this ini file documented anywhere?

Maybe in the “OMAP35x Graphics SDK Getting Started Guide” or other documents
referred by it?


> Even Google power search isn't turning up anything useful
> (in this forum or outside).

You may need to register to get SGX SDK & its documentation from Imagination.


> In any case, this is a *closed source* kernel driver, which is causing the
> entire device to die randomly.

The kernel driver is AFAIK open source.  The user space libraries are closed
source.


> I think people would be ok with it if it threw
> an error, or skipped a drawing request,

I don't think it's that simple.

SW might be executing the commands asynchronously for performance reasons and
not waiting for return values.

The HW throws an error (resets itself) and as result stuff actually *isn't*
being drawn.  For you that (UI frames not being drawn) seems like UI freeze.


> or even crashed the calling program.

I don't think kernel driver knows what process pushing things to the graphics
pipeline eventually caused the HW stall/reset and whether HW will eventually
recover from it.  Crashing e.g. X server because some other app caused GFX
pipeline stall from which system might recover, would cause device reboot.  Not
very desirable.

AFAIK one of the triggers for GFX HW state reset is also SW trying to do things
that are too slow on SGX.

There are many things (not just GL releated) that user-space processes (even
ones not running as root) can do which can completely ruin the device
usability.  Applications need to behave nicely, not DOS the device.


> *Anything*, even rebooting, would be a better result than silently spinning
> in a tight kernel loop, eating 100% CPU, completely draining the battery,
> all the while disabling the only UI on the device.  
> 
> The fact that this major bug has existed since PR 1.1, and now appears to

It existed already earlier and there have been additional fixes to it in every
release (some also in PR1.3).


> be headed toward a "WONTFIX" state is really sad.  This bug was marked as a
> "critical show stopper" by Nokia in Feb, almost a year ago.
>
> Yes, shipped software doesn't trigger it.  Shipped software also doesn't
> provide turn-by-turn navigation or lots of features.

Why turn-by-turn navigation needs GL?

Embedded GL HW is typically suited for smallish textures and a good amount of
geometry, that's why they're better match for (suitably designed) games than
applications which typically have lots of large textures but very little
geometry. Even worse is if such application tries to use 32-bit textures, they
more than halve the speed compared to 16-bit ones.


> If this device were limited to
> Nokia-only software it would be quite boring, and not purchased
> by most people who now own one.
> 
> I'll tell you this much: If Nokia goes "WONTFIX" on something as critical
> as a display driver on a 4th gen device a year and 3 patches later, what
> confidence is anyone going to have with MeeGo?

In regards to graphics drivers (kernel & user-space); there are a large changes
in them, X server, Qt etc. in MeeGo.  If you check meegotouch library &
applications that are publicly available (e.g. on ARM MeeGo release), they're
all using GLES for drawing and should be working fine (in regards to issue
discussed in this bug).

Doing such huge changes to a working release (Fremantle) which is in a
bugfixes-only mode is out of question.

Disclaimer: I'm not related to the development of the drivers or haven't even
used them, information here is at best second hand.


> The fact that it's still labeled "New" and
> "unassigned" is telling all by itself.

For Nokia SW issues, typically the state after those is fixed (or
wontfix/invalid). The intermediate bug states are mainly used by non-Nokia SW
here in bugs.maemo.org (the community bugzilla).
Comment 64 PeteC 2010-12-14 17:49:55 UTC
(In reply to comment #63)
> Does somebody have reliably *reproducible* test-case for this with PR1.3?
> Then it at least can be checked what happens in the device.

I had a 100% reliable test-case for PR1.2 on MY device and got tired of asking
for ways to help you diagnose it further. If Nokia ever manages to release a
FIASCO image for the UK variant of PR1.3 I'm sure I'll be able to reproduce it
in PR1.3 too.
Comment 65 Eero Tamminen nokia 2010-12-14 19:14:22 UTC
(In reply to comment #64)
> (In reply to comment #63)
> > Does somebody have reliably *reproducible* test-case for this with PR1.3?
> > Then it at least can be checked what happens in the device.
> 
> I had a 100% reliable test-case for PR1.2 on MY device

If you refer to comment 31, you said that you had older version of Maep which
behaved badly.  It leaked memory like sieve and did frequent window updates on
background.  I had reported a bug against it as such behavior is obvious
performance & use-time issue for the whole device and that got fixed.  After it
was fixed, the issue didn't happen again.


> and got tired of asking for ways to help you diagnose it further.

There are many easy ways to find bad behavior[1] in apps.
- memory leakage: sp-endurance
- background activity & window updates: strace & xresponse

See: http://wiki.maemo.org/Documentation/devtools/maemo5


(For diagnosing SGX reset issues, none that I know.)
Comment 66 Mayuresh 2010-12-14 19:41:11 UTC
> I had a 100% reliable test-case for PR1.2 on MY device and got tired of asking

You mean, you have exact steps to reproduce the issue? Can you share them?

I have been able to trigger the issue on many occasions till now though don't
have a reliable way to reproduce it.

Mayuresh.
Comment 67 Craig Woodward 2010-12-16 01:16:53 UTC
(In reply to comment #63)
> Maybe in the “OMAP35x Graphics SDK Getting Started Guide” or other documents
> referred by it?

Finally, something useful.  At least that I can look up and see if there's a
bandaid one can put on this mess.  Sadly, it looks like they only are offering
the 4.0 SDK, which is not what we're based on.  But it may have something
related at least.  (And yes, I have a TI developer account, as I work with TI
chipsets routinely in my "real job".)

(In reply to comment #63) 
> The kernel driver is AFAIK open source.  The user space libraries are closed
> source.

Really?  Would you happen to know it's source file name? From everything I've
found this is a closed source module.  I can't imagine that the kernel driver
is open source, since it's the part directly touching the hardware.  If it IS
open, I (or the community) could at look at it and see if we can find where
it's spinning and patch the community based kernel. 

(In reply to comment #63) 
> Crashing e.g. X server because some other app caused GFX
> pipeline stall from which system might recover, would cause
> device reboot.  Not very desirable.

Actually, YES! Yes it IS more desirable to have it reboot.  What's NOT
desirable is to have it spin and drain the battery.

Tell me, which is more desirable to you:  A device that sometimes reboots
randomly but usually works fine, or one that you can't rely on because it
randomly and *silently* drains it's entire battery?  My vote is to reboot.

(In reply to comment #63)
> Does somebody have reliably *reproducible* test-case for this with PR1.3?
> Then it at least can be checked what happens in the device.

I've had it happen while using MicroB and/or modRana, and on occasion while
just sitting with nothing running user side. But then you *have* a reliable way
to reproduce this: an old version of Maep.

(In reply to comment #65)
> you said that you had older version of Maep which behaved badly.

No matter how poorly a program is written, a USER SPACE program should never be
able to cause a kernel to spin/lock.  THIS IS A KERNEL ISSUE.  Even older
systems have measures in place to prevent user programs like "while(1) fork();"
from bringing down a system.  This is no different.

I'm really tired of the game you're playing here in blaming apps.  If you had a
car that sometimes shutdown randomly, but did so reliably if you drove it over
70kmh, your mechanic saying "Well, don't go over 60kmh!" wouldn't fly.  That's
essentially what you're saying here.

A key kernel driver is spin-locking under load, and your reply is "lighten the
load by filing bugs against other programs".  This issue CAN be reproduce
reliably with a known (admittedly buggy) version of software.  You have the
tools you need to fix it, but are asking that the tools be fixed to accommodate
the bug instead! 

If the issue is that UNDER LOAD the hardware resets or faults, saying "don't
stress the device" is not a solution.  The solution is fixing the kernel driver
to handle the fault state, or limit the load, or in a worst case make it
eventually give up and trigger a reboot.

SPINNING FOREVER IS NOT AN ACCEPTABLE SOLUTION FOR A KERNEL DRIVER, EVER.  And
that's exactly what's happening here!

(In reply to comment #63) 
> Why turn-by-turn navigation needs GL?

Way to be snarky... Look at the apps triggering this issue.  Most often they
have been nav apps.  Meap, mappero, modRana, are ALL apps that trigger this,
and ALL apps that revolve around navigation.  Most of them exist only because
the OVI maps app lacks (among many other things) turn-by-turn.

None of the above mentioned apps may actually NEED GL.  Most of them NEED to do
screen drawing though, and use existing libraries (like QT) that use GL in
doing drawing on their behalf.  Even if this is caused by X updating the screen
on behalf of an app, the issue still persists.

(In reply to comment #63) 
> Doing such huge changes to a working release (Fremantle) which is in a
> bugfixes-only mode is out of question.

Define working.  I'm not sure I'd classify a mobile device that randomly drains
it's battery as "working".  This is a critical bug.  If you're not going to fix
a critical level bug while in "bugfixes-only mode", then why have such a mode?

Nobody is asking for back-port of the then entire MeeGo display driver!  We're
asking for support in stopping this *broken* behavior.  Even a solution that
triggers the watchdog and reboots the device when it spins in this state would
be preferable to what it does now.
Comment 68 egoshin 2010-12-16 02:08:21 UTC
About crash-and-reboot - 

Hey, hey - it is not the only option, it is still possible to configure upstart
in a way which would cause a restart of X11 and GUI apps like hildon etc after
X11 failure-and-crash, but keep running server applications.

I am not sure, but it seems that MOST of GUI apps are able to safely restart
via DSME tools. At least some thread in talk.maemo.org has a script to
periodically restart some amount of GUI and servers to avoid fragmentation in
swap and memory.
Comment 69 Sami Liedes 2010-12-16 03:10:45 UTC
+1 for reboot being way preferable to the current behavior.

(In reply to comment #67)
> I've had it happen while using MicroB and/or modRana, and on occasion while
> just sitting with nothing running user side. But then you *have* a reliable way
> to reproduce this: an old version of Maep.

I haven't been able to notice a difference between old and new (including the
latest) versions of Maep; both trigger this bug routinely.
Comment 70 Mayuresh 2010-12-16 06:09:45 UTC
> I haven't been able to notice a difference between old and new (including the
> latest) versions of Maep; both trigger this bug routinely.

Bump.

STOP CLAIMING MAEP FIXED THE PROBLEM ON THIS THREAD.

It triggers with maep as well.

It has been reported with microb as well and at least once with ovi maps as
well somewhere on above thread.

I also have top logs where the issue got triggered with modrana taking just 13%
memory. Not a stressful situation by any standards. All the times the issue got
triggered, there was some activity on the GPRS/wifi connection such as
reconnecting, poor signal quality etc. High resource usage could be playing
some part, but it's not only that.

I agree with comment 67. Either provide at least half decent navig app or stop
passing on the buck to apps that are doing a good job of filling the void. It
has been reasoned a number of times on this thread how the kernel has to be
responsible for this.
Comment 71 Mayuresh 2010-12-16 07:05:12 UTC
> I also have top logs where the issue got triggered with modrana taking just 13%
> memory. Not a stressful situation by any standards. All the times the issue got
> triggered, there was some activity on the GPRS/wifi connection such as
> reconnecting, poor signal quality etc. High resource usage could be playing
> some part, but it's not only that.

Want to add one more point. If I simply keep internet connection switched off,
no matter how much stress (memory/cpu) I cause on the device I have never
managed to trigger the issue. I can achieve this by panning fast in modrana for
a long time till it takes a lot of memory and finally appears hung. Though in
such instances the device is NOT hung. You can recover from this situation. In
fact you get yellow message box saying that new application can't be allocated
memory etc. That's quite acceptable.

Now just keep the internet connection on. No matter what the stress level is,
the issue may trigger. There have been occasions of triggering the issue wth as
low memory usage by modrana as 13% and within a matter of minute of launching
it.
Comment 72 Xavier Bestel 2010-12-16 11:58:57 UTC
I also would really prefer the device to reboot.
The current behavior means:
- if I don't see it right away, the battery becomes empty fast
- if I see it quickly enough, I have to pull off the battery (only mean to
reboot the device), and put it back, which means setting again the date and
time, because the N900 doesn't do it itself (I can't remember which bug# it
is). Afwully annoying, I have lots of photos dated from 1/1/2009 because of
this.
Comment 73 Eero Tamminen nokia 2010-12-16 12:49:08 UTC
(In reply to comment #67)
>> The kernel driver is AFAIK open source.  The user space libraries
>> are closed source.
> 
> Really?  Would you happen to know it's source file name?
> From everything I've found this is a closed source module.
> I can't imagine that the kernel driver is open source, since
> it's the part directly touching the hardware.

In that case it's better to rely on facts than on hearsay or imagination. :)

2 minutes of googling gives the used upstream kernel and changes on top of it:
http://repository.maemo.org/pool/fremantle/free/k/kernel/kernel_2.6.28-20103103+0m5.diff.gz

kernel-2.6.28/drivers/gpu/pvr/module.c seems like a good starting point.


> (In reply to comment #63) 
> > Crashing e.g. X server because some other app caused GFX
> > pipeline stall from which system might recover, would cause
> > device reboot.  Not very desirable.
> 
> Actually, YES! Yes it IS more desirable to have it reboot.  What's NOT
> desirable is to have it spin and drain the battery.
> 
> Tell me, which is more desirable to you:  A device that sometimes reboots
> randomly but usually works fine, or one that you can't rely on because it
> randomly and *silently* drains it's entire battery?  My vote is to reboot.

Have you checked how many times in your case the HW would recover within
reasonable time or how many times it has happened even without you noticing it
from the UI (I think you should see this from syslog)?  If device is just
rebooted, you lose all unsaved data in your apps.

Bug 7017 would also indicate that it might be possible that reboot doesn't
always fix the issue (in which case device with your suggested solution would
be in reboot loop and anyway empty the battery).  Based on comments in that
bug, it would seem to be fixed for PR1.3 (when not meddling with device
voltages)

Regarding bug 7017, has anybody encountering this issue modified device CPU
voltage or Mhz values?


> (In reply to comment #63)
>> Does somebody have reliably *reproducible* test-case for this with PR1.3?
>> Then it at least can be checked what happens in the device.
> 
>I've had it happen while using MicroB and/or modRana, and on occasion while
>just sitting with nothing running user side. But then you *have* a reliable
>way to reproduce this: an old version of Maep.

Not in normal usage.  It had to be attached to charger so that I could run it
over a day.  (that was in spring, before PR1.2. If I remember correctly, it
didn't happen with PR1.2.)


> (In reply to comment #65)
> I'm really tired of the game you're playing here in blaming apps.
> If you had a car that sometimes shutdown randomly, but did so
> reliably if you drove it over 70kmh, your mechanic saying
> "Well, don't go over 60kmh!" wouldn't fly.  That's essentially
> what you're saying here.

And if the reason is that you kept brake down all the time and didn't use
clutch while changing the gears, guess what the mechanic would say? ;-)


Buggy user processes can DOS a normally configured Linux system very easily[1],
the reason why you don't see such applications on Desktop is that typically
they don't get into distro repos if they behave too badly.

[1] I can easily think of ways how process can DOS Linux with either:
- D-BUS (lots of messages which recipient doesn't process),
- X server (server or input grab or just lots of request),
- disk (writing huge amounts of data without madvise when you have slow disk),
- memory (constantly dirtying more memory than system has RAM), or
- GL usage (well, this breaks things even with apps accepted to distros).
- use higher priority, but not work appropriately for that

I've myself encountered all of these, but I'm sure there are many other ways
bad program can DOS our modern multi-purpose operating systems.

X & D-BUS don't offer mechanisms to control above and because there can be
valid reasons for some programs to (rarely) temporarily do rest of the things,
they are allowed to be done.

If you have programs that only together cause this kind of issues, they also
need to be fixed, but finding out about the issue is much harder and you need
to judge what is and isn't justified in the program.

But for example drawing to windows when they aren't visible isn't justified
under any circumstances.


> A key kernel driver is spin-locking under load, and your reply is
> "lighten the load by filing bugs against other programs".

As I stated, AFAIK:
- One of the conditions where HW/driver tries to fix issues
  by SGX reseting is when operation is too slow
- This can be caused by application asking HW to do more than
  is reasonable (operation it asks is too slow)
- Operations are pipelined to the HW, so that current SGX user
  isn't necessarily the cause for the stall
- Therefore kernel has no idea which process is causing the issue
  (killing processes randomly isn't reasonable either)


> This issue CAN be reproduce reliably with a known (admittedly buggy)
> version of software.

I'm not working for Fremantle anymore[1], but when I was, I couldn't reproduce
these in "normal" conditions.  I've also understood that Imagination wasn't
able to reproduce the issue with older Maep for newer driver releases (like one
in PR1.2 I think).

If these issues were easily reproducible (say within day) by the developers, of
course they wouldn't have been fixed earlier.    I think there are also other
conditions that need to be present for these issues to happen.


> (In reply to comment #63) 
> > Why turn-by-turn navigation needs GL?
> 
> Way to be snarky... Look at the apps triggering this issue.
> Most often they have been nav apps.  Meap,

Maep works fine for me. I've never seen SGX reset in normal usage.  And Maep's
so nice that I don't use the others[1].

[1] Take my comments as from another, potentially more knowledgeable, N900
user. :-)


> mappero,

Mappero I know to use Clutter i.e. GL.


> modRana, are ALL apps that trigger this,

If it's this:
http://maemo.org/packages/package_instance/view/fremantle_extras-devel_free_armel/modrana/0.20-3/

It actually would seem to be using python & cairo, not GL.


> and ALL apps that revolve around navigation.
[...]
> None of the above mentioned apps may actually NEED GL.  Most of them
> NEED to do screen drawing though, and use existing libraries (like QT)
> that use GL in doing drawing on their behalf.  Even if this is caused
> by X updating the screen on behalf of an app, the issue still persists.

If any of them do window updates on background, they're just broken and
should be fixed.  Regardless of whether that triggers SGX resets.


> (In reply to comment #63) 
> Nobody is asking for back-port of the then entire MeeGo display driver!
> We're asking for support in stopping this *broken* behavior.

I've understood that the case I referred above (too slow drawing by app) AFAIK
isn't something that will be fixed in the SGX drivers.  Expectation of things
working faster than certain (very low/unusable) minimum speed is AFAIK builtin
to the (user-space) driver and cannot be "fixed".  I would assume such buggy
apps should be pretty obvious to users though, they always draw things too
slowly & cause the issue.
Comment 74 Eero Tamminen nokia 2010-12-16 13:01:19 UTC
(In reply to comment #71)
>> I also have top logs where the issue got triggered with modrana taking
>> just 13% memory. Not a stressful situation by any standards. All the times
>> the issue got triggered, there was some activity on the GPRS/wifi connection
>> such as reconnecting, poor signal quality etc. High resource usage could be
>> playing some part, but it's not only that.

You've verified this is the whole device UI freeze caused by SGX issues e.g. by
sshing to the device and seeing sgx kernel thread highest in top or SGX resets
in dmesg?

You haven't modified device CPU MHz or voltages?


> Want to add one more point. If I simply keep internet connection switched off,
> no matter how much stress (memory/cpu) I cause on the device I have never
> managed to trigger the issue.
...
> Now just keep the internet connection on. No matter what the stress level is,
> the issue may trigger. There have been occasions of triggering the issue wth as
> low memory usage by modrana as 13% and within a matter of minute of launching
> it.

Very interesting.  Does this happen with both GPRS and Wifi or only with the
other one?  Or only when the device is moving and switching between base
stations?

(I'm wondering whether there could be any relation to bug 9116 / could this
issue actually be relate to radios instead of SGX.)

Does it happen also when battery is full?

(I'm wondering whether battery consumption could affect it.)
Comment 75 Eero Tamminen nokia 2010-12-16 13:05:39 UTC
(In reply to comment #68)
> Hey, hey - it is not the only option, it is still possible to configure upstart
> in a way which would cause a restart of X11 and GUI apps like hildon etc after
> X11 failure-and-crash, but keep running server applications.

UI related startup takes most of the bootup time, if one is re-starting the
whole UI session and loosing all user's unsaved data anyway, it would be much
safer just to do full reboot in such a stressed situation.
Comment 76 Mayuresh 2010-12-16 15:33:54 UTC
(In reply to comment #74)
> (In reply to comment #71)
> >> I also have top logs where the issue got triggered with modrana taking
> >> just 13% memory. Not a stressful situation by any standards. All the times
> >> the issue got triggered, there was some activity on the GPRS/wifi connection
> >> such as reconnecting, poor signal quality etc. High resource usage could be
> >> playing some part, but it's not only that.
> 
> You've verified this is the whole device UI freeze caused by SGX issues e.g. by
> sshing to the device and seeing sgx kernel thread highest in top or SGX resets
> in dmesg?

In above instance I mentioned it wasn't sgx, it was X server that was consuming
a lot of CPU. (I have already provided the details in one of the posts above.)
Nevertheless, as a consumer it means the same to me.

> 
> You haven't modified device CPU MHz or voltages?

No. And I don't know how to.


> > Want to add one more point. If I simply keep internet connection switched off,
> > no matter how much stress (memory/cpu) I cause on the device I have never
> > managed to trigger the issue.
> ...
> > Now just keep the internet connection on. No matter what the stress level is,
> > the issue may trigger. There have been occasions of triggering the issue wth as
> > low memory usage by modrana as 13% and within a matter of minute of launching
> > it.
> 
> Very interesting.  Does this happen with both GPRS and Wifi or only with the
> other one?  Or only when the device is moving and switching between base
> stations?

Noticed more often with GPRS and noticed when roaming i.e. possibly related to
the quality of network. (I have already logged these details on this thread.)

> (I'm wondering whether there could be any relation to bug 9116 / could this
> issue actually be relate to radios instead of SGX.)
> 
> Does it happen also when battery is full?
> 
> (I'm wondering whether battery consumption could affect it.)
Yes.
Comment 77 Eero Tamminen nokia 2010-12-16 17:51:11 UTC
(In reply to comment #76)
>> You've verified this is the whole device UI freeze caused by SGX issues
>> e.g. by sshing to the device and seeing sgx kernel thread highest in top
>> or SGX resets in dmesg?
> 
> In above instance I mentioned it wasn't sgx, it was X server that was
> consuming a lot of CPU.
>
> (I have already provided the details in one of the posts above.)

You had above commented about both the SGX and the other issues, so I wasn't
sure which one this is.

So, to verify again, the SGX issue isn't reproducible for you, but you have
some other (potentially X usage related) issue which is reproducible?


> Nevertheless, as a consumer it means the same to me.

If you don't see SGX resets or sgx_misr kernel thread using all CPU, it's a
separate issue which should be discussed in separate bug.  This bug is only
about the (currently non-reproducible) SGX issue.
Comment 78 Sami Liedes 2010-12-16 18:06:20 UTC
(In reply to comment #77)
> So, to verify again, the SGX issue isn't reproducible for you, but you have
> some other (potentially X usage related) issue which is reproducible?
> 
> If you don't see SGX resets or sgx_misr kernel thread using all CPU, it's a
> separate issue which should be discussed in separate bug.  This bug is only
> about the (currently non-reproducible) SGX issue.

When this happens to me with recent versions of Maep (which is all the time if
I let the screen go dark while navigating), I see a SGX reset message in dmesg
and top showing X taking all CPU in D state. I don't think that necessarily
means it's not sgx_misr kernel thread using all CPU. (See also my comment #40)
Comment 79 Eero Tamminen nokia 2010-12-16 18:46:32 UTC
(In reply to comment #78)
> > If you don't see SGX resets or sgx_misr kernel thread using all CPU, it's a
> > separate issue which should be discussed in separate bug.  This bug is only
> > about the (currently non-reproducible) SGX issue.
> 
> When this happens to me with recent versions of Maep (which is all the time if
> I let the screen go dark

Hm.  Is Maep still/again doing window updates when its window isn't visible?

Could you check that from e.g. ssh console with "xresponse -w 0 -a '*'"[1] and
if Maep indeed is doing window updates while screen is blanked or it's on
background, file a use-time & performance [2] bug against Maep?

[1] http://wiki.maemo.org/Documentation/devtools/maemo5/xresponse

[2] Because window update means that large amount of memory is actively (but
unnesessarily) used, this is also device memory usage issue. Due to the SGX
issue discussed here, it's also (triggering) reliability issue.

-> Non-visible window updates are really evil.


> while navigating),

So you need to be moving to get this issue?  Do you need to have network also
enabled?


> I see a SGX reset message in dmesg and top showing X taking all CPU
> in D state.

In that case X isn't taking CPU (unlike for Mayuresh) as it's in D state, and
you have SGX reset.  So sounds like an issue that belongs to this bug.


> I don't think that necessarily means it's not sgx_misr kernel thread
> using all CPU.

There are two SGX issues.  One where you see sgx_misr kernel thread taking all
CPU (which I've never seen myself, just mentioned here in bugzilla) and SGX
resets (which I was able to reproduce in spring).  I think both can be handled
in this bug.
Comment 80 Mayuresh 2010-12-16 19:33:32 UTC
(In reply to comment #77)

> So, to verify again, the SGX issue isn't reproducible for you, but you have
> some other (potentially X usage related) issue which is reproducible?

Hang on. I didn't say sgx issue is not reproducible. In fact in comment 54 I
already said that there are 2 (and possibly multiple) different issues with
same end-user perceived symptom.

I just said that in one of the logs when the device hung, sgx was not taking
100% CPU. But in some other log it was.

My point is, modrana WASN'T taking excessive memory in EITHER of the scenarios.

E.g. following are 3 consecutive snapshopts of top log produced at 5s interval
in that log where sgx GOT triggered:

Snapshot 1: sgx is yet to trigger. NOTE: modrana memory 20% NOT HIGH

  884   713 root     R <  14036  5.7 85.5 /usr/bin/Xorg -logfile
/tmp/Xorg.0.log
 3621  3620 user     S    50476 20.4 11.1 python2.5 modrana.py n900
  767     1 messageb S <   2100  0.8  0.6 /usr/bin/dbus-daemon --system
--nofork
 2231  2228 root     R      740  0.3  0.6 top
 1153  1028 user     S     9624  3.9  0.4 /usr/bin/hildon-desktop
  878     2 root     SW       0  0.0  0.4 [sgx_misr]

Snapshot 2: It's just triggering, X and sgx are seen competing for CPU. NOTE:
modrana memory 20% NOT HIGH

  878     2 root     RW       0  0.0 47.2 [sgx_misr]
  884   713 root     D <  13488  5.4 44.2 /usr/bin/Xorg -logfile
/tmp/Xorg.0.log
 3621  3620 user     S    50476 20.4  5.7 python2.5 modrana.py n900

Snapshot 3: sgx took over and it's all over now ... NOTE: modrana memory 20%
NOT HIGH

  878     2 root     RW       0  0.0 97.6 [sgx_misr]
  767     1 messageb S <   2100  0.8  0.8 /usr/bin/dbus-daemon --system
--nofork
 2231  2228 root     R      740  0.3  0.4 top
 1142  1028 user     S    10004  4.0  0.2 /usr/bin/hildon-status-menu
 1799  1028 user     S     8008  3.2  0.2 /usr/bin/osso-xterm
  757     1 root     S     1072  0.4  0.2 /usr/sbin/bme_RX-51
 5155  2236 root     S      528  0.2  0.2 powertop
 3621  3620 user     S    50476 20.4  0.0 python2.5 modrana.py n900


> > Nevertheless, as a consumer it means the same to me.
> 
> If you don't see SGX resets or sgx_misr kernel thread using all CPU, it's a
> separate issue which should be discussed in separate bug.  This bug is only
> about the (currently non-reproducible) SGX issue.

Fair enough though I think title of the bug doesn't say sgx. It's about an end
user perceived behavior. Till the issue is completely analyzed one won't know
whether to treat them different. Nevertheless, I'll confine myself for now to
those instances where sgx triggers.


I also am not sure why it has suddenly been referred as "non-reproducible"
while so much discussion is still going on about it?
Comment 81 Sami Liedes 2010-12-16 20:31:06 UTC
(In reply to comment #79)
> Hm.  Is Maep still/again doing window updates when its window isn't visible?
> 
> Could you check that from e.g. ssh console with "xresponse -w 0 -a '*'"[1] and
> if Maep indeed is doing window updates while screen is blanked or it's on
> background, file a use-time & performance [2] bug against Maep?

Ok, I'll try that. At least with Maep 1.3.6 I'm still seeing these freezes
frequently.

> So you need to be moving to get this issue?  Do you need to have network also
> enabled?

I think I've only ever seen this when moving. I've actually been thinking of
trying to code a test case which only simulates the GPS receiver moving, I
hypothesize that could be used to trigger this. So far I've had too much other
things to do to investigate this, though.

If by network you mean either of wifi or 3G, I think at least most of the time
I've had 3G connectivity when I've seen this happen. I don't think I've never
tried (or seen) it without 3G connection. Network usage is a possibility; Maep
downloads the tiles it draws if they are not found in the cache.

> > I see a SGX reset message in dmesg and top showing X taking all CPU
> > in D state.
> 
> In that case X isn't taking CPU (unlike for Mayuresh) as it's in D state, and
> you have SGX reset.  So sounds like an issue that belongs to this bug.

Yet I seem to remember top showing X taking 100% CPU, but in D state (I believe
that can happen also when the cpu time is spent in kernel as result of a
syscall by X). I'll recheck this too the next time I have the opportunity to.

> There are two SGX issues.  One where you see sgx_misr kernel thread taking all
> CPU (which I've never seen myself, just mentioned here in bugzilla) and SGX
> resets (which I was able to reproduce in spring).  I think both can be handled
> in this bug.

Ah, that clarifies.

Thank you for still looking into this. While the tone in some comments here is
slowly becoming aggressive-ish, and while I'm myself disappointed that the
response has seemed too often to be mostly "please wait for the next PR which
contains some fixes to something and retest", it's still more than I've come to
expect of many proprietary software vendors. While we open source people are
obviously frustrated when we can't just fix broken things ourselves, I'm still
optimistic about this bug because Nokia seems to be still engaged in resolving
this.
Comment 82 Sami Liedes 2010-12-16 21:32:44 UTC
(In reply to comment #81)
> (In reply to comment #79)
> > Hm.  Is Maep still/again doing window updates when its window isn't visible?
> > 
> > Could you check that from e.g. ssh console with "xresponse -w 0 -a '*'"[1] and
> > if Maep indeed is doing window updates while screen is blanked or it's on
> > background, file a use-time & performance [2] bug against Maep?
> 
> Ok, I'll try that. At least with Maep 1.3.6 I'm still seeing these freezes
> frequently.

Yeah, I do see damage events from maep while the screen is blanked:

 977414476ms : 38988ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977415515ms :  1039ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977416491ms :   976ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977421480ms :  4989ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977422489ms :  1009ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977423473ms :   984ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977424483ms :  1010ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977425487ms :  1004ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977433467ms :  7980ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977437478ms :  4011ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977447466ms :  9988ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977448475ms :  1009ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977449507ms :  1032ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977450481ms :   974ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977470482ms : 20001ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977475470ms :  4988ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977479481ms :  4011ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977480470ms :   989ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977488475ms :  8005ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977489492ms :  1017ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977490496ms :  1004ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977492484ms :  1988ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977493479ms :   995ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977499470ms :  5991ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977506483ms :  7013ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977507487ms :  1004ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977508490ms :  1003ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977509481ms :   991ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977510489ms :  1008ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977511496ms :  1007ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977513484ms :  1988ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977514482ms :   998ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977517476ms :  2994ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977518475ms :   999ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977521481ms :  3006ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977525482ms :  4001ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977529467ms :  3985ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977531482ms :  2015ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977532479ms :   997ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977533489ms :  1010ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977535489ms :  2000ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977538470ms :  2981ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977542472ms :  4002ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977552473ms : 10001ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977554490ms :  2017ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977556491ms :  2001ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977557519ms :  1028ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977558506ms :   987ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977559517ms :  1011ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977562479ms :  2962ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977563489ms :  1010ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977564547ms :  1058ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977568463ms :  3916ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977569485ms :  1022ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977574490ms :  5005ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977575495ms :  1005ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977578487ms :  2992ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977579471ms :   984ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977582481ms :  3010ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977584521ms :  2040ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977585519ms :   998ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
 977586520ms :  1001ms : Got damage event 800x424+0+56 from 0x3a0001a (maep)
Comment 83 Sami Liedes 2010-12-16 21:39:33 UTC
(In reply to comment #82)
> Yeah, I do see damage events from maep while the screen is blanked:

This seems to be related to maep's drawing of the (red by default) track. If I
disable track capturing, I no longer see damage events.

I'll see if I can come up with a patch to maep; presumably identifying the
exact operation in Maep that causes this would also help you to fix this bug?
Comment 84 egoshin 2010-12-16 22:20:20 UTC
(In reply to comment #75)
> (In reply to comment #68)
> > Hey, hey - it is not the only option, it is still possible to configure upstart
> > in a way which would cause a restart of X11 and GUI apps like hildon etc after
> > X11 failure-and-crash, but keep running server applications.
> 
> UI related startup takes most of the bootup time, if one is re-starting the
> whole UI session and loosing all user's unsaved data anyway, it would be much
> safer just to do full reboot in such a stressed situation.

No -

1) there is a possibility of bug 7017.

2) some SERVER applications also have state and it could be lost as well.

3) I use a special ADJUSTMENT scripts after N900 boot, depending from
situation.
A typical example of this - "stop klogd;stop sysklogd" and I use it for
day-to-day run. If I have a problem - I don't run it and record all syslog and
lose a battery/performance. Another exam - I use a non-regular boot (M32GB) and
it choses a kernel+system depending from KBD. And I know many users uses
something like bootmenu etc.

So - please find another solution before enforce a reboot. I suggested one -
restart X11 and all GUI apps, it is not difficult to modify upstart scripts.
Comment 85 Eero Tamminen nokia 2010-12-17 11:06:34 UTC
(In reply to comment #80)
Fair enough though I think title of the bug doesn't say sgx.

It does?


> It's about an end user perceived behavior.

Which is/shows up as SGX issue.  Please read the orignal bug report and title.


> Till the issue is completely analyzed one won't know whether
> to treat them different.

If you don't see the SGX issues, it's a different one.  It's easy for a buggy /
bad program to DOS a Linux system just by misusing e.g. X as X doesn't have any
mechanisms to prevent that from clients that user allows to access it.  If
that's the case, that app needs to be fixed not to DOS the device X server.


> I also am not sure why it has suddenly been referred as "non-reproducible"
> while so much discussion is still going on about it?

People encounter this issue now and then, but so far there haven't been any
steps with which one could reliably reproduce it with PR1.2 or later, e.g. on a
device that is just lying on the desk where it could be debugged with HW
debuggers.

For example I haven't seen these issues, except last spring before PR1.2, with
old Maep, in a way that isn't normal device usage (device on charge over a day
with Maep running) and even then it wasn't reliably reproducible.
-> people are omitting some crucial information on how to reproduce the issues,
if it really is reproducible for them.


(In reply to comment #83)
> (In reply to comment #82)
>> Yeah, I do see damage events from maep while the screen is blanked:

Thanks for testing!  Does it do that also when some other application is on top
& active?

Doing window updates on bg takes a lot more CPU away from the top app than just
keeping track of the position.  It dirties memory (e.g. window composite
buffer) and therefore causes unnecessary memory pressure (and potentially extra
swapping because of that).

(Nokia Maps has its own faults, but it  doesn't have this kind of crappy
behavior for a mobile application.  Neither does any other of the pre-installed
apps.)

You might also want to check with strace is it waking up even more often that
this 1 sec interval, because that would make it even more of a use-time issue.


> This seems to be related to maep's drawing of the (red by default) track.
> If I disable track capturing, I no longer see damage events.

I guess you encounter the UI freeze only when you have tracking enabled...?


Btw. X does some things differently when the screen is blanked in regards to
SGX.  Have you had the SGX related UI freezed only when screen is blanked or
does it get triggered also when screen is kept ON?


> I'll see if I can come up with a patch to maep; presumably identifying the
> exact operation in Maep that causes this would also help you to fix this bug?

Not really.  Only thing to help in that would be exact steps to trigger the
issue that are 100% reproducible for anybody, not just "this happens randomly
to me when using app X".  I'm not anymore related to Fremantle development but
my educated guess is that this issue won't get resolved on the SGX side for
Fremantle, that kind of steps would have been needed right after PR1.2 release
for things to proceed.

However, the apps can and should be fixed because their bad behavior has also
many other downsides besides the UI freeze. If Maep stops triggering the issue,
that's a really good thing in itself and can be used as an example of how to
fix the other apps to behave like mobile apps should.

I mean, who wouldn't want his navigation software to consume slightly less
battery[1] so that you could use it longer?

[1] Battery consumption effect goes typically in this order: display backlight,
radios (3g, wlan, bt, gps etc), CPU wakeups/usage (e.g. on previous devices
having one process do 1Hz wakeups for couple of syscalls on an otherwise idle &
offline device reduced the device use-time from a week to 1 day).
Comment 86 Mayuresh 2010-12-17 14:54:57 UTC
(In reply to comment #85)

> Which is/shows up as SGX issue.  Please read the orignal bug report and title.

Point taken. I'll keep it to instances when sgx takes 100% cpu.

The point is prior to sgx triggering firstly X server went to take all the cpu
and then in one instance sgx triggered and in the other it didn't.

If you insist one should see these differently, let's talk about only those
where sgx triggered.

My point still remains. modrana WASN'T taking high memory at that time as was
speculated in some of the posts on this thread.

I also have powertop, xresponse and dbus-monitor logs of all these instances. I
can post information that may be useful to analyze this issue.

Mayuresh.
Comment 87 Eero Tamminen nokia 2010-12-17 18:38:11 UTC
(In reply to comment #86)
> The point is prior to sgx triggering firstly X server went to take all the cpu
> and then in one instance sgx triggered and in the other it didn't.

Any idea what app was causing this and whether it was on foreground or on
background?  Was screen lit or blanked?


> My point still remains. modrana WASN'T taking high memory at that time as was
> speculated in some of the posts on this thread.

It doesn't necessarily need to be the program itself, but the device as whole 
actively using a lot of memory / swapping while some other things[1] happen
seemed as one variable affecting this issue (at least in spring when I was
looking a bit into this issue).  I would guess that high memory usage isn't the
trigger but it makes the issue more likely to happen.

[1] like non-visible window updates.
Comment 88 Mayuresh 2010-12-17 19:43:22 UTC
(In reply to comment #87)
> (In reply to comment #86)
> > The point is prior to sgx triggering firstly X server went to take all the cpu
> > and then in one instance sgx triggered and in the other it didn't.
> 
> Any idea what app was causing this and whether it was on foreground or on
> background?  Was screen lit or blanked?

I logged this in comment 48.

I managed to keep the display ON with an applet that has this option. (Newer
versions of modrana have this option in the app itself.) The application was in
the foreground and when the issue triggered fully lit display that showed only
the app and nothing else was frozen.

If at all I find a correlation of this issue with other factors - it is not
with background updates, it is not with display being switched off or lit, it
is not with high memory usage (of navig app at least), it is seen quite
repeatedly that it is with some events on the network - poor connectivity
situations in particular, it's relatively rare in areas with good connectivity.
I can regularly see this when traveling outside city areas when the likelihood
of this issue triggering increases.

Has anyone on this list ever seen this issue with network connection switched
off? If not why don't we look at that aspect?

No matter whether I leave display on or off and keep a navig app running for
hours, no matter if I manage to take modrana memory usage up to 65% or so by
continuous panning, nothing ever locks the device if network is switched off.

This whole episode reminds of the story of six blind men and an elephant as
many of us notice different aspect of the issue. May be there is a background
screen update issue, but may be the navig app doesn't do it, something pops up
while tries to pop up when network disconnects and reconnects with a foreground
app making heavy use of screen in the foreground. Something like that. Just
speculating.

Pending this issue I have started using modrana regularly only in offline mode.
I cache the data on home wifi network and when going out just switch off the
network access from modrana fully. Then I never face any problem.
Comment 89 Craig Woodward 2010-12-20 22:58:58 UTC
(In reply to comment #73) 
> 2 minutes of googling gives the used upstream kernel and changes on top of it:
> http://repository.maemo.org/pool/fremantle/free/k/kernel/kernel_2.6.28-20103103+0m5.diff.gz
> 
> kernel-2.6.28/drivers/gpu/pvr/module.c seems like a good starting point.

Thank you for the pointer, I'll start looking there.

(In reply to comment #73) 
> Have you checked how many times in your case the HW would recover within
> reasonable time or how many times it has happened even without you noticing it
> from the UI (I think you should see this from syslog)?  If device is just
> rebooted, you lose all unsaved data in your apps.

Yes, I have looked at the logs.  And yes, there are instances when it resets
and everything is fine.  The issue is when a reset *doesnt* fix it, it just
keeps resetting.  I'm just asking for a simple counter.  If the device has
tried 20 times to reset in a short window (say a minute or two), it's time to
call for a system reset.  That's not that difficult to do.  A static counter,
with a setup in the init code, a ++ in the reset code, and a clear in the draw
exit code.

In fact, I've made a script that does just this by waking up every 30 seconds
and collecting the stats from /proc/stat and /proc/<pid>/stat for the sgx_misr
process.  It warns (verbally) via espeak that it's having issues, then reboots
after a short period if not corrected.  I'd just prefer this to be built in...

As for Bug 7017, that looks like it's been fixed.  Even if it hasn't when the
device reboots it makes noise, buzzes, and does other things that make it clear
that it's rebooting.  If the device is in a reboot loop in my pocket, it will
get my attention and I can shut it down before it drains battery.  If it's
sitting idle somewhere, then the worst case result is the same: drained
battery.  The more common case I'd bet is it reboots and recovers because of
the hardware reset.

(In reply to comment #73) 
> Buggy user processes can DOS a normally configured Linux system very easily

In each of the methods you listed, you're talking about extreme, purposeful
behaviors.  And it each case, it's a denial of service, where the system is
still running, and a watchdog process can be setup to see/kill such user
processes causing this behavior.

In this case because the culprit is NOT a user process but a kernel level
process, that's not an option.  There's no way to "kill" a misbehaving kernel
process.  Let's be clear here.  It's not Maep, or modRana, or MicroB that's
locking up the system.  What's locking the system is a kernel process that's
spinning.  Sure, a kernel process (like kswap) eating all resources because of
requests from a user app is a bad app.  But a kernel process spinning after a
request failure is a bad kernel driver, regardless of the app triggering it.

(In reply to comment #73) 
> If it's this:
> http://maemo.org/packages/package_instance/view/fremantle_extras-devel_free_armel/modrana/0.20-3/
> 
> It actually would seem to be using python & cairo, not GL.

It is, and it is using python/cario.  Yet it still *triggers* this behavior,
especially when downloading maps and actively navigating.  I've had less
frequent occurrences as I've been using it, in part I think because it's no
longer fetching tiles from the air, and is instead pulling them from the local
cache.  (I do think this is tied to GPRS/GPS combo usage as well.)

(In reply to comment #74)
> You've verified this is the whole device UI freeze caused by SGX issues e.g. by
> sshing to the device and seeing sgx kernel thread highest in top or SGX resets
> in dmesg?

I can verify when it happens because I have a script running that watches the
sgx user-space process.  When it starts consuming >80% cpu, it uses espeak to
announce it has an issue and after 3 such announcements (over 15 seconds) it
reboots.  It's triggered three times so far, twice when using modRana, and once
when sitting idle (at about 5am) on my night stand.

(In reply to comment #74)
> Very interesting.  Does this happen with both GPRS and Wifi or only with the
> other one?  Or only when the device is moving and switching between base
> stations?

This is interesting.  I know I've seen it primarily when on gprs, but have seen
it on Wifi too.  But this made me think...

I checked my reboots, system logs and script logs.  One thing I found was a
very high (100%) correlation to an inbound SMS.  Each time this happened there
was an in-bound text message that was time-stamped just before the script
triggered and/or shortly before the reboot.  I also recall often getting a text
just *after* the reboot, including that 5am instance. (Did I mention I hate my
bank, posting mortgage payment notices via SMS at 5am...)

Could it be that heavy cell usage is involved?  (I use only 2.5 EDGE, not 3G.) 
I will attempt to make a call and/or have people text me in the background
while navigating tonight to see if I can trigger it.

Also, of a later question:  I've had it trigger both when the screen is off,
and when on.  In the former cases, the screen backlight comes on when I try to
access the device, but the screen stays black.  In the later cases, the UI is
unresponsive, but the backlight continues to alter state based on hardware
input.
Comment 90 egoshin 2010-12-20 23:39:28 UTC
Comment 89:

> As for Bug 7017, that looks like it's been fixed.  Even if it hasn't when the
device reboots it makes noise, buzzes, and does other things that make it clear
that it's rebooting. 

Unfortunately, it is not for anybody - sit in pocket and make noises.

And unexpected reboot is still unexpected reboot, it kills ANYTHING. But as I
understand we need to kill-and-restart only GUI apps.

I think a complete reboot is not a good solution.
Comment 91 Eero Tamminen nokia 2010-12-21 11:12:55 UTC
(In reply to comment #88)
> I can regularly see this when traveling outside city areas when
> the likelihood of this issue triggering increases.
> 
> Has anyone on this list ever seen this issue with network connection
> switched off? If not why don't we look at that aspect?
> 
> No matter whether I leave display on or off and keep a navig app running for
> hours, no matter if I manage to take modrana memory usage up to 65% or so by
> continuous panning, nothing ever locks the device if network is switched off.
...
> Pending this issue I have started using modrana regularly only in offline
> mode. I cache the data on home wifi network and when going out just switch
> off the network access from modrana fully. Then I never face any problem.

Hm. That might explain why I never see it.  That's how I typically use my
personal device (for use-time reasons).  And at work devices of course
typically are / have to be stationary.


> This whole episode reminds of the story of six blind men and an elephant
> as many of us notice different aspect of the issue. May be there is
> a background screen update issue, but may be the navig app doesn't do it,
> something pops up while tries to pop up when network disconnects and
> reconnects with a foreground app making heavy use of screen in
> the foreground. Something like that. Just speculating.

Yes, there can be multiple reasons/triggers.  Highlights the importance of
getting exact, reproducible steps and exact environment setup for reproducing
the issue.  If it's related to network, it could even be something that can be
reproduced in a certain network environment (wouldn't be first such bug).


(In reply to comment #89)
> (In reply to comment #73) 
> > Buggy user processes can DOS a normally configured Linux system very easily
> 
> In each of the methods you listed, you're talking about extreme, purposeful
> behaviors.

Several of them can happen just by having a buggy program.  I've seen such. 
They can e.g. have some untested corner-case code that in some (e.g. network
error) situations gets triggered constantly.


> In this case because the culprit is NOT a user process but a kernel level
> process, that's not an option.  There's no way to "kill" a misbehaving kernel
> process.  Let's be clear here.  It's not Maep, or modRana, or MicroB that's
> locking up the system.

Did you actually check that e.g. by killing all the UI processes one by one to
see whether the issue goes away?


If that can cure the issue, it would be good to know what was on screen, e.g.
as output of "DISPLAY=:0 xwininfo -root -tree" command (in x11-utils package)
taken before starting killing.


> I can verify when it happens because I have a script running that watches the
> sgx user-space process.  When it starts consuming >80% cpu, it uses espeak to
> announce it has an issue and after 3 such announcements (over 15 seconds) it
> reboots.  It's triggered three times so far, twice when using modRana, and
> once when sitting idle (at about 5am) on my night stand.

Device being stationary, screen blanked?  Which apps were running in last case?

You might consider starting xresponse logging when your script first notices
the issue to see what apps are doing screen updates.  + take above xwininfo
data too.


> I checked my reboots, system logs and script logs.  One thing I found was
> a very high (100%) correlation to an inbound SMS.  Each time this happened
> there was an in-bound text message that was time-stamped just before the
> script triggered and/or shortly before the reboot.
> I also recall often getting a text just *after* the reboot, including
> that 5am instance.

Very interesting finding!  When an SMS comes in, cgroups is used to freeze the
extra processes (maybe for ~1 sec) to get the notification banner up fast. 
Notification coming up on top of a fullscreen application means that it is
switched from direct to composited mode.

Maybe there's some race condition in X regarding drawing, SGX operations and
compositing switching?

(There's some SGX specific driver code in X side too, but like kernel driver, I
think that's also Open Source, unlike the OpenGL/SGX libraries.)


> Could it be that heavy cell usage is involved?  (I use only 2.5 EDGE,
> not 3G.)

The device has several HW with their own "OSes", cellmo side has one, SGX has
one etc.  Maybe they interact badly, I think all of them have their own MMUs
and can therefore read & write everywhere in RAM...


> Also, of a later question:  I've had it trigger both when the screen is off,
> and when on.  In the former cases, the screen backlight comes on when I try to
> access the device, but the screen stays black.  In the later cases, the UI is
> unresponsive, but the backlight continues to alter state based on hardware
> input.

If screen was blanked before and UI isn't drawing anything on screen after
unblanking, what you see is backlighted black color.
Comment 92 Craig Woodward 2010-12-22 23:19:02 UTC
(In reply to comment #90)
> And unexpected reboot is still unexpected reboot, it kills ANYTHING. But as I
> understand we need to kill-and-restart only GUI apps.

No... Your understanding is incorrect.  The ONLY solution so far found has been
to reboot the system.  Killing and restarting the GUI apps has NO effect.

The problem here is that a KERNEL DRIVER gets into a spinning state.  There is
no way to kill or unload the driver, and killing apps seems to have NO effect
on the driver.  I've tried killing apps and even X itself.  The only thing that
brings back the display and stops the driver from taking 100% cpu is a reboot.

Again, many would prefer to have the device reboot than to have it SILENTLY
drain all the battery and die.  An unexpected battery drain "kills anything"
just as much as a reboot does.  At least with the reboot there's a controlled
shutdown, and a chance of having battery left after an occurrence.

Let me state again, quite clearly: There is NO way we currently know to recover
from this state other than a reboot.  If you know of another way, please do
share it.  I would love to alter my script to simply reset the display system
if that's an option.  To my knowledge, there is no known way of fixing this
other than a reboot/powerdown.

(In reply to comment #91) 
> Hm. That might explain why I never see it.  That's how I typically use my
> personal device (for use-time reasons).  And at work devices of course
> typically are / have to be stationary.

I'm not sure being mobile is a requirement.  modRana has the ability to
download tiles to cache them locally.  One could easily emulate the device
strain of navigation by setting the connection to GPRS, setting it to download
lots of tiles, leaving the GPS on, and dragging the display around a bit from
time to time to force redraws.  All of that can be done on the bench, and
emulates the device usage in a mobile situation.  (Sending a few SMS during
that would also seem to be a good indicator?)

(In reply to comment #91) 
> Did you actually check that e.g. by killing all the UI processes one by one to
> see whether the issue goes away?

I did once when it happened at work (I was downloading cache tiles via wifi). 
I was able to ssh into the device and look around.  I tried killing modRana,
but decided since a reboot was going to happen anyway, I should try killing
other things too.  Most of the apps respawned, since the system likes some of
the system apps to always be loaded.  I even tried killing X with various
levels to see if it would recover. (First -1, then -15, then -9.)  Nothing
caused it to recover (though it did reboot after I killed something it didn't
like me killing.)

(In reply to comment #91)
> If that can cure the issue, it would be good to know what was on screen, e.g.
> as output of "DISPLAY=:0 xwininfo -root -tree" command (in x11-utils package)
> taken before starting killing.

I can add that to my script, to record that.. and maybe even try killing the
app(s) involved.  I'm up for adding any debug collection if it will help.

Right now I'm not having it happen as much, but I've been unable to run a few
tests as I'm busy wrapping things up before the holiday.  I'm hopeful that
after Thursday I can sit and test some things out (including the IM while
navigating test I wanted to do yesterday).

(In reply to comment #91)
> Device being stationary, screen blanked?  Which apps were running in last case?

None... I generally don't have apps running in the background, especially at
night.  I do have IM setup, but at night even that goes off.  When the 5am
instance triggered nothing but the desktop widgets were running, and the screen
was off at the time, so they should have been idle (facebook, calendar &
weather widgets is all I have there).  The only thing I know was going on was I
got a SMS around the same time the device had the issue.

(In reply to comment #91)
> You might consider starting xresponse logging when your script first notices
> the issue to see what apps are doing screen updates.  + take above xwininfo
> data too.

I will add both.  The current script is listed here btw:
http://talk.maemo.org/showthread.php?t=66660

(In reply to comment #91)
> Very interesting finding!  When an SMS comes in, cgroups is used to freeze the
> extra processes (maybe for ~1 sec) to get the notification banner up fast. 
> Notification coming up on top of a fullscreen application means that it is
> switched from direct to composited mode.
> 
> Maybe there's some race condition in X regarding drawing, SGX operations and
> compositing switching?

Possible... I don't generally use full-screen mode though, even for nav apps,
as I like to have that top status bar with the time and such visible.  But even
then, it would still use cgroups and cause X to do redraws.  That could be part
of it, since a GL "burst" is whats suspected for triggering this, yes?

(In reply to comment #91)
> The device has several HW with their own "OSes", cellmo side has one, SGX has
> one etc.  Maybe they interact badly, I think all of them have their own MMUs
> and can therefore read & write everywhere in RAM...

This was my initial thought when I realized most of the triggering apps were
GPS related.  I was wondering if it was the GPS module conflicting... But it
could be a mix of the SGX and any number of them.

But again, the SGX module should be fixed to not get stuck in a loop.  Even if
the display just goes away until a reboot, the killer is not the UI loss, it's
the battery drain.  Sucking up 100% cpu and blowing through all battery power
is a bad thing.  If I picked up my device and couldn't get the UI to refresh
without a reboot, that's annoying.  If I pick it up and it's off, and has no
battery left to restart with, that really sucks.

(In reply to comment #91)
> If screen was blanked before and UI isn't drawing anything on screen after
> unblanking, what you see is backlighted black color.

My point was more that other things were running still (backlight adjusting,
sensors for unlock etc).  So the device isn't "locked up", it's just not
drawing.  But we kind of already knew that.

Is there a key sequence one can use to force a reboot safely?  I'm not talking
about the hard power-down by holding the power button, but something where the
OS gets to close things up, like issuing a reboot or shutdown command.
Comment 93 egoshin 2010-12-22 23:38:22 UTC
(In reply to comment #92)
> (In reply to comment #90)
> > And unexpected reboot is still unexpected reboot, it kills ANYTHING. But as I
> > understand we need to kill-and-restart only GUI apps.
> 
> No... Your understanding is incorrect.  The ONLY solution so far found has been
> to reboot the system.  Killing and restarting the GUI apps has NO effect.
> 
> The problem here is that a KERNEL DRIVER gets into a spinning state.  There is
> no way to kill or unload the driver, and killing apps seems to have NO effect
> on the driver.  I've tried killing apps and even X itself.  The only thing that
> brings back the display and stops the driver from taking 100% cpu is a reboot.

Excuse me, but it is POSSIBLE to restart driver. Just some work needs to be
done. 

> Again, many would prefer to have the device reboot than to have it SILENTLY
> drain all the battery and die.  An unexpected battery drain "kills anything"
> just as much as a reboot does. 

You suggest a choice between bad and awful. But there is a man who has a
similar
symptom and suffers exactly from reboots, see -
http://talk.maemo.org/showthread.php?t=67292
It seems that that man situation is an exactly your dream but he is not happy.

> At least with the reboot there's a controlled
> shutdown, and a chance of having battery left after an occurrence.
> 
> Let me state again, quite clearly: There is NO way we currently know to recover
> from this state other than a reboot. 

You are not informed. If device (SGX) can be reset then there IS a way w/out
reboot.

>  If you know of another way, please do
> share it.  I would love to alter my script to simply reset the display system
> if that's an option.  To my knowledge, there is no known way of fixing this
> other than a reboot/powerdown.

Without programming - yes, you are right. But SGX reset with killing X11 (or
sending it some message) is possible.
Comment 94 darkwalker000 2010-12-23 00:39:11 UTC
My case is a little different. The phone will entry reboot on its own when
"HWRecoveryResetSGX: SGX Hardware Recovery triggered" happens. This usually
happens after I finish a phone call (receiving calls seem to be ok.)  The
firmware running on my system is PR 1.3. 

Dec 22 10:14:41 Nokia-N900-42-11 -- MARK --
Dec 22 10:15:28 Nokia-N900-42-11 kernel: [80711.257019] kb_lock (GPIO 113) is
now closed
Dec 22 10:15:28 Nokia-N900-42-11 kernel: [80711.506866] kb_lock (GPIO 113) is
now open
Dec 22 10:15:44 Nokia-N900-42-11 dorian[3714]: GLIB CRITICAL ** GLib-GObject -
g_object_get: assertion `G_IS_OBJECT (object)' failed
Dec 22 10:15:44 Nokia-N900-42-11 dorian[3714]: GLIB CRITICAL ** Gtk -
gtk_widget_set_sensitive: assertion `GTK_IS_WIDGET (widget)' failed
Dec 22 10:25:01 Nokia-N900-42-11 kernel: [81284.624237] kb_lock (GPIO 113) is
now closed
Dec 22 10:25:02 Nokia-N900-42-11 kernel: [81284.890228] kb_lock (GPIO 113) is
now open
Dec 22 10:31:00 Nokia-N900-42-11 kernel: [81643.178833] HWRecoveryResetSGX: SGX
Hardware Recovery triggered
Dec 22 10:32:29 Nokia-N900-42-11 kernel: [81732.428314] HWRecoveryResetSGX: SGX
Hardware Recovery triggered
Dec 22 10:32:49 Nokia-N900-42-11 syslogd 1.5.0#5maemo7+0m5: restart.
Comment 95 Craig Woodward 2010-12-23 02:00:34 UTC
(In reply to comment #93)
> Excuse me, but it is POSSIBLE to restart driver. Just some work needs to be
> done. 
>
> You are not informed. If device (SGX) can be reset then there IS a way w/out
> reboot.
>
> Without programming - yes, you are right. But SGX reset with killing X11 (or
> sending it some message) is possible.

Since you say I'm "not informed", please DO INFORM ME on how to do thi. I have
yet to see ANYONE show a way to reset the driver when in this state outside of
a reboot.  You can say it can be done, and is possible.  How?  Have you done
it? Did it work? If not, how can you say with certainty that it's possible?

As I said, I would love to be able to just reset the driver and have it work
again.  That would be a perfectly acceptable work around, and most here would
happily use such a patch.  But currently, there is NO WAY to do this.

PLEASE instead of just calling the multiple people here asking for a reboot fix
here "uninformed", why not do something useful and give an alternate solution.

(In reply to comment #93) 
> You suggest a choice between bad and awful. But there is a man who has a
> similar symptom and suffers exactly from reboots, see -

The issue he is seeing is NOT the same as this bug.  It's a very different
issue.  His device is being reset by the watchdog after calls.  Some of the
output in his logs indicated an SGX reset, among other things, before the
watchdog reboots his device.  But this is not a bug for all things where the
SGX resets. His is a separate bug, totally unrelated to what's happening here.

Given a choice, I think most would prefer that Nokia fix this so the kernel
driver worked and just reset the hardware.  But given the choice between a
device that randomly DIES and is USELESS for the rest of the day (how it works
now), or one that randomly REBOOTS (the only SHOWN way of fixing this bug), I
and many others would choose the later.  You may choose differently, but
*multiple people* have shown a desire for a reboot vs a dead battery.  Again,
if you know of another working option, PLEASE DO SHARE by showing HOW to do
that.

Ultimately, I would like this to not happen at all, by having Nokia fix this
device killing bug.  Until that's a reality, I posted a script to implement a
watcher that collects a log and reboots.  If you think it's possible to reset
the driver and X instead, please post something on how to do so.
Comment 96 Eero Tamminen nokia 2010-12-29 16:58:13 UTC
(In reply to comment #92)
> The problem here is that a KERNEL DRIVER gets into a spinning state.

It isn't necessarily the kernel driver, it may just be reacting to what SGX
does in a loop (I mean, you be seeing effect, not cause).  Would be nice if
somebody checks that from the driver...


> There is no way to kill or unload the driver,

To do rmmod on the driver (pvrsrvkm?), you need to kill its users first.


> and killing apps seems to have NO effect on the driver.
...
>> Did you actually check that e.g. by killing all the UI processes one by
>> one to see whether the issue goes away?
> 
> I did once when it happened at work (I was downloading cache tiles via wifi).

Stationary device, no GPS?

> I was able to ssh into the device and look around.  I tried killing modRana,
> but decided since a reboot was going to happen anyway, I should try killing
> other things too.  Most of the apps respawned, since the system likes some of
> the system apps to always be loaded.

You need to use dsmetool to kill stuff that uses DSME SW watchdog.


> I even tried killing X with various levels to see if it would recover.
> (First -1, then -15, then -9.)

Were you able to kill it?

(If it's in D state, even SIGKILL might not work.)


> (In reply to comment #91) 
> > Hm. That might explain why I never see it.  That's how I typically use my
> > personal device (for use-time reasons).  And at work devices of course
> > typically are / have to be stationary.
> 
> I'm not sure being mobile is a requirement.  modRana has the ability to
> download tiles to cache them locally.  One could easily emulate the device
> strain of navigation by setting the connection to GPRS, setting it to
> download lots of tiles, leaving the GPS on, and dragging the display around
> a bit from time to time to force redraws.  All of that can be done on the
> bench, and emulates the device usage in a mobile situation.

Will that trigger the issue?

(By mobile I meant things happening in other kernel drivers, not just SGX one. 
Things like base station change etc.)


> (In reply to comment #91)
> > Device being stationary, screen blanked?  Which apps were running
> > in last case? 
>
> None... I generally don't have apps running in the background, especially at
> night.  I do have IM setup, but at night even that goes off.  When the 5am
> instance triggered nothing but the desktop widgets were running, and
> the screen was off at the time, so they should have been idle (facebook,
> calendar & weather widgets is all I have there).

Did you have network up?  Facebook or weather widget might be doing screen
update although screen is blanked (as update e.g. once a day isn't a use-time
problem).

> The only thing I know was going on was I
> got a SMS around the same time the device had the issue.



> (In reply to comment #91)
> > You might consider starting xresponse logging when your script first notices
> > the issue to see what apps are doing screen updates.  + take above xwininfo
> > data too.
> 
> I will add both.  The current script is listed here btw:
> http://talk.maemo.org/showthread.php?t=66660

Hm.  They don't work if X is stuck at D state.  Best to first get:
  cat /proc/$(pidof Xorg)/wchan

output to see where it's stuck (if it indeed happens to be in D state).


> (In reply to comment #91)
>> Maybe there's some race condition in X regarding drawing, SGX operations
>> and compositing switching?
> 
> Possible... I don't generally use full-screen mode though, even for nav apps,
> as I like to have that top status bar with the time and such visible.  But
> even then, it would still use cgroups and cause X to do redraws.

Non-fullscreen apps are run in composited mode, so that shouldn't trigger
composition switch.  Incoming SMS triggers temporary cgroups freeze for
non-related processes though.


> That could be part of it, since a GL "burst" is whats suspected for
> triggering this, yes?

If by "burst" you mean multiple processes waking up (after freeze) to do window
updates at the same (while system is otherwise stressed), yes something like
that could be one of the triggering conditions.


> Is there a key sequence one can use to force a reboot safely?  I'm not talking
> about the hard power-down by holding the power button, but something where the
> OS gets to close things up, like issuing a reboot or shutdown command.

Holding power button down should be doing "clean" shutdown (unmount etc).
Comment 97 Sami Liedes 2010-12-29 18:25:23 UTC
(In reply to comment #96)
> > Is there a key sequence one can use to force a reboot safely?  I'm not talking
> > about the hard power-down by holding the power button, but something where the
> > OS gets to close things up, like issuing a reboot or shutdown command.
> 
> Holding power button down should be doing "clean" shutdown (unmount etc).

That doesn't work. Hmm. I always assumed that's because X is in an
uninterruptible state, but then I'm not sure if that makes clean shutdown
impossible. In the PC world inability to cleanly shutdown can be seen when
something that the shutdown depends on is in uninterruptible state (in my
experience most often when umount blocks waiting on a D-state task with a file
open in the target filesystem).
Comment 98 Jüri Ivask 2011-01-28 14:45:07 UTC
It seems that I found a way to reproduce it easily...
Installed Marble 1.0.0 from extras/testing selected OpenStreetMap theme,
selected on the screen a map region about 300x500 km. Selected "Download
Region" from the pull down menu, then "Visible region", set Zoom to Tile level
range 10-13 (estimated download size: 387 MB) and pressed the button "Done".
The download started, put the phone on the table and after about 15-20 it is
frozen - had to pull out the battery...
Comment 99 Eero Tamminen nokia 2011-02-01 12:19:05 UTC
(In reply to comment #98)
> It seems that I found a way to reproduce it easily...
> Installed Marble 1.0.0 from extras/testing selected OpenStreetMap theme,
> selected on the screen a map region about 300x500 km. Selected "Download
> Region" from the pull down menu, then "Visible region", set Zoom to Tile level
> range 10-13 (estimated download size: 387 MB) and pressed the button "Done".
> The download started, put the phone on the table and after about 15-20 it is

Seconds?  Minutes?

> frozen - had to pull out the battery...

Could you check with xresponse (from ssh console) whether this application does
screen updates after screen has blanked, and how often?  Also, please install
sp-memusage and give marble PID to mem-cpu-monitor to monitor its (and whole
system) memory usage changes.
Comment 100 Jan Kratochvil 2011-05-03 23:06:34 UTC
Created an attachment (id=3362) [details]
thread apply all bt for marble

(In reply to comment #99)
> after about 15-20 it is
> 
> Seconds?  Minutes?

Rather minutes.  But it hangs for 90% of downloads, IIRC it has completed only
one large download for me.  marble 1.1.0.  I find it easier to make it
reproducible for yourself than the remote debugging like this one.
http://userbase.kde.org/Tutorials#Marble
http://userbase.kde.org/Marble/Maemo/OfflineRouting


> Could you check with xresponse

I never got any output from it using:
xresponse -w0 -i
xresponse -w0 -k A
xresponse -w0 -c 100x100
xresponse -w0 -t foo


> sp-memusage and give marble PID to mem-cpu-monitor to monitor its

It has never changed the output:
# mem-cpu-monitor 2710
System total memory: 245380 kB RAM, 786424 kB swap
               _______________  ____________ 
________  __  / system memory \/ system CPU \
time:   \/BL\/  used:  change:     %:  MHz: 
21:43:24  --   171136       +0   0.00     0

Also tried:
Nokia-N900:~# ls -l /proc/2710/exe
lrwxrwxrwx    1 user     users           0 May  3 22:00 /proc/2710/exe ->
/opt/marble/bin/marble
Nokia-N900:~# strace -p 2710
Process 2710 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>^C <unfinished ...>
Process 2710 detached
 --- strace prints no output

GDB backtraces attached.  Thanks for investigation.

CPU:  0.1% usr 99.8% sys  0.0% nice  0.0% idle  0.0% io  0.0% irq  0.0% softirq
Mem: 194720K used, 50660K free, 0K shrd, 364K buff, 85616K cached
CPU:  0.0% usr  100% sys  0.0% nice  0.0% idle  0.0% io  0.0% irq  0.0% softirq
Load average: 1.00 1.10 1.13
  PID  PPID USER     STAT   RSS %MEM %CPU COMMAND
  876     2 root     RW       0  0.0 98.4 [sgx_misr]
 3103  1566 root     S     1348  0.5  0.3 sshd: root@pts/0     
 3412  3105 root     R      736  0.3  0.3 top 
   10     2 root     SW       0  0.0  0.3 [omap2_mcspi]
  544     2 root     SW       0  0.0  0.2 [wl12xx]
  884   729 root     D <  16428  6.6  0.0 /usr/bin/Xorg -logfile
/tmp/Xorg.0.log -logverbose 1 -nolisten tcp -noreset -s 0 -core 
 2710     1 user     S    11932  4.8  0.0 /opt/marble/bin/marble 
[...]