Bug 7017 - (int-144156) SGX memory reset seems failed during reboot
(int-144156)
: SGX memory reset seems failed during reboot
Status: RESOLVED FIXED
Product: Core
Kernel
: 5.0:(10.2010.19-1)
: N900 Maemo
: Medium critical with 5 votes (vote)
: 5.0/(10.2010.19-1)
Assigned To: unassigned
: linux-kernel-bugs
:
: moreinfo
:
:
  Show dependency tree
 
Reported: 2009-12-16 03:37 UTC by egoshin
Modified: 2011-04-28 05:13 UTC (History)
7 users (show)

See Also:


Attachments
syslog (173.55 KB, text/plain)
2009-12-16 03:38 UTC, egoshin
Details
syslog (after software reset) (292.64 KB, text/plain)
2009-12-16 21:12 UTC, egoshin
Details
gzip of /dev/mtd2 (5.13 KB, application/gzip)
2009-12-17 09:33 UTC, egoshin
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description egoshin (reporter) 2009-12-16 03:37:12 UTC
SOFTWARE VERSION:
(Settings > General > About product)

EXACT STEPS LEADING TO PROBLEM: 
(Explain in detail what you do (e.g. tap on OK) and what you see (e.g. message
Connection Failed appears))
1. reboot
2. sudo gainroot
3. grep SGX /var/log/syslog

EXPECTED OUTCOME:

System reboot, no problems, no messages about SGX

ACTUAL OUTCOME:

Screen may have a blackened symbol (black square in place of X-Term symbol).
Syslog has messages 

Dec 15 16:52:55 Nokia-N900-42-11 kernel: [30295.444152] HWRecoveryResetSGX: SGX
Hardware Recovery triggered

which repeats.

REPRODUCIBILITY:
(always, less than 1/10, 5/10, 9/10)

Unknown. But never happened after OFF-ON cycle. It was seen only after shell
command "reboot".

EXTRA SOFTWARE INSTALLED:

Many, but nothing in kernel or power management.

OTHER COMMENTS:

It was seen only after shell command "reboot". Switching OFF/"halt" and then ON
helps. 
Syslog is attached (IMEI and WiFi name are edited).

User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.14)
Gecko/2009090216 Ubuntu/8.04 (hardy) Firefox/3.0.14
Comment 1 egoshin (reporter) 2009-12-16 03:38:22 UTC
Created an attachment (id=1772) [details]
syslog
Comment 2 egoshin (reporter) 2009-12-16 21:12:09 UTC
Created an attachment (id=1785) [details]
syslog (after software reset)

syslog of reboot due to "sw_rst" reason from /proc/bootreason
Comment 3 egoshin (reporter) 2009-12-16 21:16:37 UTC
Today I experienced the problem again.

I played with KMPlayer (based on gstreamer0.10) with some video file which is
not played by Nokia Media Player (missed codec) and got a reboot due to
"sw_rst". I attached a log.

After that reboot I see again the blackened symbol on X-Term window, slowdown
the system and find in the syslog the messages about HWRecoveryResetSGX. 

Again, the switching OFF-ON cures a problem.
Comment 4 Eero Tamminen nokia 2009-12-17 09:12:11 UTC
(In reply to comment #3)
> Today I experienced the problem again.
> 
> I played with KMPlayer (based on gstreamer0.10) with some video file which is
> not played by Nokia Media Player (missed codec) and got a reboot due to
> "sw_rst". I attached a log.

As there was no message about some critical system service termination causing
reboot, I assume this was a kernel oops.  Can you attach output of
"sp-oops-extract /dev/mtd2" (tool available from SDK tools repo) or just the
gzipped /dev/mtd2 contents?


> After that reboot I see again the blackened symbol on X-Term window, slowdown
> the system and find in the syslog the messages about HWRecoveryResetSGX. 
> 
> Again, the switching OFF-ON cures a problem.

If I understood you right, SGX resets don't go away with reboot.  But can the
resets start happening before the device has been rebooted even once (after
last power key power up), or do they start happening only after device has
rebooted?


Btw: in many cases you don't need syslog (and be root to be able to read
/var/log/syslog), you can just do "dmesg | grep SGX" as user if the issue is
recent (dmesg buffer size is quite limited).  If device has rebooted, then
syslog is needed.
Comment 5 egoshin (reporter) 2009-12-17 09:33:48 UTC
Created an attachment (id=1792) [details]
gzip of /dev/mtd2

You may find a lot of crashes in it - I just experiment with different video
codecs to play my MPEG files from JVC camcoder. And that codecs do not work
properly with HD video. Each time then system crashes it was because of SW
bugs/incompatibility in "controlled" test.

It is not an issue, the real issue is an unstable SGX after reboot. Again, I
stress - after power cycle the problem disapears.
Comment 6 Eero Tamminen nokia 2009-12-17 09:47:09 UTC
The oopslog has few MEM_FlushCache oopses, DSP driver or code (DSP can also
manipulate MMU...) may be corrupting the memory.

Do the SGX issues start occurring only when using video & things that use DSP?


> It is not an issue,

Well, the SGX lockups + resulting required HW resets appearing is a problem.


> the real issue is an unstable SGX after reboot. Again, I
> stress - after power cycle the problem disapears.

Although this is of course also an issue, reboot should clear everything
needed.
Comment 7 egoshin (reporter) 2009-12-17 10:20:20 UTC
(In reply to comment #6)
> The oopslog has few MEM_FlushCache oopses, DSP driver or code (DSP can also
> manipulate MMU...) may be corrupting the memory.
> 
> Do the SGX issues start occurring only when using video & things that use DSP?
> 

I don't know, I don't use "reboot". I did system boots but I used power button
to switch N900 OFF, then ON.

But SGX issue starts AFTER automatic reboot but not before - you may see it in
log, it includes messages before reboots.


> Well, the SGX lockups + resulting required HW resets appearing is a problem.

OK, I used KMPlayer attempting play some video (see it's profile below) with
"Decoder Support" and another different codec packages. It causes a stable
system reboot if I move slider pointer right a long way.

I also played some HD video from JVC HM100U and it blinks a screen. I swithed
N900 OFF-ON but in some cases it may reboot itself (I am not sure - a lot of
attempts to play with different codec packs were done).

> 
> 
> > the real issue is an unstable SGX after reboot. Again, I
> > stress - after power cycle the problem disapears.
> 
> Although this is of course also an issue, reboot should clear everything
> needed.

... but it doesn't and that worries me a most (do I need to send N900 for
repair or it can be fixed in software?)

Actually, it looks like some problem with SW reset pulse length or/and power
parameters setup.

---------------------------------- video parameters ----------------

Format                           : Matroska
File size                        : 376 MiB
Duration                         : 29mn 25s
Overall bit rate                 : 1 788 Kbps
Encoded date                     : UTC 2006-04-13 23:14:00
Writing application              : mkvmerge v1.6.5 ('Watcher Of The Skies')
built on Dec  7 2005 18:53:53
Writing library                  : libebml v0.7.6 + libmatroska v0.8.0

Video
ID                               : 1
Format                           : MPEG-4 Visual
Format settings, BVOP            : Yes
Format settings, QPel            : No
Format settings, GMC             : No warppoints
Format settings, Matrix          : Default (H.263)
Codec ID                         : DX50
Codec ID/Hint                    : DivX 5
Duration                         : 29mn 23s
Bit rate                         : 1 368 Kbps
Width                            : 640 pixels
Height                           : 480 pixels
Display aspect ratio             : 4/3
Frame rate                       : 29.970 fps
Resolution                       : 24 bits
Colorimetry                      : 4:2:0
Scan type                        : Progressive
Bits/(Pixel*Frame)               : 0.149
Stream size                      : 287 MiB (76%)
Writing library                  : DivX 6.1.1 (UTC 2006-02-01)

Audio #1
ID                               : 2
Format                           : AC-3
Format/Info                      : Audio Coding 3
Codec ID                         : A_AC3
Duration                         : 29mn 25s
Bit rate mode                    : Constant
Bit rate                         : 192 Kbps
Channel(s)                       : 2 channels
Channel positions                : L R
Sampling rate                    : 48.0 KHz
Stream size                      : 40.4 MiB (11%)
Title                            : Undefined AC3 2.0
Language                         : Undefined

Audio #2
ID                               : 3
Format                           : AC-3
Format/Info                      : Audio Coding 3
Codec ID                         : A_AC3
Duration                         : 29mn 25s
Bit rate mode                    : Constant
Bit rate                         : 192 Kbps
Channel(s)                       : 2 channels
Channel positions                : L R
Sampling rate                    : 48.0 KHz
Stream size                      : 40.4 MiB (11%)
Title                            : English AC3 2.0
Language                         : English

Text
ID                               : 4
Format                           : UTF-8
Codec ID                         : S_TEXT/UTF8
Codec ID/Info                    : UTF-8 Plain Text
Title                            : English Subs
Language                         : English
Comment 8 egoshin (reporter) 2009-12-17 10:27:25 UTC
Sorry, I missed your questions -

(In reply to comment #4)

> If I understood you right, SGX resets don't go away with reboot.  But can the
> resets start happening before the device has been rebooted even once (after
> last power key power up), or do they start happening only after device has
> rebooted?

I saw it only after reboot but not before - both logs are attached. Actually,
the device slows down, and that was my primary reason to look into syslog. But
then I started X-Term I found a "blackened" symbol during typing shell command.

> 
> 
> Btw: in many cases you don't need syslog (and be root to be able to read
> /var/log/syslog), you can just do "dmesg | grep SGX" as user if the issue is
> recent (dmesg buffer size is quite limited).  If device has rebooted, then
> syslog is needed.
> 

Thank you.
Comment 9 pepitoe 2009-12-21 12:54:04 UTC
*** This bug has been confirmed by popular vote. ***
Comment 10 Andre Klapper maemo.org 2010-03-09 22:31:28 UTC
Also see bug 9150.
Comment 11 Eero Tamminen nokia 2010-03-23 15:32:06 UTC
There are HWRecoveryResetSGX fixes in PR1.2.  Not sure whether this issue is
solved with them, but tentatively setting TM as PR1.2.
Comment 12 Andre Klapper maemo.org 2010-05-28 21:28:27 UTC
This week Nokia released the Maemo5 update version 10.2010.19-1 for public
(also called "PR1.2").
If you have some time we kindly ask you to test again if the problem reported
here still happens in this new version - just leave a comment (and feel free to
update the "Version" field to the new version if it's still a problem).
Comment 13 egoshin (reporter) 2010-06-30 01:27:40 UTC
Does not seen it anymore during reboots. I am around 1month on PR1.2
Comment 14 egoshin (reporter) 2010-07-20 20:57:26 UTC
Today I saw it again bu I have a new input -

It happened after I have a depleted battery (near red line) and power socket
attachment failed with wall socket. In the morning I found N900 with "NOKIA"
screen w/out backlight and I tried to shutdown/start N900. I have now an
installed frame buffer console driver and I see N900 attempts to run in
ACT_DEAD mode - couple of reboots happened. During that the lamp blinking was
inconsistent - yellow blink during running N900 and even short green flash
(with depleted battery).

After one I found again the blackened symbol and "SGX" message in syslog.

Again, after I sucessfully force switch OFF, charge it a little and switched ON
- no problem.
Comment 15 steve 2010-10-15 23:46:33 UTC
This is still happening for me on PR1.2.

SOFTWARE VERSION:
10.2010.19-1.203.1

EXACT STEPS LEADING TO PROBLEM: 
1)  Issue 'reboot' shell command
2)  Follow dmesg via ssh with:  while [ 1 ];do sleep 1 ; dmesg ; dmesg -c ;
done
3)  Scrolling between the 4 desktops frequently pauses for ~1 second, and dmesg
shows:
[  100.487518] HWRecoveryResetSGX: SGX Hardware Recovery triggered
[  102.081024] HWRecoveryResetSGX: SGX Hardware Recovery triggered
[  102.081024] HWRecoveryResetSGX: SGX Hardware Recovery triggered
[  104.620056] HWRecoveryResetSGX: SGX Hardware Recovery triggered
[  104.620056] HWRecoveryResetSGX: SGX Hardware Recovery triggered
[  106.495178] HWRecoveryResetSGX: SGX Hardware Recovery triggered
[  106.495178] HWRecoveryResetSGX: SGX Hardware Recovery triggered

REPRODUCIBILITY:
Appx. 5/10 reboots.

OTHER COMMENTS:
After a power-off then power-on there are no problems;  this ONLY happens after
a reboot.
Interestingly no SGX recovery messages are thrown whilst playing Angry Birds, 
but they will resume after quitting back to the desktop;  this problem isn't
related to 'heavy' usage of the SGX.

On the reboots when this bug doesn't occur,  instead I get this constantly (but
again never after power-off then power-on):
[  255.422363] dspbridge: timed out waiting for mailbox
[  256.227050] dspbridge: timed out waiting for mailbox
[  256.242736] dspbridge: timed out waiting for mailbox
[  256.258331] dspbridge: timed out waiting for mailbox
[  256.280212] dspbridge: timed out waiting for mailbox
[  256.312988] dspbridge: timed out waiting for mailbox

I'd guess some hardware isn't being reset correctly at reboot,  only via a full
power cycle.
Comment 16 Andre Klapper maemo.org 2010-10-17 20:25:35 UTC
(In reply to comment #15)
> This is still happening for me on PR1.2.

Can you please file a new bug report with steps (if possible) and post the
number here? Thanks a lot in advance!
Comment 17 steve 2010-10-17 21:29:46 UTC
(In reply to comment #16)
> Can you please file a new bug report with steps (if possible) and post the
> number here? Thanks a lot in advance!

I'm ashamed to admit that after much experimentation today I've narrowed down
the root cause of my problem to running lower than default voltages.  I wasn't
aware previously that the SGX was affected by modifying ARM's vCore.  Still, 
it's odd that this only happens after a reboot and not a cold boot.

I'm very sorry for the noise.  Thanks for your attention though.  Keep up the
great work, Andre.
Comment 18 egoshin (reporter) 2010-11-01 21:07:37 UTC
I saw this problem again, right after reboot during huge OTA upgrade from PR1.2
to PR1.3

But no occurrence after that even with reboots.
Comment 19 ancow 2011-04-28 05:13:10 UTC
It's happening to me after a couple of soft reboots triggered by the watchdog.
PR1.3