Bug 5767 - Corrupted /home/user/MyDocs VFAT filesystem out of the box
: Corrupted /home/user/MyDocs VFAT filesystem out of the box
Status: RESOLVED WORKSFORME
Product: System software
mmc-and-usb
: 5.0/(1.2009.41-10)
: N900 Maemo
: Low critical with 4 votes (vote)
: ---
Assigned To: unassigned
: linux-kernel-bugs
:
: moreinfo
:
: int-154777
  Show dependency tree
 
Reported: 2009-10-25 02:42 UTC by Kasper Souren
Modified: 2010-09-27 11:56 UTC (History)
10 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Kasper Souren (reporter) 2009-10-25 02:42:12 UTC
SOFTWARE VERSION:
(Control Panel > General > About product)

1.2009.41-10


STEPS TO REPRODUCE THE PROBLEM:

Not sure - yet.  I don't know where to look.

EXPECTED OUTCOME:

No problem.

ACTUAL OUTCOME:

Problem: /media/mmc* filesystems mounted read-only.

REPRODUCIBILITY:
(always/sometimes/once)

Sometimes.  Too often actually.

EXTRA SOFTWARE INSTALLED:

A bunch.
Comment 1 Lucas Maneos 2009-10-25 11:27:26 UTC
This usually happens when the kernel thinks the filesystem on the card is
corrupted (it automatically remounts it read-only to avoid causing more
damage).

Can you check the output of the command "dmesg" in an x terminal, and
check/repair the card on another computer?
Comment 2 Kasper Souren (reporter) 2009-10-25 15:10:36 UTC
I actually meant /home/user/MyDocs, a bit confused with N810.

I can find this in dmesg:

[55636.575378] FAT: Filesystem error (dev mmcblk1p1)
[55636.575408]     fat_get_cluster: invalid cluster chain (i_pos 0)
[55636.575439]     File system has been set read-only
[55638.017578] FAT: Filesystem error (dev mmcblk1p1)
[55638.017639]     fat_get_cluster: invalid cluster chain (i_pos 0)
[55638.124145] FAT: Filesystem error (dev mmcblk1p1)
[55638.124176]     fat_get_cluster: invalid cluster chain (i_pos 0)

Can I just fsck it from the N900 itself?
I guess it would be nice if the software would deal with this in a more
convenient way.
Comment 3 Kasper Souren (reporter) 2009-10-25 15:13:36 UTC
Weird.  So that in dmesg _was_ the external card.  And everything is still
mounted rw *BUT* it's not writable:

Nokia-N900-41-10:/home/user/MyDocs# mount|grep mmcblk
/dev/mmcblk0p2 on /home type ext3
(rw,noatime,errors=continue,commit=1,data=writeback)
/dev/mmcblk0p1 on /home/user/MyDocs type vfat
(rw,noauto,nodev,noexec,nosuid,noatime,nodiratime,utf8,uid=29999,shortname=mixed,dmask=000,fmask=0133,rodir)
/dev/mmcblk1p1 on /media/mmc1 type vfat
(rw,noauto,nodev,noexec,nosuid,noatime,nodiratime,utf8,uid=29999,shortname=mixed,dmask=000,fmask=0133,rodir)
Nokia-N900-41-10:/home/user/MyDocs# touch test
touch: test: Read-only file system

And then...

Nokia-N900-41-10:/home/user/MyDocs# mount -o remount,rw /home/user/MyDocs
Nokia-N900-41-10:/home/user/MyDocs# touch test
Nokia-N900-41-10:/home/user/MyDocs# 

What can I try to fix this?

Come to think of it... This might be the cause of #5766
Comment 4 Lucas Maneos 2009-10-25 15:40:59 UTC
(In reply to comment #2)
> Can I just fsck it from the N900 itself?

Yes (dosfsck is installed by default), or you can do it on a PC over a USB
connection.

(In reply to comment #3)
> Weird.  So that in dmesg _was_ the external card.  And everything is still
> mounted rw *BUT* it's not writable:
> 
> Nokia-N900-41-10:/home/user/MyDocs# touch test
> touch: test: Read-only file system

Sounds like both internal & card need repair.

I had the same issue with the internal one on mine (but no problems after a
dosfsck repair), and there's at least one more report of a corrupted filesystem
(bug 5414 comment 2).  It may be an artifact of the way these particular
devices were flashed, but at this point I think it's worth bringing to Nokia's
attention considering a similar thing has happened before (bug 2940).
Comment 5 Zach Goldberg 2009-10-27 16:04:56 UTC
Ah!  I thought this was only me!  About a week after using my 2009.41 N900 I
noticed this behavior, I now have to remount MyDocs RW on every bootup.
Comment 6 Andre Klapper maemo.org 2009-10-27 21:21:52 UTC
(In reply to comment #4)
> at this point I think it's worth bringing to Nokia's
> attention considering a similar thing has happened before (bug 2940).

As we have a few reports now about the Maemo Summit devices having issues with
damaged filesystems I'm also wondering. CC'ing Quim - any idea who to ping for
this?
Comment 7 Quim Gil nokia 2009-10-28 07:28:30 UTC
CCing Eero so he is informed.

Summit devices were flashed in an unorthodox way (manual flashing with normal
Linux laptops, instead of the official flashing process in factories). I have
no idea whether this has anything to do with the problem reported but it's
worth mentioning it.
Comment 8 Eero Tamminen nokia 2009-10-28 18:51:52 UTC
(In reply to comment #5)
> Ah!  I thought this was only me!  About a week after using my 2009.41 N900 I
> noticed this behavior, I now have to remount MyDocs RW on every bootup.

In general most likely reasons for the FAT corruption is "user error":
- removing battery (or even just back cover) while FAT is being written
- removing USB cable without doing the "safely remove" on the PC side first
Could the issue be result of either of these?

If the device HW watchdog resets the device while FAT was being written, that
will corrupt it also.  If device reboots because of the HW watchdog, the
/proc/bootreason file will contain value "32wd_to".  Have you had those?

(FAT is used because of compatibility reasons, all USB mass storage devices
support FAT by default.  However, Microsoft's FAT file system is very fragile
in regards to disconnects/power loss.)


(In reply to comment #6)
> (In reply to comment #4)
> > at this point I think it's worth bringing to Nokia's
> > attention considering a similar thing has happened before (bug 2940).
> 
> As we have a few reports now about the Maemo Summit devices having issues with
> damaged filesystems I'm also wondering.

Bug 2940 thing is easy to check.  What "cat /proc/partitions" and "df" say?
Comment 9 Zach Goldberg 2009-10-28 18:58:11 UTC
(In reply to comment #8)
> In general most likely reasons for the FAT corruption is "user error":
> - removing battery (or even just back cover) while FAT is being written
> - removing USB cable without doing the "safely remove" on the PC side first
> Could the issue be result of either of these?

I am one of those currently with a (partially) corrupted disk.  I know that I
have at some point had to take the battery out because of a lock up and I may
or may not have unsafely removed usb (I can't recall).  These, however, are
things any user is liable to, and will, do in the course of owning the device.

If this is really the case, in that a user can corrupt the filesystem and cause
it to be mounted read only (hence preventing most of the device's functionality
from working properly) by doing normal "dumb user things" then we will end up
with a lot of unhappy and confused users.  The device itself will need to do
something to counteract this, such as periodic fsck's or something.
Comment 10 Eero Tamminen nokia 2009-10-28 19:47:22 UTC
(In reply to comment #9)
> I am one of those currently with a (partially) corrupted disk.  I know that I
> have at some point had to take the battery out because of a lock up and I may
> or may not have unsafely removed usb (I can't recall).  These, however, are
> things any user is liable to, and will, do in the course of owning the device.
> 
> If this is really the case, in that a user can corrupt the filesystem and
> cause it to be mounted read only (hence preventing most of the device's
> functionality from working properly) by doing normal "dumb user things"
> then we will end up with a lot of unhappy and confused users.

It should give a notification that user needs to do something.  If it didn't,
that's a (separate) bug.


> The device itself will need to do something to counteract this, such as
> periodic fsck's or something.

Unfortunately doing full fsck for tens of GB of data before it's mounted on
bootup or on USB disconnect just takes too much time.  With worst case file
system content (full of small files & dirs), it will need hundred(s) of MBs of
memory (fsck needs to keep whole disk structure in RAM and if it starts
swapping, it will take tens of minutes) and besides not letting user access
MyDocs, slows down the device while that happens.  That would be really bad
user experience.

After the filesystem is mounted RW, you cannot modify it through direct block
level access and even doing just a check while it's mounted (i.e. potentially
modified through file system access) doesn't provide reliable information.

There are not really good solutions for that.  Proper solution would be getting
rid of FAT, but we cannot do that for compatibility reasons.
Comment 11 Kasper Souren (reporter) 2009-10-29 00:24:49 UTC
Is it somehow possible to notice that filesystems are turned ro?  If so,  the
user could be shown a warning sign and possibly a question whether to reboot
and fsck the filesystem.
Comment 12 Eero Tamminen nokia 2009-10-30 13:47:48 UTC
(In reply to comment #11)
> Is it somehow possible to notice that filesystems are turned ro?  If so,  the
> user could be shown a warning sign

That should be already shown.  If not, AFAIK that's a bug.
Comment 13 Kasper Souren (reporter) 2009-10-30 14:11:06 UTC
I haven't see any warnings about filesystems being remounted ro.
Comment 14 Jamie Lokier 2009-11-02 06:49:25 UTC
I'm not sure why FAT has to be used for the internal flash.  Why?  There are
other ways to provide a USB mass storage that *looks* like FAT to a PC but is
not internally.

If FAT must be used, perhaps a solution is to place a journalling file on it to
record metadata writes, with the file being preallocated?  It would be replayed
in the event of an unclean unmount, to protect against unexpected reboots and
power failures in the same way as journalling filesystems, but completely FAT
compatible.

My experience of other phones is watchdog resets happen often enough that
everything must be engineered to assume it happens.  And with old batteries, or
old/dirty battery contacts, or too much vibration, or dropping the phone on a
hard surface, can result in sudden power loss.
Comment 15 Jamie Lokier 2009-11-02 06:52:08 UTC
> If FAT must be used, perhaps a solution is to place a journalling file on it to
record metadata writes

Btw, if this turns out to be a necessary solution, I might be able to
advise/help with implementation.
Comment 16 Eero Tamminen nokia 2009-11-02 10:26:38 UTC
(In reply to comment #13)
> I haven't see any warnings about filesystems being remounted ro.

Could you file a separate bug about that?

(e.g. "File system RO; no notice/warning to user on bootup or USB disconnect")


(In reply to comment #14)
> I'm not sure why FAT has to be used for the internal flash.  Why?  There are
> other ways to provide a USB mass storage that *looks* like FAT to a PC but is
> not internally.

If you know one that is:
- More reliable that having real FAT
- Well tested (to actually work fine with other operating systems)
- Copes with tens of GB sized partition without gobbling large amounts of RAM

Please let us know.


> If FAT must be used, perhaps a solution is to place a journalling file
> on it to record metadata writes, with the file being preallocated?

How do you get Windowses and OSXes to do that?
Comment 17 Kasper Souren (reporter) 2009-11-02 10:54:05 UTC
(In reply to comment #16)
> (In reply to comment #13)
> > I haven't see any warnings about filesystems being remounted ro.
> 
> Could you file a separate bug about that?
> 
> (e.g. "File system RO; no notice/warning to user on bootup or USB disconnect")

Ok
https://bugs.maemo.org/show_bug.cgi?id=5997
Comment 18 Andre Klapper maemo.org 2009-11-02 16:02:59 UTC
Note to myself: "/home/user/MyDocs has become readonly" (int-141809) might be
related.
Comment 19 Jamie Lokier 2009-11-05 03:51:08 UTC
(In reply to comment #16)
> (In reply to comment #14)
> > I'm not sure why FAT has to be used for the internal flash.  Why?  There are
> > other ways to provide a USB mass storage that *looks* like FAT to a PC but is
> > not internally.
> 
> If you know one that is:
> - More reliable that having real FAT
> - Well tested (to actually work fine with other operating systems)
> - Copes with tens of GB sized partition without gobbling large amounts of RAM
> 
> Please let us know.

I've no good answer here.  Let's drop that.

> > If FAT must be used, perhaps a solution is to place a journalling file
> > on it to record metadata writes, with the file being preallocated?
> 
> How do you get Windowses and OSXes to do that?

That's not necessary.

For the internal memory: I'm assuming the 32GB internal memory is FAT too, as
an earlier comment says the internal memory was corrupted too.

It would be written only when the internal memory is being updated by Maemo
itself during normal use.  That would protect against metadata corruption
during power failures and watchdog resets, which this bug shows do occur.

It would not protect against power failure when writing the internal memory
from an external PC as a block device, but that's ok.  The big problem to fix
is corruption during normal Maemo use with power failures and watchdog resets.

Linux' FAT driver knows which blocks are file data and which are not, and the
journal file can be a normal preallocated file, so it's not unrealistic to put
this in the kernel FAT driver.

(But I realise it's a bit much to do before the release date ;-)

Journal replay can be an external tool, and is only needed after unclean
shutdown.  But it might be easier to implement in the kernel.  It's a simple
process.

For plugged-in memory cards, journalling (by Maemo FAT writes) is not an option
unless there is a reliable way to detect when the card has been modified by
another OS, so that the journal will not be replayed after that, when the card
is reinserted.  I don't know without further study if there is such a reliable
way.  But this only applies to the plugged-in memory cards.

I've considered journalling FAT in Linux before to cope with this kind of
problem on other devices, but not implemented anything.  In the end I just used
ext3 with barriers :-)  It isn't an idea made up just for this bug.
Comment 20 Eero Tamminen nokia 2009-11-05 15:13:27 UTC
(In reply to comment #19)
> > > If FAT must be used, perhaps a solution is to place a journalling file
> > > on it to record metadata writes, with the file being preallocated?
[...]
> It would be written only when the internal memory is being updated by Maemo
> itself during normal use.

This may be something to consider for Harmattan.

Note that journaling would probably noticeably slow down the file system speed
(MyDocs with FAT is nearly twice as fast as Ext3 used for /home/) which may
affect for example Youtube streams buffering & camera recording (which use
MyDocs) and slow down swap (as swap partition is on same media as MyDocs).  It
will also wear the flash more.
Comment 21 Lucas Maneos 2009-11-07 21:00:22 UTC
(In reply to comment #8)
> - removing USB cable without doing the "safely remove" on the PC side first

May be related: I'm getting lots of I/O errors when connected in USB mass
storage mode on one linux (Ubuntu 9.04 earlier and 9.10 now) box here, even
when only reading, reproducible every time.  This ends with the USB getting
reset and rescanned so the effect is similar to yanking the USB cable without
umounting cleanly.

All other machines I tried, including the laptop I had with me at the summit 
(same OS, same cable) work fine, so a) I don't think it's a Maemo bug and b)
that didn't cause the corruption in my case at least.

> Bug 2940 thing is easy to check.  What "cat /proc/partitions" and "df" say?

Slight mismatch, but not in a bad way (fs < device):

major minor  #blocks  name
 179        1   28315648 mmcblk0p1

Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/mmcblk0p1        28312128  14912384  13399744  53% /home/user/MyDocs

How is the FAT partition flashed btw?  It is a copy of a filesystem image, or a
new filesystem created and populated from a tarball or similar?
Comment 22 Eero Tamminen nokia 2009-11-09 16:26:43 UTC
(In reply to comment #21)
> How is the FAT partition flashed btw?  It is a copy of a filesystem image,
> or a new filesystem created and populated from a tarball or similar?

Apparently it's actually nowadays created & populated from tarballs (home +
MyDocs), so that issue in N810 shouldn't be able to happen.
Comment 23 Lucas Maneos 2009-11-16 18:16:08 UTC
For end-user flashing it looks like a filesystem image:

$ ./flasher-3.5 -F RX-51_2009SE_1.2009.41-1.VANILLA_PR_EMMC_MR0_ARM.bin -u
flasher v2.5.2 (Oct 21 2009)

Image 'mmc', size 241163 kB
    Version RX-51_2009SE_1.2009.41-1.VANILLA
Unpacking mmc image to file 'mmc'...
$ file mmc
mmc: x86 boot sector, mkdosfs boot message display, code offset 0x58, OEM-ID " 
 Maemo", sectors/cluster 128, Media descriptor 0xf8, heads 64, sectors 8390306
(volumes > 32 MB) , FAT (32 bit), sectors/FAT 513, serial number 0x4acd9aa4,
label: "Nokia N900 "

After flashing (51_2009SE_1.2009.42-11_PR_COMBINED_MR0_ARM.bin) and the above,
and just installing openssh-server (in order to get root access) dosfsck
reports no errors.

Now we just need someone to get root and run dosfsck on a
straight-out-of-the-box retail unit and we can close this.
Comment 24 Andre Klapper maemo.org 2010-01-05 17:06:45 UTC
So did this issue ever happen to somebody having a "real" device bought, not a
device from the maemo.org Summit in Amsterdam?

If so, please speak up!
Comment 25 carlson.cox 2010-01-08 06:16:44 UTC
(In reply to comment #24)
> So did this issue ever happen to somebody having a "real" device bought, not a
> device from the maemo.org Summit in Amsterdam?
> 
> If so, please speak up!
> 

Yeah, I have a production N900 and when I put a SD card with pictures and tried
to delete EVERYTHING on the card is read-only. I ran chkdsk and supposedly it
fixed some errors, but I'm still not sure what to do.
Comment 26 Eero Tamminen nokia 2010-01-11 18:57:26 UTC
(In reply to comment #25)
> Yeah, I have a production N900 and when I put a SD card with pictures and
> tried to delete EVERYTHING on the card is read-only.

The issue was with the SD card, not the internal eMMC (MyDocs)?

(Then it's not the potential issue discussed here.)


> I ran chkdsk and supposedly it fixed some errors, but I'm still not
> sure what to do.

I assume it fixed the issues.  FAT file system is easy to corrupt, especially
in a multipurpose devices like N900 which can access the card data in many
different ways.  See comment 8.
Comment 27 verbelen.tim 2010-02-02 11:01:56 UTC
I have also have a corrupted filesystem. I also find the mediaplayer causing
sw_reboots when playing files from the FAT fs and I think that's also caused by
the corrupted filesystem.

In dmesg I find following rules:
[37046.271850] FAT: Directory bread(block 9782837) failed
[37046.283264] mmcblk0: error -110 transferring data, sector 9782964, nr 1,
card status 0x200900
[37046.283294] end_request: I/O error, dev mmcblk0, sector 9782964

When I run fsck it returns:  input/output error

When I try to reflash the emmc image it returns:
SU_GET_UPDATE_STATUS_REQ terminated with error code 1.
Comment 28 Eero Tamminen nokia 2010-02-08 18:24:24 UTC
Bug 8235 command 15 has one more reason why file system could get corrupted.


(In reply to comment #27)
> I have also have a corrupted filesystem. I also find the mediaplayer
> causing sw_reboots when playing files from the FAT fs and I think
> that's also caused by the corrupted filesystem.

Can you file a separate bug about that, put me on CC and attach to the bug
contents of /dev/mtd2 file and files from /var/lib/dsme/stats/ directory?
(you need to be root to access these)


> When I try to reflash the emmc image it returns:
> SU_GET_UPDATE_STATUS_REQ terminated with error code 1.

That's bug 7433.
Comment 29 Andre Klapper maemo.org 2010-09-27 11:56:26 UTC
Closing this bug report as the issue could not be reproduced. Please feel free
to reopen this report if you can provide better steps to reproduce this /
answer the questions in comment 16 and comment 26.