summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c
Commit message (Collapse)AuthorAge
...
| * | btrfs: btrfs_inherit_iflags() can be staticAnand Jain2017-08-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_new_inode() is the only consumer move it to inode.c, from ioctl.c. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: drop newlines from strings when using btrfs_* helpersDavid Sterba2017-08-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The helpers append "\n" so we can keep the actual strings shorter. The extra newline will print an empty line. Some messages have been slightly modified to be more consistent with the rest (lowercase first letter). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: report errors when checksum is not foundLiu Bo2017-08-16
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When btrfs fails the checksum check, it'll fill the whole page with "1". However, if %csum_expected is 0 (which means there is no checksum), then for some unknown reason, we just pretend that the read is correct, so userspace would be confused about the dilemma that read is successful but getting a page with all content being "1". This can happen due to a bug in btrfs-convert. This fixes it by always returning errors if checksum doesn't match. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: fix blk_status_t/errno confusionOmar Sandoval2017-08-24
|/ | | | | | | | | | | | | | | | | | This fixes several instances of blk_status_t and bare errno ints being mixed up, some of which are real bugs. In the normal case, 0 matches BLK_STS_OK, so we don't observe any effects of the missing conversion, but in case of errors or passes through the repair/retry paths, the errors get mixed up. The changes were identified using 'sparse', we don't have reports of the buggy behaviour. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Merge branch 'for-4.13-part2' of ↵Linus Torvalds2017-07-14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We've identified and fixed a silent corruption (introduced by code in the first pull), a fixup after the blk_status_t merge and two fixes to incremental send that Filipe has been hunting for some time" * 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix unexpected return value of bio_readpage_error btrfs: btrfs_create_repair_bio never fails, skip error handling btrfs: cloned bios must not be iterated by bio_for_each_segment_all Btrfs: fix write corruption due to bio cloning on raid5/6 Btrfs: incremental send, fix invalid memory access Btrfs: incremental send, fix invalid path for link commands
| * btrfs: btrfs_create_repair_bio never fails, skip error handlingDavid Sterba2017-07-14
| | | | | | | | | | | | | | | | | | As the function uses the non-failing bio allocation, we can remove error handling from the callers as well. Signed-off-by: David Sterba <dsterba@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: cloned bios must not be iterated by bio_for_each_segment_allDavid Sterba2017-07-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We've started using cloned bios more in 4.13, there are some specifics regarding the iteration. Filipe found [1] that the raid56 iterated a cloned bio using bio_for_each_segment_all, which is incorrect. The cloned bios have wrong bi_vcnt and this could lead to silent corruptions. This patch adds assertions to all remaining bio_for_each_segment_all cases. [1] https://patchwork.kernel.org/patch/9838535/ Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge branch 'for-4.13' of ↵Linus Torvalds2017-07-06
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "These are the percpu changes for the v4.13-rc1 merge window. There are a couple visibility related changes - tracepoints and allocator stats through debugfs, along with __ro_after_init markings and a cosmetic rename in percpu_counter. Please note that the simple O(#elements_in_the_chunk) area allocator used by percpu allocator is again showing scalability issues, primarily with bpf allocating and freeing large number of counters. Dennis is working on the replacement allocator and the percpu allocator will be seeing increased churns in the coming cycles" * 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: fix static checker warnings in pcpu_destroy_chunk percpu: fix early calls for spinlock in pcpu_stats percpu: resolve err may not be initialized in pcpu_alloc percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch percpu: add tracepoint support for percpu memory percpu: expose statistics about percpu memory via debugfs percpu: migrate percpu data structures to internal header percpu: add missing lockdep_assert_held to func pcpu_free_area mark most percpu globals as __ro_after_init
| * | percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batchNikolay Borisov2017-06-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, percpu_counter_add is a wrapper around __percpu_counter_add which is preempt safe due to explicit calls to preempt_disable. Given how __ prefix is used in percpu related interfaces, the naming unfortunately creates the false sense that __percpu_counter_add is less safe than percpu_counter_add. In terms of context-safety, they're equivalent. The only difference is that the __ version takes a batch parameter. Make this a bit more explicit by just renaming __percpu_counter_add to percpu_counter_add_batch. This patch doesn't cause any functional changes. tj: Minor updates to patch description for clarity. Cosmetic indentation updates. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@fb.com> Cc: linux-mm@kvack.org Cc: "David S. Miller" <davem@davemloft.net>
* | | Merge branch 'for-4.13-part1' of ↵Linus Torvalds2017-07-05
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The core updates improve error handling (mostly related to bios), with the usual incremental work on the GFP_NOFS (mis)use removal, refactoring or cleanups. Except the two top patches, all have been in for-next for an extensive amount of time. User visible changes: - statx support - quota override tunable - improved compression thresholds - obsoleted mount option alloc_start Core updates: - bio-related updates: - faster bio cloning - no allocation failures - preallocated flush bios - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates - prep work for btree_inode removal - dir-item validation - qgoup fixes and updates - cleanups: - removed unused struct members, unused code, refactoring - argument refactoring (fs_info/root, caller -> callee sink) - SEARCH_TREE ioctl docs" * 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits) btrfs: Remove false alert when fiemap range is smaller than on-disk extent btrfs: Don't clear SGID when inheriting ACLs btrfs: fix integer overflow in calc_reclaim_items_nr btrfs: scrub: fix target device intialization while setting up scrub context btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges btrfs: qgroup: Introduce extent changeset for qgroup reserve functions btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled btrfs: qgroup: Return actually freed bytes for qgroup release or free data btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function btrfs: qgroup: Add quick exit for non-fs extents Btrfs: rework delayed ref total_bytes_pinned accounting Btrfs: return old and new total ref mods when adding delayed refs Btrfs: always account pinned bytes when dropping a tree block ref Btrfs: update total_bytes_pinned when pinning down extents Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 btrfs: fix validation of XATTR_ITEM dir items btrfs: Verify dir_item in iterate_object_props btrfs: Check name_len before in btrfs_del_root_ref btrfs: Check name_len before reading btrfs_get_name ...
| * | btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ↵Qu Wenruo2017-06-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ranges [BUG] For the following case, btrfs can underflow qgroup reserved space at an error path: (Page size 4K, function name without "btrfs_" prefix) Task A | Task B ---------------------------------------------------------------------- Buffered_write [0, 2K) | |- check_data_free_space() | | |- qgroup_reserve_data() | | Range aligned to page | | range [0, 4K) <<< | | 4K bytes reserved <<< | |- copy pages to page cache | | Buffered_write [2K, 4K) | |- check_data_free_space() | | |- qgroup_reserved_data() | | Range alinged to page | | range [0, 4K) | | Already reserved by A <<< | | 0 bytes reserved <<< | |- delalloc_reserve_metadata() | | And it *FAILED* (Maybe EQUOTA) | |- free_reserved_data_space() |- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at writeback time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: qgroup: Introduce extent changeset for qgroup reserve functionsQu Wenruo2017-06-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a new parameter, struct extent_changeset for btrfs_qgroup_reserved_data() and its callers. Such extent_changeset was used in btrfs_qgroup_reserve_data() to record which range it reserved in current reserve, so it can free it in error paths. The reason we need to export it to callers is, at buffered write error path, without knowing what exactly which range we reserved in current allocation, we can free space which is not reserved by us. This will lead to qgroup reserved space underflow. Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write ↵Qu Wenruo2017-06-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and quotas being enabled [BUG] Under the following case, we can underflow qgroup reserved space. Task A | Task B --------------------------------------------------------------- Quota disabled | Buffered write | |- btrfs_check_data_free_space() | | *NO* qgroup space is reserved | | since quota is *DISABLED* | |- All pages are copied to page | cache | | Enable quota | Quota scan finished | | Sync_fs | |- run_delalloc_range | |- Write pages | |- btrfs_finish_ordered_io | |- insert_reserved_file_extent | |- btrfs_qgroup_release_data() | Since no qgroup space is reserved in Task A, we underflow qgroup reserved space This can be detected by fstest btrfs/104. [CAUSE] In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes size of qgroup reserved_space in all cases. And btrfs_qgroup_release_data() will check if quotas are enabled. However in the above case, the buffered write happens before quota is enabled, so we don't have the reserved space for that range. [FIX] In insert_reserved_file_extent(), we tell qgroup to release the acctual byte number it released. In the above case, since we don't have the reserved space, we tell qgroups to release 0 byte, so the problem can be fixed. And thanks to the @reserved parameter introduced by the qgroup rework, and previous patch to return released bytes, the fix can be as small as 10 lines. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> [ changelog updates ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: Check name_len with boundary in verify dir_itemSu Yue2017-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally, verify_dir_item verifies name_len of dir_item with fixed values but not item boundary. If corrupted name_len was not bigger than the fixed value, for example 255, the function will think the dir_item is fine. And then reading beyond boundary will cause crash. Example: 1. Corrupt one dir_item name_len to be 255. 2. Run 'ls -lar /mnt/test/ > /dev/null' dmesg: [ 48.451449] BTRFS info (device vdb1): disk space caching is enabled [ 48.451453] BTRFS info (device vdb1): has skinny extents [ 48.489420] general protection fault: 0000 [#1] SMP [ 48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq [ 48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5 [ 48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 [ 48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000 [ 48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs] [ 48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202 [ 48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000 [ 48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368 [ 48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000 [ 48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288 [ 48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20 [ 48.490008] FS: 00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000 [ 48.490008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0 [ 48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 48.490008] Call Trace: [ 48.490008] btrfs_real_readdir+0x3b7/0x4a0 [btrfs] [ 48.490008] iterate_dir+0x181/0x1b0 [ 48.490008] SyS_getdents+0xa7/0x150 [ 48.490008] ? fillonedir+0x150/0x150 [ 48.490008] entry_SYSCALL_64_fastpath+0x18/0xad [ 48.490008] RIP: 0033:0x7fef5032546b [ 48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e [ 48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b [ 48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003 [ 48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000 [ 48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38 [ 48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e [ 48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98 [ 48.499455] ---[ end trace 321920d8e8339505 ]--- Fix it by adding a parameter @slot and check name_len with item boundary by calling btrfs_is_name_len_valid. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> rev Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: cleanup duplicate return value in insert_inline_extentDavid Sterba2017-06-20
| | | | | | | | | | | | | | | | | | | | | | | | The pattern when err is used for function exit and ret is used for return values of callees is not used here. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: compression must free at least one sector sizeTimofey Titovets2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already skip storing data where compression does not make the result at least one byte less. Let's make the logic better and check that compression frees at least one sector size of bytes, otherwise it's not that useful. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ changelog updated ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: skip checksum verification if IO error occursLiu Bo2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently dio read also goes to verify checksum if -EIO has been returned, although it usually fails on checksum, it's not necessary at all, we could directly check if there is another copy to read. And with this, the behavior of dio read is now consistent with that of buffered read. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use bool for uptodate ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: tolerate errors if we have retried successfullyLiu Bo2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With raid1 profile, dio read isn't tolerating IO errors if read length is less than the stripe length (64K). Our bio didn't get split in btrfs_submit_direct_hook() if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) is true and that happens when the read length is less than 64k. In this case, if the underlying device returns error somehow, bio->bi_error has recorded that error. If we could recover the correct data from another copy in profile raid1/10/5/6, with btrfs_subio_endio_read() returning 0, bio would have the correct data in its vector, but bio->bi_error is not updated accordingly so that the following dio_end_io(dio_bio, bio->bi_error) makes directIO think this read has failed. This fixes the problem by setting bio's error to 0 if a good copy has been found. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: sink gfp parameter to btrfs_bio_cloneDavid Sterba2017-06-19
| | | | | | | | | | | | | | | | | | | | | All callers pass GFP_NOFS. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: btrfs_bio_clone never fails, skip error handlingDavid Sterba2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | Update direct callers of btrfs_bio_clone that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: simplify code with bio_io_errorGuoqing Jiang2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bio_io_error was introduced in the commit 4246a0b63bd8f56a1469b ("block: add a bi_error field to struct bio"), so use it to simplify code. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: use generic slab for for btrfs_transactionDavid Sterba2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Observing the number of slab objects of btrfs_transaction, there's just one active on an almost quiescent filesystem, and the number of objects goes to about ten when sync is in progress. Then the nubmer goes down to 1. This matches the expectations of the transaction lifetime. For such use the separate slab cache is not justified, as we do not reuse objects frequently. For the shortlived transaction, the generic slab (size 512) should be ok. We can optimistically expect that the 512 slabs are not all used (fragmentation) and there are free slots to take when we do the allocation, compared to potentially allocating a whole new page for the separate slab. We'll lose the stats about the object use, which could be added later if we really need them. Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: add statx supportYonghong Song2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Return enhanced file attributes from the btrfs, including: (1). inode creation time as stx_btime, and (2). Certain BTRFS_INODE_xxx flags are mapped to stx_attributes flags. Example output: [root@localhost ~]# cat t.sh touch t chattr +aic t ~/linux/samples/statx/test-statx t chattr -aic t touch t echo "========================================" ~/linux/samples/statx/test-statx t /bin/rm t [root@localhost ~]# ./t.sh statx(t) = 0 results=fff Size: 0 Blocks: 0 IO Block: 4096 regular file Device: 00:1c Inode: 63962 Links: 1 Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 Access: 2017-05-11 16:03:13.999856591-0700 Modify: 2017-05-11 16:03:13.999856591-0700 Change: 2017-05-11 16:03:14.000856663-0700 Birth: 2017-05-11 16:03:13.999856591-0700 Attributes: 0000000000000034 (........ ........ ........ ........ ........ ........ ........ .-ai.c..) ======================================== statx(t) = 0 results=fff Size: 0 Blocks: 0 IO Block: 4096 regular file Device: 00:1c Inode: 63962 Links: 1 Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 Access: 2017-05-11 16:03:14.006857097-0700 Modify: 2017-05-11 16:03:14.006857097-0700 Change: 2017-05-11 16:03:14.006857097-0700 Birth: 2017-05-11 16:03:13.999856591-0700 Attributes: 0000000000000000 (........ ........ ........ ........ ........ ........ ........ .---.-..) [root@localhost ~]# Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: cleanup root usage by btrfs_get_alloc_profileJeff Mahoney2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two places where we don't already know what kind of alloc profile we need before calling btrfs_get_alloc_profile, but we need access to a root everywhere we call it. This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile() and relegates btrfs_system_alloc_profile to a static for use in those two cases. The next patch will eliminate one of those. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fix bool type in btrfs_page_exists_in_rangeDavid Sterba2017-06-19
| | | | | | | | | | | | | | | | | | We use only a simple bool indicator, int is not a problem here. Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: hardcode GFP_NOFS for btrfs_bio_clone_partialLiu Bo2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: work around maybe-uninitialized warningArnd Bergmann2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A rewrite of btrfs_submit_direct_hook appears to have introduced a warning: fs/btrfs/inode.c: In function 'btrfs_submit_direct_hook': fs/btrfs/inode.c:8467:14: error: 'bio' may be used uninitialized in this function [-Werror=maybe-uninitialized] Where the 'bio' variable was previously initialized unconditionally, it is now set in the "while (submit_len > 0)" loop that would never execute if submit_len is zero. Assuming this cannot happen in practice, we can avoid the warning by simply replacing the while{} loop with a do{}while() loop so the compiler knows that it will always be entered at least once. Fixes changes introduced in "Btrfs: use bio_clone_bioset_partial to simplify DIO submit". Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: unify naming of btrfs_io_bioLiu Bo2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | All dio endio functions are using io_bio for struct btrfs_io_bio, this makes btrfs_submit_direct to follow this convention. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: record error if one block has failed to retryLiu Bo2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the nocsum case of dio read endio, it returns immediately if an error gets returned when repairing, which leaves the rest blocks unrepaired. The behavior is different from how buffered read endio works in the same case. This changes it to record error only and go on repairing the rest blocks. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: change how we iterate bios in endioLiu Bo2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since dio submit has used bio_clone_fast, the submitted bio may not have a reliable bi_vcnt, for the bio vector iterations in checksum related functions, bio->bi_iter is not modified yet and it's safe to use bio_for_each_segment, while for those bio vector iterations in dio read's endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning bios and use the helper __bio_for_each_segment with the saved bvec_iter to access each bvec. Also for dio reads which don't get split, we also need to save a copy of bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access each bvec in dio read's endio. Note that it doesn't affect other calls of btrfs_bio_clone() because they don't need to use this iterator. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: use bio_clone_bioset_partial to simplify DIO submitLiu Bo2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when mapping bio to limit bio to a single stripe length, we split bio by adding page to bio one by one, but later we don't modify the vector of bio at all, thus we can use bio_clone_fast to use the original bio vector directly. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: don't pass the inode through clean_io_failureJosef Bacik2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | Instead pass around the failure tree and the io tree. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: replace tree->mapping with tree->private_dataJosef Bacik2017-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For extent_io tree's we have carried the address_mapping of the inode around in the io tree in order to pull the inode back out for calling into various tree ops hooks. This works fine when everything that has an extent_io_tree has an inode. But we are going to remove the btree_inode, so we need to change this. Instead just have a generic void * for private data that we can initialize with, and have all the tree ops use that instead. This had a lot of cascading changes but should be relatively straightforward. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor reordering of the callback prototypes ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: remove an unused variableDan Carpenter2017-06-19
| | | | | | | | | | | | | | | | | | | | | "item" is never used. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | btrfs: nowait aio supportGoldwyn Rodrigues2017-06-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Return EAGAIN if any of the following checks fail + i_rwsem is not lockable + NODATACOW or PREALLOC is not set + Cannot nocow at the desired location + Writing beyond end of file which is not allocated Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'v4.12-rc5' into for-4.13/blockJens Axboe2017-06-12
|\ \ \ | |/ / | | | | | | | | | | | | | | | | | | | | | | | | We've already got a few conflicts and upcoming work depends on some of the changes that have gone into mainline as regression fixes for this series. Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream trees to continue working on 4.13 changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_ioLiu Bo2017-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before this, we use 'filled' mode here, ie. if all range has been filled with EXTENT_DEFRAG bits, get to clear it, but if the defrag range joins the adjacent delalloc range, then we'll have EXTENT_DEFRAG bits in extent_state until releasing this inode's pages, and that prevents extent_data from being freed. This clears the bit if any was found within the ordered extent. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | btrfs: use correct types for page indices in btrfs_page_exists_in_rangeDavid Sterba2017-06-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Variables start_idx and end_idx are supposed to hold a page index derived from the file offsets. The int type is not the right one though, offsets larger than 1 << 44 will get silently trimmed off the high bits. (1 << 44 is 16TiB) What can go wrong, if start is below the boundary and end gets trimmed: - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot) - the final check "if (page->index <= end_idx)" will unexpectedly fail The function will return false, ie. "there's no page in the range", although there is at least one. btrfs_page_exists_in_range is used to prevent races in: * in hole punching, where we make sure there are not pages in the truncated range, otherwise we'll wait for them to finish and redo truncation, but we're going to replace the pages with holes anyway so the only problem is the intermediate state * lock_extent_direct: we want to make sure there are no pages before we lock and start DIO, to prevent stale data reads For practical occurence of the bug, there are several constaints. The file must be quite large, the affected range must cross the 16TiB boundary and the internal state of the file pages and pending operations must match. Also, we must not have started any ordered data in the range, otherwise we don't even reach the buggy function check. DIO locking tries hard in several places to avoid deadlocks with buffered IO and avoids waiting for ranges. The worst consequence seems to be stale data read. CC: Liu Bo <bo.li.liu@oracle.com> CC: stable@vger.kernel.org # 3.16+ Fixes: fc4adbff823f7 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking") Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | block: switch bios to blk_status_tChristoph Hellwig2017-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | fs: remove the unused error argument to dio_end_io()Christoph Hellwig2017-06-09
|/ / | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge branch 'for-chris-4.12' of ↵Chris Mason2017-04-27
|\ \ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.12
| * | Btrfs: fix reported number of inode blocksFilipe Manana2017-04-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when there are buffered writes that were not yet flushed and they fall within allocated ranges of the file (that is, not in holes or beyond eof assuming there are no prealloc extents beyond eof), btrfs simply reports an incorrect number of used blocks through the stat(2) system call (or any of its variants), regardless of mount options or inode flags (compress, compress-force, nodatacow). This is because the number of blocks used that is reported is based on the current number of bytes in the vfs inode plus the number of dealloc bytes in the btrfs inode. The later covers bytes that both fall within allocated regions of the file and holes. Example scenarios where the number of reported blocks is wrong while the buffered writes are not flushed: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec) # The following should have reported 64K... $ du -h /mnt/sdc/foo1 128K /mnt/sdc/foo1 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo1 64K /mnt/sdc/foo1 $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 65536 64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec) # The following should have reported 128K... $ du -h /mnt/sdc/foo2 192K /mnt/sdc/foo2 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo2 128K /mnt/sdc/foo2 So the number of used file blocks is simply incorrect, unlike in other filesystems such as ext4 and xfs for example, but only while the buffered writes are not flushed. Fix this by tracking the number of delalloc bytes that fall within holes and beyond eof of a file, and use instead this new counter when reporting the number of used blocks for an inode. Another different problem that exists is that the delalloc bytes counter is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from the respective range in the inode's iotree) and the vfs inode's bytes counter is only incremented when writeback finishes (through insert_reserved_file_extent()). Therefore while writeback is ongoing we simply report a wrong number of blocks used by an inode if the write operation covers a range previously unallocated. While this change does not fix this problem, it does minimizes it a lot by shortening that time window, as the new dealloc bytes counter (new_delalloc_bytes) is only decremented when writeback finishes right before updating the vfs inode's bytes counter. Fully fixing this second problem is not trivial and will be addressed later by a different patch. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * | Btrfs: fix incorrect space accounting after failure to insert inline extentFilipe Manana2017-04-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using compression, if we fail to insert an inline extent we incorrectly end up attempting to free the reserved data space twice, once through extent_clear_unlock_delalloc(), because we pass it the flag EXTENT_DO_ACCOUNTING, and once through a direct call to btrfs_free_reserved_data_space_noquota(). This results in a trace like the following: [ 834.576240] ------------[ cut here ]------------ [ 834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs] [ 834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2 [ 834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs] [ 834.596103] Call Trace: [ 834.596103] dump_stack+0x67/0x90 [ 834.596103] __warn+0xc2/0xdd [ 834.596103] warn_slowpath_null+0x1d/0x1f [ 834.596103] btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 834.596103] compress_file_range.constprop.42+0x2fa/0x3fc [btrfs] [ 834.596103] ? submit_compressed_extents+0x3a7/0x3a7 [btrfs] [ 834.596103] async_cow_start+0x32/0x4d [btrfs] [ 834.596103] btrfs_scrubparity_helper+0x187/0x3e7 [btrfs] [ 834.596103] btrfs_delalloc_helper+0xe/0x10 [btrfs] [ 834.596103] process_one_work+0x273/0x4e4 [ 834.596103] worker_thread+0x1eb/0x2ca [ 834.596103] ? rescuer_thread+0x2b6/0x2b6 [ 834.596103] kthread+0x100/0x108 [ 834.596103] ? __list_del_entry+0x22/0x22 [ 834.596103] ret_from_fork+0x2e/0x40 [ 834.611656] ---[ end trace 719902fe6bdef08f ]--- So fix this by not calling directly btrfs_free_reserved_data_space_noquota() if an error happened. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * | Btrfs: fix invalid attempt to free reserved space on failure to cow rangeFilipe Manana2017-04-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When attempting to COW a file range (we are starting writeback and doing COW), if we manage to reserve an extent for the range we will write into but fail after reserving it and before creating the respective ordered extent, we end up in an error path where we attempt to decrement the data space's bytes_may_use counter after we already did it while reserving the extent, leading to a warning/trace like the following: [ 847.621524] ------------[ cut here ]------------ [ 847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg [ 847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2 [ 847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 847.648601] Call Trace: [ 847.648601] dump_stack+0x67/0x90 [ 847.648601] __warn+0xc2/0xdd [ 847.648601] warn_slowpath_null+0x1d/0x1f [ 847.648601] btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 847.648601] btrfs_clear_bit_hook+0x140/0x258 [btrfs] [ 847.648601] clear_state_bit+0x87/0x128 [btrfs] [ 847.648601] __clear_extent_bit+0x222/0x2b7 [btrfs] [ 847.648601] clear_extent_bit+0x17/0x19 [btrfs] [ 847.648601] extent_clear_unlock_delalloc+0x3b/0x6b [btrfs] [ 847.648601] cow_file_range.isra.39+0x387/0x39a [btrfs] [ 847.648601] run_delalloc_nocow+0x4d7/0x70e [btrfs] [ 847.648601] ? arch_local_irq_save+0x9/0xc [ 847.648601] run_delalloc_range+0xa7/0x2b5 [btrfs] [ 847.648601] writepage_delalloc.isra.31+0xb9/0x15c [btrfs] [ 847.648601] __extent_writepage+0x249/0x2e8 [btrfs] [ 847.648601] extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs] [ 847.648601] ? arch_local_irq_save+0x9/0xc [ 847.648601] ? mark_lock+0x24/0x201 [ 847.648601] extent_writepages+0x4b/0x5c [btrfs] [ 847.648601] ? btrfs_writepage_start_hook+0xed/0xed [btrfs] [ 847.648601] btrfs_writepages+0x28/0x2a [btrfs] [ 847.648601] do_writepages+0x23/0x2c [ 847.648601] __filemap_fdatawrite_range+0x5a/0x61 [ 847.648601] filemap_fdatawrite_range+0x13/0x15 [ 847.648601] btrfs_fdatawrite_range+0x20/0x46 [btrfs] [ 847.648601] start_ordered_ops+0x19/0x23 [btrfs] [ 847.648601] btrfs_sync_file+0x136/0x42c [btrfs] [ 847.648601] vfs_fsync_range+0x8c/0x9e [ 847.648601] vfs_fsync+0x1c/0x1e [ 847.648601] do_fsync+0x31/0x4a [ 847.648601] SyS_fsync+0x10/0x14 [ 847.648601] entry_SYSCALL_64_fastpath+0x18/0xad [ 847.648601] RIP: 0033:0x7f5b05200800 [ 847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [ 847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800 [ 847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003 [ 847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f [ 847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046 [ 847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740 [ 847.648601] ? trace_hardirqs_off_caller+0x3f/0xaa [ 847.685787] ---[ end trace 2a4a3e15382508e8 ]--- So fix this by not attempting to decrement the data space info's bytes_may_use counter if we already reserved the extent and an error happened before creating the ordered extent. We are already correctly freeing the reserved extent if an error happens, so there's no additional measure needed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| * | btrfs: Handle delalloc error correctly to avoid ordered extent hangQu Wenruo2017-04-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] If run_delalloc_range() returns error and there is already some ordered extents created, btrfs will be hanged with the following backtrace: Call Trace: __schedule+0x2d4/0xae0 schedule+0x3d/0x90 btrfs_start_ordered_extent+0x160/0x200 [btrfs] ? wake_atomic_t_function+0x60/0x60 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 [CAUSE] |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<>| |<---------- cleanup range --------->| || \_=> First page handled by end_extent_writepage() in __extent_writepage() The problem is caused by error handler of run_delalloc_range(), which doesn't handle any created ordered extents, leaving them waiting on btrfs_finish_ordered_io() to finish. However after run_delalloc_range() returns error, __extent_writepage() won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered except the first page, and btrfs_finish_ordered_io() won't be triggered for created ordered extents either. So OE 2~n will hang forever, and if OE 1 is larger than one page, it will also hang. [FIX] Introduce btrfs_cleanup_ordered_extents() function to cleanup created ordered extents and finish them manually. The function is based on existing btrfs_endio_direct_write_update_ordered() function, and modify it to act just like btrfs_writepage_endio_hook() but handles specified range other than one page. After fix, delalloc error will be handled like: |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<>|<-------- ----------->|<------ old error handler --------->| || || || \_=> Cleaned up by cleanup_ordered_extents() \_=> First page handled by end_extent_writepage() in __extent_writepage() Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * | btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum errorQu Wenruo2017-04-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] When btrfs_reloc_clone_csum() reports error, it can underflow metadata and leads to kernel assertion on outstanding extents in run_delalloc_nocow() and cow_file_range(). BTRFS info (device vdb5): relocating block group 12582912 flags data BTRFS info (device vdb5): found 1 extents assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 Currently, due to another bug blocking ordered extents, the bug is only reproducible under certain block group layout and using error injection. a) Create one data block group with one 4K extent in it. To avoid the bug that hangs btrfs due to ordered extent which never finishes b) Make btrfs_reloc_clone_csum() always fail c) Relocate that block group [CAUSE] run_delalloc_nocow() and cow_file_range() handles error from btrfs_reloc_clone_csum() wrongly: (The ascii chart shows a more generic case of this bug other than the bug mentioned above) |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<----------- cleanup range --------------->| |<----------- ----------->| \/ btrfs_finish_ordered_io() range So error handler, which calls extent_clear_unlock_delalloc() with EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() will both cover OE n, and free its metadata, causing metadata under flow. [Fix] The fix is to ensure after calling btrfs_add_ordered_extent(), we only call error handler after increasing the iteration offset, so that cleanup range won't cover any created ordered extent. |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<----------- ----------->|<---------- cleanup range --------->| \/ btrfs_finish_ordered_io() range Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
* | | Btrfs: handle only applicable errors returned by btrfs_get_extentDan Carpenter2017-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_get_extent() never returns NULL pointers, so this code introduces a static checker warning. The btrfs_get_extent() is a bit complex, but trust me that it doesn't return NULLs and also if it did we would trigger the BUG_ON(!em) before the last return statement. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> [ updated subject ] Signed-off-by: David Sterba <dsterba@suse.com>
* | | Btrfs: add file item tracepointsLiu Bo2017-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While debugging truncate problems, I found that these tracepoints could help us quickly know what went wrong. Two sets of tracepoints are created to track regular/prealloc file item and inline file item respectively, I put inline as a separate one since what inline file items cares about are way less than the regular one. This adds four tracepoints: - btrfs_get_extent_show_fi_regular - btrfs_get_extent_show_fi_inline - btrfs_truncate_show_fi_regular - btrfs_truncate_show_fi_inline Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ formatting adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
* | | Btrfs: remove ASSERT in btrfs_truncate_inode_itemsLiu Bo2017-04-18
| |/ |/| | | | | | | | | | | | | | | | | | | | | After 76b42abbf748 ("Btrfs: fix data loss after truncate when using the no-holes feature"), For either NO_HOLES or inline extents, we've set last_size to newsize to avoid data loss after remount or inode got evicted and read again, thus, we don't need this check anymore. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge branch 'for-linus-4.11' of ↵Linus Torvalds2017-04-14
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Dave Sterba collected a few more fixes for the last rc. These aren't marked for stable, but I'm putting them in with a batch were testing/sending by hand for this release" * 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix potential use-after-free for cloned bio Btrfs: fix segmentation fault when doing dio read Btrfs: fix invalid dereference in btrfs_retry_endio btrfs: drop the nossd flag when remounting with -o ssd