Problem statement
For the DataLad project we will establish a large number of git-annex
repositories. Majority of those git-annex repositories will not
contain any data, but rather a lot of empty files and/or broken
symlinks. Some data sharing projects which we are going to cover
contain thousands and even millions of files (e.g. HCP500 release S3
bucket contained over 5 million files). Originally, we had a very
nice experience with ZoL (ZFS on Linux) on
computational clusters we administer/use. Its volume management, data
integrity checking, and various features (snapshotting, etc) are very
appealing and served us well before. So ZFS was a logical choice to
deploy for our DataLad development/storage server(s). But
unfortunately, during development we ran into complete system stalls
(which thanks to ZoL developers were promptly resolved), and overall
very slow performance (even with SSD L2ARC caches and relatively
large RAM on those boxes). So it became important to investigate if
we were indeed hitting the limits of contemporary conventional
file-system designs while dealing with large number of files, or just
getting bound by ZoL.
System setup
All the benchmarks were ran on a recently purchased server
- CPU: Intel(R) Xeon(R) CPU E5-1607 v2 @ 3.00GHz (quad core)
- RAM: 64GB (registered)
- OS: Debian jessie (8.1) amd64 build, with NeuroDebian and ZoL
repositories enabled
- Kernel: 3.16.7-ckt9-3~deb8u1 (Debian: 3.16.0-4-amd64)
- ZoL/ZFS module: 0.6.4-15-7fec46
- BTRFS tools: originally 3.17 (as in jessie) but then tested a single setup
initiated with 4.0 tools (backport coming from NeuroDebian repository)
- Drives: 6 Seagate Constellation ES.3 (ST4000NM0033-9ZM170, fw SN04)
on an LSI SAS3008 HBA
Scripts
test_fs.py
script was created to initiate a range of prototypical filesystem
setups, with software raid/MD (redundancy), LVM (volume management)
and various file systems (EXT4, XFS, ZFS, RaiserFS, BTRFS) in various
configurations. We tested typical operations in the life-time of a
git/annex repository -- creation (only for larger repositories test),
du, chmod, git clone, git annex get, git annex (in)direct, tar etc --
full list is available in results tables below.
Results
Below we present results of the analysis of collected timings, which
in its entirety is available as
IPython notebook
with all the necessary data in the repository if you would like to
have an alternative analysis or visualization done since presentation
here is quite messy.
Overall verdict is: we are going for Software Raid (RAID6) + LVM
(scalability) + BTRFS (performance) as our setup with robust volume
management (snapper tool providing snapshots management) and good
performance (with tolerable impact from enabled compression). BTRFS
exhibited robust performance across a wide range of meta-information
utilization, while RaiserFS -- the next contender -- failed to scale
(although it may have behaved better with some tune up/options -- we
didn't explore). ZoL/ZFS quickly became way too slow to be even
properly tested and thus was not considered for our case. Additional
features of BTRFS, such as COW (copy-on-write) can already be utilized
by git-annex so providing BTRFS additional bonus points in our
decision. The choice of compression (lzo vs zlib) for BTRFS did not
have a clear winner, and impact from compression was not completely
detrimental (although went up to ~30% on git clone
operation overall
impact due to compression).
Disclaimer YMMV. We have tested file-systems on quite an obscure
setup, which is very heavy on meta-information without much of actual
data being stored -- lots of tiny files with as many symlinks and
directories (under .git/annex/objects
). Though, majority of the
benchmarked commands (e.g., chmod, du) were meta-information access
heavy, so even if large data files were stored we would expect similar
performance in those cases.
Hint If you would like to see larger plots, just open images on
separate pages so they would become zoomable in your browser.
Small repositories test
Initial test consisted of creating relatively small repositories each containing 100 files in 20 directories added to git-annex. Multiple rounds (without removing previously created test directories/repositories) were ran, thus slowly growing impact on the filesystem. ZFS filesystem setups ran only 10 rounds of such tests, while 100 times for other filesystems.
Following results present timings on the first 10 rounds across file systems on "cold" runs of the commands. Ratings/timing table presents overall rating (0 -- is the best, estimated as a median among ratings across all benchmark commands) with mean timing (across rounds) per each command.
Got 19 reports
# of reports per each FS: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
Got 2850 entries total
Gross-rating/timing for each filesystem/per each command
command |
rating |
00: du -scm testX |
01: tar -cf testX.tar testX |
02: pigz testX.tar |
03: chmod +w -R testX/.git/annex/objects |
04: git clone testX testX.clone |
05: git annex get . |
06: git annex drop . |
07: du -scm testX.clone |
08: rm -rf testX |
09: rm -rf testX.clone |
10: tar -xzf testX.tar.gz |
fs |
|
|
|
|
|
|
|
|
|
|
|
|
BTRFS_LVM_MD_raid6 |
6 |
6/0.78 |
6/0.85 |
9/0.19 |
6/0.41 |
6/0.58 |
5/10.77 |
5/8.87 |
12/1.09 |
5/0.98 |
11/1.39 |
4/0.59 |
BTRFS_LVM_raid6_sda+sdb+sdc+sdd+sde+sdf |
3 |
4/0.73 |
3/0.70 |
1/0.12 |
3/0.23 |
3/0.47 |
3/10.26 |
3/8.51 |
7/0.77 |
4/0.90 |
5/1.07 |
0/0.51 |
BTRFS_MD_raid6 |
5 |
5/0.74 |
5/0.79 |
8/0.18 |
5/0.37 |
5/0.56 |
4/10.72 |
4/8.79 |
11/1.05 |
6/0.98 |
12/1.43 |
3/0.57 |
BTRFS_sda_-mraid6 |
4 |
3/0.73 |
4/0.72 |
0/0.12 |
4/0.28 |
4/0.48 |
6/11.46 |
6/10.80 |
8/0.79 |
3/0.87 |
7/1.11 |
1/0.56 |
EXT4_LVM_MD_raid6 |
10 |
10/3.73 |
10/12.15 |
3/0.13 |
10/2.29 |
11/2.48 |
8/32.87 |
12/27.71 |
4/0.67 |
10/3.95 |
3/0.83 |
14/1.01 |
EXT4_MD_raid6 |
9 |
9/3.24 |
9/10.30 |
2/0.13 |
9/2.02 |
8/2.03 |
7/32.71 |
9/25.91 |
3/0.65 |
9/3.51 |
4/0.84 |
11/0.90 |
EXT4_MD_raid6_-Estride=128_-Estripe_width=512 |
11 |
11/4.37 |
11/14.19 |
7/0.17 |
13/3.56 |
15/3.17 |
9/36.24 |
11/27.60 |
6/0.77 |
11/4.61 |
8/1.11 |
12/0.92 |
EXT4_MD_raid6_-Estride=32_-Estripe_width=128 |
12 |
13/4.62 |
13/15.83 |
6/0.15 |
12/3.20 |
16/3.18 |
12/37.88 |
13/28.28 |
9/0.80 |
13/5.43 |
9/1.20 |
10/0.89 |
EXT4_MD_raid6_-Estride=4_-Estripe_width=16 |
11 |
12/4.56 |
12/14.57 |
5/0.14 |
11/3.00 |
12/3.05 |
11/36.53 |
10/27.30 |
5/0.73 |
12/4.72 |
6/1.07 |
8/0.84 |
EXT4_MD_raid6_-Estride=512_-Estripe_width=2048 |
14 |
14/5.09 |
15/16.87 |
12/0.21 |
14/3.83 |
14/3.16 |
14/38.74 |
14/29.01 |
10/0.98 |
14/5.50 |
10/1.33 |
13/0.94 |
ReiserFS_LVM_MD_raid6 |
1 |
0/0.12 |
0/0.43 |
11/0.21 |
1/0.08 |
0/0.20 |
2/9.93 |
1/7.69 |
1/0.13 |
0/0.39 |
2/0.74 |
6/0.80 |
ReiserFS_LVM_raid6_sda+sdb+sdc+sdd+sde+sdf |
1 |
2/0.12 |
2/0.48 |
4/0.13 |
0/0.07 |
1/0.20 |
0/9.67 |
0/7.65 |
0/0.11 |
1/0.39 |
0/0.53 |
5/0.75 |
ReiserFS_MD_raid6 |
2 |
1/0.12 |
1/0.44 |
14/0.22 |
2/0.09 |
2/0.21 |
1/9.84 |
2/7.76 |
2/0.13 |
2/0.46 |
1/0.70 |
7/0.83 |
XFS_LVM_MD_raid6 |
9 |
8/2.45 |
7/7.66 |
13/0.21 |
8/0.57 |
13/3.15 |
16/42.24 |
8/20.34 |
17/2.31 |
8/1.81 |
14/2.72 |
9/0.88 |
XFS_MD_raid6 |
8 |
7/1.61 |
8/8.13 |
10/0.20 |
7/0.55 |
17/3.62 |
17/43.25 |
7/20.07 |
16/1.65 |
7/1.76 |
13/2.56 |
2/0.57 |
ZFS_layout=raid10_ashift=12_compression=on_sync=standard |
17 |
17/15.97 |
17/24.75 |
15/0.27 |
17/14.48 |
10/2.41 |
15/40.41 |
17/38.07 |
15/1.23 |
17/25.08 |
17/3.44 |
15/9.94 |
ZFS_layout=raid6_ashift=12_compression=on_sync=standard |
16 |
16/10.44 |
16/19.59 |
17/0.45 |
16/9.91 |
9/2.08 |
10/36.31 |
16/36.11 |
14/1.22 |
16/18.03 |
16/3.25 |
16/11.71 |
ZFS_layout=raid6_compression=on_sync=standard |
15 |
15/9.96 |
14/16.24 |
16/0.42 |
15/7.83 |
7/1.69 |
13/38.64 |
15/30.65 |
13/1.18 |
15/13.87 |
15/2.97 |
17/14.04 |
Gross time per each file-system
Execution time per each command across file-systems and across runs
and "warm" (re-running the same command thus utilizing possible caching of meta-information and data) runs:
Gross-rating/timing for each filesystem/per each command
command |
rating |
00: du -scm testX |
01: tar -cf testX.tar testX |
03: chmod +w -R testX/.git/annex/objects |
07: du -scm testX.clone |
fs |
|
|
|
|
|
BTRFS_LVM_MD_raid6 |
4.0 |
8/0.07 |
3/0.14 |
5/0.05 |
3/0.04 |
BTRFS_LVM_raid6_sda+sdb+sdc+sdd+sde+sdf |
6.0 |
7/0.07 |
2/0.14 |
6/0.05 |
6/0.04 |
BTRFS_MD_raid6 |
4.0 |
5/0.06 |
0/0.12 |
3/0.05 |
8/0.04 |
BTRFS_sda_-mraid6 |
3.5 |
6/0.07 |
1/0.13 |
7/0.05 |
0/0.03 |
EXT4_LVM_MD_raid6 |
11.5 |
9/0.08 |
11/0.31 |
12/0.08 |
14/0.05 |
EXT4_MD_raid6 |
12.5 |
12/0.08 |
7/0.29 |
13/0.08 |
13/0.05 |
EXT4_MD_raid6_-Estride=128_-Estripe_width=512 |
10.0 |
10/0.08 |
10/0.31 |
11/0.08 |
9/0.04 |
EXT4_MD_raid6_-Estride=32_-Estripe_width=128 |
12.0 |
11/0.08 |
13/0.35 |
14/0.08 |
11/0.05 |
EXT4_MD_raid6_-Estride=4_-Estripe_width=16 |
9.5 |
14/0.08 |
5/0.26 |
9/0.08 |
10/0.05 |
EXT4_MD_raid6_-Estride=512_-Estripe_width=2048 |
12.0 |
13/0.08 |
12/0.34 |
10/0.08 |
12/0.05 |
ReiserFS_LVM_MD_raid6 |
1.5 |
0/0.05 |
9/0.30 |
1/0.05 |
2/0.04 |
ReiserFS_LVM_raid6_sda+sdb+sdc+sdd+sde+sdf |
2.0 |
3/0.06 |
4/0.17 |
0/0.04 |
1/0.03 |
ReiserFS_MD_raid6 |
4.0 |
4/0.06 |
14/0.42 |
2/0.05 |
4/0.04 |
XFS_LVM_MD_raid6 |
4.5 |
2/0.06 |
6/0.27 |
4/0.05 |
5/0.04 |
XFS_MD_raid6 |
7.5 |
1/0.05 |
8/0.29 |
8/0.06 |
7/0.04 |
ZFS_layout=raid10_ashift=12_compression=on_sync=standard |
17.0 |
17/11.71 |
17/21.08 |
17/11.09 |
17/0.28 |
ZFS_layout=raid6_ashift=12_compression=on_sync=standard |
16.0 |
16/7.74 |
16/16.75 |
16/7.85 |
16/0.21 |
ZFS_layout=raid6_compression=on_sync=standard |
15.0 |
15/7.36 |
15/13.83 |
15/7.03 |
15/0.18 |
Execution time per each command across file-systems and across runs
In both of the above reports (cold and warm) on small repositories we can see that ZFS performs quite poorely (and also becomes slower with higher FS utilization on many commands) but still on the same order as XFS and EXT4, while BTRFS and RaiserFS perform much smoother with RaiserFS being the most efficient.
Following plots shows all 100 rounds:
Got 19 reports
# of reports per each FS: [10, 100, 100, 100, 100, 100, 100, 100, 100, 10, 10, 100, 100, 100, 100, 100, 100, 100, 100]
Got 24450 entries total
Gross-rating/timing for each filesystem/per each command
command |
rating |
00: du -scm testX |
01: tar -cf testX.tar testX |
02: pigz testX.tar |
03: chmod +w -R testX/.git/annex/objects |
04: git clone testX testX.clone |
05: git annex get . |
06: git annex drop . |
07: du -scm testX.clone |
08: rm -rf testX |
09: rm -rf testX.clone |
10: tar -xzf testX.tar.gz |
fs |
|
|
|
|
|
|
|
|
|
|
|
|
BTRFS_LVM_MD_raid6 |
6 |
5/0.84 |
6/0.95 |
12/0.19 |
6/0.45 |
6/0.69 |
4/10.99 |
5/8.87 |
13/1.19 |
6/1.13 |
12/1.51 |
3/0.58 |
BTRFS_LVM_raid6_sda+sdb+sdc+sdd+sde+sdf |
4 |
4/0.79 |
4/0.76 |
2/0.12 |
3/0.27 |
4/0.57 |
3/10.45 |
3/8.59 |
10/0.92 |
4/0.94 |
10/1.23 |
0/0.54 |
BTRFS_MD_raid6 |
5 |
6/0.84 |
5/0.91 |
11/0.18 |
5/0.44 |
5/0.69 |
5/11.00 |
4/8.84 |
11/1.18 |
5/1.09 |
11/1.50 |
2/0.58 |
BTRFS_sda_-mraid6 |
4 |
3/0.74 |
3/0.73 |
4/0.14 |
4/0.28 |
3/0.52 |
6/11.55 |
6/10.84 |
9/0.82 |
3/0.91 |
9/1.18 |
1/0.55 |
EXT4_LVM_MD_raid6 |
10 |
10/3.64 |
10/11.99 |
5/0.14 |
10/2.26 |
12/2.80 |
8/33.39 |
10/27.15 |
4/0.65 |
10/3.89 |
3/0.84 |
11/0.94 |
EXT4_MD_raid6 |
8 |
9/3.27 |
9/10.35 |
1/0.12 |
9/2.05 |
8/1.94 |
7/32.69 |
9/25.78 |
3/0.62 |
9/3.51 |
4/0.85 |
8/0.85 |
EXT4_MD_raid6_-Estride=128_-Estripe_width=512 |
11 |
11/3.66 |
11/12.28 |
3/0.14 |
13/2.91 |
11/2.71 |
9/35.04 |
11/27.37 |
5/0.66 |
11/3.97 |
5/0.94 |
9/0.87 |
EXT4_MD_raid6_-Estride=32_-Estripe_width=128 |
14 |
14/4.66 |
14/15.62 |
10/0.15 |
14/3.28 |
15/3.09 |
13/37.86 |
14/28.71 |
8/0.80 |
14/5.24 |
8/1.16 |
12/0.96 |
EXT4_MD_raid6_-Estride=4_-Estripe_width=16 |
12 |
13/4.40 |
13/14.64 |
6/0.15 |
12/2.84 |
14/3.06 |
12/37.51 |
13/28.21 |
7/0.76 |
13/4.76 |
7/1.11 |
10/0.92 |
EXT4_MD_raid6_-Estride=512_-Estripe_width=2048 |
11 |
12/3.75 |
12/12.64 |
9/0.15 |
11/2.45 |
13/2.81 |
10/35.94 |
12/27.76 |
6/0.70 |
12/4.17 |
6/0.97 |
7/0.84 |
ReiserFS_LVM_MD_raid6 |
2 |
0/0.13 |
2/0.47 |
7/0.15 |
2/0.09 |
1/0.23 |
2/9.96 |
1/7.73 |
2/0.14 |
0/0.42 |
2/0.78 |
5/0.65 |
ReiserFS_LVM_raid6_sda+sdb+sdc+sdd+sde+sdf |
0 |
2/0.13 |
0/0.43 |
0/0.10 |
0/0.09 |
0/0.22 |
0/9.77 |
0/7.60 |
0/0.13 |
1/0.43 |
0/0.58 |
4/0.62 |
ReiserFS_MD_raid6 |
1 |
1/0.13 |
1/0.44 |
8/0.15 |
1/0.09 |
2/0.23 |
1/9.94 |
2/7.74 |
1/0.14 |
2/0.43 |
1/0.74 |
6/0.65 |
XFS_LVM_MD_raid6 |
13 |
7/2.37 |
7/8.02 |
13/0.26 |
8/0.61 |
16/3.93 |
16/48.10 |
8/22.02 |
17/2.29 |
7/2.21 |
17/3.90 |
14/1.25 |
XFS_MD_raid6 |
13 |
8/2.44 |
8/8.03 |
14/0.26 |
7/0.60 |
17/3.93 |
17/48.71 |
7/21.86 |
16/2.27 |
8/2.30 |
16/3.84 |
13/1.15 |
ZFS_layout=raid10_ashift=12_compression=on_sync=standard |
15 |
17/15.97 |
17/24.75 |
15/0.27 |
17/14.48 |
10/2.41 |
15/40.41 |
17/38.07 |
15/1.23 |
17/25.08 |
15/3.44 |
15/9.94 |
ZFS_layout=raid6_ashift=12_compression=on_sync=standard |
16 |
16/10.44 |
16/19.59 |
17/0.45 |
16/9.91 |
9/2.08 |
11/36.31 |
16/36.11 |
14/1.22 |
16/18.03 |
14/3.25 |
16/11.71 |
ZFS_layout=raid6_compression=on_sync=standard |
15 |
15/9.96 |
15/16.24 |
16/0.42 |
15/7.83 |
7/1.69 |
14/38.64 |
15/30.65 |
12/1.18 |
15/13.87 |
13/2.97 |
17/14.04 |
Execution time per each command across file-systems and across runs