3x Cubietruck SoC A20 @ 1Ghz, 2GiB DDR3 @ 480MHz, NAND 8GB
10Gbit Switch 8 port
3 x 2.5″ SATA 5400RPM
CubianX on class10 SD
Linux node1 3.4.79-sun7i #14 SMP PREEMPT Thu Jul 3 06:39:51 CST 2014 armv7l GNU/Linux
Ceph Version ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070) FIREFLY
3 OSDs on EXT4, journal on same spinning disk.
<< PREFACE >>
All test are executed 3 times to reduce the risk of distortion.
<< PREFACE >>
PERFORMANCE BEFORE TWEAK
———————————————————— │———————————————————— Server listening on TCP port 5001 │Client connecting to node1, TCP port 5001 TCP window size: 85.3 KByte (default) │TCP window size: 58.7 KByte (default) ———————————————————— │———————————————————— [ 4] local 192.168.0.106 port 5001 connected with 192.168.0.228 port 47900 │[ 3] local 192.168.0.228 port 47900 connected with 192.168.0.106 port 5001 [ ID] Interval Transfer Bandwidth │[ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 616 MBytes 517 Mbits/sec │[ 3] 0.0-10.0 sec 616 MBytes 517 Mbits/sec
iperf -s iperf -c host
0.0-10.0 sec 609 MBytes 511 Mbits/sec
0.0-10.0 sec 609 MBytes 511 Mbits/sec
iostat 5 dstat -clnv –fs –vm –top-bio –top-cpu –top-io
sudo dd if=/dev/urandom of=randwrite bs=4M count=400 iflag=fullblock
observing 100% CPU seems like generating random numbers is cpu heavy..
also observing that there is no constant writing to disk.. generating randomnums is slower than WRITE
276824064 bytes (277 MB) copied, 159.145 s, 1.7 MB/s
maybe generate numbers and then DD them..
sudo dd if=/dev/zero of=here bs=500M count=1 oflag=direct
dd writes it to RAM as it seems
bs should be twice as big
ceph@node1:/mnt/osd1$ sudo dd if=/dev/zero of=here bs=1G count=1 oflag=direct
dd: memory exhausted by input buffer of size 1073741824 bytes (1.0 GiB)
dd if=/mounted/sda1/randwirte of=/dev/null
dd if=/mounted/sda1/seqwrite of=/dev/null
Although bandwith and IO should not be a problem, the architecture of the A20 influences the the overall performance:
CPU clock speeds always directly influence I/O and network bandwidth
High CPU load(during rebalance, scrub, write) constrains all other components.
Not to be confused with shared bandwith
On A10/A20 devices Ethernet, SATA and the 3 USB ports are all connected directly to the SoC and do not have to share bandwidth between (but you will find some devices where this restriction applies to USB connected onboard Wi-Fi). This is a real advantage compared to many other ARM devices where 2 or more of these ports are behind a single USB connection. Compare with every model of Raspberry Pi for example: only one single connection exists between SoC and an internal USB hub with integrated Ethernet controller (LAN9512/LAN9514). All expansions ports share the bandwidth of this single USB 2.0 connection.
** BENCHMARKING DONE RIGHT **
Tools/methods that definitely lead to wrong results/assumptions:
dd with small filesizes and without approriate flags (testing mainly buffers/caches and therefore RAM not disk) time cp $src $dst (same problem as with dd above) hdparm -tT (tampering disk throughput with memory bandwidth) wget/curl downloads from somewhere (many random unrelated influences on both sides of the network connection and in between) scp/SFTP (random results based on SSH ciphers negotiated between client and server)
Use IOzone, bonnie++, IOmeter and the like with large filesizes (at least twice as much as RAM available) and also test random I/O and not just only sequential transfers. Use network tools that don’t tamper network throughput with other stuff like disk performance on one or both sides of the network connection (use iperf/netperf and the like). When you’re done testing individually always do a combined test using both synthetic benchmarks and real-world tasks (eg. copying a couple of small and afterwards a really large file between client and sunxi NAS. Always test both directions and keep in mind that copying many small files over the network is always slower than one big file, that in turn might be slower than a few large files transferred in a batch)
PERFORMANCE TWEAKS ON BOARD
/tmp & /log = RAM, ramlog app saves logs to disk daily and on shut-down (Wheezy and Jessie w/o systemd)
automatic IO scheduler. (check /etc/init.d/armhwinfo)
journal data writeback enabled. (/etc/fstab)
commit=600 to flush data to the disk every 10 minutes (/etc/fstab)
optimized CPU frequency scaling 480-1010Mhz (392-996Mhz @Freescale, 600-2000Mhz @Exynos) with interactive governor (/etc/init.d/cpufrequtils)
eth0 interrupts are using dedicated core (Allwinner based boards)
Adjust CPU frequency scaling settings accordingly (eg. use ondemand always togehter with io_is_busy)
get available cpufreqs with:
show the current cpufreq with:
Using ‘performance’ as governor will only marginally increase the power consumption
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Out of the available governors the most interesting seem to be ondemand and interactive since they dynamically switch cpufreq settings based on load. The interactive governor is only available in the sunxi-3.4 kernel, because it is a simple pragmatic enhancement of the ondemand governor, tailored for Android and shipped on millions of devices. It was submitted to the mainline kernel but never accepted, because the ondemand governor itself is considered to be broken by design (the “waking up to decide whether the CPU is idle” concept) and polishing a turd was not welcome. So the default with mainline kernel is now ondemand (you can adjust that by writing to scaling_governor) and the lower/upper limits are 144000 and 960000 on sun7i (144MHz-960MHz). Since the default lower limit might be responsible for a laggy system (it takes quite a bit of time for the governor to realize that the frequency needs to be increased) it’s advisable to define the lower limit yourself adjusting /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq eg.
!! You can also lower the maximum value by adjusting scaling_max_freq. Please be aware that adjusting these values works in 48MHz steps and is not limited to the few operating-points/dvfs_table-entries. You can use any value in between min/max but have to keep in mind that the values supplied will be rounded down: you will end up with just 864MHz if you set scaling_max_freq to 911999 instead of 912000 — always compare with cpuinfo_cur_freq if in doubt. !!
echo 408000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
Assign eth0 IRQs to cpu1 since irqbalancing neither works on sunxi/ARM nor increases performance when used with network interrupts
check out different I/O schedulers (can be set and read out using /sys/block/sda/queue/scheduler). On sunxi deadline seems to be the most performant
Do some TCP/IP stack tuning to adjust parameters for GBit Ethernet (especially increasing buffer sizes and queue lenghts)
When using Mainline kernel consider using a modern filesystem like btrfs with transparent file compression
PERFORMANCE TWEAKS ON CEPH
PERFORMANCE AFTER TWEAK
iostat 5 htop dstat -clnv –fs –vm –top-bio –top-cpu –top-io
SATA throughput is unbalanced for unknown reasons: With appropriate cpufreq settings it’s possible to get sequential read speeds of +200 MB/s while write speeds retain at approx. 45 MB/s. This might be caused by buggy silicone or driver problems.