用于SSD的低延迟键值存储[英] Low-latency Key-Value Store for SSD

本文是小编为大家收集整理的关于用于SSD的低延迟键值存储的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我们正在使用具有以下属性的SSD支持的键值解决方案:

  • 吞吐量:10000 tps; 50/50 puts/get;
  • 延迟:平均1ms,第99.9个百分点10ms
  • 数据量:〜10亿个值,〜150字节; 64位键;随机访问,20%的数据适合RAM

我们在商品SSD上尝试了Kyotocabinet,LevelDB和RethinkDB,具有不同的Linux IO调度程序,EXT3/XFS文件系统;使用 rebench 进行了许多测试.并在所有情况下都发现:

  • 只读吞吐量/延迟非常好
  • 整个过程中只写/更新是中等的,但是有许多高延迟异常值
  • 混合读/写工作负载会导致吞吐量/延迟中的灾难性振荡,即使在直接访问块设备(绕过文件系统)的情况下,

下图说明了京都族的这种行为(水平轴是时间,三个时期清晰可见 - 只读,仅混合,仅更新).

问题是:是否可以使用SSD来实现描述的SLA的低潜伏期,并且建议使用哪些键值商店?

在此处输入图像说明

推荐答案

高度变异的写入延迟是SSD(尤其是消费者模型)的常见属性.有一个很好的解释,说明为什么在此 Anandtech评论.

摘要是,随着磨损升级开销的增加,SSD写入性能会加班.随着驱动器上的免费页面的数量减少,NAND控制器必须开始碎片编制页面,这有助于延迟. NAND还必须构建LBA以阻止地图以跟踪各个NAND块的数据随机分布.随着该地图的增长,地图上的操作(插入,删除)将变得较慢.

您将无法通过SW方法解决低级别的HW问题,您将需要提高到企业级SSD或放松您的延迟要求.

其他推荐答案

aerospike 是一个较新的密钥/值(row)商店,可以完全从SSD中运行<1mss读/写的延迟和非常高的TP(达到数百万).

SSD具有出色的随机读取访问,但是减少写入方差的关键是使用顺序IO(这类似于常规硬盘).它还大大降低了在SSD上进行大量写入可能会发生的磨损水平和褪色.

如果您要构建自己的键值系统,请使用日志结构方法(如Aerospike),以便写入大量并附加/书写大块.内存索引可以维护值的正确数据位置,而背景过程清除了磁盘和删除文件的陈旧/删除数据.

其他推荐答案

这是一个令人难以置信的想法,但可能会起作用.假设您的SSD为128GB.

  1. 在SSD上创建128GB交换分区
  2. 配置您的计算机将其用作交换
  3. 设置在计算机上模拟并设置一个128GB内存限制
  4. 基准

内核能够足够快地输入和输入东西吗?无法知道.与内核相比,这更多地取决于您的硬件.

Poul-Henning Kamp在Varnish中做到了与此相似的事情,通过使内核跟踪清漆的事物(虚拟与物理内存),而不是使清漆做到这一点. https://www.varnish-cache.org/trac/trac/wiki/architectnotes

本文地址:https://www.itbaoku.cn/post/597563.html

问题描述

We are working on a SSD-backed key-value solution with the following properties:

  • Throughput: 10000 TPS; 50/50 puts/gets;
  • Latency: 1ms average, 99.9th percentile 10ms
  • Data volume: ~1 billion values, ~150 bytes each; 64-bit keys; random access, 20% of data fits RAM

We tried KyotoCabinet, LevelDB, and RethinkDB on commodity SSDs, with different Linux IO schedulers, ext3/xfs file systems; made a number of tests using Rebench; and found that in all cases:

  • Read-only throughput/latency are very good
  • Write/update-only throughout is moderate, but there are many high-latency outliers
  • Mixed read/write workload causes catastrophic oscillation in throughput/latency even in case of direct access to the block device (bypassing the file system)

The picture below illustrates such behavior for KyotoCabinet (horizontal axis is time, three periods are clearly visible - read-only, mixed, update only).

The question is: is it possible to achieve low latency for described SLAs using SSDs and what key-value stores are recommended?

enter image description here

推荐答案

Highly variant write latency is a common attribute of SSDs (especially consumer models). There is a pretty good explanation of why in this AnandTech review .

Summary is that the SSD write performance worsens overtime as the wear leveling overhead increases. As the number of free pages on the drive decreases the NAND controller must start defragmenting pages, which contributes to latency. The NAND also must build an LBA to block map to track the random distribution of data across various NAND blocks. As this map grows, operations on the map (inserts, deletions) will get slower.

You aren't going to be able to solve a low level HW issue with a SW approach, you are going to need to either move up to an enterprise level SSD or relax your latency requirements.

其他推荐答案

Aerospike is a newer key/value (row) store that can run completely off of SSDs with < 1ms latency for read/write and very high TPS (reaching into millions).

SSDs have great random read access but the key to reducing variance on writes is using sequential IO (this is similar to regular hard disks). It also greatly reduces wear leveling and fade that can occur with lots of writes on SSDs.

If you're building your own key-value system, use a log-structured approach (like Aerospike) so that writes are in bulk and appended/written in large chunks. An in-memory index can maintain the correct data locations for the values while a background process cleans stale/deleted data from disk and defrags files.

其他推荐答案

This is kind-of a harebrained idea but it MIGHT work. Let's assume that your SSD is 128GB.

  1. Create a 128GB swap partition on the SSD
  2. Configure your machine to use that as swap
  3. Setup memcached on the machine and set a 128GB memory limit
  4. Benchmark

Will the kernel be able to page stuff in and out fast enough? No way to know. That depends more on your hardware than the kernel.

Poul-Henning Kamp does something very similar to this in Varnish by making the kernel keep track of things (virtual vs physical memory) for Varnish rather than making Varnish do it. https://www.varnish-cache.org/trac/wiki/ArchitectNotes