286 lines
9.7 KiB
Plaintext
286 lines
9.7 KiB
Plaintext
|
Introduction
|
||
|
============
|
||
|
|
||
|
This document descibes a collection of device-mapper targets that
|
||
|
between them implement thin-provisioning and snapshots.
|
||
|
|
||
|
The main highlight of this implementation, compared to the previous
|
||
|
implementation of snapshots, is that it allows many virtual devices to
|
||
|
be stored on the same data volume. This simplifies administration and
|
||
|
allows the sharing of data between volumes, thus reducing disk usage.
|
||
|
|
||
|
Another significant feature is support for an arbitrary depth of
|
||
|
recursive snapshots (snapshots of snapshots of snapshots ...). The
|
||
|
previous implementation of snapshots did this by chaining together
|
||
|
lookup tables, and so performance was O(depth). This new
|
||
|
implementation uses a single data structure to avoid this degradation
|
||
|
with depth. Fragmentation may still be an issue, however, in some
|
||
|
scenarios.
|
||
|
|
||
|
Metadata is stored on a separate device from data, giving the
|
||
|
administrator some freedom, for example to:
|
||
|
|
||
|
- Improve metadata resilience by storing metadata on a mirrored volume
|
||
|
but data on a non-mirrored one.
|
||
|
|
||
|
- Improve performance by storing the metadata on SSD.
|
||
|
|
||
|
Status
|
||
|
======
|
||
|
|
||
|
These targets are very much still in the EXPERIMENTAL state. Please
|
||
|
do not yet rely on them in production. But do experiment and offer us
|
||
|
feedback. Different use cases will have different performance
|
||
|
characteristics, for example due to fragmentation of the data volume.
|
||
|
|
||
|
If you find this software is not performing as expected please mail
|
||
|
dm-devel@redhat.com with details and we'll try our best to improve
|
||
|
things for you.
|
||
|
|
||
|
Userspace tools for checking and repairing the metadata are under
|
||
|
development.
|
||
|
|
||
|
Cookbook
|
||
|
========
|
||
|
|
||
|
This section describes some quick recipes for using thin provisioning.
|
||
|
They use the dmsetup program to control the device-mapper driver
|
||
|
directly. End users will be advised to use a higher-level volume
|
||
|
manager such as LVM2 once support has been added.
|
||
|
|
||
|
Pool device
|
||
|
-----------
|
||
|
|
||
|
The pool device ties together the metadata volume and the data volume.
|
||
|
It maps I/O linearly to the data volume and updates the metadata via
|
||
|
two mechanisms:
|
||
|
|
||
|
- Function calls from the thin targets
|
||
|
|
||
|
- Device-mapper 'messages' from userspace which control the creation of new
|
||
|
virtual devices amongst other things.
|
||
|
|
||
|
Setting up a fresh pool device
|
||
|
------------------------------
|
||
|
|
||
|
Setting up a pool device requires a valid metadata device, and a
|
||
|
data device. If you do not have an existing metadata device you can
|
||
|
make one by zeroing the first 4k to indicate empty metadata.
|
||
|
|
||
|
dd if=/dev/zero of=$metadata_dev bs=4096 count=1
|
||
|
|
||
|
The amount of metadata you need will vary according to how many blocks
|
||
|
are shared between thin devices (i.e. through snapshots). If you have
|
||
|
less sharing than average you'll need a larger-than-average metadata device.
|
||
|
|
||
|
As a guide, we suggest you calculate the number of bytes to use in the
|
||
|
metadata device as 48 * $data_dev_size / $data_block_size but round it up
|
||
|
to 2MB if the answer is smaller. The largest size supported is 16GB.
|
||
|
|
||
|
If you're creating large numbers of snapshots which are recording large
|
||
|
amounts of change, you may need find you need to increase this.
|
||
|
|
||
|
Reloading a pool table
|
||
|
----------------------
|
||
|
|
||
|
You may reload a pool's table, indeed this is how the pool is resized
|
||
|
if it runs out of space. (N.B. While specifying a different metadata
|
||
|
device when reloading is not forbidden at the moment, things will go
|
||
|
wrong if it does not route I/O to exactly the same on-disk location as
|
||
|
previously.)
|
||
|
|
||
|
Using an existing pool device
|
||
|
-----------------------------
|
||
|
|
||
|
dmsetup create pool \
|
||
|
--table "0 20971520 thin-pool $metadata_dev $data_dev \
|
||
|
$data_block_size $low_water_mark"
|
||
|
|
||
|
$data_block_size gives the smallest unit of disk space that can be
|
||
|
allocated at a time expressed in units of 512-byte sectors. People
|
||
|
primarily interested in thin provisioning may want to use a value such
|
||
|
as 1024 (512KB). People doing lots of snapshotting may want a smaller value
|
||
|
such as 128 (64KB). If you are not zeroing newly-allocated data,
|
||
|
a larger $data_block_size in the region of 256000 (128MB) is suggested.
|
||
|
$data_block_size must be the same for the lifetime of the
|
||
|
metadata device.
|
||
|
|
||
|
$low_water_mark is expressed in blocks of size $data_block_size. If
|
||
|
free space on the data device drops below this level then a dm event
|
||
|
will be triggered which a userspace daemon should catch allowing it to
|
||
|
extend the pool device. Only one such event will be sent.
|
||
|
Resuming a device with a new table itself triggers an event so the
|
||
|
userspace daemon can use this to detect a situation where a new table
|
||
|
already exceeds the threshold.
|
||
|
|
||
|
Thin provisioning
|
||
|
-----------------
|
||
|
|
||
|
i) Creating a new thinly-provisioned volume.
|
||
|
|
||
|
To create a new thinly- provisioned volume you must send a message to an
|
||
|
active pool device, /dev/mapper/pool in this example.
|
||
|
|
||
|
dmsetup message /dev/mapper/pool 0 "create_thin 0"
|
||
|
|
||
|
Here '0' is an identifier for the volume, a 24-bit number. It's up
|
||
|
to the caller to allocate and manage these identifiers. If the
|
||
|
identifier is already in use, the message will fail with -EEXIST.
|
||
|
|
||
|
ii) Using a thinly-provisioned volume.
|
||
|
|
||
|
Thinly-provisioned volumes are activated using the 'thin' target:
|
||
|
|
||
|
dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
|
||
|
|
||
|
The last parameter is the identifier for the thinp device.
|
||
|
|
||
|
Internal snapshots
|
||
|
------------------
|
||
|
|
||
|
i) Creating an internal snapshot.
|
||
|
|
||
|
Snapshots are created with another message to the pool.
|
||
|
|
||
|
N.B. If the origin device that you wish to snapshot is active, you
|
||
|
must suspend it before creating the snapshot to avoid corruption.
|
||
|
This is NOT enforced at the moment, so please be careful!
|
||
|
|
||
|
dmsetup suspend /dev/mapper/thin
|
||
|
dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
|
||
|
dmsetup resume /dev/mapper/thin
|
||
|
|
||
|
Here '1' is the identifier for the volume, a 24-bit number. '0' is the
|
||
|
identifier for the origin device.
|
||
|
|
||
|
ii) Using an internal snapshot.
|
||
|
|
||
|
Once created, the user doesn't have to worry about any connection
|
||
|
between the origin and the snapshot. Indeed the snapshot is no
|
||
|
different from any other thinly-provisioned device and can be
|
||
|
snapshotted itself via the same method. It's perfectly legal to
|
||
|
have only one of them active, and there's no ordering requirement on
|
||
|
activating or removing them both. (This differs from conventional
|
||
|
device-mapper snapshots.)
|
||
|
|
||
|
Activate it exactly the same way as any other thinly-provisioned volume:
|
||
|
|
||
|
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
|
||
|
|
||
|
Deactivation
|
||
|
------------
|
||
|
|
||
|
All devices using a pool must be deactivated before the pool itself
|
||
|
can be.
|
||
|
|
||
|
dmsetup remove thin
|
||
|
dmsetup remove snap
|
||
|
dmsetup remove pool
|
||
|
|
||
|
Reference
|
||
|
=========
|
||
|
|
||
|
'thin-pool' target
|
||
|
------------------
|
||
|
|
||
|
i) Constructor
|
||
|
|
||
|
thin-pool <metadata dev> <data dev> <data block size (sectors)> \
|
||
|
<low water mark (blocks)> [<number of feature args> [<arg>]*]
|
||
|
|
||
|
Optional feature arguments:
|
||
|
- 'skip_block_zeroing': skips the zeroing of newly-provisioned blocks.
|
||
|
|
||
|
Data block size must be between 64KB (128 sectors) and 1GB
|
||
|
(2097152 sectors) inclusive.
|
||
|
|
||
|
|
||
|
ii) Status
|
||
|
|
||
|
<transaction id> <used metadata blocks>/<total metadata blocks>
|
||
|
<used data blocks>/<total data blocks> <held metadata root>
|
||
|
|
||
|
|
||
|
transaction id:
|
||
|
A 64-bit number used by userspace to help synchronise with metadata
|
||
|
from volume managers.
|
||
|
|
||
|
used data blocks / total data blocks
|
||
|
If the number of free blocks drops below the pool's low water mark a
|
||
|
dm event will be sent to userspace. This event is edge-triggered and
|
||
|
it will occur only once after each resume so volume manager writers
|
||
|
should register for the event and then check the target's status.
|
||
|
|
||
|
held metadata root:
|
||
|
The location, in sectors, of the metadata root that has been
|
||
|
'held' for userspace read access. '-' indicates there is no
|
||
|
held root. This feature is not yet implemented so '-' is
|
||
|
always returned.
|
||
|
|
||
|
iii) Messages
|
||
|
|
||
|
create_thin <dev id>
|
||
|
|
||
|
Create a new thinly-provisioned device.
|
||
|
<dev id> is an arbitrary unique 24-bit identifier chosen by
|
||
|
the caller.
|
||
|
|
||
|
create_snap <dev id> <origin id>
|
||
|
|
||
|
Create a new snapshot of another thinly-provisioned device.
|
||
|
<dev id> is an arbitrary unique 24-bit identifier chosen by
|
||
|
the caller.
|
||
|
<origin id> is the identifier of the thinly-provisioned device
|
||
|
of which the new device will be a snapshot.
|
||
|
|
||
|
delete <dev id>
|
||
|
|
||
|
Deletes a thin device. Irreversible.
|
||
|
|
||
|
trim <dev id> <new size in sectors>
|
||
|
|
||
|
Delete mappings from the end of a thin device. Irreversible.
|
||
|
You might want to use this if you're reducing the size of
|
||
|
your thinly-provisioned device. In many cases, due to the
|
||
|
sharing of blocks between devices, it is not possible to
|
||
|
determine in advance how much space 'trim' will release. (In
|
||
|
future a userspace tool might be able to perform this
|
||
|
calculation.)
|
||
|
|
||
|
set_transaction_id <current id> <new id>
|
||
|
|
||
|
Userland volume managers, such as LVM, need a way to
|
||
|
synchronise their external metadata with the internal metadata of the
|
||
|
pool target. The thin-pool target offers to store an
|
||
|
arbitrary 64-bit transaction id and return it on the target's
|
||
|
status line. To avoid races you must provide what you think
|
||
|
the current transaction id is when you change it with this
|
||
|
compare-and-swap message.
|
||
|
|
||
|
'thin' target
|
||
|
-------------
|
||
|
|
||
|
i) Constructor
|
||
|
|
||
|
thin <pool dev> <dev id>
|
||
|
|
||
|
pool dev:
|
||
|
the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
|
||
|
|
||
|
dev id:
|
||
|
the internal device identifier of the device to be
|
||
|
activated.
|
||
|
|
||
|
The pool doesn't store any size against the thin devices. If you
|
||
|
load a thin target that is smaller than you've been using previously,
|
||
|
then you'll have no access to blocks mapped beyond the end. If you
|
||
|
load a target that is bigger than before, then extra blocks will be
|
||
|
provisioned as and when needed.
|
||
|
|
||
|
If you wish to reduce the size of your thin device and potentially
|
||
|
regain some space then send the 'trim' message to the pool.
|
||
|
|
||
|
ii) Status
|
||
|
|
||
|
<nr mapped sectors> <highest mapped sector>
|