forked from Minki/linux
staging: ramster: add how-to document
Add how-to documentation that provides a step-by-step guide for configuring and trying out a ramster cluster. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This commit is contained in:
parent
642f2ecc09
commit
8bb3e55103
366
drivers/staging/zcache/ramster/ramster-howto.txt
Normal file
366
drivers/staging/zcache/ramster/ramster-howto.txt
Normal file
@ -0,0 +1,366 @@
|
||||
RAMSTER HOW-TO
|
||||
|
||||
Author: Dan Magenheimer
|
||||
Ramster maintainer: Konrad Wilk <konrad.wilk@oracle.com>
|
||||
|
||||
This is a HOWTO document for ramster which, as of this writing, is in
|
||||
the kernel as a subdirectory of zcache in drivers/staging, called ramster.
|
||||
(Zcache can be built with or without ramster functionality.) If enabled
|
||||
and properly configured, ramster allows memory capacity load balancing
|
||||
across multiple machines in a cluster. Further, the ramster code serves
|
||||
as an example of asynchronous access for zcache (as well as cleancache and
|
||||
frontswap) that may prove useful for future transcendent memory
|
||||
implementations, such as KVM and NVRAM. While ramster works today on
|
||||
any network connection that supports kernel sockets, its features may
|
||||
become more interesting on future high-speed fabrics/interconnects.
|
||||
|
||||
Ramster requires both kernel and userland support. The userland support,
|
||||
called ramster-tools, is known to work with EL6-based distros, but is a
|
||||
set of poorly-hacked slightly-modified cluster tools based on ocfs2, which
|
||||
includes an init file, a config file, and a userland binary that interfaces
|
||||
to the kernel. This state of userland support reflects the abysmal userland
|
||||
skills of this suitably-embarrassed author; any help/patches to turn
|
||||
ramster-tools into more distributable rpms/debs useful for a wider range
|
||||
of distros would be appreciated. The source RPM that can be used as a
|
||||
starting point is available at:
|
||||
http://oss.oracle.com/projects/tmem/files/RAMster/
|
||||
|
||||
As a result of this author's ignorance, userland setup described in this
|
||||
HOWTO assumes an EL6 distro and is described in EL6 syntax. Apologies
|
||||
if this offends anyone!
|
||||
|
||||
Kernel support has only been tested on x86_64. Systems with an active
|
||||
ocfs2 filesystem should work, but since ramster leverages a lot of
|
||||
code from ocfs2, there may be latent issues. A kernel configuration that
|
||||
includes CONFIG_OCFS2_FS should build OK, and should certainly run OK
|
||||
if no ocfs2 filesystem is mounted.
|
||||
|
||||
This HOWTO demonstrates memory capacity load balancing for a two-node
|
||||
cluster, where one node called the "local" node becomes overcommitted
|
||||
and the other node called the "remote" node provides additional RAM
|
||||
capacity for use by the local node. Ramster is capable of more complex
|
||||
topologies; see the last section titled "ADVANCED RAMSTER TOPOLOGIES".
|
||||
|
||||
If you find any terms in this HOWTO unfamiliar or don't understand the
|
||||
motivation for ramster, the following LWN reading is recommended:
|
||||
-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795)
|
||||
-- The future calculus of memory management (lwn.net/Articles/475681)
|
||||
And since ramster is built on top of zcache, this article may be helpful:
|
||||
-- In-kernel memory compression (lwn.net/Articles/545244)
|
||||
|
||||
Now that you've memorized the contents of those articles, let's get started!
|
||||
|
||||
A. PRELIMINARY
|
||||
|
||||
1) Install two x86_64 Linux systems that are known to work when
|
||||
upgraded to a recent upstream Linux kernel version.
|
||||
|
||||
On each system:
|
||||
|
||||
2) Configure, build and install, then boot Linux, just to ensure it
|
||||
can be done with an unmodified upstream kernel. Confirm you booted
|
||||
the upstream kernel with "uname -a".
|
||||
|
||||
3) If you plan to do any performance testing or unless you plan to
|
||||
test only swapping, the "WasActive" patch is also highly recommended.
|
||||
(Search lkml.org for WasActive, apply the patch, rebuild your kernel.)
|
||||
For a demo or simple testing, the patch can be ignored.
|
||||
|
||||
4) Install ramster-tools as root. An x86_64 rpm for EL6-based systems
|
||||
can be found at:
|
||||
http://oss.oracle.com/projects/tmem/files/RAMster/
|
||||
(Sorry but for now, non-EL6 users must recreate ramster-tools on
|
||||
their own from source. See above.)
|
||||
|
||||
5) Ensure that debugfs is mounted at each boot. Examples below assume it
|
||||
is mounted at /sys/kernel/debug.
|
||||
|
||||
B. BUILDING RAMSTER INTO THE KERNEL
|
||||
|
||||
Do the following on each system:
|
||||
|
||||
1) Using the kernel configuration mechanism of your choice, change
|
||||
your config to include:
|
||||
|
||||
CONFIG_CLEANCACHE=y
|
||||
CONFIG_FRONTSWAP=y
|
||||
CONFIG_STAGING=y
|
||||
CONFIG_CONFIGFS_FS=y # NOTE: MUST BE y, not m
|
||||
CONFIG_ZCACHE=y
|
||||
CONFIG_RAMSTER=y
|
||||
|
||||
For a linux-3.10 or later kernel, you should also set:
|
||||
|
||||
CONFIG_ZCACHE_DEBUG=y
|
||||
CONFIG_RAMSTER_DEBUG=y
|
||||
|
||||
Before building the kernel please doublecheck your kernel config
|
||||
file to ensure all of the settings are correct.
|
||||
|
||||
2) Build this kernel and change your boot file (e.g. /etc/grub.conf)
|
||||
so that the new kernel will boot.
|
||||
|
||||
3) Add "zcache" and "ramster" as kernel boot parameters for the new kernel.
|
||||
|
||||
4) Reboot each system approximately simultaneously.
|
||||
|
||||
5) Check dmesg to ensure there are some messages from ramster, prefixed
|
||||
by "ramster:"
|
||||
|
||||
# dmesg | grep ramster
|
||||
|
||||
You should also see a lot of files in:
|
||||
|
||||
# ls /sys/kernel/debug/zcache
|
||||
# ls /sys/kernel/debug/ramster
|
||||
|
||||
These are mostly counters for various zcache and ramster activities.
|
||||
You should also see files in:
|
||||
|
||||
# ls /sys/kernel/mm/ramster
|
||||
|
||||
These are sysfs files that control ramster as we shall see.
|
||||
|
||||
Ramster now will act as a single-system zcache on each system
|
||||
but doesn't yet know anything about the cluster so can't yet do
|
||||
anything remotely.
|
||||
|
||||
C. CONFIGURING THE RAMSTER CLUSTER
|
||||
|
||||
This part can be error prone unless you are familiar with clustering
|
||||
filesystems. We need to describe the cluster in a /etc/ramster.conf
|
||||
file and the init scripts that parse it are extremely picky about
|
||||
the syntax.
|
||||
|
||||
1) Create a /etc/ramster.conf file and ensure it is identical on both
|
||||
systems. This file mimics the ocfs2 format and there is a good amount
|
||||
of documentation that can be searched for ocfs2.conf, but you can use:
|
||||
|
||||
cluster:
|
||||
name = ramster
|
||||
node_count = 2
|
||||
node:
|
||||
name = system1
|
||||
cluster = ramster
|
||||
number = 0
|
||||
ip_address = my.ip.ad.r1
|
||||
ip_port = 7777
|
||||
node:
|
||||
name = system2
|
||||
cluster = ramster
|
||||
number = 1
|
||||
ip_address = my.ip.ad.r2
|
||||
ip_port = 7777
|
||||
|
||||
You must ensure that the "name" field in the file exactly matches
|
||||
the output of "hostname" on each system; if "hostname" shows a
|
||||
fully-qualified hostname, ensure the name is fully qualified in
|
||||
/etc/ramster.conf. Obviously, substitute my.ip.ad.rx with proper
|
||||
ip addresses.
|
||||
|
||||
2) Enable the ramster service and configure it. If you used the
|
||||
EL6 ramster-tools, this would be:
|
||||
|
||||
# chkconfig --add ramster
|
||||
# service ramster configure
|
||||
|
||||
Set "load on boot" to "y", cluster to start is "ramster" (or whatever
|
||||
name you chose in ramster.conf), heartbeat dead threshold as "500",
|
||||
network idle timeout as "1000000". Leave the others as default.
|
||||
|
||||
3) Reboot both systems. After reboot, try (assuming EL6 ramster-tools):
|
||||
|
||||
# service ramster status
|
||||
|
||||
You should see "Checking RAMSTER cluster "ramster": Online". If you do
|
||||
not, something is wrong and ramster will not work. Note that you
|
||||
should also see that the driver for "configfs" is loaded and mounted,
|
||||
the driver for ocfs2_dlmfs is not loaded, and some numbers for network
|
||||
parameters. You will also see "Checking RAMSTER heartbeat: Not active".
|
||||
That's all OK.
|
||||
|
||||
4) Now you need to start the cluster heartbeat; the cluster is not "up"
|
||||
until all nodes detect a heartbeat. In a real cluster, heartbeat detection
|
||||
is done via a cluster filesystem, but ramster doesn't require one. Some
|
||||
hack-y kernel code in ramster can start the heartbeat for you though if
|
||||
you tell it what nodes are "up". To enable the heartbeat, do:
|
||||
|
||||
# echo 0 > /sys/kernel/mm/ramster/manual_node_up
|
||||
# echo 1 > /sys/kernel/mm/ramster/manual_node_up
|
||||
|
||||
This must be done on BOTH nodes and, to avoid timeouts, must be done
|
||||
approximately concurrently on both nodes. On an EL6 system, it is
|
||||
convenient to put these lines in /etc/rc.local. To confirm that the
|
||||
cluster is now up, on both systems do:
|
||||
|
||||
# dmesg | grep ramster
|
||||
|
||||
You should see ramster "Accepted connection" messages in dmesg on both
|
||||
nodes after this. Note that if you check userland status again with
|
||||
|
||||
# service ramster status
|
||||
|
||||
you will still see "Checking RAMSTER heartbeat: Not active". That's
|
||||
still OK... the ramster kernel heartbeat hack doesn't communicate to
|
||||
userland.
|
||||
|
||||
5) You now must tell each node the node to which it should "remotify" pages.
|
||||
On this two node cluster, we will assume the "local" node, node 0, has
|
||||
memory overcommitted and will use ramster to utilize RAM capacity on
|
||||
the "remote node", node 1. To configure this, on node 0, you do:
|
||||
|
||||
# echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum
|
||||
|
||||
You should see "ramster: node 1 set as remotification target" in dmesg
|
||||
on node 0. Again, on EL6, /etc/rc.local is a good place to put this
|
||||
on node 0 so you don't forget to do it at each boot.
|
||||
|
||||
6) One more step: By default, the ramster code does not "remotify" any
|
||||
pages; this is primarily for testing purposes, but sometimes it is
|
||||
useful. This may change in the future, but for now, on node 0, you do:
|
||||
|
||||
# echo 1 > /sys/kernel/mm/ramster/pers_remotify_enable
|
||||
# echo 1 > /sys/kernel/mm/ramster/eph_remotify_enable
|
||||
|
||||
The first enables remotifying swap (persistent, aka frontswap) pages,
|
||||
the second enables remotifying of page cache (ephemeral, cleancache)
|
||||
pages.
|
||||
|
||||
On EL6, these lines can also be put in /etc/rc.local (AFTER the
|
||||
node_up lines), or at the beginning of a script that runs a workload.
|
||||
|
||||
7) Note that most testing has been done with both/all machines booted
|
||||
roughly simultaneously to avoid cluster timeouts. Ideally, you should
|
||||
do this too unless you are trying to break ramster rather than just
|
||||
use it. ;-)
|
||||
|
||||
D. TESTING RAMSTER
|
||||
|
||||
1) Note that ramster has no value unless pages get "remotified". For
|
||||
swap/frontswap/persistent pages, this doesn't happen unless/until
|
||||
the workload would cause swapping to occur, at which point pages
|
||||
are put into frontswap/zcache, and the remotification thread starts
|
||||
working. To get to the point where the system swaps, you either
|
||||
need a workload for which the working set exceeds the RAM in the
|
||||
system; or you need to somehow reduce the amount of RAM one of
|
||||
the system sees. This latter is easy when testing in a VM, but
|
||||
harder on physical systems. In some cases, "mem=xxxM" on the
|
||||
kernel command line restricts memory, but for some values of xxx
|
||||
the kernel may fail to boot. One may also try creating a fixed
|
||||
RAMdisk, doing nothing with it, but ensuring that it eats up a fixed
|
||||
amount of RAM.
|
||||
|
||||
2) To see if ramster is working, on the "remote node", node 1, try:
|
||||
|
||||
# grep . /sys/kernel/debug/ramster/foreign_*
|
||||
# # note, that is space-dot-space between grep and the pathname
|
||||
|
||||
to monitor the number (and max) ephemeral and persistent pages
|
||||
that ramster has sent. If these stay at zero, ramster is not working
|
||||
either because the workload on the local node (node 0) isn't creating
|
||||
enough memory pressure or because "remotifying" isn't working. On the
|
||||
local system, node 0, you can watch lots of useful information also.
|
||||
Try:
|
||||
|
||||
grep . /sys/kernel/debug/zcache/*pageframes* \
|
||||
/sys/kernel/debug/zcache/*zbytes* \
|
||||
/sys/kernel/debug/zcache/*zpages* \
|
||||
/sys/kernel/debug/ramster/*remote*
|
||||
|
||||
Of particular note are the remote_*_pages_succ_get counters. These
|
||||
show how many disk reads and/or disk writes have been avoided on the
|
||||
overcommitted local system by storing pages remotely using ramster.
|
||||
|
||||
At the risk of information overload, you can also grep:
|
||||
|
||||
/sys/kernel/debug/cleancache/* and /sys/kernel/debug/frontswap/*
|
||||
|
||||
These show, for example, how many disk reads and/or disk writes have
|
||||
been avoided by using zcache to optimize RAM on the local system.
|
||||
|
||||
|
||||
AUTOMATIC SWAP REPATRIATION
|
||||
|
||||
You may notice that while the systems are idle, the foreign persistent
|
||||
page count on the remote machine slowly decreases. This is because
|
||||
ramster implements "frontswap selfshrinking": When possible, swap
|
||||
pages that have been remotified are slowly repatriated to the local
|
||||
machine. This is so that local RAM can be used when possible and
|
||||
so that, in case of remote machine crash, the probability of loss
|
||||
of data is reduced.
|
||||
|
||||
REBOOTING / POWEROFF
|
||||
|
||||
If a system is shut down while some of its swap pages still reside
|
||||
on a remote system, the system may lock up during the shutdown
|
||||
sequence. This will occur if the network is shut down before the
|
||||
swap mechansim is shut down, which is the default ordering on many
|
||||
distros. To avoid this annoying problem, simply shut off the swap
|
||||
subsystem before starting the shutdown sequence, e.g.:
|
||||
|
||||
# swapoff -a
|
||||
# reboot
|
||||
|
||||
Ideally, this swapoff-before-ifdown ordering should be enforced permanently
|
||||
using shutdown scripts.
|
||||
|
||||
KNOWN PROBLEMS
|
||||
|
||||
1) You may periodically see messages such as:
|
||||
|
||||
ramster_r2net, message length problem
|
||||
|
||||
This is harmless but indicates that a node is sending messages
|
||||
containing compressed pages that exceed the maximum for zcache
|
||||
(PAGE_SIZE*15/16). The sender side needs to be fixed.
|
||||
|
||||
2) If you see a "No longer connected to node..." message or a "No connection
|
||||
established with node X after N seconds", it is possible you may
|
||||
be in an unrecoverable state. If you are certain all of the
|
||||
appropriate cluster configuration steps described above have been
|
||||
performed, try rebooting the two servers concurrently to see if
|
||||
the cluster starts.
|
||||
|
||||
Note that "Connection to node... shutdown, state 7" is an intermediate
|
||||
connection state. As long as you later see "Accepted connection", the
|
||||
intermediate states are harmless.
|
||||
|
||||
3) There are known issues in counting certain values. As a result
|
||||
you may see periodic warnings from the kernel. Almost always you
|
||||
will see "ramster: bad accounting for XXX". There are also "WARN_ONCE"
|
||||
messages. If you see kernel warnings with a tombstone, please report
|
||||
them. They are harmless but reflect bugs that need to be eventually fixed.
|
||||
|
||||
ADVANCED RAMSTER TOPOLOGIES
|
||||
|
||||
The kernel code for ramster can support up to eight nodes in a cluster,
|
||||
but no testing has been done with more than three nodes.
|
||||
|
||||
In the example described above, the "remote" node serves as a RAM
|
||||
overflow for the "local" node. This can be made symmetric by appropriate
|
||||
settings of the sysfs remote_target_nodenum file. For example, by setting:
|
||||
|
||||
# echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum
|
||||
|
||||
on node 0, and
|
||||
|
||||
# echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
|
||||
|
||||
on node 1, each node can serve as a RAM overflow for the other.
|
||||
|
||||
For more than two nodes, a "RAM server" can be configured. For a
|
||||
three node system, set:
|
||||
|
||||
# echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
|
||||
|
||||
on node 1, and
|
||||
|
||||
# echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
|
||||
|
||||
on node 2. Then node 0 is a RAM server for node 1 and node 2.
|
||||
|
||||
In this implementation of ramster, any remote node is potentially a single
|
||||
point of failure (SPOF). Though the probability of failure is reduced
|
||||
by automatic swap repatriation (see above), a proposed future enhancement
|
||||
to ramster improves high-availability for the cluster by sending a copy
|
||||
of each page of date to two other nodes. Patches welcome!
|
Loading…
Reference in New Issue
Block a user