This has been a busy cycle for documentation work. Highlights include:

- Lots of RST conversion work by Mauro, Daniel ALmeida, and others.
     Maybe someday we'll get to the end of this stuff...maybe...
 
   - Some organizational work to bring some order to the core-api manual.
 
   - Various new docs and additions to the existing documentation.
 
   - Typo fixes, warning fixes, ...
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl6BLf4PHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YLhkIAIhcg6gxp0oZZ3KDfQyhvej0EWQGVDNkmloQ
 O1VOSV3RJsZL9HwN9xSNnNfN5+hw5RUYVbn1s201uj6kovZY9qcTpHP2LCizUeGb
 eFkSTmzkyAuAbJjuVLgMPDerJPEew0HnudiToeSpQeoIL1WB6YGd4/5H/cN1KLex
 8ggjllcY0wOgbiFffmK6+tavDv7vT0lKTdwKRYh2nxu7zrPVVd1ZnW+RtntdTVQt
 i+xwV6/YdWtg5C53IwBPpeyubX40vqaIjU8rzpLq5SCVbsZN14sSR709m1AYCOK0
 i4VDWEhfA2XBi6Nycl5U0czuGziwoHrTgSCkS1mmSDujnpgfKM8=
 =6YOS
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.7' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "This has been a busy cycle for documentation work.

  Highlights include:

   - Lots of RST conversion work by Mauro, Daniel ALmeida, and others.
     Maybe someday we'll get to the end of this stuff...maybe...

   - Some organizational work to bring some order to the core-api
     manual.

   - Various new docs and additions to the existing documentation.

   - Typo fixes, warning fixes, ..."

* tag 'docs-5.7' of git://git.lwn.net/linux: (123 commits)
  Documentation: x86: exception-tables: document CONFIG_BUILDTIME_TABLE_SORT
  MAINTAINERS: adjust to filesystem doc ReST conversion
  docs: deprecated.rst: Add BUG()-family
  doc: zh_CN: add translation for virtiofs
  doc: zh_CN: index files in filesystems subdirectory
  docs: locking: Drop :c:func: throughout
  docs: locking: Add 'need' to hardirq section
  docs: conf.py: avoid thousands of duplicate label warning on Sphinx
  docs: prevent warnings due to autosectionlabel
  docs: fix reference to core-api/namespaces.rst
  docs: fix pointers to io-mapping.rst and io_ordering.rst files
  Documentation: Better document the softlockup_panic sysctl
  docs: hw-vuln: tsx_async_abort.rst: get rid of an unused ref
  docs: perf: imx-ddr.rst: get rid of a warning
  docs: filesystems: fuse.rst: supress a Sphinx warning
  docs: translations: it: avoid duplicate refs at programming-language.rst
  docs: driver.rst: supress two ReSt warnings
  docs: trace: events.rst: convert some new stuff to ReST format
  Documentation: Add io_ordering.rst to driver-api manual
  Documentation: Add io-mapping.rst to driver-api manual
  ...
This commit is contained in:
Linus Torvalds 2020-03-30 12:45:23 -07:00
commit 481ed297d9
141 changed files with 4535 additions and 3257 deletions

View File

@ -1,5 +1,5 @@
What: /sys/kernel/uids/<uid>/cpu_shares
Date: December 2007
Date: December 2007, finally removed in kernel v2.6.34-rc1
Contact: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Description:

View File

@ -13,7 +13,7 @@ endif
SPHINXBUILD = sphinx-build
SPHINXOPTS =
SPHINXDIRS = .
_SPHINXDIRS = $(patsubst $(srctree)/Documentation/%/index.rst,%,$(wildcard $(srctree)/Documentation/*/index.rst))
_SPHINXDIRS = $(sort $(patsubst $(srctree)/Documentation/%/index.rst,%,$(wildcard $(srctree)/Documentation/*/index.rst)))
SPHINX_CONF = conf.py
PAPER =
BUILDDIR = $(obj)/output

View File

@ -239,7 +239,7 @@ from the PCI device config space. Use the values in the pci_dev structure
as the PCI "bus address" might have been remapped to a "host physical"
address by the arch/chip-set specific kernel support.
See Documentation/io-mapping.txt for how to access device registers
See Documentation/driver-api/io-mapping.rst for how to access device registers
or device memory.
The device driver needs to call pci_request_region() to verify

View File

@ -1,3 +1,5 @@
.. _psi:
================================
PSI - Pressure Stall Information
================================

View File

@ -1,5 +1,5 @@
Kernel Support for miscellaneous (your favourite) Binary Formats v1.1
=====================================================================
Kernel Support for miscellaneous Binary Formats (binfmt_misc)
=============================================================
This Kernel feature allows you to invoke almost (for restrictions see below)
every program by simply typing its name in the shell.

View File

@ -251,8 +251,6 @@ line of text and contains the following stats separated by whitespace:
================ =============================================================
orig_data_size uncompressed size of data stored in this disk.
This excludes same-element-filled pages (same_pages) since
no memory is allocated for them.
Unit: bytes
compr_data_size compressed size of data stored in this disk
mem_used_total the amount of memory allocated for this disk. This

View File

@ -23,7 +23,7 @@ of dot-connected-words, and key and value are connected by ``=``. The value
has to be terminated by semi-colon (``;``) or newline (``\n``).
For array value, array entries are separated by comma (``,``). ::
KEY[.WORD[...]] = VALUE[, VALUE2[...]][;]
KEY[.WORD[...]] = VALUE[, VALUE2[...]][;]
Unlike the kernel command line syntax, spaces are OK around the comma and ``=``.

View File

@ -1,3 +1,5 @@
.. _cgroup-v1:
========================
Control Groups version 1
========================

View File

@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and
conventions of cgroup v2. It describes all userland-visible aspects
of cgroup including core and specific controller behaviors. All
future changes must be reflected in this document. Documentation for
v1 is available under Documentation/admin-guide/cgroup-v1/.
v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
.. CONTENTS
@ -1023,7 +1023,7 @@ All time durations are in microseconds.
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.rst for details.
:ref:`Documentation/accounting/psi.rst <psi>` for details.
cpu.uclamp.min
A read-write single value file which exists on non-root cgroups.
@ -1103,7 +1103,7 @@ PAGE_SIZE multiple when read back.
proportionally to the overage, reducing reclaim pressure for
smaller overages.
Effective min boundary is limited by memory.min values of
Effective min boundary is limited by memory.min values of
all ancestor cgroups. If there is memory.min overcommitment
(child cgroup or cgroups are requiring more protected memory
than parent will allow), then each child cgroup will get
@ -1313,53 +1313,41 @@ PAGE_SIZE multiple when read back.
Number of major page faults incurred
workingset_refault
Number of refaults of previously evicted pages
workingset_activate
Number of refaulted pages that were immediately activated
workingset_nodereclaim
Number of times a shadow node has been reclaimed
pgrefill
Amount of scanned pages (in an active LRU list)
pgscan
Amount of scanned pages (in an inactive LRU list)
pgsteal
Amount of reclaimed pages
pgactivate
Amount of pages moved to the active LRU list
pgdeactivate
Amount of pages moved to the inactive LRU list
pglazyfree
Amount of pages postponed to be freed under memory pressure
pglazyfreed
Amount of reclaimed lazyfree pages
thp_fault_alloc
Number of transparent hugepages which were allocated to satisfy
a page fault, including COW faults. This counter is not present
when CONFIG_TRANSPARENT_HUGEPAGE is not set.
thp_collapse_alloc
Number of transparent hugepages which were allocated to allow
collapsing an existing range of pages. This counter is not
present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
@ -1403,7 +1391,7 @@ PAGE_SIZE multiple when read back.
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for memory. See
Documentation/accounting/psi.rst for details.
:ref:`Documentation/accounting/psi.rst <psi>` for details.
Usage Guidelines
@ -1478,7 +1466,7 @@ IO Interface Files
dios Number of discard IOs
====== =====================
An example read output follows:
An example read output follows::
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
@ -1643,7 +1631,7 @@ IO Interface Files
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for IO. See
Documentation/accounting/psi.rst for details.
:ref:`Documentation/accounting/psi.rst <psi>` for details.
Writeback
@ -1853,7 +1841,7 @@ Cpuset Interface Files
from the requested CPUs.
The CPU numbers are comma-separated numbers or ranges.
For example:
For example::
# cat cpuset.cpus
0-4,6,8-10
@ -1892,7 +1880,7 @@ Cpuset Interface Files
from the requested memory nodes.
The memory node numbers are comma-separated numbers or ranges.
For example:
For example::
# cat cpuset.mems
0-1,3

View File

@ -11,11 +11,13 @@ Today, with the advent of Kernel Mode Setting, a graphics board is
either correctly working because all components follow the standards -
or the computer is unusable, because the screen remains dark after
booting or it displays the wrong area. Cases when this happens are:
- The graphics board does not recognize the monitor.
- The graphics board is unable to detect any EDID data.
- The graphics board incorrectly forwards EDID data to the driver.
- The monitor sends no or bogus EDID data.
- A KVM sends its own EDID data instead of querying the connected monitor.
Adding the kernel parameter "nomodeset" helps in most cases, but causes
restrictions later on.
@ -32,7 +34,7 @@ individual data for a specific misbehaving monitor, commented sources
and a Makefile environment are given here.
To create binary EDID and C source code files from the existing data
material, simply type "make".
material, simply type "make" in tools/edid/.
If you want to create your own EDID file, copy the file 1024x768.S,
replace the settings with your own data and add a new target to the

View File

@ -136,8 +136,6 @@ enables the mitigation by default.
The mitigation can be controlled at boot time via a kernel command line option.
See :ref:`taa_mitigation_control_command_line`.
.. _virt_mechanism:
Virtualization mitigation
^^^^^^^^^^^^^^^^^^^^^^^^^

View File

@ -75,6 +75,7 @@ configure specific aspects of kernel behavior to your liking.
cputopology
dell_rbu
device-mapper/index
edid
efi-stub
ext4
nfs/index

View File

@ -1099,6 +1099,12 @@
A valid base address must be provided, and the serial
port must already be setup and configured.
ec_imx21,<addr>
ec_imx6q,<addr>
Start an early, polled-mode, output-only console on the
Freescale i.MX UART at the specified address. The UART
must already be setup and configured.
ar3700_uart,<addr>
Start an early, polled-mode console on the
Armada 3700 serial port at the specified
@ -1779,7 +1785,7 @@
provided by tboot because it makes the system
vulnerable to DMA attacks.
nobounce [Default off]
Disable bounce buffer for unstrusted devices such as
Disable bounce buffer for untrusted devices such as
the Thunderbolt devices. This will treat the untrusted
devices as the trusted ones, hence might expose security
risks of DMA attacks.
@ -1883,7 +1889,7 @@
No delay
ip= [IP_PNP]
See Documentation/filesystems/nfs/nfsroot.txt.
See Documentation/admin-guide/nfs/nfsroot.rst.
ipcmni_extend [KNL] Extend the maximum number of unique System V
IPC identifiers from 32,768 to 16,777,216.
@ -2795,7 +2801,7 @@
<name>,<region-number>[,<base>,<size>,<buswidth>,<altbuswidth>]
mtdparts= [MTD]
See drivers/mtd/cmdlinepart.c.
See drivers/mtd/parsers/cmdlinepart.c
multitce=off [PPC] This parameter disables the use of the pSeries
firmware feature for updating multiple TCE entries
@ -2853,13 +2859,13 @@
Default value is 0.
nfsaddrs= [NFS] Deprecated. Use ip= instead.
See Documentation/filesystems/nfs/nfsroot.txt.
See Documentation/admin-guide/nfs/nfsroot.rst.
nfsroot= [NFS] nfs root filesystem for disk-less boxes.
See Documentation/filesystems/nfs/nfsroot.txt.
See Documentation/admin-guide/nfs/nfsroot.rst.
nfsrootdebug [NFS] enable nfsroot debugging messages.
See Documentation/filesystems/nfs/nfsroot.txt.
See Documentation/admin-guide/nfs/nfsroot.rst.
nfs.callback_nr_threads=
[NFSv4] set the total number of threads that the
@ -4514,10 +4520,10 @@
Format: <integer>
A nonzero value instructs the soft-lockup detector
to panic the machine when a soft-lockup occurs. This
is also controlled by CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC
which is the respective build-time switch to that
functionality.
to panic the machine when a soft-lockup occurs. It is
also controlled by the kernel.softlockup_panic sysctl
and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the
respective build-time switch to that functionality.
softlockup_all_cpu_backtrace=
[KNL] Should the soft-lockup detector generate

View File

@ -234,7 +234,7 @@ To reduce its OS jitter, do any of the following:
Such a workqueue can be confined to a given subset of the
CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs
files. The set of WQ_SYSFS workqueues can be displayed using
"ls sys/devices/virtual/workqueue". That said, the workqueues
"ls /sys/devices/virtual/workqueue". That said, the workqueues
maintainer would like to caution people against indiscriminately
sprinkling WQ_SYSFS across all the workqueues. The reason for
caution is that it is easy to add WQ_SYSFS, but because sysfs is

View File

@ -43,7 +43,8 @@ value 1 for supported.
AXI_ID and AXI_MASKING are mapped on DPCR1 register in performance counter.
When non-masked bits are matching corresponding AXI_ID bits then counter is
incremented. Perf counter is incremented if
incremented. Perf counter is incremented if::
AxID && AXI_MASKING == AXI_ID && AXI_MASKING
This filter doesn't support filter different AXI ID for axid-read and axid-write

File diff suppressed because it is too large Load Diff

View File

@ -4,18 +4,18 @@ ARM TCM (Tightly-Coupled Memory) handling in Linux
Written by Linus Walleij <linus.walleij@stericsson.com>
Some ARM SoC:s have a so-called TCM (Tightly-Coupled Memory).
Some ARM SoCs have a so-called TCM (Tightly-Coupled Memory).
This is usually just a few (4-64) KiB of RAM inside the ARM
processor.
Due to being embedded inside the CPU The TCM has a
Due to being embedded inside the CPU, the TCM has a
Harvard-architecture, so there is an ITCM (instruction TCM)
and a DTCM (data TCM). The DTCM can not contain any
instructions, but the ITCM can actually contain data.
The size of DTCM or ITCM is minimum 4KiB so the typical
minimum configuration is 4KiB ITCM and 4KiB DTCM.
ARM CPU:s have special registers to read out status, physical
ARM CPUs have special registers to read out status, physical
location and size of TCM memories. arch/arm/include/asm/cputype.h
defines a CPUID_TCM register that you can read out from the
system control coprocessor. Documentation from ARM can be found

View File

@ -38,7 +38,11 @@ needs_sphinx = '1.3'
# ones.
extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain',
'kfigure', 'sphinx.ext.ifconfig', 'automarkup',
'maintainers_include']
'maintainers_include', 'sphinx.ext.autosectionlabel' ]
# Ensure that autosectionlabel will produce unique names
autosectionlabel_prefix_document = True
autosectionlabel_maxdepth = 2
# The name of the math extension changed on Sphinx 1.4
if (major == 1 and minor > 3) or (major > 1):

View File

@ -8,41 +8,81 @@ This is the beginning of a manual for core kernel APIs. The conversion
Core utilities
==============
This section has general and "core core" documentation. The first is a
massive grab-bag of kerneldoc info left over from the docbook days; it
should really be broken up someday when somebody finds the energy to do
it.
.. toctree::
:maxdepth: 1
kernel-api
assoc_array
atomic_ops
cachetlb
refcount-vs-atomic
cpu_hotplug
idr
local_ops
workqueue
genericirq
xarray
librs
genalloc
errseq
packing
printk-formats
symbol-namespaces
Data structures and low-level utilities
=======================================
Library functionality that is used throughout the kernel.
.. toctree::
:maxdepth: 1
kobject
assoc_array
xarray
idr
circular-buffers
generic-radix-tree
packing
timekeeping
errseq
Concurrency primitives
======================
How Linux keeps everything from happening at the same time. See
:doc:`/locking/index` for more related documentation.
.. toctree::
:maxdepth: 1
atomic_ops
refcount-vs-atomic
local_ops
padata
../RCU/index
Low-level hardware management
=============================
Cache management, managing CPU hotplug, etc.
.. toctree::
:maxdepth: 1
cachetlb
cpu_hotplug
memory-hotplug
genericirq
protection-keys
Memory management
=================
How to allocate and use memory in the kernel. Note that there is a lot
more memory-management documentation in :doc:`/vm/index`.
.. toctree::
:maxdepth: 1
memory-allocation
mm-api
genalloc
pin_user_pages
gfp_mask-from-fs-io
timekeeping
boot-time-mm
memory-hotplug
protection-keys
../RCU/index
gcc-plugins
symbol-namespaces
padata
ioctl
gfp_mask-from-fs-io
Interfaces for kernel debugging
===============================
@ -53,6 +93,16 @@ Interfaces for kernel debugging
debug-objects
tracepoint
Everything else
===============
Documents that don't fit elsewhere or which have yet to be categorized.
.. toctree::
:maxdepth: 1
librs
.. only:: subproject and html
Indices

View File

@ -25,7 +25,7 @@ some terms we will be working with.
usually embedded within some other structure which contains the stuff
the code is really interested in.
No structure should EVER have more than one kobject embedded within it.
No structure should **EVER** have more than one kobject embedded within it.
If it does, the reference counting for the object is sure to be messed
up and incorrect, and your code will be buggy. So do not do this.
@ -55,7 +55,7 @@ a larger, domain-specific object. To this end, kobjects will be found
embedded in other structures. If you are used to thinking of things in
object-oriented terms, kobjects can be seen as a top-level, abstract class
from which other classes are derived. A kobject implements a set of
capabilities which are not particularly useful by themselves, but which are
capabilities which are not particularly useful by themselves, but are
nice to have in other objects. The C language does not allow for the
direct expression of inheritance, so other techniques - such as structure
embedding - must be used.
@ -65,12 +65,12 @@ this is analogous as to how "list_head" structs are rarely useful on
their own, but are invariably found embedded in the larger objects of
interest.)
So, for example, the UIO code in drivers/uio/uio.c has a structure that
So, for example, the UIO code in ``drivers/uio/uio.c`` has a structure that
defines the memory region associated with a uio device::
struct uio_map {
struct kobject kobj;
struct uio_mem *mem;
struct kobject kobj;
struct uio_mem *mem;
};
If you have a struct uio_map structure, finding its embedded kobject is
@ -78,30 +78,30 @@ just a matter of using the kobj member. Code that works with kobjects will
often have the opposite problem, however: given a struct kobject pointer,
what is the pointer to the containing structure? You must avoid tricks
(such as assuming that the kobject is at the beginning of the structure)
and, instead, use the container_of() macro, found in <linux/kernel.h>::
and, instead, use the container_of() macro, found in ``<linux/kernel.h>``::
container_of(pointer, type, member)
where:
* "pointer" is the pointer to the embedded kobject,
* "type" is the type of the containing structure, and
* "member" is the name of the structure field to which "pointer" points.
* ``pointer`` is the pointer to the embedded kobject,
* ``type`` is the type of the containing structure, and
* ``member`` is the name of the structure field to which ``pointer`` points.
The return value from container_of() is a pointer to the corresponding
container type. So, for example, a pointer "kp" to a struct kobject
embedded *within* a struct uio_map could be converted to a pointer to the
*containing* uio_map structure with::
container type. So, for example, a pointer ``kp`` to a struct kobject
embedded **within** a struct uio_map could be converted to a pointer to the
**containing** uio_map structure with::
struct uio_map *u_map = container_of(kp, struct uio_map, kobj);
For convenience, programmers often define a simple macro for "back-casting"
For convenience, programmers often define a simple macro for **back-casting**
kobject pointers to the containing type. Exactly this happens in the
earlier drivers/uio/uio.c, as you can see here::
earlier ``drivers/uio/uio.c``, as you can see here::
struct uio_map {
struct kobject kobj;
struct uio_mem *mem;
struct kobject kobj;
struct uio_mem *mem;
};
#define to_map(map) container_of(map, struct uio_map, kobj)
@ -125,7 +125,7 @@ must have an associated kobj_type. After calling kobject_init(), to
register the kobject with sysfs, the function kobject_add() must be called::
int kobject_add(struct kobject *kobj, struct kobject *parent,
const char *fmt, ...);
const char *fmt, ...);
This sets up the parent of the kobject and the name for the kobject
properly. If the kobject is to be associated with a specific kset,
@ -172,13 +172,13 @@ call to kobject_uevent()::
int kobject_uevent(struct kobject *kobj, enum kobject_action action);
Use the KOBJ_ADD action for when the kobject is first added to the kernel.
Use the **KOBJ_ADD** action for when the kobject is first added to the kernel.
This should be done only after any attributes or children of the kobject
have been initialized properly, as userspace will instantly start to look
for them when this call happens.
When the kobject is removed from the kernel (details on how to do that are
below), the uevent for KOBJ_REMOVE will be automatically created by the
below), the uevent for **KOBJ_REMOVE** will be automatically created by the
kobject core, so the caller does not have to worry about doing that by
hand.
@ -238,7 +238,7 @@ Both types of attributes used here, with a kobject that has been created
with the kobject_create_and_add(), can be of type kobj_attribute, so no
special custom attribute is needed to be created.
See the example module, samples/kobject/kobject-example.c for an
See the example module, ``samples/kobject/kobject-example.c`` for an
implementation of a simple kobject and attributes.
@ -270,10 +270,10 @@ such a method has a form like::
void my_object_release(struct kobject *kobj)
{
struct my_object *mine = container_of(kobj, struct my_object, kobj);
struct my_object *mine = container_of(kobj, struct my_object, kobj);
/* Perform any additional cleanup on this object, then... */
kfree(mine);
/* Perform any additional cleanup on this object, then... */
kfree(mine);
}
One important point cannot be overstated: every kobject must have a
@ -297,11 +297,11 @@ instead, it is associated with the ktype. So let us introduce struct
kobj_type::
struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;
struct attribute **default_attrs;
const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
const void *(*namespace)(struct kobject *kobj);
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;
struct attribute **default_attrs;
const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
const void *(*namespace)(struct kobject *kobj);
};
This structure is used to describe a particular type of kobject (or, more
@ -352,8 +352,8 @@ created and never declared statically or on the stack. To create a new
kset use::
struct kset *kset_create_and_add(const char *name,
struct kset_uevent_ops *u,
struct kobject *parent);
struct kset_uevent_ops *u,
struct kobject *parent);
When you are finished with the kset, call::
@ -365,16 +365,16 @@ Because other references to the kset may still exist, the release may happen
after kset_unregister() returns.
An example of using a kset can be seen in the
samples/kobject/kset-example.c file in the kernel tree.
``samples/kobject/kset-example.c`` file in the kernel tree.
If a kset wishes to control the uevent operations of the kobjects
associated with it, it can use the struct kset_uevent_ops to handle it::
struct kset_uevent_ops {
int (*filter)(struct kset *kset, struct kobject *kobj);
const char *(*name)(struct kset *kset, struct kobject *kobj);
int (*uevent)(struct kset *kset, struct kobject *kobj,
struct kobj_uevent_env *env);
int (*filter)(struct kset *kset, struct kobject *kobj);
const char *(*name)(struct kset *kset, struct kobject *kobj);
int (*uevent)(struct kset *kset, struct kobject *kobj,
struct kobj_uevent_env *env);
};
@ -408,8 +408,8 @@ Kobject removal
After a kobject has been registered with the kobject core successfully, it
must be cleaned up when the code is finished with it. To do that, call
kobject_put(). By doing this, the kobject core will automatically clean up
all of the memory allocated by this kobject. If a KOBJ_ADD uevent has been
sent for the object, a corresponding KOBJ_REMOVE uevent will be sent, and
all of the memory allocated by this kobject. If a ``KOBJ_ADD`` uevent has been
sent for the object, a corresponding ``KOBJ_REMOVE`` uevent will be sent, and
any other sysfs housekeeping will be handled for the caller properly.
If you need to do a two-stage delete of the kobject (say you are not
@ -430,5 +430,5 @@ Example code to copy from
=========================
For a more complete example of using ksets and kobjects properly, see the
example programs samples/kobject/{kobject-example.c,kset-example.c},
which will be built as loadable modules if you select CONFIG_SAMPLE_KOBJECT.
example programs ``samples/kobject/{kobject-example.c,kset-example.c}``,
which will be built as loadable modules if you select ``CONFIG_SAMPLE_KOBJECT``.

View File

@ -1,22 +0,0 @@
Debugging Modules after 2.6.3
-----------------------------
In almost all distributions, the kernel asks for modules which don't
exist, such as "net-pf-10" or whatever. Changing "modprobe -q" to
"succeed" in this case is hacky and breaks some setups, and also we
want to know if it failed for the fallback code for old aliases in
fs/char_dev.c, for example.
In the past a debugging message which would fill people's logs was
emitted. This debugging message has been removed. The correct way
of debugging module problems is something like this:
echo '#! /bin/sh' > /tmp/modprobe
echo 'echo "$@" >> /tmp/modprobe.log' >> /tmp/modprobe
echo 'exec /sbin/modprobe "$@"' >> /tmp/modprobe
chmod a+x /tmp/modprobe
echo /tmp/modprobe > /proc/sys/kernel/modprobe
Note that the above applies only when the *kernel* is requesting
that the module be loaded -- it won't have any effect if that module
is being loaded explicitly using "modprobe" from userspace.

View File

@ -203,7 +203,7 @@ Cause
may not correctly copy files from sysfs.
Solution
Use ``cat``' to read ``.gcda`` files and ``cp -d`` to copy links.
Use ``cat`` to read ``.gcda`` files and ``cp -d`` to copy links.
Alternatively use the mechanism shown in Appendix B.

View File

@ -8,7 +8,8 @@ with the difference that the orphan objects are not freed but only
reported via /sys/kernel/debug/kmemleak. A similar method is used by the
Valgrind tool (``memcheck --leak-check``) to detect the memory leaks in
user-space applications.
Kmemleak is supported on x86, arm, powerpc, sparc, sh, microblaze, ppc, mips, s390 and tile.
Kmemleak is supported on x86, arm, arm64, powerpc, sparc, sh, microblaze, mips,
s390, nds32, arc and xtensa.
Usage
-----

View File

@ -272,8 +272,8 @@ STA information lifetime rules
.. kernel-doc:: net/mac80211/sta_info.c
:doc: STA information lifetime rules
Aggregation
===========
Aggregation Functions
=====================
.. kernel-doc:: net/mac80211/sta_info.h
:functions: sta_ampdu_mlme
@ -284,8 +284,8 @@ Aggregation
.. kernel-doc:: net/mac80211/sta_info.h
:functions: tid_ampdu_rx
Synchronisation
===============
Synchronisation Functions
=========================
TBD

View File

@ -5,8 +5,8 @@ DMAEngine documentation
DMAEngine documentation provides documents for various aspects of DMAEngine
framework.
DMAEngine documentation
-----------------------
DMAEngine development documentation
-----------------------------------
This book helps with DMAengine internal APIs and guide for DMAEngine device
driver writers.

View File

@ -210,7 +210,7 @@ probed.
While the typical use case for sync_state() is to have the kernel cleanly take
over management of devices from the bootloader, the usage of sync_state() is
not restricted to that. Use it whenever it makes sense to take an action after
all the consumers of a device have probed.
all the consumers of a device have probed::
int (*remove) (struct device *dev);

View File

@ -17,6 +17,7 @@ available subsections can be seen below.
driver-model/index
basics
infrastructure
ioctl
early-userspace/index
pm/index
clk
@ -74,11 +75,12 @@ available subsections can be seen below.
connector
console
dcdbas
edid
eisa
ipmb
isa
isapnp
io-mapping
io_ordering
generic-counter
lightnvm-pblk
memory-devices/index

View File

@ -23,7 +23,7 @@
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| riscv: | ok |
| s390: | ok |
| sh: | ok |
| sparc: | ok |

View File

@ -1,7 +1,10 @@
v9fs: Plan 9 Resource Sharing for Linux
=======================================
.. SPDX-License-Identifier: GPL-2.0
ABOUT
=======================================
v9fs: Plan 9 Resource Sharing for Linux
=======================================
About
=====
v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
@ -14,32 +17,34 @@ and Maya Gokhale. Additional development by Greg Watson
The best detailed explanation of the Linux implementation and applications of
the 9p client is available in the form of a USENIX paper:
http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html
Other applications are described in the following papers:
* XCPU & Clustering
http://xcpu.org/papers/xcpu-talk.pdf
* KVMFS: control file system for KVM
http://xcpu.org/papers/kvmfs.pdf
* CellFS: A New Programming Model for the Cell BE
http://xcpu.org/papers/cellfs-talk.pdf
* PROSE I/O: Using 9p to enable Application Partitions
http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
* VirtFS: A Virtualization Aware File System pass-through
http://goo.gl/3WPDg
USAGE
* XCPU & Clustering
http://xcpu.org/papers/xcpu-talk.pdf
* KVMFS: control file system for KVM
http://xcpu.org/papers/kvmfs.pdf
* CellFS: A New Programming Model for the Cell BE
http://xcpu.org/papers/cellfs-talk.pdf
* PROSE I/O: Using 9p to enable Application Partitions
http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
* VirtFS: A Virtualization Aware File System pass-through
http://goo.gl/3WPDg
Usage
=====
For remote file server:
For remote file server::
mount -t 9p 10.10.1.2 /mnt/9
For Plan 9 From User Space applications (http://swtch.com/plan9)
For Plan 9 From User Space applications (http://swtch.com/plan9)::
mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
For server running on QEMU host with virtio transport:
For server running on QEMU host with virtio transport::
mount -t 9p -o trans=virtio <mount_tag> /mnt/9
@ -48,18 +53,22 @@ mount points. Each 9P export is seen by the client as a virtio device with an
associated "mount_tag" property. Available mount tags can be
seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
OPTIONS
Options
=======
============= ===============================================================
trans=name select an alternative transport. Valid options are
currently:
unix - specifying a named pipe mount point
tcp - specifying a normal TCP/IP connection
fd - used passed file descriptors for connection
(see rfdno and wfdno)
virtio - connect to the next virtio channel available
(from QEMU with trans_virtio module)
rdma - connect to a specified RDMA channel
======== ============================================
unix specifying a named pipe mount point
tcp specifying a normal TCP/IP connection
fd used passed file descriptors for connection
(see rfdno and wfdno)
virtio connect to the next virtio channel available
(from QEMU with trans_virtio module)
rdma connect to a specified RDMA channel
======== ============================================
uname=name user name to attempt mount as on the remote server. The
server may override or ignore this value. Certain user
@ -69,28 +78,36 @@ OPTIONS
offering several exported file systems.
cache=mode specifies a caching policy. By default, no caches are used.
none = default no cache policy, metadata and data
none
default no cache policy, metadata and data
alike are synchronous.
loose = no attempts are made at consistency,
loose
no attempts are made at consistency,
intended for exclusive, read-only mounts
fscache = use FS-Cache for a persistent, read-only
fscache
use FS-Cache for a persistent, read-only
cache backend.
mmap = minimal cache that is only used for read-write
mmap
minimal cache that is only used for read-write
mmap. Northing else is cached, like cache=none
debug=n specifies debug level. The debug level is a bitmask.
0x01 = display verbose error messages
0x02 = developer debug (DEBUG_CURRENT)
0x04 = display 9p trace
0x08 = display VFS trace
0x10 = display Marshalling debug
0x20 = display RPC debug
0x40 = display transport debug
0x80 = display allocation debug
0x100 = display protocol message debug
0x200 = display Fid debug
0x400 = display packet debug
0x800 = display fscache tracing debug
===== ================================
0x01 display verbose error messages
0x02 developer debug (DEBUG_CURRENT)
0x04 display 9p trace
0x08 display VFS trace
0x10 display Marshalling debug
0x20 display RPC debug
0x40 display transport debug
0x80 display allocation debug
0x100 display protocol message debug
0x200 display Fid debug
0x400 display packet debug
0x800 display fscache tracing debug
===== ================================
rfdno=n the file descriptor for reading with trans=fd
@ -103,9 +120,12 @@ OPTIONS
noextend force legacy mode (no 9p2000.u or 9p2000.L semantics)
version=name Select 9P protocol version. Valid options are:
9p2000 - Legacy mode (same as noextend)
9p2000.u - Use 9P2000.u protocol
9p2000.L - Use 9P2000.L protocol
======== ==============================
9p2000 Legacy mode (same as noextend)
9p2000.u Use 9P2000.u protocol
9p2000.L Use 9P2000.L protocol
======== ==============================
dfltuid attempt to mount as a particular uid
@ -118,22 +138,27 @@ OPTIONS
hosts. This functionality will be expanded in later versions.
access there are four access modes.
user = if a user tries to access a file on v9fs
user
if a user tries to access a file on v9fs
filesystem for the first time, v9fs sends an
attach command (Tattach) for that user.
This is the default mode.
<uid> = allows only user with uid=<uid> to access
<uid>
allows only user with uid=<uid> to access
the files on the mounted filesystem
any = v9fs does single attach and performs all
any
v9fs does single attach and performs all
operations as one user
client = ACL based access check on the 9p client
clien
ACL based access check on the 9p client
side for access validation
cachetag cache tag to use the specified persistent cache.
cache tags for existing cache sessions can be listed at
/sys/fs/9p/caches. (applies only to cache=fscache)
============= ===============================================================
RESOURCES
Resources
=========
Protocol specifications are maintained on github:
@ -158,4 +183,3 @@ http://plan9.bell-labs.com/plan9
For information on Plan 9 from User Space (Plan 9 applications and libraries
ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9

View File

@ -1,3 +1,9 @@
.. SPDX-License-Identifier: GPL-2.0
===============================
Acorn Disc Filing System - ADFS
===============================
Filesystems supported by ADFS
-----------------------------
@ -25,6 +31,7 @@ directory updates, specifically updating the access mode and timestamp.
Mount options for ADFS
----------------------
============ ======================================================
uid=nnn All files in the partition will be owned by
user id nnn. Default 0 (root).
gid=nnn All files in the partition will be in group
@ -36,22 +43,23 @@ Mount options for ADFS
ftsuffix=n When ftsuffix=0, no file type suffix will be applied.
When ftsuffix=1, a hexadecimal suffix corresponding to
the RISC OS file type will be added. Default 0.
============ ======================================================
Mapping of ADFS permissions to Linux permissions
------------------------------------------------
ADFS permissions consist of the following:
Owner read
Owner write
Other read
Other write
- Owner read
- Owner write
- Other read
- Other write
(In older versions, an 'execute' permission did exist, but this
does not hold the same meaning as the Linux 'execute' permission
and is now obsolete).
does not hold the same meaning as the Linux 'execute' permission
and is now obsolete).
The mapping is performed as follows:
The mapping is performed as follows::
Owner read -> -r--r--r--
Owner write -> --w--w---w
@ -66,17 +74,18 @@ Mapping of ADFS permissions to Linux permissions
Possible other mode permissions -> ----rwxrwx
Hence, with the default masks, if a file is owner read/write, and
not a UnixExec filetype, then the permissions will be:
not a UnixExec filetype, then the permissions will be::
-rw-------
However, if the masks were ownmask=0770,othmask=0007, then this would
be modified to:
be modified to::
-rw-rw----
There is no restriction on what you can do with these masks. You may
wish that either read bits give read access to the file for all, but
keep the default write protection (ownmask=0755,othmask=0577):
keep the default write protection (ownmask=0755,othmask=0577)::
-rw-r--r--

View File

@ -1,9 +1,13 @@
.. SPDX-License-Identifier: GPL-2.0
=============================
Overview of Amiga Filesystems
=============================
Not all varieties of the Amiga filesystems are supported for reading and
writing. The Amiga currently knows six different filesystems:
============== ===============================================================
DOS\0 The old or original filesystem, not really suited for
hard disks and normally not used on them, either.
Supported read/write.
@ -23,6 +27,7 @@ DOS\4 The original filesystem with directory cache. The directory
sense on hard disks. Supported read only.
DOS\5 The Fast File System with directory cache. Supported read only.
============== ===============================================================
All of the above filesystems allow block sizes from 512 to 32K bytes.
Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
@ -36,14 +41,18 @@ are supported, too.
Mount options for the AFFS
==========================
protect If this option is set, the protection bits cannot be altered.
protect
If this option is set, the protection bits cannot be altered.
setuid[=uid] This sets the owner of all files and directories in the file
setuid[=uid]
This sets the owner of all files and directories in the file
system to uid or the uid of the current user, respectively.
setgid[=gid] Same as above, but for gid.
setgid[=gid]
Same as above, but for gid.
mode=mode Sets the mode flags to the given (octal) value, regardless
mode=mode
Sets the mode flags to the given (octal) value, regardless
of the original permissions. Directories will get an x
permission if the corresponding r bit is set.
This is useful since most of the plain AmigaOS files
@ -53,33 +62,41 @@ nofilenametruncate
The file system will return an error when filename exceeds
standard maximum filename length (30 characters).
reserved=num Sets the number of reserved blocks at the start of the
reserved=num
Sets the number of reserved blocks at the start of the
partition to num. You should never need this option.
Default is 2.
root=block Sets the block number of the root block. This should never
root=block
Sets the block number of the root block. This should never
be necessary.
bs=blksize Sets the blocksize to blksize. Valid block sizes are 512,
bs=blksize
Sets the blocksize to blksize. Valid block sizes are 512,
1024, 2048 and 4096. Like the root option, this should
never be necessary, as the affs can figure it out itself.
quiet The file system will not return an error for disallowed
quiet
The file system will not return an error for disallowed
mode changes.
verbose The volume name, file system type and block size will
verbose
The volume name, file system type and block size will
be written to the syslog when the filesystem is mounted.
mufs The filesystem is really a muFS, also it doesn't
mufs
The filesystem is really a muFS, also it doesn't
identify itself as one. This option is necessary if
the filesystem wasn't formatted as muFS, but is used
as one.
prefix=path Path will be prefixed to every absolute path name of
prefix=path
Path will be prefixed to every absolute path name of
symbolic links on an AFFS partition. Default = "/".
(See below.)
volume=name When symbolic links with an absolute path are created
volume=name
When symbolic links with an absolute path are created
on an AFFS partition, name will be prepended as the
volume name. Default = "" (empty string).
(See below.)
@ -119,7 +136,7 @@ The Linux rwxrwxrwx file mode is handled as follows:
- All other flags (suid, sgid, ...) are ignored and will
not be retained.
Newly created files and directories will get the user and group ID
of the current user and a mode according to the umask.
@ -148,11 +165,13 @@ might be "User", "WB" and "Graphics", the mount points /amiga/User,
Examples
========
Command line:
Command line::
mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
mount /dev/sda3 /Amiga -t affs
/etc/fstab entry:
/etc/fstab entry::
/dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
IMPORTANT NOTE
@ -170,7 +189,8 @@ before booting Windows!
If the damage is already done, the following should fix the RDB
(where <disk> is the device name).
DO AT YOUR OWN RISK:
DO AT YOUR OWN RISK::
dd if=/dev/<disk> of=rdb.tmp count=1
cp rdb.tmp rdb.fixed
@ -189,10 +209,14 @@ By default, filenames are truncated to 30 characters without warning.
'nofilenametruncate' mount option can change that behavior.
Case is ignored by the affs in filename matching, but Linux shells
do care about the case. Example (with /wb being an affs mounted fs):
do care about the case. Example (with /wb being an affs mounted fs)::
rm /wb/WRONGCASE
will remove /mnt/wrongcase, but
will remove /mnt/wrongcase, but::
rm /wb/WR*
will not since the names are matched by the shell.
The block allocation is designed for hard disk partitions. If more
@ -219,4 +243,4 @@ due to an incompatibility with the Amiga floppy controller.
If you are interested in an Amiga Emulator for Linux, look at
http://web.archive.org/web/*/http://www.freiburg.linux.de/~uae/
http://web.archive.org/web/%2E/http://www.freiburg.linux.de/~uae/

View File

@ -1,8 +1,10 @@
====================
kAFS: AFS FILESYSTEM
====================
.. SPDX-License-Identifier: GPL-2.0
Contents:
====================
kAFS: AFS FILESYSTEM
====================
.. Contents:
- Overview.
- Usage.
@ -14,8 +16,7 @@ Contents:
- The @sys substitution.
========
OVERVIEW
Overview
========
This filesystem provides a fairly simple secure AFS filesystem driver. It is
@ -35,35 +36,33 @@ It does not yet support the following AFS features:
(*) pioctl() system call.
===========
COMPILATION
Compilation
===========
The filesystem should be enabled by turning on the kernel configuration
options:
options::
CONFIG_AF_RXRPC - The RxRPC protocol transport
CONFIG_RXKAD - The RxRPC Kerberos security handler
CONFIG_AFS - The AFS filesystem
Additionally, the following can be turned on to aid debugging:
Additionally, the following can be turned on to aid debugging::
CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled
CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled
They permit the debugging messages to be turned on dynamically by manipulating
the masks in the following files:
the masks in the following files::
/sys/module/af_rxrpc/parameters/debug
/sys/module/kafs/parameters/debug
=====
USAGE
Usage
=====
When inserting the driver modules the root cell must be specified along with a
list of volume location server IP addresses:
list of volume location server IP addresses::
modprobe rxrpc
modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
@ -77,14 +76,14 @@ The second module is the kerberos RxRPC security driver, and the third module
is the actual filesystem driver for the AFS filesystem.
Once the module has been loaded, more modules can be added by the following
procedure:
procedure::
echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
Where the parameters to the "add" command are the name of a cell and a list of
volume location servers within that cell, with the latter separated by colons.
Filesystems can be mounted anywhere by commands similar to the following:
Filesystems can be mounted anywhere by commands similar to the following::
mount -t afs "%cambridge.redhat.com:root.afs." /afs
mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
@ -104,8 +103,7 @@ named volume will be looked up in the cell specified during modprobe.
Additional cells can be added through /proc (see later section).
===========
MOUNTPOINTS
Mountpoints
===========
AFS has a concept of mountpoints. In AFS terms, these are specially formatted
@ -123,42 +121,40 @@ culled first. If all are culled, then the requested volume will also be
unmounted, otherwise error EBUSY will be returned.
This can be used by the administrator to attempt to unmount the whole AFS tree
mounted on /afs in one go by doing:
mounted on /afs in one go by doing::
umount /afs
============
DYNAMIC ROOT
Dynamic Root
============
A mount option is available to create a serverless mount that is only usable
for dynamic lookup. Creating such a mount can be done by, for example:
for dynamic lookup. Creating such a mount can be done by, for example::
mount -t afs none /afs -o dyn
This creates a mount that just has an empty directory at the root. Attempting
to look up a name in this directory will cause a mountpoint to be created that
looks up a cell of the same name, for example:
looks up a cell of the same name, for example::
ls /afs/grand.central.org/
===============
PROC FILESYSTEM
Proc Filesystem
===============
The AFS modules creates a "/proc/fs/afs/" directory and populates it:
(*) A "cells" file that lists cells currently known to the afs module and
their usage counts:
their usage counts::
[root@andromeda ~]# cat /proc/fs/afs/cells
USE NAME
3 cambridge.redhat.com
(*) A directory per cell that contains files that list volume location
servers, volumes, and active servers known within that cell.
servers, volumes, and active servers known within that cell::
[root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers
USE ADDR STATE
@ -171,8 +167,7 @@ The AFS modules creates a "/proc/fs/afs/" directory and populates it:
1 Val 20000000 20000001 20000002 root.afs
=================
THE CELL DATABASE
The Cell Database
=================
The filesystem maintains an internal database of all the cells it knows and the
@ -181,7 +176,7 @@ the system belongs is added to the database when modprobe is performed by the
"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on
the kernel command line.
Further cells can be added by commands similar to the following:
Further cells can be added by commands similar to the following::
echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
@ -189,8 +184,7 @@ Further cells can be added by commands similar to the following:
No other cell database operations are available at this time.
========
SECURITY
Security
========
Secure operations are initiated by acquiring a key using the klog program. A
@ -198,17 +192,17 @@ very primitive klog program is available at:
http://people.redhat.com/~dhowells/rxrpc/klog.c
This should be compiled by:
This should be compiled by::
make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils"
And then run as:
And then run as::
./klog
Assuming it's successful, this adds a key of type RxRPC, named for the service
and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or
by cat'ing /proc/keys:
by cat'ing /proc/keys::
[root@andromeda ~]# keyctl show
Session Keyring
@ -232,20 +226,19 @@ socket), then the operations on the file will be made with key that was used to
open the file.
=====================
THE @SYS SUBSTITUTION
The @sys Substitution
=====================
The list of up to 16 @sys substitutions for the current network namespace can
be configured by writing a list to /proc/fs/afs/sysname:
be configured by writing a list to /proc/fs/afs/sysname::
[root@andromeda ~]# echo foo amd64_linux_26 >/proc/fs/afs/sysname
or cleared entirely by writing an empty list:
or cleared entirely by writing an empty list::
[root@andromeda ~]# echo >/proc/fs/afs/sysname
The current list for current network namespace can be retrieved by:
The current list for current network namespace can be retrieved by::
[root@andromeda ~]# cat /proc/fs/afs/sysname
foo

View File

@ -1,4 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
====================================================================
Miscellaneous Device control operations for the autofs kernel module
====================================================================
@ -36,24 +38,24 @@ For example, there are two types of automount maps, direct (in the kernel
module source you will see a third type called an offset, which is just
a direct mount in disguise) and indirect.
Here is a master map with direct and indirect map entries:
Here is a master map with direct and indirect map entries::
/- /etc/auto.direct
/test /etc/auto.indirect
/- /etc/auto.direct
/test /etc/auto.indirect
and the corresponding map files:
and the corresponding map files::
/etc/auto.direct:
/etc/auto.direct:
/automount/dparse/g6 budgie:/autofs/export1
/automount/dparse/g1 shark:/autofs/export1
and so on.
/automount/dparse/g6 budgie:/autofs/export1
/automount/dparse/g1 shark:/autofs/export1
and so on.
/etc/auto.indirect:
/etc/auto.indirect::
g1 shark:/autofs/export1
g6 budgie:/autofs/export1
and so on.
g1 shark:/autofs/export1
g6 budgie:/autofs/export1
and so on.
For the above indirect map an autofs file system is mounted on /test and
mounts are triggered for each sub-directory key by the inode lookup
@ -69,23 +71,23 @@ use the follow_link inode operation to trigger the mount.
But, each entry in direct and indirect maps can have offsets (making
them multi-mount map entries).
For example, an indirect mount map entry could also be:
For example, an indirect mount map entry could also be::
g1 \
/ shark:/autofs/export5/testing/test \
/s1 shark:/autofs/export/testing/test/s1 \
/s2 shark:/autofs/export5/testing/test/s2 \
/s1/ss1 shark:/autofs/export1 \
/s2/ss2 shark:/autofs/export2
g1 \
/ shark:/autofs/export5/testing/test \
/s1 shark:/autofs/export/testing/test/s1 \
/s2 shark:/autofs/export5/testing/test/s2 \
/s1/ss1 shark:/autofs/export1 \
/s2/ss2 shark:/autofs/export2
and a similarly a direct mount map entry could also be:
and a similarly a direct mount map entry could also be::
/automount/dparse/g1 \
/ shark:/autofs/export5/testing/test \
/s1 shark:/autofs/export/testing/test/s1 \
/s2 shark:/autofs/export5/testing/test/s2 \
/s1/ss1 shark:/autofs/export2 \
/s2/ss2 shark:/autofs/export2
/automount/dparse/g1 \
/ shark:/autofs/export5/testing/test \
/s1 shark:/autofs/export/testing/test/s1 \
/s2 shark:/autofs/export5/testing/test/s2 \
/s1/ss1 shark:/autofs/export2 \
/s2/ss2 shark:/autofs/export2
One of the issues with version 4 of autofs was that, when mounting an
entry with a large number of offsets, possibly with nesting, we needed
@ -170,32 +172,32 @@ autofs Miscellaneous Device mount control interface
The control interface is opening a device node, typically /dev/autofs.
All the ioctls use a common structure to pass the needed parameter
information and return operation results:
information and return operation results::
struct autofs_dev_ioctl {
__u32 ver_major;
__u32 ver_minor;
__u32 size; /* total size of data passed in
* including this struct */
__s32 ioctlfd; /* automount command fd */
struct autofs_dev_ioctl {
__u32 ver_major;
__u32 ver_minor;
__u32 size; /* total size of data passed in
* including this struct */
__s32 ioctlfd; /* automount command fd */
/* Command parameters */
union {
struct args_protover protover;
struct args_protosubver protosubver;
struct args_openmount openmount;
struct args_ready ready;
struct args_fail fail;
struct args_setpipefd setpipefd;
struct args_timeout timeout;
struct args_requester requester;
struct args_expire expire;
struct args_askumount askumount;
struct args_ismountpoint ismountpoint;
};
/* Command parameters */
union {
struct args_protover protover;
struct args_protosubver protosubver;
struct args_openmount openmount;
struct args_ready ready;
struct args_fail fail;
struct args_setpipefd setpipefd;
struct args_timeout timeout;
struct args_requester requester;
struct args_expire expire;
struct args_askumount askumount;
struct args_ismountpoint ismountpoint;
};
char path[0];
};
char path[0];
};
The ioctlfd field is a mount point file descriptor of an autofs mount
point. It is returned by the open call and is used by all calls except
@ -212,7 +214,7 @@ is used account for the increased structure length when translating the
structure sent from user space.
This structure can be initialized before setting specific fields by using
the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *).
the void function call init_autofs_dev_ioctl(``struct autofs_dev_ioctl *``).
All of the ioctls perform a copy of this structure from user space to
kernel space and return -EINVAL if the size parameter is smaller than

View File

@ -1,48 +1,54 @@
.. SPDX-License-Identifier: GPL-2.0
=========================
BeOS filesystem for Linux
=========================
Document last updated: Dec 6, 2001
WARNING
Warning
=======
Make sure you understand that this is alpha software. This means that the
implementation is neither complete nor well-tested.
implementation is neither complete nor well-tested.
I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
LICENSE
=====
This software is covered by the GNU General Public License.
License
=======
This software is covered by the GNU General Public License.
See the file COPYING for the complete text of the license.
Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
AUTHOR
=====
Author
======
The largest part of the code written by Will Dyson <will_dyson@pobox.com>
He has been working on the code since Aug 13, 2001. See the changelog for
details.
Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
His original code can still be found at:
<http://hp.vector.co.jp/authors/VA008030/bfs/>
Does anyone know of a more current email address for Makoto? He doesn't
respond to the address given above...
This filesystem doesn't have a maintainer.
WHAT IS THIS DRIVER?
==================
This module implements the native filesystem of BeOS http://www.beincorporated.com/
What is this Driver?
====================
This module implements the native filesystem of BeOS http://www.beincorporated.com/
for the linux 2.4.1 and later kernels. Currently it is a read-only
implementation.
Which is it, BFS or BEFS?
================
Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
=========================
Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
But Unixware Boot Filesystem is called bfs, too. And they are already in
the kernel. Because of this naming conflict, on Linux the BeOS
filesystem is called befs.
HOW TO INSTALL
How to Install
==============
step 1. Install the BeFS patch into the source code tree of linux.
@ -54,16 +60,16 @@ is called patch-befs-xxx, you would do the following:
patch -p1 < /path/to/patch-befs-xxx
if the patching step fails (i.e. there are rejected hunks), you can try to
figure it out yourself (it shouldn't be hard), or mail the maintainer
figure it out yourself (it shouldn't be hard), or mail the maintainer
(Will Dyson <will_dyson@pobox.com>) for help.
step 2. Configuration & make kernel
The linux kernel has many compile-time options. Most of them are beyond the
scope of this document. I suggest the Kernel-HOWTO document as a good general
reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html
reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html
However, to use the BeFS module, you must enable it at configure time.
However, to use the BeFS module, you must enable it at configure time::
cd /foo/bar/linux
make menuconfig (or xconfig)
@ -82,35 +88,40 @@ step 3. Install
See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
instructions on this critical step.
USING BFS
Using BFS
=========
To use the BeOS filesystem, use filesystem type 'befs'.
ex)
ex::
mount -t befs /dev/fd0 /beos
MOUNT OPTIONS
Mount Options
=============
============= ===========================================================
uid=nnn All files in the partition will be owned by user id nnn.
gid=nnn All files in the partition will be in group nnn.
iocharset=xxx Use xxx as the name of the NLS translation table.
debug The driver will output debugging information to the syslog.
============= ===========================================================
HOW TO GET LASTEST VERSION
How to Get Lastest Version
==========================
The latest version is currently available at:
<http://befs-driver.sourceforge.net/>
ANY KNOWN BUGS?
===========
Any Known Bugs?
===============
As of Jan 20, 2002:
None
SPECIAL THANKS
Special Thanks
==============
Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
Hiroyuki Yamada ... Testing LinuxPPC.

View File

@ -1,4 +1,7 @@
BFS FILESYSTEM FOR LINUX
.. SPDX-License-Identifier: GPL-2.0
========================
BFS Filesystem for Linux
========================
The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
@ -9,22 +12,22 @@ In order to access /stand partition under Linux you obviously need to
know the partition number and the kernel must support UnixWare disk slices
(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
depend on having UnixWare disklabel support because one can also mount
BFS filesystem via loopback:
BFS filesystem via loopback::
# losetup /dev/loop0 stand.img
# mount -t bfs /dev/loop0 /mnt/stand
# losetup /dev/loop0 stand.img
# mount -t bfs /dev/loop0 /mnt/stand
where stand.img is a file containing the image of BFS filesystem.
where stand.img is a file containing the image of BFS filesystem.
When you have finished using it and umounted you need to also deallocate
/dev/loop0 device by:
/dev/loop0 device by::
# losetup -d /dev/loop0
# losetup -d /dev/loop0
You can simplify mounting by just typing:
You can simplify mounting by just typing::
# mount -t bfs -o loop stand.img /mnt/stand
# mount -t bfs -o loop stand.img /mnt/stand
this will allocate the first available loopback device (and load loop.o
this will allocate the first available loopback device (and load loop.o
kernel module if necessary) automatically. If the loopback driver is not
loaded automatically, make sure that you have compiled the module and
that modprobe is functioning. Beware that umount will not deallocate
@ -33,21 +36,21 @@ that modprobe is functioning. Beware that umount will not deallocate
losetup(8). Read losetup(8) manpage for more info.
To create the BFS image under UnixWare you need to find out first which
slice contains it. The command prtvtoc(1M) is your friend:
slice contains it. The command prtvtoc(1M) is your friend::
# prtvtoc /dev/rdsk/c0b0t0d0s0
# prtvtoc /dev/rdsk/c0b0t0d0s0
(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
look for the slice with tag "STAND", which is usually slice 10. With this
information you can use dd(1) to create the BFS image:
information you can use dd(1) to create the BFS image::
# umount /stand
# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
# umount /stand
# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
Just in case, you can verify that you have done the right thing by checking
the magic number:
the magic number::
# od -Ad -tx4 stand.img | more
# od -Ad -tx4 stand.img | more
The first 4 bytes should be 0x1badface.

View File

@ -1,3 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
=====
BTRFS
=====

View File

@ -1,3 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
============================
Ceph Distributed File System
============================
@ -15,6 +18,7 @@ Basic features include:
* Easy deployment: most FS components are userspace daemons
Also,
* Flexible snapshots (on any directory)
* Recursive accounting (nested files, directories, bytes)
@ -63,7 +67,7 @@ no 'du' or similar recursive scan of the file system is required.
Finally, Ceph also allows quotas to be set on any directory in the system.
The quota can restrict the number of bytes or the number of files stored
beneath that point in the directory hierarchy. Quotas can be set using
extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg:
extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg::
setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
getfattr -n ceph.quota.max_bytes /some/dir
@ -76,7 +80,7 @@ from writing as much data as it needs.
Mount Syntax
============
The basic mount syntax is:
The basic mount syntax is::
# mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
@ -84,7 +88,7 @@ You only need to specify a single monitor, as the client will get the
full list when it connects. (However, if the monitor you specify
happens to be down, the mount won't succeed.) The port can be left
off if the monitor is using the default. So if the monitor is at
1.2.3.4,
1.2.3.4::
# mount -t ceph 1.2.3.4:/ /mnt/ceph
@ -163,14 +167,14 @@ Mount Options
available modes are "no" and "clean". The default is "no".
* no: never attempt to reconnect when client detects that it has been
blacklisted. Operations will generally fail after being blacklisted.
blacklisted. Operations will generally fail after being blacklisted.
* clean: client reconnects to the ceph cluster automatically when it
detects that it has been blacklisted. During reconnect, client drops
dirty data/metadata, invalidates page caches and writable file handles.
After reconnect, file locks become stale because the MDS loses track
of them. If an inode contains any stale file locks, read/write on the
inode is not allowed until applications release all stale file locks.
detects that it has been blacklisted. During reconnect, client drops
dirty data/metadata, invalidates page caches and writable file handles.
After reconnect, file locks become stale because the MDS loses track
of them. If an inode contains any stale file locks, read/write on the
inode is not allowed until applications release all stale file locks.
More Information
================
@ -179,8 +183,8 @@ For more information on Ceph, see the home page at
https://ceph.com/
The Linux kernel client source tree is available at
https://github.com/ceph/ceph-client.git
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
- https://github.com/ceph/ceph-client.git
- git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
and the source for the full system is at
https://github.com/ceph/ceph.git

View File

@ -13,7 +13,7 @@ network by utilizing SMB or CIFS protocol.
In order to mount, the network stack will also need to be set up by
using 'ip=' config option. For more details, see
Documentation/filesystems/nfs/nfsroot.txt.
Documentation/admin-guide/nfs/nfsroot.rst.
A CIFS root mount currently requires the use of SMB1+UNIX Extensions
which is only supported by the Samba server. SMB1 is the older

View File

@ -1,12 +1,15 @@
.. SPDX-License-Identifier: GPL-2.0
Cramfs - cram a filesystem onto a small ROM
===========================================
Cramfs - cram a filesystem onto a small ROM
===========================================
cramfs is designed to be simple and small, and to compress things well.
cramfs is designed to be simple and small, and to compress things well.
It uses the zlib routines to compress a file one page at a time, and
allows random page access. The meta-data is not compressed, but is
expressed in a very terse representation to make it use much less
diskspace than traditional filesystems.
diskspace than traditional filesystems.
You can't write to a cramfs filesystem (making it compressible and
compact also makes it _very_ hard to update on-the-fly), so you have to
@ -28,9 +31,9 @@ issue.
Hard links are supported, but hard linked files
will still have a link count of 1 in the cramfs image.
Cramfs directories have no `.' or `..' entries. Directories (like
Cramfs directories have no ``.`` or ``..`` entries. Directories (like
every other file on cramfs) always have a link count of 1. (There's
no need to use -noleaf in `find', btw.)
no need to use -noleaf in ``find``, btw.)
No timestamps are stored in a cramfs, so these default to the epoch
(1970 GMT). Recently-accessed files may have updated timestamps, but
@ -70,9 +73,9 @@ MTD drivers are cfi_cmdset_0001 (Intel/Sharp CFI flash) or physmap
(Flash device in physical memory map). MTD partitions based on such devices
are fine too. Then that device should be specified with the "mtd:" prefix
as the mount device argument. For example, to mount the MTD device named
"fs_partition" on the /mnt directory:
"fs_partition" on the /mnt directory::
$ mount -t cramfs mtd:fs_partition /mnt
$ mount -t cramfs mtd:fs_partition /mnt
To boot a kernel with this as root filesystem, suffice to specify
something like "root=mtd:fs_partition" on the kernel command line.
@ -90,6 +93,7 @@ https://github.com/npitre/cramfs-tools
For /usr/share/magic
--------------------
===== ======================= =======================
0 ulelong 0x28cd3d45 Linux cramfs offset 0
>4 ulelong x size %d
>8 ulelong x flags 0x%x
@ -110,6 +114,7 @@ For /usr/share/magic
>552 ulelong x fsid.blocks %d
>556 ulelong x fsid.files %d
>560 string >\0 name "%.16s"
===== ======================= =======================
Hacker Notes

View File

@ -1,4 +1,11 @@
Copyright 2009 Jonathan Corbet <corbet@lwn.net>
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
=======
DebugFS
=======
Copyright |copy| 2009 Jonathan Corbet <corbet@lwn.net>
Debugfs exists as a simple way for kernel developers to make information
available to user space. Unlike /proc, which is only meant for information
@ -6,11 +13,11 @@ about a process, or sysfs, which has strict one-value-per-file rules,
debugfs has no rules at all. Developers can put any information they want
there. The debugfs filesystem is also intended to not serve as a stable
ABI to user space; in theory, there are no stability constraints placed on
files exported there. The real world is not always so simple, though [1];
files exported there. The real world is not always so simple, though [1]_;
even debugfs interfaces are best designed with the idea that they will need
to be maintained forever.
Debugfs is typically mounted with a command like:
Debugfs is typically mounted with a command like::
mount -t debugfs none /sys/kernel/debug
@ -23,7 +30,7 @@ Note that the debugfs API is exported GPL-only to modules.
Code using debugfs should include <linux/debugfs.h>. Then, the first order
of business will be to create at least one directory to hold a set of
debugfs files:
debugfs files::
struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
@ -36,7 +43,7 @@ something went wrong. If ERR_PTR(-ENODEV) is returned, that is an
indication that the kernel has been built without debugfs support and none
of the functions described below will work.
The most general way to create a file within a debugfs directory is with:
The most general way to create a file within a debugfs directory is with::
struct dentry *debugfs_create_file(const char *name, umode_t mode,
struct dentry *parent, void *data,
@ -53,7 +60,7 @@ ERR_PTR(-ERROR) on error, or ERR_PTR(-ENODEV) if debugfs support is
missing.
Create a file with an initial size, the following function can be used
instead:
instead::
struct dentry *debugfs_create_file_size(const char *name, umode_t mode,
struct dentry *parent, void *data,
@ -66,7 +73,7 @@ as the function debugfs_create_file.
In a number of cases, the creation of a set of file operations is not
actually necessary; the debugfs code provides a number of helper functions
for simple situations. Files containing a single integer value can be
created with any of:
created with any of::
void debugfs_create_u8(const char *name, umode_t mode,
struct dentry *parent, u8 *value);
@ -80,7 +87,7 @@ created with any of:
These files support both reading and writing the given value; if a specific
file should not be written to, simply set the mode bits accordingly. The
values in these files are in decimal; if hexadecimal is more appropriate,
the following functions can be used instead:
the following functions can be used instead::
void debugfs_create_x8(const char *name, umode_t mode,
struct dentry *parent, u8 *value);
@ -94,7 +101,7 @@ the following functions can be used instead:
These functions are useful as long as the developer knows the size of the
value to be exported. Some types can have different widths on different
architectures, though, complicating the situation somewhat. There are
functions meant to help out in such special cases:
functions meant to help out in such special cases::
void debugfs_create_size_t(const char *name, umode_t mode,
struct dentry *parent, size_t *value);
@ -103,7 +110,7 @@ As might be expected, this function will create a debugfs file to represent
a variable of type size_t.
Similarly, there are helpers for variables of type unsigned long, in decimal
and hexadecimal:
and hexadecimal::
struct dentry *debugfs_create_ulong(const char *name, umode_t mode,
struct dentry *parent,
@ -111,7 +118,7 @@ and hexadecimal:
void debugfs_create_xul(const char *name, umode_t mode,
struct dentry *parent, unsigned long *value);
Boolean values can be placed in debugfs with:
Boolean values can be placed in debugfs with::
struct dentry *debugfs_create_bool(const char *name, umode_t mode,
struct dentry *parent, bool *value);
@ -120,7 +127,7 @@ A read on the resulting file will yield either Y (for non-zero values) or
N, followed by a newline. If written to, it will accept either upper- or
lower-case values, or 1 or 0. Any other input will be silently ignored.
Also, atomic_t values can be placed in debugfs with:
Also, atomic_t values can be placed in debugfs with::
void debugfs_create_atomic_t(const char *name, umode_t mode,
struct dentry *parent, atomic_t *value)
@ -129,7 +136,7 @@ A read of this file will get atomic_t values, and a write of this file
will set atomic_t values.
Another option is exporting a block of arbitrary binary data, with
this structure and function:
this structure and function::
struct debugfs_blob_wrapper {
void *data;
@ -151,7 +158,7 @@ If you want to dump a block of registers (something that happens quite
often during development, even if little such code reaches mainline.
Debugfs offers two functions: one to make a registers-only file, and
another to insert a register block in the middle of another sequential
file.
file::
struct debugfs_reg32 {
char *name;
@ -175,7 +182,7 @@ The "base" argument may be 0, but you may want to build the reg32 array
using __stringify, and a number of register names (macros) are actually
byte offsets over a base for the register block.
If you want to dump an u32 array in debugfs, you can create file with:
If you want to dump an u32 array in debugfs, you can create file with::
void debugfs_create_u32_array(const char *name, umode_t mode,
struct dentry *parent,
@ -185,7 +192,7 @@ The "array" argument provides data, and the "elements" argument is
the number of elements in the array. Note: Once array is created its
size can not be changed.
There is a helper function to create device related seq_file:
There is a helper function to create device related seq_file::
struct dentry *debugfs_create_devm_seqfile(struct device *dev,
const char *name,
@ -197,14 +204,14 @@ The "dev" argument is the device related to this debugfs file, and
the "read_fn" is a function pointer which to be called to print the
seq_file content.
There are a couple of other directory-oriented helper functions:
There are a couple of other directory-oriented helper functions::
struct dentry *debugfs_rename(struct dentry *old_dir,
struct dentry *debugfs_rename(struct dentry *old_dir,
struct dentry *old_dentry,
struct dentry *new_dir,
struct dentry *new_dir,
const char *new_name);
struct dentry *debugfs_create_symlink(const char *name,
struct dentry *debugfs_create_symlink(const char *name,
struct dentry *parent,
const char *target);
@ -219,7 +226,7 @@ module is unloaded without explicitly removing debugfs entries, the result
will be a lot of stale pointers and no end of highly antisocial behavior.
So all debugfs users - at least those which can be built as modules - must
be prepared to remove all files and directories they create there. A file
can be removed with:
can be removed with::
void debugfs_remove(struct dentry *dentry);
@ -229,7 +236,7 @@ be removed.
Once upon a time, debugfs users were required to remember the dentry
pointer for every debugfs file they created so that all files could be
cleaned up. We live in more civilized times now, though, and debugfs users
can call:
can call::
void debugfs_remove_recursive(struct dentry *dentry);
@ -237,5 +244,4 @@ If this function is passed a pointer for the dentry corresponding to the
top-level directory, the entire hierarchy below that directory will be
removed.
Notes:
[1] http://lwn.net/Articles/309298/
.. [1] http://lwn.net/Articles/309298/

View File

@ -1,20 +1,25 @@
dlmfs
==================
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
=====
DLMFS
=====
A minimal DLM userspace interface implemented via a virtual file
system.
dlmfs is built with OCFS2 as it requires most of its infrastructure.
Project web page: http://ocfs2.wiki.kernel.org
Tools web page: https://github.com/markfasheh/ocfs2-tools
OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
:Project web page: http://ocfs2.wiki.kernel.org
:Tools web page: https://github.com/markfasheh/ocfs2-tools
:OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
All code copyright 2005 Oracle except when otherwise noted.
CREDITS
Credits
=======
Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds
Some code taken from ramfs which is Copyright |copy| 2000 Linus Torvalds
and Transmeta Corp.
Mark Fasheh <mark.fasheh@oracle.com>
@ -96,14 +101,19 @@ operation. If the lock succeeds, you'll get an fd.
open(2) with O_CREAT to ensure the resource inode is created - dlmfs does
not automatically create inodes for existing lock resources.
============ ===========================
Open Flag Lock Request Type
--------- -----------------
============ ===========================
O_RDONLY Shared Read
O_RDWR Exclusive
============ ===========================
============ ===========================
Open Flag Resulting Locking Behavior
--------- --------------------------
============ ===========================
O_NONBLOCK Trylock operation
============ ===========================
You must provide exactly one of O_RDONLY or O_RDWR.

View File

@ -1,14 +1,18 @@
.. SPDX-License-Identifier: GPL-2.0
======================================================
eCryptfs: A stacked cryptographic filesystem for Linux
======================================================
eCryptfs is free software. Please see the file COPYING for details.
For documentation, please see the files in the doc/ subdirectory. For
building and installation instructions please see the INSTALL file.
Maintainer: Phillip Hellewell
Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com>
Developers: Michael C. Thompson
Kent Yoder
Web Site: http://ecryptfs.sf.net
:Maintainer: Phillip Hellewell
:Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com>
:Developers: Michael C. Thompson
Kent Yoder
:Web Site: http://ecryptfs.sf.net
This software is currently undergoing development. Make sure to
maintain a backup copy of any data you write into eCryptfs.
@ -19,34 +23,36 @@ SourceForge site:
http://sourceforge.net/projects/ecryptfs/
Userspace requirements include:
- David Howells' userspace keyring headers and libraries (version
1.0 or higher), obtainable from
http://people.redhat.com/~dhowells/keyutils/
- Libgcrypt
- David Howells' userspace keyring headers and libraries (version
1.0 or higher), obtainable from
http://people.redhat.com/~dhowells/keyutils/
- Libgcrypt
NOTES
.. note::
In the beta/experimental releases of eCryptfs, when you upgrade
eCryptfs, you should copy the files to an unencrypted location and
then copy the files back into the new eCryptfs mount to migrate the
files.
In the beta/experimental releases of eCryptfs, when you upgrade
eCryptfs, you should copy the files to an unencrypted location and
then copy the files back into the new eCryptfs mount to migrate the
files.
MOUNT-WIDE PASSPHRASE
Mount-wide Passphrase
=====================
Create a new directory into which eCryptfs will write its encrypted
files (i.e., /root/crypt). Then, create the mount point directory
(i.e., /mnt/crypt). Now it's time to mount eCryptfs:
(i.e., /mnt/crypt). Now it's time to mount eCryptfs::
mount -t ecryptfs /root/crypt /mnt/crypt
mount -t ecryptfs /root/crypt /mnt/crypt
You should be prompted for a passphrase and a salt (the salt may be
blank).
Try writing a new file:
Try writing a new file::
echo "Hello, World" > /mnt/crypt/hello.txt
echo "Hello, World" > /mnt/crypt/hello.txt
The operation will complete. Notice that there is a new file in
/root/crypt that is at least 12288 bytes in size (depending on your
@ -59,10 +65,13 @@ keyctl clear @u
Then umount /mnt/crypt and mount again per the instructions given
above.
cat /mnt/crypt/hello.txt
::
cat /mnt/crypt/hello.txt
NOTES
Notes
=====
eCryptfs version 0.1 should only be mounted on (1) empty directories
or (2) directories containing files only created by eCryptfs. If you

View File

@ -1,5 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
=======================================
efivarfs - a (U)EFI variable filesystem
=======================================
The efivarfs filesystem was created to address the shortcomings of
using entries in sysfs to maintain EFI variables. The old sysfs EFI
@ -11,7 +14,7 @@ than a single page, sysfs isn't the best interface for this.
Variables can be created, deleted and modified with the efivarfs
filesystem.
efivarfs is typically mounted like this,
efivarfs is typically mounted like this::
mount -t efivarfs none /sys/firmware/efi/efivars

View File

@ -1,3 +1,9 @@
.. SPDX-License-Identifier: GPL-2.0
======================================
Enhanced Read-Only File System - EROFS
======================================
Overview
========
@ -6,6 +12,7 @@ from other read-only file systems, it aims to be designed for flexibility,
scalability, but be kept simple and high performance.
It is designed as a better filesystem solution for the following scenarios:
- read-only storage media or
- part of a fully trusted read-only solution, which means it needs to be
@ -17,6 +24,7 @@ It is designed as a better filesystem solution for the following scenarios:
for those embedded devices with limited memory (ex, smartphone);
Here is the main features of EROFS:
- Little endian on-disk design;
- Currently 4KB block size (nobh) and therefore maximum 16TB address space;
@ -24,13 +32,17 @@ Here is the main features of EROFS:
- Metadata & data could be mixed by design;
- 2 inode versions for different requirements:
===================== ============ =====================================
compact (v1) extended (v2)
Inode metadata size: 32 bytes 64 bytes
Max file size: 4 GB 16 EB (also limited by max. vol size)
Max uids/gids: 65536 4294967296
File change time: no yes (64 + 32-bit timestamp)
Max hardlinks: 65536 4294967296
Metadata reserved: 4 bytes 14 bytes
===================== ============ =====================================
Inode metadata size 32 bytes 64 bytes
Max file size 4 GB 16 EB (also limited by max. vol size)
Max uids/gids 65536 4294967296
File change time no yes (64 + 32-bit timestamp)
Max hardlinks 65536 4294967296
Metadata reserved 4 bytes 14 bytes
===================== ============ =====================================
- Support extended attributes (xattrs) as an option;
@ -43,29 +55,36 @@ Here is the main features of EROFS:
The following git tree provides the file system user-space tools under
development (ex, formatting tool mkfs.erofs):
>> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
Bugs and patches are welcome, please kindly help us and send to the following
linux-erofs mailing list:
>> linux-erofs mailing list <linux-erofs@lists.ozlabs.org>
- linux-erofs mailing list <linux-erofs@lists.ozlabs.org>
Mount options
=============
=================== =========================================================
(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled
by default if CONFIG_EROFS_FS_XATTR is selected.
(no)acl Setup POSIX Access Control List. Note: acl is enabled
by default if CONFIG_EROFS_FS_POSIX_ACL is selected.
cache_strategy=%s Select a strategy for cached decompression from now on:
disabled: In-place I/O decompression only;
readahead: Cache the last incomplete compressed physical
========== =============================================
disabled In-place I/O decompression only;
readahead Cache the last incomplete compressed physical
cluster for further reading. It still does
in-place I/O decompression for the rest
compressed physical clusters;
readaround: Cache the both ends of incomplete compressed
readaround Cache the both ends of incomplete compressed
physical clusters for further reading.
It still does in-place I/O decompression
for the rest compressed physical clusters.
========== =============================================
=================== =========================================================
On-disk details
===============
@ -73,7 +92,7 @@ On-disk details
Summary
-------
Different from other read-only file systems, an EROFS volume is designed
to be as simple as possible:
to be as simple as possible::
|-> aligned with the block size
____________________________________________________________
@ -83,41 +102,45 @@ to be as simple as possible:
All data areas should be aligned with the block size, but metadata areas
may not. All metadatas can be now observed in two different spaces (views):
1. Inode metadata space
Each valid inode should be aligned with an inode slot, which is a fixed
value (32 bytes) and designed to be kept in line with compact inode size.
Each inode can be directly found with the following formula:
inode offset = meta_blkaddr * block_size + 32 * nid
|-> aligned with 8B
|-> followed closely
+ meta_blkaddr blocks |-> another slot
_____________________________________________________________________
| ... | inode | xattrs | extents | data inline | ... | inode ...
|________|_______|(optional)|(optional)|__(optional)_|_____|__________
|-> aligned with the inode slot size
. .
. .
. .
. .
. .
. .
.____________________________________________________|-> aligned with 4B
| xattr_ibody_header | shared xattrs | inline xattrs |
|____________________|_______________|_______________|
|-> 12 bytes <-|->x * 4 bytes<-| .
. . .
. . .
. . .
._______________________________.______________________.
| id | id | id | id | ... | id | ent | ... | ent| ... |
|____|____|____|____|______|____|_____|_____|____|_____|
|-> aligned with 4B
|-> aligned with 4B
::
|-> aligned with 8B
|-> followed closely
+ meta_blkaddr blocks |-> another slot
_____________________________________________________________________
| ... | inode | xattrs | extents | data inline | ... | inode ...
|________|_______|(optional)|(optional)|__(optional)_|_____|__________
|-> aligned with the inode slot size
. .
. .
. .
. .
. .
. .
.____________________________________________________|-> aligned with 4B
| xattr_ibody_header | shared xattrs | inline xattrs |
|____________________|_______________|_______________|
|-> 12 bytes <-|->x * 4 bytes<-| .
. . .
. . .
. . .
._______________________________.______________________.
| id | id | id | id | ... | id | ent | ... | ent| ... |
|____|____|____|____|______|____|_____|_____|____|_____|
|-> aligned with 4B
|-> aligned with 4B
Inode could be 32 or 64 bytes, which can be distinguished from a common
field which all inode versions have -- i_format:
field which all inode versions have -- i_format::
__________________ __________________
| i_format | | i_format |
@ -132,16 +155,19 @@ may not. All metadatas can be now observed in two different spaces (views):
proper alignment, and they could be optional for different data mappings.
_currently_ total 4 valid data mappings are supported:
== ====================================================================
0 flat file data without data inline (no extent);
1 fixed-sized output data compression (with non-compacted indexes);
2 flat file data with tail packing data inline (no extent);
3 fixed-sized output data compression (with compacted indexes, v5.3+).
== ====================================================================
The size of the optional xattrs is indicated by i_xattr_count in inode
header. Large xattrs or xattrs shared by many different files can be
stored in shared xattrs metadata rather than inlined right after inode.
2. Shared xattrs metadata space
Shared xattrs space is similar to the above inode space, started with
a specific block indicated by xattr_blkaddr, organized one by one with
proper align.
@ -149,11 +175,13 @@ may not. All metadatas can be now observed in two different spaces (views):
Each share xattr can also be directly found by the following formula:
xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
|-> aligned by 4 bytes
+ xattr_blkaddr blocks |-> aligned with 4 bytes
_________________________________________________________________________
| ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ...
|________|_____________|_____________|_____|______________|_______________
::
|-> aligned by 4 bytes
+ xattr_blkaddr blocks |-> aligned with 4 bytes
_________________________________________________________________________
| ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ...
|________|_____________|_____________|_____|______________|_______________
Directories
-----------
@ -163,19 +191,21 @@ random file lookup, and all directory entries are _strictly_ recorded in
alphabetical order in order to support improved prefix binary search
algorithm (could refer to the related source code).
___________________________
/ |
/ ______________|________________
/ / | nameoff1 | nameoffN-1
____________.______________._______________v________________v__________
| dirent | dirent | ... | dirent | filename | filename | ... | filename |
|___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
\ ^
\ | * could have
\ | trailing '\0'
\________________________| nameoff0
::
Directory block
___________________________
/ |
/ ______________|________________
/ / | nameoff1 | nameoffN-1
____________.______________._______________v________________v__________
| dirent | dirent | ... | dirent | filename | filename | ... | filename |
|___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
\ ^
\ | * could have
\ | trailing '\0'
\________________________| nameoff0
Directory block
Note that apart from the offset of the first filename, nameoff0 also indicates
the total number of directory entries in this block since it is no need to
@ -184,28 +214,27 @@ introduce another on-disk field at all.
Compression
-----------
Currently, EROFS supports 4KB fixed-sized output transparent file compression,
as illustrated below:
as illustrated below::
|---- Variant-Length Extent ----|-------- VLE --------|----- VLE -----
clusterofs clusterofs clusterofs
| | | logical data
_________v_______________________________v_____________________v_______________
... | . | | . | | . | ...
____|____.________|_____________|________.____|_____________|__.__________|____
|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|
size size size size size
. . . .
. . . .
. . . .
_______._____________._____________._____________._____________________
... | | | | ... physical data
_______|_____________|_____________|_____________|_____________________
|-> cluster <-|-> cluster <-|-> cluster <-|
size size size
|---- Variant-Length Extent ----|-------- VLE --------|----- VLE -----
clusterofs clusterofs clusterofs
| | | logical data
_________v_______________________________v_____________________v_______________
... | . | | . | | . | ...
____|____.________|_____________|________.____|_____________|__.__________|____
|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|
size size size size size
. . . .
. . . .
. . . .
_______._____________._____________._____________._____________________
... | | | | ... physical data
_______|_____________|_____________|_____________|_____________________
|-> cluster <-|-> cluster <-|-> cluster <-|
size size size
Currently each on-disk physical cluster can contain 4KB (un)compressed data
at most. For each logical cluster, there is a corresponding on-disk index to
describe its cluster type, physical cluster address, etc.
See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.

View File

@ -1,3 +1,5 @@
.. SPDX-License-Identifier: GPL-2.0
The Second Extended Filesystem
==============================
@ -14,8 +16,9 @@ Options
Most defaults are determined by the filesystem superblock, and can be
set using tune2fs(8). Kernel-determined defaults are indicated by (*).
bsddf (*) Makes `df' act like BSD.
minixdf Makes `df' act like Minix.
==================== === ================================================
bsddf (*) Makes ``df`` act like BSD.
minixdf Makes ``df`` act like Minix.
check=none, nocheck (*) Don't do extra checking of bitmaps on mount
(check=normal and check=strict options removed)
@ -62,6 +65,7 @@ quota, usrquota Enable user disk quota support
grpquota Enable group disk quota support
(requires CONFIG_QUOTA).
==================== === ================================================
noquota option ls silently ignored by ext2.
@ -294,9 +298,9 @@ respective fsck programs.
If you're exceptionally paranoid, there are 3 ways of making metadata
writes synchronous on ext2:
per-file if you have the program source: use the O_SYNC flag to open()
per-file if you don't have the source: use "chattr +S" on the file
per-filesystem: add the "sync" option to mount (or in /etc/fstab)
- per-file if you have the program source: use the O_SYNC flag to open()
- per-file if you don't have the source: use "chattr +S" on the file
- per-filesystem: add the "sync" option to mount (or in /etc/fstab)
the first and last are not ext2 specific but do force the metadata to
be written synchronously. See also Journaling below.
@ -316,10 +320,12 @@ Most of these limits could be overcome with slight changes in the on-disk
format and using a compatibility flag to signal the format change (at
the expense of some compatibility).
Filesystem block size: 1kB 2kB 4kB 8kB
File size limit: 16GB 256GB 2048GB 2048GB
Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
===================== ======= ======= ======= ========
Filesystem block size 1kB 2kB 4kB 8kB
===================== ======= ======= ======= ========
File size limit 16GB 256GB 2048GB 2048GB
Filesystem size limit 2047GB 8192GB 16384GB 32768GB
===================== ======= ======= ======= ========
There is a 2.4 kernel limit of 2048GB for a single block device, so no
filesystem larger than that can be created at this time. There is also
@ -370,19 +376,24 @@ ext4 and journaling.
References
==========
======================= ===============================================
The kernel source file:/usr/src/linux/fs/ext2/
e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
Filesystem Resizing http://ext2resize.sourceforge.net/
Compression (*) http://e2compr.sourceforge.net/
Compression [1]_ http://e2compr.sourceforge.net/
======================= ===============================================
Implementations for:
Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs
Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2
DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/
(*) no longer actively developed/supported (as of Apr 2001)
(+) no longer actively developed/supported (as of Mar 2009)
======================= ===========================================================
Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs
Windows 95 [1]_ http://www.yipton.net/content.html#FSDEXT2
DOS client [1]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
OS/2 [2]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/
======================= ===========================================================
.. [1] no longer actively developed/supported (as of Apr 2001)
.. [2] no longer actively developed/supported (as of Mar 2009)

View File

@ -1,4 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
===============
Ext3 Filesystem
===============

View File

@ -1,6 +1,8 @@
================================================================================
.. SPDX-License-Identifier: GPL-2.0
==========================================
WHAT IS Flash-Friendly File System (F2FS)?
================================================================================
==========================================
NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
been equipped on a variety systems ranging from mobile to server systems. Since
@ -20,14 +22,15 @@ layout, but also for selecting allocation and cleaning algorithms.
The following git tree provides the file system formatting tool (mkfs.f2fs),
a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs).
>> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
- git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
For reporting bugs and sending patches, please use the following mailing list:
>> linux-f2fs-devel@lists.sourceforge.net
================================================================================
BACKGROUND AND DESIGN ISSUES
================================================================================
- linux-f2fs-devel@lists.sourceforge.net
Background and Design issues
============================
Log-structured File System (LFS)
--------------------------------
@ -61,6 +64,7 @@ needs to reclaim these obsolete blocks seamlessly to users. This job is called
as a cleaning process.
The process consists of three operations as follows.
1. A victim segment is selected through referencing segment usage table.
2. It loads parent index structures of all the data in the victim identified by
segment summary blocks.
@ -71,9 +75,8 @@ This cleaning job may cause unexpected long delays, so the most important goal
is to hide the latencies to users. And also definitely, it should reduce the
amount of valid data to be moved, and move them quickly as well.
================================================================================
KEY FEATURES
================================================================================
Key Features
============
Flash Awareness
---------------
@ -94,10 +97,11 @@ Cleaning Overhead
- Support multi-head logs for static/dynamic hot and cold data separation
- Introduce adaptive logging for efficient block allocation
================================================================================
MOUNT OPTIONS
================================================================================
Mount Options
=============
====================== ============================================================
background_gc=%s Turn on/off cleaning operations, namely garbage
collection, triggered in background when I/O subsystem is
idle. If background_gc=on, it will turn on the garbage
@ -167,7 +171,10 @@ fault_injection=%d Enable fault injection in all supported types with
fault_type=%d Support configuring fault injection type, should be
enabled with fault_injection option, fault type value
is shown below, it supports single or combined type.
=================== ===========
Type_Name Type_Value
=================== ===========
FAULT_KMALLOC 0x000000001
FAULT_KVMALLOC 0x000000002
FAULT_PAGE_ALLOC 0x000000004
@ -183,6 +190,7 @@ fault_type=%d Support configuring fault injection type, should be
FAULT_CHECKPOINT 0x000001000
FAULT_DISCARD 0x000002000
FAULT_WRITE_IO 0x000004000
=================== ===========
mode=%s Control block allocation mode which supports "adaptive"
and "lfs". In "lfs" mode, there should be no random
writes towards main area.
@ -219,7 +227,7 @@ fsync_mode=%s Control the policy of fsync. Currently supports "posix",
non-atomic files likewise "nobarrier" mount option.
test_dummy_encryption Enable dummy encryption, which provides a fake fscrypt
context. The fake fscrypt context is used by xfstests.
checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
to reenable checkpointing. Is enabled by default. While
disabled, any unmounting or unexpected shutdowns will cause
the filesystem contents to appear as they did when the
@ -246,22 +254,22 @@ compress_extension=%s Support adding specified extension, so that f2fs can enab
on compression extension list and enable compression on
these file by default rather than to enable it via ioctl.
For other files, we can still enable compression via ioctl.
====================== ============================================================
================================================================================
DEBUGFS ENTRIES
================================================================================
Debugfs Entries
===============
/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as
f2fs. Each file shows the whole f2fs information.
/sys/kernel/debug/f2fs/status includes:
- major file system information managed by f2fs currently
- average SIT information about whole segments
- current memory footprint consumed by f2fs.
================================================================================
SYSFS ENTRIES
================================================================================
Sysfs Entries
=============
Information about mounted f2fs file systems can be found in
/sys/fs/f2fs. Each mounted filesystem will have a directory in
@ -271,22 +279,24 @@ The files in each per-device directory are shown in table below.
Files in /sys/fs/f2fs/<devname>
(see also Documentation/ABI/testing/sysfs-fs-f2fs)
================================================================================
USAGE
================================================================================
Usage
=====
1. Download userland tools and compile them.
2. Skip, if f2fs was compiled statically inside kernel.
Otherwise, insert the f2fs.ko module.
# insmod f2fs.ko
Otherwise, insert the f2fs.ko module::
3. Create a directory trying to mount
# mkdir /mnt/f2fs
# insmod f2fs.ko
4. Format the block device, and then mount as f2fs
# mkfs.f2fs -l label /dev/block_device
# mount -t f2fs /dev/block_device /mnt/f2fs
3. Create a directory trying to mount::
# mkdir /mnt/f2fs
4. Format the block device, and then mount as f2fs::
# mkfs.f2fs -l label /dev/block_device
# mount -t f2fs /dev/block_device /mnt/f2fs
mkfs.f2fs
---------
@ -294,18 +304,26 @@ The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem,
which builds a basic on-disk layout.
The options consist of:
-l [label] : Give a volume label, up to 512 unicode name.
-a [0 or 1] : Split start location of each area for heap-based allocation.
1 is set by default, which performs this.
-o [int] : Set overprovision ratio in percent over volume size.
5 is set by default.
-s [int] : Set the number of segments per section.
1 is set by default.
-z [int] : Set the number of sections per zone.
1 is set by default.
-e [str] : Set basic extension list. e.g. "mp3,gif,mov"
-t [0 or 1] : Disable discard command or not.
1 is set by default, which conducts discard.
=============== ===========================================================
``-l [label]`` Give a volume label, up to 512 unicode name.
``-a [0 or 1]`` Split start location of each area for heap-based allocation.
1 is set by default, which performs this.
``-o [int]`` Set overprovision ratio in percent over volume size.
5 is set by default.
``-s [int]`` Set the number of segments per section.
1 is set by default.
``-z [int]`` Set the number of sections per zone.
1 is set by default.
``-e [str]`` Set basic extension list. e.g. "mp3,gif,mov"
``-t [0 or 1]`` Disable discard command or not.
1 is set by default, which conducts discard.
=============== ===========================================================
fsck.f2fs
---------
@ -314,7 +332,8 @@ partition, which examines whether the filesystem metadata and user-made data
are cross-referenced correctly or not.
Note that, initial version of the tool does not fix any inconsistency.
The options consist of:
The options consist of::
-d debug level [default:0]
dump.f2fs
@ -327,20 +346,21 @@ It shows on-disk inode information recognized by a given inode number, and is
able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and
./dump_sit respectively.
The options consist of:
The options consist of::
-d debug level [default:0]
-i inode no (hex)
-s [SIT dump segno from #1~#2 (decimal), for all 0~-1]
-a [SSA dump segno from #1~#2 (decimal), for all 0~-1]
Examples:
# dump.f2fs -i [ino] /dev/sdx
# dump.f2fs -s 0~-1 /dev/sdx (SIT dump)
# dump.f2fs -a 0~-1 /dev/sdx (SSA dump)
Examples::
================================================================================
DESIGN
================================================================================
# dump.f2fs -i [ino] /dev/sdx
# dump.f2fs -s 0~-1 /dev/sdx (SIT dump)
# dump.f2fs -a 0~-1 /dev/sdx (SSA dump)
Design
======
On-disk Layout
--------------
@ -351,7 +371,7 @@ consists of a set of sections. By default, section and zone sizes are set to one
segment size identically, but users can easily modify the sizes by mkfs.
F2FS splits the entire volume into six areas, and all the areas except superblock
consists of multiple segments as described below.
consists of multiple segments as described below::
align with the zone size <-|
|-> align with the segment size
@ -373,28 +393,28 @@ consists of multiple segments as described below.
|__zone__|
- Superblock (SB)
: It is located at the beginning of the partition, and there exist two copies
It is located at the beginning of the partition, and there exist two copies
to avoid file system crash. It contains basic partition information and some
default parameters of f2fs.
- Checkpoint (CP)
: It contains file system information, bitmaps for valid NAT/SIT sets, orphan
It contains file system information, bitmaps for valid NAT/SIT sets, orphan
inode lists, and summary entries of current active segments.
- Segment Information Table (SIT)
: It contains segment information such as valid block count and bitmap for the
It contains segment information such as valid block count and bitmap for the
validity of all the blocks.
- Node Address Table (NAT)
: It is composed of a block address table for all the node blocks stored in
It is composed of a block address table for all the node blocks stored in
Main area.
- Segment Summary Area (SSA)
: It contains summary entries which contains the owner information of all the
It contains summary entries which contains the owner information of all the
data and node blocks stored in Main area.
- Main Area
: It contains file and directory data including their indices.
It contains file and directory data including their indices.
In order to avoid misalignment between file system and flash-based storage, F2FS
aligns the start block address of CP with the segment size. Also, it aligns the
@ -414,7 +434,7 @@ One of them always indicates the last valid data, which is called as shadow copy
mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism.
For file system consistency, each CP points to which NAT and SIT copies are
valid, as shown as below.
valid, as shown as below::
+--------+----------+---------+
| CP | SIT | NAT |
@ -438,7 +458,7 @@ indirect node. F2FS assigns 4KB to an inode block which contains 923 data block
indices, two direct node pointers, two indirect node pointers, and one double
indirect node pointer as described below. One direct node block contains 1018
data blocks, and one indirect node block contains also 1018 node blocks. Thus,
one inode block (i.e., a file) covers:
one inode block (i.e., a file) covers::
4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
@ -473,6 +493,8 @@ A dentry block consists of 214 dentry slots and file names. Therein a bitmap is
used to represent whether each dentry is valid or not. A dentry block occupies
4KB with the following composition.
::
Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
dentries(11 * 214 bytes) + file name (8 * 214 bytes)
@ -498,23 +520,25 @@ F2FS implements multi-level hash tables for directory structure. Each level has
a hash table with dedicated number of hash buckets as shown below. Note that
"A(2B)" means a bucket includes 2 data blocks.
----------------------
A : bucket
B : block
N : MAX_DIR_HASH_DEPTH
----------------------
::
level #0 | A(2B)
|
level #1 | A(2B) - A(2B)
|
level #2 | A(2B) - A(2B) - A(2B) - A(2B)
. | . . . .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
----------------------
A : bucket
B : block
N : MAX_DIR_HASH_DEPTH
----------------------
The number of blocks and buckets are determined by,
level #0 | A(2B)
|
level #1 | A(2B) - A(2B)
|
level #2 | A(2B) - A(2B) - A(2B) - A(2B)
. | . . . .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
The number of blocks and buckets are determined by::
,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
# of blocks in level #n = |
@ -532,7 +556,7 @@ dentry consisting of the file name and its inode number. If not found, F2FS
scans the next hash table in level #1. In this way, F2FS scans hash tables in
each levels incrementally from 1 to N. In each levels F2FS needs to scan only
one bucket determined by the following equation, which shows O(log(# of files))
complexity.
complexity::
bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
@ -540,7 +564,8 @@ In the case of file creation, F2FS finds empty consecutive slots that cover the
file name. F2FS searches the empty slots in the hash tables of whole levels from
1 to N in the same way as the lookup operation.
The following figure shows an example of two cases holding children.
The following figure shows an example of two cases holding children::
--------------> Dir <--------------
| |
child child
@ -611,14 +636,15 @@ Write-hint Policy
2) whint_mode=user-based. F2FS tries to pass down hints given by
users.
===================== ======================== ===================
User F2FS Block
---- ---- -----
===================== ======================== ===================
META WRITE_LIFE_NOT_SET
HOT_NODE "
WARM_NODE "
COLD_NODE "
*ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
*extension list " "
ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
extension list " "
-- buffered io
WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
@ -635,11 +661,13 @@ WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE " WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG " WRITE_LIFE_LONG
===================== ======================== ===================
3) whint_mode=fs-based. F2FS passes down hints with its policy.
===================== ======================== ===================
User F2FS Block
---- ---- -----
===================== ======================== ===================
META WRITE_LIFE_MEDIUM;
HOT_NODE WRITE_LIFE_NOT_SET
WARM_NODE "
@ -662,6 +690,7 @@ WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE " WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG " WRITE_LIFE_LONG
===================== ======================== ===================
Fallocate(2) Policy
-------------------
@ -681,6 +710,7 @@ Allocating disk space
However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to
fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having
zero or random data, which is useful to the below scenario where:
1. create(fd)
2. ioctl(fd, F2FS_IOC_SET_PIN_FILE)
3. fallocate(fd, 0, 0, size)
@ -692,39 +722,41 @@ Compression implementation
--------------------------
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special block address is used to indicate
cluster is compressed one or normal one, for compressed cluster, following
metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs
stores data including compress header and compressed data.
cluster is compressed one or normal one, for compressed cluster, following
metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs
stores data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout::
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+

View File

@ -1,7 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
==============
====
FUSE
==============
====
Definitions
===========

View File

@ -1,14 +1,18 @@
uevents and GFS2
==================
.. SPDX-License-Identifier: GPL-2.0
================
uevents and GFS2
================
During the lifetime of a GFS2 mount, a number of uevents are generated.
This document explains what the events are and what they are used
for (by gfs_controld in gfs2-utils).
A list of GFS2 uevents
-----------------------
======================
1. ADD
------
The ADD event occurs at mount time. It will always be the first
uevent generated by the newly created filesystem. If the mount
@ -21,6 +25,7 @@ with no journal assigned), and read-only (with journal assigned) status
of the filesystem respectively.
2. ONLINE
---------
The ONLINE uevent is generated after a successful mount or remount. It
has the same environment variables as the ADD uevent. The ONLINE
@ -29,6 +34,7 @@ RDONLY are a relatively recent addition (2.6.32-rc+) and will not
be generated by older kernels.
3. CHANGE
---------
The CHANGE uevent is used in two places. One is when reporting the
successful mount of the filesystem by the first node (FIRSTMOUNT=Done).
@ -52,6 +58,7 @@ cluster. For this reason the ONLINE uevent was used when adding a new
uevent for a successful mount or remount.
4. OFFLINE
----------
The OFFLINE uevent is only generated due to filesystem errors and is used
as part of the "withdraw" mechanism. Currently this doesn't give any
@ -59,6 +66,7 @@ information about what the error is, which is something that needs to
be fixed.
5. REMOVE
---------
The REMOVE uevent is generated at the end of an unsuccessful mount
or at the end of a umount of the filesystem. All REMOVE uevents will
@ -68,9 +76,10 @@ kobject subsystem.
Information common to all GFS2 uevents (uevent environment variables)
----------------------------------------------------------------------
=====================================================================
1. LOCKTABLE=
--------------
The LOCKTABLE is a string, as supplied on the mount command
line (locktable=) or via fstab. It is used as a filesystem label
@ -78,6 +87,7 @@ as well as providing the information for a lock_dlm mount to be
able to join the cluster.
2. LOCKPROTO=
-------------
The LOCKPROTO is a string, and its value depends on what is set
on the mount command line, or via fstab. It will be either
@ -85,12 +95,14 @@ lock_nolock or lock_dlm. In the future other lock managers
may be supported.
3. JOURNALID=
-------------
If a journal is in use by the filesystem (journals are not
assigned for spectator mounts) then this will give the
numeric journal id in all GFS2 uevents.
4. UUID=
--------
With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID
into the filesystem superblock. If it exists, this will

View File

@ -1,5 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
==================
Global File System
------------------
==================
https://fedorahosted.org/cluster/wiki/HomePage
@ -14,16 +17,18 @@ on one machine show up immediately on all other machines in the cluster.
GFS uses interchangeable inter-node locking mechanisms, the currently
supported mechanisms are:
lock_nolock -- allows gfs to be used as a local file system
lock_nolock
- allows gfs to be used as a local file system
lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking
The dlm is found at linux/fs/dlm/
lock_dlm
- uses a distributed lock manager (dlm) for inter-node locking.
The dlm is found at linux/fs/dlm/
Lock_dlm depends on user space cluster management systems found
at the URL above.
To use gfs as a local file system, no external clustering systems are
needed, simply:
needed, simply::
$ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device
$ mount -t gfs2 /dev/block_device /dir
@ -37,9 +42,12 @@ GFS2 is not on-disk compatible with previous versions of GFS, but it
is pretty close.
The following man pages can be found at the URL above:
============ =============================================
fsck.gfs2 to repair a filesystem
gfs2_grow to expand a filesystem online
gfs2_jadd to add journals to a filesystem online
tunegfs2 to manipulate, examine and tune a filesystem
gfs2_convert to convert a gfs filesystem to gfs2 in-place
gfs2_convert to convert a gfs filesystem to gfs2 in-place
mkfs.gfs2 to make a filesystem
============ =============================================

View File

@ -1,11 +1,16 @@
Note: This filesystem doesn't have a maintainer.
.. SPDX-License-Identifier: GPL-2.0
==================================
Macintosh HFS Filesystem for Linux
==================================
HFS stands for ``Hierarchical File System'' and is the filesystem used
.. Note:: This filesystem doesn't have a maintainer.
HFS stands for ``Hierarchical File System`` and is the filesystem used
by the Mac Plus and all later Macintosh models. Earlier Macintosh
models used MFS (``Macintosh File System''), which is not supported,
models used MFS (``Macintosh File System``), which is not supported,
MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
HFS but is extended in various areas. Use the hfsplus filesystem driver
to access such filesystems from Linux.
@ -49,25 +54,25 @@ Writing to HFS Filesystems
HFS is not a UNIX filesystem, thus it does not have the usual features you'd
expect:
o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
* You can't modify the set-uid, set-gid, sticky or executable bits or the uid
and gid of files.
o You can't create hard- or symlinks, device files, sockets or FIFOs.
* You can't create hard- or symlinks, device files, sockets or FIFOs.
HFS does on the other have the concepts of multiple forks per file. These
non-standard forks are represented as hidden additional files in the normal
filesystems namespace which is kind of a cludge and makes the semantics for
the a little strange:
o You can't create, delete or rename resource forks of files or the
* You can't create, delete or rename resource forks of files or the
Finder's metadata.
o They are however created (with default values), deleted and renamed
* They are however created (with default values), deleted and renamed
along with the corresponding data fork or directory.
o Copying files to a different filesystem will loose those attributes
* Copying files to a different filesystem will loose those attributes
that are essential for MacOS to work.
Creating HFS filesystems
===================================
========================
The hfsutils package from Robert Leslie contains a program called
hformat that can be used to create HFS filesystem. See

View File

@ -1,4 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
======================================
Macintosh HFSPlus Filesystem for Linux
======================================

View File

@ -1,13 +1,21 @@
.. SPDX-License-Identifier: GPL-2.0
====================
Read/Write HPFS 2.09
====================
1998-2004, Mikulas Patocka
email: mikulas@artax.karlin.mff.cuni.cz
homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
:email: mikulas@artax.karlin.mff.cuni.cz
:homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
CREDITS:
Credits
=======
Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
is taken from it
Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
Mount options
@ -50,6 +58,7 @@ timeshift=(-)nnn (default 0)
File names
==========
As in OS/2, filenames are case insensitive. However, shell thinks that names
are case sensitive, so for example when you create a file FOO, you can use
@ -64,6 +73,7 @@ access it under names 'a.', 'a..', 'a . . . ' etc.
Extended attributes
===================
On HPFS partitions, OS/2 can associate to each file a special information called
extended attributes. Extended attributes are pairs of (key,value) where key is
@ -88,6 +98,7 @@ values doesn't work.
Symlinks
========
You can do symlinks on HPFS partition, symlinks are achieved by setting extended
attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
@ -101,6 +112,7 @@ to analyze or change OS2SYS.INI.
Codepages
=========
HPFS can contain several uppercasing tables for several codepages and each
file has a pointer to codepage its name is in. However OS/2 was created in
@ -128,6 +140,7 @@ this codepage - if you don't try to do what I described above :-)
Known bugs
==========
HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
should work. If you have OS/2 server, use only read-only mode. I don't know how
@ -152,7 +165,8 @@ would result in directory tree splitting, that takes disk space. Workaround is
to delete other files that are leaf (probability that the file is non-leaf is
about 1/50) or to truncate file first to make some space.
You encounter this problem only if you have many directories so that
preallocated directory band is full i.e.
preallocated directory band is full i.e.::
number_of_directories / size_of_filesystem_in_mb > 4.
You can't delete open directories.
@ -174,6 +188,7 @@ anybody know what does it mean?
What does "unbalanced tree" message mean?
=========================================
Old versions of this driver created sometimes unbalanced dnode trees. OS/2
chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
@ -187,6 +202,7 @@ whole created by this driver, it is BUG - let me know about it.
Bugs in OS/2
============
When you have two (or more) lost directories pointing each to other, chkdsk
locks up when repairing filesystem.
@ -199,98 +215,139 @@ File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
marks them as short (and writes "minor fs error corrected"). This bug is not in
HPFS386.
Codepage bugs described above.
Codepage bugs described above
=============================
If you don't install fixpacks, there are many, many more...
History
=======
0.90 First public release
0.91 Fixed bug that caused shooting to memory when write_inode was called on
open inode (rarely happened)
0.92 Fixed a little memory leak in freeing directory inodes
0.93 Fixed bug that locked up the machine when there were too many filenames
with first 15 characters same
Fixed write_file to zero file when writing behind file end
0.94 Fixed a little memory leak when trying to delete busy file or directory
0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
1.90 First version for 2.1.1xx kernels
1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
Fixed a race-condition when write_inode is called while deleting file
Fixed a bug that could possibly happen (with very low probability) when
using 0xff in filenames
Rewritten locking to avoid race-conditions
Mount option 'eas' now works
Fsync no longer returns error
Files beginning with '.' are marked hidden
Remount support added
Alloc is not so slow when filesystem becomes full
Atimes are no more updated because it slows down operation
Code cleanup (removed all commented debug prints)
1.92 Corrected a bug when sync was called just before closing file
1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
works with previous versions
Fixed a possible problem with disks > 64G (but I don't have one, so I can't
test it)
Fixed a file overflow at 2G
Added new option 'timeshift'
Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
read-only mode
Fixed a bug that slowed down alloc and prevented allocating 100% space
(this bug was not destructive)
1.94 Added workaround for one bug in Linux
Fixed one buffer leak
Fixed some incompatibilities with large extended attributes (but it's still
not 100% ok, I have no info on it and OS/2 doesn't want to create them)
Rewritten allocation
Fixed a bug with i_blocks (du sometimes didn't display correct values)
Directories have no longer archive attribute set (some programs don't like
it)
Fixed a bug that it set badly one flag in large anode tree (it was not
destructive)
1.95 Fixed one buffer leak, that could happen on corrupted filesystem
Fixed one bug in allocation in 1.94
1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
error sometimes when opening directories in PMSHELL)
Fixed a possible bitmap race
Fixed possible problem on large disks
You can now delete open files
Fixed a nondestructive race in rename
1.97 Support for HPFS v3 (on large partitions)
Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
====== =========================================================================
0.90 First public release
0.91 Fixed bug that caused shooting to memory when write_inode was called on
open inode (rarely happened)
0.92 Fixed a little memory leak in freeing directory inodes
0.93 Fixed bug that locked up the machine when there were too many filenames
with first 15 characters same
Fixed write_file to zero file when writing behind file end
0.94 Fixed a little memory leak when trying to delete busy file or directory
0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
1.90 First version for 2.1.1xx kernels
1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
Fixed a race-condition when write_inode is called while deleting file
Fixed a bug that could possibly happen (with very low probability) when
using 0xff in filenames.
Rewritten locking to avoid race-conditions
Mount option 'eas' now works
Fsync no longer returns error
Files beginning with '.' are marked hidden
Remount support added
Alloc is not so slow when filesystem becomes full
Atimes are no more updated because it slows down operation
Code cleanup (removed all commented debug prints)
1.92 Corrected a bug when sync was called just before closing file
1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
works with previous versions
Fixed a possible problem with disks > 64G (but I don't have one, so I can't
test it)
Fixed a file overflow at 2G
Added new option 'timeshift'
Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
read-only mode
Fixed a bug that slowed down alloc and prevented allocating 100% space
(this bug was not destructive)
1.94 Added workaround for one bug in Linux
Fixed one buffer leak
Fixed some incompatibilities with large extended attributes (but it's still
not 100% ok, I have no info on it and OS/2 doesn't want to create them)
Rewritten allocation
Fixed a bug with i_blocks (du sometimes didn't display correct values)
Directories have no longer archive attribute set (some programs don't like
it)
Fixed a bug that it set badly one flag in large anode tree (it was not
destructive)
1.95 Fixed one buffer leak, that could happen on corrupted filesystem
Fixed one bug in allocation in 1.94
1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
error sometimes when opening directories in PMSHELL)
Fixed a possible bitmap race
Fixed possible problem on large disks
You can now delete open files
Fixed a nondestructive race in rename
1.97 Support for HPFS v3 (on large partitions)
ZFixed a bug that it didn't allow creation of files > 128M
(it should be 2G)
1.97.1 Changed names of global symbols
Fixed a bug when chmoding or chowning root directory
1.98 Fixed a deadlock when using old_readdir
Better directory handling; workaround for "unbalanced tree" bug in OS/2
1.99 Corrected a possible problem when there's not enough space while deleting
file
Now it tries to truncate the file if there's not enough space when deleting
Removed a lot of redundant code
2.00 Fixed a bug in rename (it was there since 1.96)
Better anti-fragmentation strategy
2.01 Fixed problem with directory listing over NFS
Directory lseek now checks for proper parameters
Fixed race-condition in buffer code - it is in all filesystems in Linux;
when reading device (cat /dev/hda) while creating files on it, files
could be damaged
2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond
end of partition
2.03 Char, block devices and pipes are correctly created
Fixed non-crashing race in unlink (Alexander Viro)
Now it works with Japanese version of OS/2
2.04 Fixed error when ftruncate used to extend file
2.05 Fixed crash when got mount parameters without =
Fixed crash when allocation of anode failed due to full disk
Fixed some crashes when block io or inode allocation failed
2.06 Fixed some crash on corrupted disk structures
Better allocation strategy
Reschedule points added so that it doesn't lock CPU long time
It should work in read-only mode on Warp Server
2.07 More fixes for Warp Server. Now it really works
2.08 Creating new files is not so slow on large disks
An attempt to sync deleted file does not generate filesystem error
2.09 Fixed error on extremely fragmented files
1.98 Fixed a deadlock when using old_readdir
Better directory handling; workaround for "unbalanced tree" bug in OS/2
1.99 Corrected a possible problem when there's not enough space while deleting
file
Now it tries to truncate the file if there's not enough space when
deleting
vim: set textwidth=80:
Removed a lot of redundant code
2.00 Fixed a bug in rename (it was there since 1.96)
Better anti-fragmentation strategy
2.01 Fixed problem with directory listing over NFS
Directory lseek now checks for proper parameters
Fixed race-condition in buffer code - it is in all filesystems in Linux;
when reading device (cat /dev/hda) while creating files on it, files
could be damaged
2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond
end of partition
2.03 Char, block devices and pipes are correctly created
Fixed non-crashing race in unlink (Alexander Viro)
Now it works with Japanese version of OS/2
2.04 Fixed error when ftruncate used to extend file
2.05 Fixed crash when got mount parameters without =
Fixed crash when allocation of anode failed due to full disk
Fixed some crashes when block io or inode allocation failed
2.06 Fixed some crash on corrupted disk structures
Better allocation strategy
Reschedule points added so that it doesn't lock CPU long time
It should work in read-only mode on Warp Server
2.07 More fixes for Warp Server. Now it really works
2.08 Creating new files is not so slow on large disks
An attempt to sync deleted file does not generate filesystem error
2.09 Fixed error on extremely fragmented files
====== =========================================================================

View File

@ -1,3 +1,5 @@
.. _filesystems_index:
===============================
Filesystems in the Linux kernel
===============================
@ -46,8 +48,53 @@ Documentation for filesystem implementations.
.. toctree::
:maxdepth: 2
9p
adfs
affs
afs
autofs
autofs-mount-control
befs
bfs
btrfs
ceph
cramfs
debugfs
dlmfs
ecryptfs
efivarfs
erofs
ext2
ext3
f2fs
gfs2
gfs2-uevents
hfs
hfsplus
hpfs
fuse
inotify
isofs
nilfs2
nfs/index
ntfs
ocfs2
ocfs2-online-filecheck
omfs
orangefs
overlayfs
proc
qnx6
ramfs-rootfs-initramfs
relay
romfs
squashfs
sysfs
sysv-fs
tmpfs
ubifs
ubifs-authentication.rst
udf
virtiofs
vfat
zonefs

View File

@ -1,27 +1,36 @@
inotify
a powerful yet simple file change notification system
.. SPDX-License-Identifier: GPL-2.0
===============================================================
Inotify - A Powerful yet Simple File Change Notification System
===============================================================
Document started 15 Mar 2005 by Robert Love <rml@novell.com>
Document updated 4 Jan 2015 by Zhang Zhen <zhenzhang.zhang@huawei.com>
--Deleted obsoleted interface, just refer to manpages for user interface.
- Deleted obsoleted interface, just refer to manpages for user interface.
(i) Rationale
Q: What is the design decision behind not tying the watch to the open fd of
Q:
What is the design decision behind not tying the watch to the open fd of
the watched object?
A: Watches are associated with an open inotify device, not an open file.
A:
Watches are associated with an open inotify device, not an open file.
This solves the primary problem with dnotify: keeping the file open pins
the file and thus, worse, pins the mount. Dnotify is therefore infeasible
for use on a desktop system with removable media as the media cannot be
unmounted. Watching a file should not require that it be open.
Q: What is the design decision behind using an-fd-per-instance as opposed to
Q:
What is the design decision behind using an-fd-per-instance as opposed to
an fd-per-watch?
A: An fd-per-watch quickly consumes more file descriptors than are allowed,
A:
An fd-per-watch quickly consumes more file descriptors than are allowed,
more fd's than are feasible to manage, and more fd's than are optimally
select()-able. Yes, root can bump the per-process fd limit and yes, users
can use epoll, but requiring both is a silly and extraneous requirement.
@ -29,8 +38,8 @@ A: An fd-per-watch quickly consumes more file descriptors than are allowed,
spaces is thus sensible. The current design is what user-space developers
want: Users initialize inotify, once, and add n watches, requiring but one
fd and no twiddling with fd limits. Initializing an inotify instance two
thousand times is silly. If we can implement user-space's preferences
cleanly--and we can, the idr layer makes stuff like this trivial--then we
thousand times is silly. If we can implement user-space's preferences
cleanly--and we can, the idr layer makes stuff like this trivial--then we
should.
There are other good arguments. With a single fd, there is a single
@ -65,9 +74,11 @@ A: An fd-per-watch quickly consumes more file descriptors than are allowed,
need not be a one-fd-per-process mapping; it is one-fd-per-queue and a
process can easily want more than one queue.
Q: Why the system call approach?
Q:
Why the system call approach?
A: The poor user-space interface is the second biggest problem with dnotify.
A:
The poor user-space interface is the second biggest problem with dnotify.
Signals are a terrible, terrible interface for file notification. Or for
anything, for that matter. The ideal solution, from all perspectives, is a
file descriptor-based one that allows basic file I/O and poll/select.

View File

@ -0,0 +1,64 @@
.. SPDX-License-Identifier: GPL-2.0
==================
ISO9660 Filesystem
==================
Mount options that are the same as for msdos and vfat partitions.
========= ========================================================
gid=nnn All files in the partition will be in group nnn.
uid=nnn All files in the partition will be owned by user id nnn.
umask=nnn The permission mask (see umask(1)) for the partition.
========= ========================================================
Mount options that are the same as vfat partitions. These are only useful
when using discs encoded using Microsoft's Joliet extensions.
============== =============================================================
iocharset=name Character set to use for converting from Unicode to
ASCII. Joliet filenames are stored in Unicode format, but
Unix for the most part doesn't know how to deal with Unicode.
There is also an option of doing UTF-8 translations with the
utf8 option.
utf8 Encode Unicode names in UTF-8 format. Default is no.
============== =============================================================
Mount options unique to the isofs filesystem.
================= ============================================================
block=512 Set the block size for the disk to 512 bytes
block=1024 Set the block size for the disk to 1024 bytes
block=2048 Set the block size for the disk to 2048 bytes
check=relaxed Matches filenames with different cases
check=strict Matches only filenames with the exact same case
cruft Try to handle badly formatted CDs.
map=off Do not map non-Rock Ridge filenames to lower case
map=normal Map non-Rock Ridge filenames to lower case
map=acorn As map=normal but also apply Acorn extensions if present
mode=xxx Sets the permissions on files to xxx unless Rock Ridge
extensions set the permissions otherwise
dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge
extensions set the permissions otherwise
overriderockperm Set permissions on files and directories according to
'mode' and 'dmode' even though Rock Ridge extensions are
present.
nojoliet Ignore Joliet extensions if they are present.
norock Ignore Rock Ridge extensions if they are present.
hide Completely strip hidden files from the file system.
showassoc Show files marked with the 'associated' bit
unhide Deprecated; showing hidden files is now default;
If given, it is a synonym for 'showassoc' which will
recreate previous unhide behavior
session=x Select number of session on multisession CD
sbsector=xxx Session begins from sector xxx
================= ============================================================
Recommended documents about ISO 9660 standard are located at:
- http://www.y-adagio.com/
- ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
identical with ISO 9660.", so it is a valid and gratis substitute of the
official ISO specification.

View File

@ -1,48 +0,0 @@
Mount options that are the same as for msdos and vfat partitions.
gid=nnn All files in the partition will be in group nnn.
uid=nnn All files in the partition will be owned by user id nnn.
umask=nnn The permission mask (see umask(1)) for the partition.
Mount options that are the same as vfat partitions. These are only useful
when using discs encoded using Microsoft's Joliet extensions.
iocharset=name Character set to use for converting from Unicode to
ASCII. Joliet filenames are stored in Unicode format, but
Unix for the most part doesn't know how to deal with Unicode.
There is also an option of doing UTF-8 translations with the
utf8 option.
utf8 Encode Unicode names in UTF-8 format. Default is no.
Mount options unique to the isofs filesystem.
block=512 Set the block size for the disk to 512 bytes
block=1024 Set the block size for the disk to 1024 bytes
block=2048 Set the block size for the disk to 2048 bytes
check=relaxed Matches filenames with different cases
check=strict Matches only filenames with the exact same case
cruft Try to handle badly formatted CDs.
map=off Do not map non-Rock Ridge filenames to lower case
map=normal Map non-Rock Ridge filenames to lower case
map=acorn As map=normal but also apply Acorn extensions if present
mode=xxx Sets the permissions on files to xxx unless Rock Ridge
extensions set the permissions otherwise
dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge
extensions set the permissions otherwise
overriderockperm Set permissions on files and directories according to
'mode' and 'dmode' even though Rock Ridge extensions are
present.
nojoliet Ignore Joliet extensions if they are present.
norock Ignore Rock Ridge extensions if they are present.
hide Completely strip hidden files from the file system.
showassoc Show files marked with the 'associated' bit
unhide Deprecated; showing hidden files is now default;
If given, it is a synonym for 'showassoc' which will
recreate previous unhide behavior
session=x Select number of session on multisession CD
sbsector=xxx Session begins from sector xxx
Recommended documents about ISO 9660 standard are located at:
http://www.y-adagio.com/
ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
identical with ISO 9660.", so it is a valid and gratis substitute of the
official ISO specification.

View File

@ -0,0 +1,13 @@
===============================
NFS
===============================
.. toctree::
:maxdepth: 1
pnfs
rpc-cache
rpc-server-gss
nfs41-server
knfsd-stats

View File

@ -1,7 +1,9 @@
============================
Kernel NFS Server Statistics
============================
:Authors: Greg Banks <gnb@sgi.com> - 26 Mar 2009
This document describes the format and semantics of the statistics
which the kernel NFS server makes available to userspace. These
statistics are available in several text form pseudo files, each of
@ -18,7 +20,7 @@ by parsing routines. All other lines contain a sequence of fields
separated by whitespace.
/proc/fs/nfsd/pool_stats
------------------------
========================
This file is available in kernels from 2.6.30 onwards, if the
/proc/fs/nfsd filesystem is mounted (it almost always should be).
@ -109,15 +111,12 @@ this case), or the transport can be enqueued for later attention
(sockets-enqueued counts this case), or the packet can be temporarily
deferred because the transport is currently being used by an nfsd
thread. This last case is not very interesting and is not explicitly
counted, but can be inferred from the other counters thus:
counted, but can be inferred from the other counters thus::
packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
More
----
====
Descriptions of the other statistics file should go here.
Greg Banks <gnb@sgi.com>
26 Mar 2009

View File

@ -0,0 +1,256 @@
=============================
NFSv4.1 Server Implementation
=============================
Server support for minorversion 1 can be controlled using the
/proc/fs/nfsd/versions control file. The string output returned
by reading this file will contain either "+4.1" or "-4.1"
correspondingly.
Currently, server support for minorversion 1 is enabled by default.
It can be disabled at run time by writing the string "-4.1" to
the /proc/fs/nfsd/versions control file. Note that to write this
control file, the nfsd service must be taken down. You can use rpc.nfsd
for this; see rpc.nfsd(8).
(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
"-4", respectively. Therefore, code meant to work on both new and old
kernels must turn 4.1 on or off *before* turning support for version 4
on or off; rpc.nfsd does this correctly.)
The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
on RFC 5661.
From the many new features in NFSv4.1 the current implementation
focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
"exactly once" semantics and better control and throttling of the
resources allocated for each client.
The table below, taken from the NFSv4.1 document, lists
the operations that are mandatory to implement (REQ), optional
(OPT), and NFSv4.0 operations that are required not to implement (MNI)
in minor version 1. The first column indicates the operations that
are not supported yet by the linux server implementation.
The OPTIONAL features identified and their abbreviations are as follows:
- **pNFS** Parallel NFS
- **FDELG** File Delegations
- **DDELG** Directory Delegations
The following abbreviations indicate the linux server implementation status.
- **I** Implemented NFSv4.1 operations.
- **NS** Not Supported.
- **NS\*** Unimplemented optional feature.
Operations
==========
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition |
+=======================+======================+=====================+===========================+================+
| | ACCESS | REQ | | Section 18.1 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | BACKCHANNEL_CTL | REQ | | Section 18.33 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | CLOSE | REQ | | Section 18.2 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | COMMIT | REQ | | Section 18.3 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | CREATE | REQ | | Section 18.4 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | CREATE_SESSION | REQ | | Section 18.36 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| NS* | DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | DELEGRETURN | OPT | FDELG, | Section 18.6 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | | | DDELG, pNFS | |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | | | (REQ) | |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | DESTROY_CLIENTID | REQ | | Section 18.50 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | DESTROY_SESSION | REQ | | Section 18.37 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | EXCHANGE_ID | REQ | | Section 18.35 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | FREE_STATEID | REQ | | Section 18.38 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | GETATTR | REQ | | Section 18.7 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| NS* | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | GETFH | REQ | | Section 18.8 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| NS* | GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | LINK | OPT | | Section 18.9 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | LOCK | REQ | | Section 18.10 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | LOCKT | REQ | | Section 18.11 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | LOCKU | REQ | | Section 18.12 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | LOOKUP | REQ | | Section 18.13 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | LOOKUPP | REQ | | Section 18.14 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | NVERIFY | REQ | | Section 18.15 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | OPEN | REQ | | Section 18.16 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| NS* | OPENATTR | OPT | | Section 18.17 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | OPEN_CONFIRM | MNI | | N/A |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | OPEN_DOWNGRADE | REQ | | Section 18.18 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | PUTFH | REQ | | Section 18.19 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | PUTPUBFH | REQ | | Section 18.20 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | PUTROOTFH | REQ | | Section 18.21 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | READ | REQ | | Section 18.22 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | READDIR | REQ | | Section 18.23 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | READLINK | OPT | | Section 18.24 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | RECLAIM_COMPLETE | REQ | | Section 18.51 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | RELEASE_LOCKOWNER | MNI | | N/A |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | REMOVE | REQ | | Section 18.25 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | RENAME | REQ | | Section 18.26 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | RENEW | MNI | | N/A |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | RESTOREFH | REQ | | Section 18.27 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | SAVEFH | REQ | | Section 18.28 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | SECINFO | REQ | | Section 18.29 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | | | layout (REQ) | Section 13.12 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | SEQUENCE | REQ | | Section 18.46 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | SETATTR | REQ | | Section 18.30 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | SETCLIENTID | MNI | | N/A |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | SETCLIENTID_CONFIRM | MNI | | N/A |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| NS | SET_SSV | REQ | | Section 18.47 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| I | TEST_STATEID | REQ | | Section 18.48 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | VERIFY | REQ | | Section 18.31 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| NS* | WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
| | WRITE | REQ | | Section 18.32 |
+-----------------------+----------------------+---------------------+---------------------------+----------------+
Callback Operations
===================
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition |
+=======================+=========================+=====================+===========================+===============+
| | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| NS* | CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| NS* | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| NS* | CB_NOTIFY_LOCK | OPT | | Section 20.11 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| NS* | CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| | CB_RECALL | OPT | FDELG, | Section 20.2 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| | | | DDELG, pNFS | |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| | | | (REQ) | |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| NS* | CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| | | | DDELG, pNFS | |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| | | | (REQ) | |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| NS* | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| | | | (REQ) | |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| | | | DDELG, pNFS | |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| | | | (REQ) | |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| NS* | CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| | | | DDELG, pNFS | |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
| | | | (REQ) | |
+-----------------------+-------------------------+---------------------+---------------------------+---------------+
Implementation notes:
=====================
SSV:
The spec claims this is mandatory, but we don't actually know of any
implementations, so we're ignoring it for now. The server returns
NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof.
GSS on the backchannel:
Again, theoretically required but not widely implemented (in
particular, the current Linux client doesn't request it). We return
NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION.
DELEGPURGE:
mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
persist across client reboots). Thus we need not implement this for
now.
EXCHANGE_ID:
implementation ids are ignored
CREATE_SESSION:
backchannel attributes are ignored
SEQUENCE:
no support for dynamic slot table renegotiation (optional)
Nonstandard compound limitations:
No support for a sessions fore channel RPC compound that requires both a
ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
fail to live up to the promise we made in CREATE_SESSION fore channel
negotiation.
See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues.

View File

@ -1,173 +0,0 @@
NFSv4.1 Server Implementation
Server support for minorversion 1 can be controlled using the
/proc/fs/nfsd/versions control file. The string output returned
by reading this file will contain either "+4.1" or "-4.1"
correspondingly.
Currently, server support for minorversion 1 is enabled by default.
It can be disabled at run time by writing the string "-4.1" to
the /proc/fs/nfsd/versions control file. Note that to write this
control file, the nfsd service must be taken down. You can use rpc.nfsd
for this; see rpc.nfsd(8).
(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
"-4", respectively. Therefore, code meant to work on both new and old
kernels must turn 4.1 on or off *before* turning support for version 4
on or off; rpc.nfsd does this correctly.)
The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
on RFC 5661.
From the many new features in NFSv4.1 the current implementation
focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
"exactly once" semantics and better control and throttling of the
resources allocated for each client.
The table below, taken from the NFSv4.1 document, lists
the operations that are mandatory to implement (REQ), optional
(OPT), and NFSv4.0 operations that are required not to implement (MNI)
in minor version 1. The first column indicates the operations that
are not supported yet by the linux server implementation.
The OPTIONAL features identified and their abbreviations are as follows:
pNFS Parallel NFS
FDELG File Delegations
DDELG Directory Delegations
The following abbreviations indicate the linux server implementation status.
I Implemented NFSv4.1 operations.
NS Not Supported.
NS* Unimplemented optional feature.
Operations
+----------------------+------------+--------------+----------------+
| Operation | REQ, REC, | Feature | Definition |
| | OPT, or | (REQ, REC, | |
| | MNI | or OPT) | |
+----------------------+------------+--------------+----------------+
| ACCESS | REQ | | Section 18.1 |
I | BACKCHANNEL_CTL | REQ | | Section 18.33 |
I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
| CLOSE | REQ | | Section 18.2 |
| COMMIT | REQ | | Section 18.3 |
| CREATE | REQ | | Section 18.4 |
I | CREATE_SESSION | REQ | | Section 18.36 |
NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
| DELEGRETURN | OPT | FDELG, | Section 18.6 |
| | | DDELG, pNFS | |
| | | (REQ) | |
I | DESTROY_CLIENTID | REQ | | Section 18.50 |
I | DESTROY_SESSION | REQ | | Section 18.37 |
I | EXCHANGE_ID | REQ | | Section 18.35 |
I | FREE_STATEID | REQ | | Section 18.38 |
| GETATTR | REQ | | Section 18.7 |
I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
NS*| GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
| GETFH | REQ | | Section 18.8 |
NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
| LINK | OPT | | Section 18.9 |
| LOCK | REQ | | Section 18.10 |
| LOCKT | REQ | | Section 18.11 |
| LOCKU | REQ | | Section 18.12 |
| LOOKUP | REQ | | Section 18.13 |
| LOOKUPP | REQ | | Section 18.14 |
| NVERIFY | REQ | | Section 18.15 |
| OPEN | REQ | | Section 18.16 |
NS*| OPENATTR | OPT | | Section 18.17 |
| OPEN_CONFIRM | MNI | | N/A |
| OPEN_DOWNGRADE | REQ | | Section 18.18 |
| PUTFH | REQ | | Section 18.19 |
| PUTPUBFH | REQ | | Section 18.20 |
| PUTROOTFH | REQ | | Section 18.21 |
| READ | REQ | | Section 18.22 |
| READDIR | REQ | | Section 18.23 |
| READLINK | OPT | | Section 18.24 |
| RECLAIM_COMPLETE | REQ | | Section 18.51 |
| RELEASE_LOCKOWNER | MNI | | N/A |
| REMOVE | REQ | | Section 18.25 |
| RENAME | REQ | | Section 18.26 |
| RENEW | MNI | | N/A |
| RESTOREFH | REQ | | Section 18.27 |
| SAVEFH | REQ | | Section 18.28 |
| SECINFO | REQ | | Section 18.29 |
I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
| | | layout (REQ) | Section 13.12 |
I | SEQUENCE | REQ | | Section 18.46 |
| SETATTR | REQ | | Section 18.30 |
| SETCLIENTID | MNI | | N/A |
| SETCLIENTID_CONFIRM | MNI | | N/A |
NS | SET_SSV | REQ | | Section 18.47 |
I | TEST_STATEID | REQ | | Section 18.48 |
| VERIFY | REQ | | Section 18.31 |
NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
| WRITE | REQ | | Section 18.32 |
Callback Operations
+-------------------------+-----------+-------------+---------------+
| Operation | REQ, REC, | Feature | Definition |
| | OPT, or | (REQ, REC, | |
| | MNI | or OPT) | |
+-------------------------+-----------+-------------+---------------+
| CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
NS*| CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 |
NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
| CB_RECALL | OPT | FDELG, | Section 20.2 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
| | | (REQ) | |
I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
| | | DDELG, pNFS | |
| | | (REQ) | |
+-------------------------+-----------+-------------+---------------+
Implementation notes:
SSV:
* The spec claims this is mandatory, but we don't actually know of any
implementations, so we're ignoring it for now. The server returns
NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof.
GSS on the backchannel:
* Again, theoretically required but not widely implemented (in
particular, the current Linux client doesn't request it). We return
NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION.
DELEGPURGE:
* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
persist across client reboots). Thus we need not implement this for
now.
EXCHANGE_ID:
* implementation ids are ignored
CREATE_SESSION:
* backchannel attributes are ignored
SEQUENCE:
* no support for dynamic slot table renegotiation (optional)
Nonstandard compound limitations:
* No support for a sessions fore channel RPC compound that requires both a
ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
fail to live up to the promise we made in CREATE_SESSION fore channel
negotiation.
See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues.

View File

@ -1,15 +1,17 @@
Reference counting in pnfs:
==========================
Reference counting in pnfs
==========================
The are several inter-related caches. We have layouts which can
reference multiple devices, each of which can reference multiple data servers.
Each data server can be referenced by multiple devices. Each device
can be referenced by multiple layouts. To keep all of this straight,
can be referenced by multiple layouts. To keep all of this straight,
we need to reference count.
struct pnfs_layout_hdr
----------------------
======================
The on-the-wire command LAYOUTGET corresponds to struct
pnfs_layout_segment, usually referred to by the variable name lseg.
Each nfs_inode may hold a pointer to a cache of these layout
@ -25,7 +27,8 @@ the reference count, as the layout is kept around by the lseg that
keeps it in the list.
deviceid_cache
--------------
==============
lsegs reference device ids, which are resolved per nfs_client and
layout driver type. The device ids are held in a RCU cache (struct
nfs4_deviceid_cache). The cache itself is referenced across each
@ -38,24 +41,26 @@ justification, but seems reasonable given that we can have multiple
deviceid's per filesystem, and multiple filesystems per nfs_client.
The hash code is copied from the nfsd code base. A discussion of
hashing and variations of this algorithm can be found at:
http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809
hashing and variations of this algorithm can be found `here.
<http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809>`_
data server cache
-----------------
=================
file driver devices refer to data servers, which are kept in a module
level cache. Its reference is held over the lifetime of the deviceid
pointing to it.
lseg
----
====
lseg maintains an extra reference corresponding to the NFS_LSEG_VALID
bit which holds it in the pnfs_layout_hdr's list. When the final lseg
is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED
bit is set, preventing any new lsegs from being added.
layout drivers
--------------
==============
PNFS utilizes what is called layout drivers. The STD defines 4 basic
layout types: "files", "objects", "blocks", and "flexfiles". For each
@ -68,6 +73,6 @@ Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory
Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory
blocks-layout setup
-------------------
===================
TODO: Document the setup needs of the blocks layout driver

View File

@ -1,9 +1,14 @@
This document gives a brief introduction to the caching
=========
RPC Cache
=========
This document gives a brief introduction to the caching
mechanisms in the sunrpc layer that is used, in particular,
for NFS authentication.
CACHES
Caches
======
The caching replaces the old exports table and allows for
a wide variety of values to be caches.
@ -12,6 +17,7 @@ quite possibly very different in content and use. There is a corpus
of common code for managing these caches.
Examples of caches that are likely to be needed are:
- mapping from IP address to client name
- mapping from client name and filesystem to export options
- mapping from UID to list of GIDs, to work around NFS's limitation
@ -21,6 +27,7 @@ Examples of caches that are likely to be needed are:
- mapping from network identify to public key for crypto authentication.
The common code handles such things as:
- general cache lookup with correct locking
- supporting 'NEGATIVE' as well as positive entries
- allowing an EXPIRED time on cache items, and removing
@ -35,60 +42,66 @@ The common code handles such things as:
Creating a Cache
----------------
1/ A cache needs a datum to store. This is in the form of a
structure definition that must contain a
struct cache_head
- A cache needs a datum to store. This is in the form of a
structure definition that must contain a struct cache_head
as an element, usually the first.
It will also contain a key and some content.
Each cache element is reference counted and contains
expiry and update times for use in cache management.
2/ A cache needs a "cache_detail" structure that
- A cache needs a "cache_detail" structure that
describes the cache. This stores the hash table, some
parameters for cache management, and some operations detailing how
to work with particular cache items.
The operations requires are:
struct cache_head *alloc(void)
This simply allocates appropriate memory and returns
a pointer to the cache_detail embedded within the
structure
void cache_put(struct kref *)
This is called when the last reference to an item is
dropped. The pointer passed is to the 'ref' field
in the cache_head. cache_put should release any
references create by 'cache_init' and, if CACHE_VALID
is set, any references created by cache_update.
It should then release the memory allocated by
'alloc'.
int match(struct cache_head *orig, struct cache_head *new)
test if the keys in the two structures match. Return
1 if they do, 0 if they don't.
void init(struct cache_head *orig, struct cache_head *new)
Set the 'key' fields in 'new' from 'orig'. This may
include taking references to shared objects.
void update(struct cache_head *orig, struct cache_head *new)
Set the 'content' fileds in 'new' from 'orig'.
int cache_show(struct seq_file *m, struct cache_detail *cd,
struct cache_head *h)
Optional. Used to provide a /proc file that lists the
contents of a cache. This should show one item,
usually on just one line.
int cache_request(struct cache_detail *cd, struct cache_head *h,
char **bpp, int *blen)
Format a request to be send to user-space for an item
to be instantiated. *bpp is a buffer of size *blen.
bpp should be moved forward over the encoded message,
and *blen should be reduced to show how much free
space remains. Return 0 on success or <0 if not
enough room or other problem.
int cache_parse(struct cache_detail *cd, char *buf, int len)
A message from user space has arrived to fill out a
cache entry. It is in 'buf' of length 'len'.
cache_parse should parse this, find the item in the
cache with sunrpc_cache_lookup_rcu, and update the item
with sunrpc_cache_update.
The operations are:
struct cache_head \*alloc(void)
This simply allocates appropriate memory and returns
a pointer to the cache_detail embedded within the
structure
void cache_put(struct kref \*)
This is called when the last reference to an item is
dropped. The pointer passed is to the 'ref' field
in the cache_head. cache_put should release any
references create by 'cache_init' and, if CACHE_VALID
is set, any references created by cache_update.
It should then release the memory allocated by
'alloc'.
int match(struct cache_head \*orig, struct cache_head \*new)
test if the keys in the two structures match. Return
1 if they do, 0 if they don't.
void init(struct cache_head \*orig, struct cache_head \*new)
Set the 'key' fields in 'new' from 'orig'. This may
include taking references to shared objects.
void update(struct cache_head \*orig, struct cache_head \*new)
Set the 'content' fileds in 'new' from 'orig'.
int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h)
Optional. Used to provide a /proc file that lists the
contents of a cache. This should show one item,
usually on just one line.
int cache_request(struct cache_detail \*cd, struct cache_head \*h, char \*\*bpp, int \*blen)
Format a request to be send to user-space for an item
to be instantiated. \*bpp is a buffer of size \*blen.
bpp should be moved forward over the encoded message,
and \*blen should be reduced to show how much free
space remains. Return 0 on success or <0 if not
enough room or other problem.
int cache_parse(struct cache_detail \*cd, char \*buf, int len)
A message from user space has arrived to fill out a
cache entry. It is in 'buf' of length 'len'.
cache_parse should parse this, find the item in the
cache with sunrpc_cache_lookup_rcu, and update the item
with sunrpc_cache_update.
3/ A cache needs to be registered using cache_register(). This
- A cache needs to be registered using cache_register(). This
includes it on a list of caches that will be regularly
cleaned to discard old data.
@ -107,7 +120,7 @@ cache_check will return -ENOENT in the entry is negative or if an up
call is needed but not possible, -EAGAIN if an upcall is pending,
or 0 if the data is valid;
cache_check can be passed a "struct cache_req *". This structure is
cache_check can be passed a "struct cache_req\*". This structure is
typically embedded in the actual request and can be used to create a
deferred copy of the request (struct cache_deferred_req). This is
done when the found cache item is not uptodate, but the is reason to
@ -139,9 +152,11 @@ The 'channel' works a bit like a datagram socket. Each 'write' is
passed as a whole to the cache for parsing and interpretation.
Each cache can treat the write requests differently, but it is
expected that a message written will contain:
- a key
- an expiry time
- a content.
with the intention that an item in the cache with the give key
should be create or updated to have the given content, and the
expiry time should be set on that item.
@ -156,7 +171,8 @@ If there are no more requests to return, read will return EOF, but a
select or poll for read will block waiting for another request to be
added.
Thus a user-space helper is likely to:
Thus a user-space helper is likely to::
open the channel.
select for readable
read a request
@ -175,12 +191,13 @@ Each cache should also define a "cache_request" method which
takes a cache item and encodes a request into the buffer
provided.
Note: If a cache has no active readers on the channel, and has had not
active readers for more than 60 seconds, further requests will not be
added to the channel but instead all lookups that do not find a valid
entry will fail. This is partly for backward compatibility: The
previous nfs exports table was deemed to be authoritative and a
failed lookup meant a definite 'no'.
.. note::
If a cache has no active readers on the channel, and has had not
active readers for more than 60 seconds, further requests will not be
added to the channel but instead all lookups that do not find a valid
entry will fail. This is partly for backward compatibility: The
previous nfs exports table was deemed to be authoritative and a
failed lookup meant a definite 'no'.
request/response format
-----------------------
@ -193,10 +210,11 @@ with precisely one newline character which should be at the end.
Fields within the record should be separated by spaces, normally one.
If spaces, newlines, or nul characters are needed in a field they
much be quoted. two mechanisms are available:
1/ If a field begins '\x' then it must contain an even number of
- If a field begins '\x' then it must contain an even number of
hex digits, and pairs of these digits provide the bytes in the
field.
2/ otherwise a \ in the field must be followed by 3 octal digits
- otherwise a \ in the field must be followed by 3 octal digits
which give the code for a byte. Other characters are treated
as them selves. At the very least, space, newline, nul, and
'\' must be quoted in this way.

View File

@ -1,4 +1,4 @@
=========================================
rpcsec_gss support for kernel RPC servers
=========================================
@ -9,14 +9,17 @@ NFSv4.1 and higher don't require the client to act as a server for the
purposes of authentication.)
RPCGSS is specified in a few IETF documents:
- RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt
- RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt
and there is a 3rd version being proposed:
- http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt
(At draft n. 02 at the time of writing)
Background
----------
==========
The RPCGSS Authentication method describes a way to perform GSSAPI
Authentication for NFS. Although GSSAPI is itself completely mechanism
@ -29,6 +32,7 @@ depends on GSSAPI extensions that are KRB5 specific.
GSSAPI is a complex library, and implementing it completely in kernel is
unwarranted. However GSSAPI operations are fundementally separable in 2
parts:
- initial context establishment
- integrity/privacy protection (signing and encrypting of individual
packets)
@ -41,7 +45,7 @@ kernel, but leave the initial context establishment to userspace. We
need upcalls to request userspace to perform context establishment.
NFS Server Legacy Upcall Mechanism
----------------------------------
==================================
The classic upcall mechanism uses a custom text based upcall mechanism
to talk to a custom daemon called rpc.svcgssd that is provide by the
@ -62,21 +66,20 @@ groups) due to limitation on the size of the buffer that can be send
back to the kernel (4KiB).
NFS Server New RPC Upcall Mechanism
-----------------------------------
===================================
The newer upcall mechanism uses RPC over a unix socket to a daemon
called gss-proxy, implemented by a userspace program called Gssproxy.
The gss_proxy RPC protocol is currently documented here:
https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation
The gss_proxy RPC protocol is currently documented `here
<https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation>`_.
This upcall mechanism uses the kernel rpc client and connects to the gssproxy
userspace program over a regular unix socket. The gssproxy protocol does not
suffer from the size limitations of the legacy protocol.
Negotiating Upcall Mechanisms
-----------------------------
=============================
To provide backward compatibility, the kernel defaults to using the
legacy mechanism. To switch to the new mechanism, gss-proxy must bind

View File

@ -1,5 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
======
NILFS2
------
======
NILFS2 is a log-structured file system (LFS) supporting continuous
snapshotting. In addition to versioning capability of the entire file
@ -25,9 +28,9 @@ available from the following download page. At least "mkfs.nilfs2",
cleaner or garbage collector) are required. Details on the tools are
described in the man pages included in the package.
Project web page: https://nilfs.sourceforge.io/
Download page: https://nilfs.sourceforge.io/en/download.html
List info: http://vger.kernel.org/vger-lists.html#linux-nilfs
:Project web page: https://nilfs.sourceforge.io/
:Download page: https://nilfs.sourceforge.io/en/download.html
:List info: http://vger.kernel.org/vger-lists.html#linux-nilfs
Caveats
=======
@ -47,6 +50,7 @@ Mount options
NILFS2 supports the following mount options:
(*) == default
======================= =======================================================
barrier(*) This enables/disables the use of write barriers. This
nobarrier requires an IO stack which can support barriers, and
if nilfs gets an error on a barrier write, it will
@ -79,6 +83,7 @@ discard This enables/disables the use of discard/TRIM commands.
nodiscard(*) The discard/TRIM commands are sent to the underlying
block device when blocks are freed. This is useful
for SSD devices and sparse/thinly-provisioned LUNs.
======================= =======================================================
Ioctls
======
@ -87,9 +92,11 @@ There is some NILFS2 specific functionality which can be accessed by application
through the system call interfaces. The list of all NILFS2 specific ioctls are
shown in the table below.
Table of NILFS2 specific ioctls
..............................................................................
Table of NILFS2 specific ioctls:
============================== ===============================================
Ioctl Description
============================== ===============================================
NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between
checkpoint and snapshot state. This ioctl is
used in chcp and mkcp utilities.
@ -142,11 +149,12 @@ Table of NILFS2 specific ioctls
NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and
upper limit of segments in bytes. This ioctl
is used by nilfs_resize utility.
============================== ===============================================
NILFS2 usage
============
To use nilfs2 as a local file system, simply:
To use nilfs2 as a local file system, simply::
# mkfs -t nilfs2 /dev/block_device
# mount -t nilfs2 /dev/block_device /dir
@ -157,18 +165,20 @@ This will also invoke the cleaner through the mount helper program
Checkpoints and snapshots are managed by the following commands.
Their manpages are included in the nilfs-utils package above.
==== ===========================================================
lscp list checkpoints or snapshots.
mkcp make a checkpoint or a snapshot.
chcp change an existing checkpoint to a snapshot or vice versa.
rmcp invalidate specified checkpoint(s).
==== ===========================================================
To mount a snapshot,
To mount a snapshot::
# mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir
where <cno> is the checkpoint number of the snapshot.
To unmount the NILFS2 mount point or snapshot, simply:
To unmount the NILFS2 mount point or snapshot, simply::
# umount /dir
@ -181,7 +191,7 @@ Disk format
A nilfs2 volume is equally divided into a number of segments except
for the super block (SB) and segment #0. A segment is the container
of logs. Each log is composed of summary information blocks, payload
blocks, and an optional super root block (SR):
blocks, and an optional super root block (SR)::
______________________________________________________
| |SB| | Segment | Segment | Segment | ... | Segment | |
@ -200,7 +210,7 @@ blocks, and an optional super root block (SR):
|_blocks__|_________________|__|
The payload blocks are organized per file, and each file consists of
data blocks and B-tree node blocks:
data blocks and B-tree node blocks::
|<--- File-A --->|<--- File-B --->|
_______________________________________________________________
@ -213,7 +223,7 @@ files without data blocks or B-tree node blocks.
The organization of the blocks is recorded in the summary information
blocks, which contains a header structure (nilfs_segment_summary), per
file structures (nilfs_finfo), and per block structures (nilfs_binfo):
file structures (nilfs_finfo), and per block structures (nilfs_binfo)::
_________________________________________________________________________
| Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |...
@ -223,7 +233,7 @@ file structures (nilfs_finfo), and per block structures (nilfs_binfo):
The logs include regular files, directory files, symbolic link files
and several meta data files. The mata data files are the files used
to maintain file system meta data. The current version of NILFS2 uses
the following meta data files:
the following meta data files::
1) Inode file (ifile) -- Stores on-disk inodes
2) Checkpoint file (cpfile) -- Stores checkpoints
@ -232,7 +242,7 @@ the following meta data files:
(DAT) block numbers. This file serves to
make on-disk blocks relocatable.
The following figure shows a typical organization of the logs:
The following figure shows a typical organization of the logs::
_________________________________________________________________________
| Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR|
@ -250,7 +260,7 @@ three special inodes, inodes for the DAT, cpfile, and sufile. Inodes
of regular files, directories, symlinks and other special files, are
included in the ifile. The inode of ifile itself is included in the
corresponding checkpoint entry in the cpfile. Thus, the hierarchy
among NILFS2 files can be depicted as follows:
among NILFS2 files can be depicted as follows::
Super block (SB)
|

View File

@ -1,19 +1,21 @@
.. SPDX-License-Identifier: GPL-2.0
================================
The Linux NTFS filesystem driver
================================
Table of contents
=================
.. Table of contents
- Overview
- Web site
- Features
- Supported mount options
- Known bugs and (mis-)features
- Using NTFS volume and stripe sets
- The Device-Mapper driver
- The Software RAID / MD driver
- Limitations when using the MD driver
- Overview
- Web site
- Features
- Supported mount options
- Known bugs and (mis-)features
- Using NTFS volume and stripe sets
- The Device-Mapper driver
- The Software RAID / MD driver
- Limitations when using the MD driver
Overview
@ -66,8 +68,10 @@ Features
partition by creating a large file while in Windows and then loopback
mounting the file while in Linux and creating a Linux filesystem on it that
is used to install Linux on it.
- A comparison of the two drivers using:
- A comparison of the two drivers using::
time find . -type f -exec md5sum "{}" \;
run three times in sequence with each driver (after a reboot) on a 1.4GiB
NTFS partition, showed the new driver to be 20% faster in total time elapsed
(from 9:43 minutes on average down to 7:53). The time spent in user space
@ -104,6 +108,7 @@ In addition to the generic mount options described by the manual page for the
mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
following mount options:
======================= =======================================================
iocharset=name Deprecated option. Still supported but please use
nls=name in the future. See description for nls=name.
@ -175,16 +180,22 @@ disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse
errors=opt What to do when critical filesystem errors are found.
Following values can be used for "opt":
continue: DEFAULT, try to clean-up as much as
======== =========================================
continue DEFAULT, try to clean-up as much as
possible, e.g. marking a corrupt inode as
bad so it is no longer accessed, and then
continue.
recover: At present only supported is recovery of
recover At present only supported is recovery of
the boot sector from the backup copy.
If read-only mount, the recovery is done
in memory only and not written to disk.
Note that the options are additive, i.e. specifying:
======== =========================================
Note that the options are additive, i.e. specifying::
errors=continue,errors=recover
means the driver will attempt to recover and if that
fails it will clean-up as much as possible and
continue.
@ -202,12 +213,18 @@ mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
In general use the default. If you have a lot of small
files then use a higher value. The values have the
following meaning:
===== =================================
Value MFT zone size (% of volume size)
===== =================================
1 12.5%
2 25%
3 37.5%
4 50%
===== =================================
Note this option is irrelevant for read-only mounts.
======================= =======================================================
Known bugs and (mis-)features
@ -252,18 +269,18 @@ To create the table describing your volume you will need to know each of its
components and their sizes in sectors, i.e. multiples of 512-byte blocks.
For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
example if one of your partitions is /dev/hda2 you would do:
example if one of your partitions is /dev/hda2 you would do::
$ fdisk -ul /dev/hda
$ fdisk -ul /dev/hda
Disk /dev/hda: 81.9 GB, 81964302336 bytes
255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
Units = sectors of 1 * 512 = 512 bytes
Disk /dev/hda: 81.9 GB, 81964302336 bytes
255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 63 4209029 2104483+ 83 Linux
/dev/hda2 4209030 37768814 16779892+ 86 NTFS
/dev/hda3 37768815 46170809 4200997+ 83 Linux
Device Boot Start End Blocks Id System
/dev/hda1 * 63 4209029 2104483+ 83 Linux
/dev/hda2 4209030 37768814 16779892+ 86 NTFS
/dev/hda3 37768815 46170809 4200997+ 83 Linux
And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
33559785 sectors.
@ -271,15 +288,17 @@ And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
For Win2k and later dynamic disks, you can for example use the ldminfo utility
which is part of the Linux LDM tools (the latest version at the time of
writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
http://www.linux-ntfs.org/
Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
able to compile this yourself easily so use the binary version!
Then you would use ldminfo in dump mode to obtain the necessary information:
Then you would use ldminfo in dump mode to obtain the necessary information::
$ ./ldminfo --dump /dev/hda
$ ./ldminfo --dump /dev/hda
This would dump the LDM database found on /dev/hda which describes all of your
dynamic disks and all the volumes on them. At the bottom you will see the
@ -305,42 +324,36 @@ give you the correct information to do this.
Assuming you know all your devices and their sizes things are easy.
For a linear raid the table would look like this (note all values are in
512-byte sectors):
512-byte sectors)::
--- cut here ---
# Offset into Size of this Raid type Device Start sector
# volume device of device
0 1028161 linear /dev/hda1 0
1028161 3903762 linear /dev/hdb2 0
4931923 2103211 linear /dev/hdc1 0
--- cut here ---
# Offset into Size of this Raid type Device Start sector
# volume device of device
0 1028161 linear /dev/hda1 0
1028161 3903762 linear /dev/hdb2 0
4931923 2103211 linear /dev/hdc1 0
For a striped volume, i.e. raid level 0, you will need to know the chunk size
you used when creating the volume. Windows uses 64kiB as the default, so it
will probably be this unless you changes the defaults when creating the array.
For a raid level 0 the table would look like this (note all values are in
512-byte sectors):
512-byte sectors)::
--- cut here ---
# Offset Size Raid Number Chunk 1st Start 2nd Start
# into of the type of size Device in Device in
# volume volume stripes device device
0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
--- cut here ---
# Offset Size Raid Number Chunk 1st Start 2nd Start
# into of the type of size Device in Device in
# volume volume stripes device device
0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
If there are more than two devices, just add each of them to the end of the
line.
Finally, for a mirrored volume, i.e. raid level 1, the table would look like
this (note all values are in 512-byte sectors):
this (note all values are in 512-byte sectors)::
--- cut here ---
# Ofs Size Raid Log Number Region Should Number Source Start Target Start
# in of the type type of log size sync? of Device in Device in
# vol volume params mirrors Device Device
0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
--- cut here ---
# Ofs Size Raid Log Number Region Should Number Source Start Target Start
# in of the type type of log size sync? of Device in Device in
# vol volume params mirrors Device Device
0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
If you are mirroring to multiple devices you can specify further targets at the
end of the line.
@ -353,17 +366,17 @@ to the "Target Device" or if you specified multiple target devices to all of
them.
Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
and hand it over to dmsetup to work with, like so:
and hand it over to dmsetup to work with, like so::
$ dmsetup create myvolume1 /etc/ntfsvolume1
$ dmsetup create myvolume1 /etc/ntfsvolume1
You can obviously replace "myvolume1" with whatever name you like.
If it all worked, you will now have the device /dev/device-mapper/myvolume1
which you can then just use as an argument to the mount command as usual to
mount the ntfs volume. For example:
mount the ntfs volume. For example::
$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
(You need to create the directory /mnt/myvol1 first and of course you can use
anything you like instead of /mnt/myvol1 as long as it is an existing
@ -395,18 +408,18 @@ Windows by default uses a stripe chunk size of 64k, so you probably want the
"chunk-size 64k" option for each raid-disk, too.
For example, if you have a stripe set consisting of two partitions /dev/hda5
and /dev/hdb1 your /etc/raidtab would look like this:
and /dev/hdb1 your /etc/raidtab would look like this::
raiddev /dev/md0
raid-level 0
nr-raid-disks 2
nr-spare-disks 0
persistent-superblock 0
chunk-size 64k
device /dev/hda5
raid-disk 0
device /dev/hdb1
raid-disk 1
raiddev /dev/md0
raid-level 0
nr-raid-disks 2
nr-spare-disks 0
persistent-superblock 0
chunk-size 64k
device /dev/hda5
raid-disk 0
device /dev/hdb1
raid-disk 1
For linear raid, just change the raid-level above to "raid-level linear", for
mirrors, change it to "raid-level 1", and for stripe sets with parity, change
@ -427,7 +440,9 @@ Once the raidtab is setup, run for example raid0run -a to start all devices or
raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
Then just use the mount command as usual to mount the ntfs volume using for
example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
example::
mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
It is advisable to do the mount read-only to see if the md volume has been
setup correctly to avoid the possibility of causing damage to the data on the

View File

@ -1,5 +1,8 @@
OCFS2 online file check
-----------------------
.. SPDX-License-Identifier: GPL-2.0
=====================================
OCFS2 file system - online file check
=====================================
This document will describe OCFS2 online file check feature.
@ -40,7 +43,7 @@ When there are errors in the OCFS2 filesystem, they are usually accompanied
by the inode number which caused the error. This inode number would be the
input to check/fix the file.
There is a sysfs directory for each OCFS2 file system mounting:
There is a sysfs directory for each OCFS2 file system mounting::
/sys/fs/ocfs2/<devname>/filecheck
@ -50,34 +53,36 @@ communicate with kernel space, tell which file(inode number) will be checked or
fixed. Currently, three operations are supported, which includes checking
inode, fixing inode and setting the size of result record history.
1. If you want to know what error exactly happened to <inode> before fixing, do
1. If you want to know what error exactly happened to <inode> before fixing, do::
# echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check
# cat /sys/fs/ocfs2/<devname>/filecheck/check
# echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check
# cat /sys/fs/ocfs2/<devname>/filecheck/check
The output is like this:
INO DONE ERROR
39502 1 GENERATION
The output is like this::
<INO> lists the inode numbers.
<DONE> indicates whether the operation has been finished.
<ERROR> says what kind of errors was found. For the detailed error numbers,
please refer to the file linux/fs/ocfs2/filecheck.h.
INO DONE ERROR
39502 1 GENERATION
2. If you determine to fix this inode, do
<INO> lists the inode numbers.
<DONE> indicates whether the operation has been finished.
<ERROR> says what kind of errors was found. For the detailed error numbers,
please refer to the file linux/fs/ocfs2/filecheck.h.
# echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix
# cat /sys/fs/ocfs2/<devname>/filecheck/fix
2. If you determine to fix this inode, do::
The output is like this:
INO DONE ERROR
39502 1 SUCCESS
# echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix
# cat /sys/fs/ocfs2/<devname>/filecheck/fix
The output is like this:::
INO DONE ERROR
39502 1 SUCCESS
This time, the <ERROR> column indicates whether this fix is successful or not.
3. The record cache is used to store the history of check/fix results. It's
default size is 10, and can be adjust between the range of 10 ~ 100. You can
adjust the size like this:
adjust the size like this::
# echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set

View File

@ -1,5 +1,9 @@
.. SPDX-License-Identifier: GPL-2.0
================
OCFS2 filesystem
==================
================
OCFS2 is a general purpose extent based shared disk cluster file
system with many similarities to ext3. It supports 64 bit inode
numbers, and has automatically extending metadata groups which may
@ -14,22 +18,26 @@ OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
All code copyright 2005 Oracle except when otherwise noted.
CREDITS:
Credits
=======
Lots of code taken from ext3 and other projects.
Authors in alphabetical order:
Joel Becker <joel.becker@oracle.com>
Zach Brown <zach.brown@oracle.com>
Mark Fasheh <mfasheh@suse.com>
Kurt Hackel <kurt.hackel@oracle.com>
Tao Ma <tao.ma@oracle.com>
Sunil Mushran <sunil.mushran@oracle.com>
Manish Singh <manish.singh@oracle.com>
Tiger Yang <tiger.yang@oracle.com>
- Joel Becker <joel.becker@oracle.com>
- Zach Brown <zach.brown@oracle.com>
- Mark Fasheh <mfasheh@suse.com>
- Kurt Hackel <kurt.hackel@oracle.com>
- Tao Ma <tao.ma@oracle.com>
- Sunil Mushran <sunil.mushran@oracle.com>
- Manish Singh <manish.singh@oracle.com>
- Tiger Yang <tiger.yang@oracle.com>
Caveats
=======
Features which OCFS2 does not support yet:
- Directory change notification (F_NOTIFY)
- Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
@ -37,8 +45,10 @@ Mount options
=============
OCFS2 supports the following mount options:
(*) == default
======================= ========================================================
barrier=1 This enables/disables barriers. barrier=0 disables it,
barrier=1 enables it.
errors=remount-ro(*) Remount the filesystem read-only on an error.
@ -104,3 +114,4 @@ journal_async_commit Commit block can be written to disk without waiting
for descriptor blocks. If enabled older kernels cannot
mount the device. This will enable 'journal_checksum'
internally.
======================= ========================================================

View File

@ -0,0 +1,112 @@
.. SPDX-License-Identifier: GPL-2.0
================================
Optimized MPEG Filesystem (OMFS)
================================
Overview
========
OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR
and Rio Karma MP3 player. The filesystem is extent-based, utilizing
block sizes from 2k to 8k, with hash-based directories. This
filesystem driver may be used to read and write disks from these
devices.
Note, it is not recommended that this FS be used in place of a general
filesystem for your own streaming media device. Native Linux filesystems
will likely perform better.
More information is available at:
http://linux-karma.sf.net/
Various utilities, including mkomfs and omfsck, are included with
omfsprogs, available at:
http://bobcopeland.com/karma/
Instructions are included in its README.
Options
=======
OMFS supports the following mount-time options:
============ ========================================
uid=n make all files owned by specified user
gid=n make all files owned by specified group
umask=xxx set permission umask to xxx
fmask=xxx set umask to xxx for files
dmask=xxx set umask to xxx for directories
============ ========================================
Disk format
===========
OMFS discriminates between "sysblocks" and normal data blocks. The sysblock
group consists of super block information, file metadata, directory structures,
and extents. Each sysblock has a header containing CRCs of the entire
sysblock, and may be mirrored in successive blocks on the disk. A sysblock may
have a smaller size than a data block, but since they are both addressed by the
same 64-bit block number, any remaining space in the smaller sysblock is
unused.
Sysblock header information::
struct omfs_header {
__be64 h_self; /* FS block where this is located */
__be32 h_body_size; /* size of useful data after header */
__be16 h_crc; /* crc-ccitt of body_size bytes */
char h_fill1[2];
u8 h_version; /* version, always 1 */
char h_type; /* OMFS_INODE_X */
u8 h_magic; /* OMFS_IMAGIC */
u8 h_check_xor; /* XOR of header bytes before this */
__be32 h_fill2;
};
Files and directories are both represented by omfs_inode::
struct omfs_inode {
struct omfs_header i_head; /* header */
__be64 i_parent; /* parent containing this inode */
__be64 i_sibling; /* next inode in hash bucket */
__be64 i_ctime; /* ctime, in milliseconds */
char i_fill1[35];
char i_type; /* OMFS_[DIR,FILE] */
__be32 i_fill2;
char i_fill3[64];
char i_name[OMFS_NAMELEN]; /* filename */
__be64 i_size; /* size of file, in bytes */
};
Directories in OMFS are implemented as a large hash table. Filenames are
hashed then prepended into the bucket list beginning at OMFS_DIR_START.
Lookup requires hashing the filename, then seeking across i_sibling pointers
until a match is found on i_name. Empty buckets are represented by block
pointers with all-1s (~0).
A file is an omfs_inode structure followed by an extent table beginning at
OMFS_EXTENT_START::
struct omfs_extent_entry {
__be64 e_cluster; /* start location of a set of blocks */
__be64 e_blocks; /* number of blocks after e_cluster */
};
struct omfs_extent {
__be64 e_next; /* next extent table location */
__be32 e_extent_count; /* total # extents in this table */
__be32 e_fill;
struct omfs_extent_entry e_entry; /* start of extent entries */
};
Each extent holds the block offset followed by number of blocks allocated to
the extent. The final extent in each table is a terminator with e_cluster
being ~0 and e_blocks being ones'-complement of the total number of blocks
in the table.
If this table overflows, a continuation inode is written and pointed to by
e_next. These have a header but lack the rest of the inode structure.

View File

@ -1,106 +0,0 @@
Optimized MPEG Filesystem (OMFS)
Overview
========
OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR
and Rio Karma MP3 player. The filesystem is extent-based, utilizing
block sizes from 2k to 8k, with hash-based directories. This
filesystem driver may be used to read and write disks from these
devices.
Note, it is not recommended that this FS be used in place of a general
filesystem for your own streaming media device. Native Linux filesystems
will likely perform better.
More information is available at:
http://linux-karma.sf.net/
Various utilities, including mkomfs and omfsck, are included with
omfsprogs, available at:
http://bobcopeland.com/karma/
Instructions are included in its README.
Options
=======
OMFS supports the following mount-time options:
uid=n - make all files owned by specified user
gid=n - make all files owned by specified group
umask=xxx - set permission umask to xxx
fmask=xxx - set umask to xxx for files
dmask=xxx - set umask to xxx for directories
Disk format
===========
OMFS discriminates between "sysblocks" and normal data blocks. The sysblock
group consists of super block information, file metadata, directory structures,
and extents. Each sysblock has a header containing CRCs of the entire
sysblock, and may be mirrored in successive blocks on the disk. A sysblock may
have a smaller size than a data block, but since they are both addressed by the
same 64-bit block number, any remaining space in the smaller sysblock is
unused.
Sysblock header information:
struct omfs_header {
__be64 h_self; /* FS block where this is located */
__be32 h_body_size; /* size of useful data after header */
__be16 h_crc; /* crc-ccitt of body_size bytes */
char h_fill1[2];
u8 h_version; /* version, always 1 */
char h_type; /* OMFS_INODE_X */
u8 h_magic; /* OMFS_IMAGIC */
u8 h_check_xor; /* XOR of header bytes before this */
__be32 h_fill2;
};
Files and directories are both represented by omfs_inode:
struct omfs_inode {
struct omfs_header i_head; /* header */
__be64 i_parent; /* parent containing this inode */
__be64 i_sibling; /* next inode in hash bucket */
__be64 i_ctime; /* ctime, in milliseconds */
char i_fill1[35];
char i_type; /* OMFS_[DIR,FILE] */
__be32 i_fill2;
char i_fill3[64];
char i_name[OMFS_NAMELEN]; /* filename */
__be64 i_size; /* size of file, in bytes */
};
Directories in OMFS are implemented as a large hash table. Filenames are
hashed then prepended into the bucket list beginning at OMFS_DIR_START.
Lookup requires hashing the filename, then seeking across i_sibling pointers
until a match is found on i_name. Empty buckets are represented by block
pointers with all-1s (~0).
A file is an omfs_inode structure followed by an extent table beginning at
OMFS_EXTENT_START:
struct omfs_extent_entry {
__be64 e_cluster; /* start location of a set of blocks */
__be64 e_blocks; /* number of blocks after e_cluster */
};
struct omfs_extent {
__be64 e_next; /* next extent table location */
__be32 e_extent_count; /* total # extents in this table */
__be32 e_fill;
struct omfs_extent_entry e_entry; /* start of extent entries */
};
Each extent holds the block offset followed by number of blocks allocated to
the extent. The final extent in each table is a terminator with e_cluster
being ~0 and e_blocks being ones'-complement of the total number of blocks
in the table.
If this table overflows, a continuation inode is written and pointed to by
e_next. These have a header but lack the rest of the inode structure.

View File

@ -1,3 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
========
ORANGEFS
========
@ -21,25 +24,25 @@ Orangefs features include:
* Stateless
MAILING LIST ARCHIVES
Mailing List Archives
=====================
http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
MAILING LIST SUBMISSIONS
Mailing List Submissions
========================
devel@lists.orangefs.org
DOCUMENTATION
Documentation
=============
http://www.orangefs.org/documentation/
USERSPACE FILESYSTEM SOURCE
Userspace Filesystem Source
===========================
http://www.orangefs.org/download
@ -48,16 +51,16 @@ Orangefs versions prior to 2.9.3 would not be compatible with the
upstream version of the kernel client.
RUNNING ORANGEFS ON A SINGLE SERVER
Running ORANGEFS On a Single Server
===================================
OrangeFS is usually run in large installations with multiple servers and
clients, but a complete filesystem can be run on a single machine for
development and testing.
On Fedora, install orangefs and orangefs-server.
On Fedora, install orangefs and orangefs-server::
dnf -y install orangefs orangefs-server
dnf -y install orangefs orangefs-server
There is an example server configuration file in
/etc/orangefs/orangefs.conf. Change localhost to your hostname if
@ -70,29 +73,29 @@ single line. Uncomment it and change the hostname if necessary. This
controls clients which use libpvfs2. This does not control the
pvfs2-client-core.
Create the filesystem.
Create the filesystem::
pvfs2-server -f /etc/orangefs/orangefs.conf
pvfs2-server -f /etc/orangefs/orangefs.conf
Start the server.
Start the server::
systemctl start orangefs-server
systemctl start orangefs-server
Test the server.
Test the server::
pvfs2-ping -m /pvfsmnt
pvfs2-ping -m /pvfsmnt
Start the client. The module must be compiled in or loaded before this
point.
point::
systemctl start orangefs-client
systemctl start orangefs-client
Mount the filesystem.
Mount the filesystem::
mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
BUILDING ORANGEFS ON A SINGLE SERVER
Building ORANGEFS on a Single Server
====================================
Where OrangeFS cannot be installed from distribution packages, it may be
@ -102,49 +105,51 @@ You can omit --prefix if you don't care that things are sprinkled around
in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by
default, we will probably be changing the default to LMDB soon.
./configure --prefix=/opt/ofs --with-db-backend=lmdb
::
make
./configure --prefix=/opt/ofs --with-db-backend=lmdb
make install
make
Create an orangefs config file.
make install
/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
Create an orangefs config file::
Create an /etc/pvfs2tab file.
/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
/etc/pvfs2tab
Create an /etc/pvfs2tab file::
Create the mount point you specified in the tab file if needed.
echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
/etc/pvfs2tab
mkdir /pvfsmnt
Create the mount point you specified in the tab file if needed::
Bootstrap the server.
mkdir /pvfsmnt
/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
Bootstrap the server::
Start the server.
/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf
Start the server::
/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf
Now the server should be running. Pvfs2-ls is a simple
test to verify that the server is running.
test to verify that the server is running::
/opt/ofs/bin/pvfs2-ls /pvfsmnt
/opt/ofs/bin/pvfs2-ls /pvfsmnt
If stuff seems to be working, load the kernel module and
turn on the client core.
turn on the client core::
/opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
/opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
Mount your filesystem.
Mount your filesystem::
mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
RUNNING XFSTESTS
Running xfstests
================
It is useful to use a scratch filesystem with xfstests. This can be
@ -159,21 +164,23 @@ Then there are two FileSystem sections: orangefs and scratch.
This change should be made before creating the filesystem.
pvfs2-server -f /etc/orangefs/orangefs.conf
::
To run xfstests, create /etc/xfsqa.config.
pvfs2-server -f /etc/orangefs/orangefs.conf
TEST_DIR=/orangefs
TEST_DEV=tcp://localhost:3334/orangefs
SCRATCH_MNT=/scratch
SCRATCH_DEV=tcp://localhost:3334/scratch
To run xfstests, create /etc/xfsqa.config::
Then xfstests can be run
TEST_DIR=/orangefs
TEST_DEV=tcp://localhost:3334/orangefs
SCRATCH_MNT=/scratch
SCRATCH_DEV=tcp://localhost:3334/scratch
./check -pvfs2
Then xfstests can be run::
./check -pvfs2
OPTIONS
Options
=======
The following mount options are accepted:
@ -193,32 +200,32 @@ The following mount options are accepted:
Distributed locking is being worked on for the future.
DEBUGGING
Debugging
=========
If you want the debug (GOSSIP) statements in a particular
source file (inode.c for example) go to syslog:
source file (inode.c for example) go to syslog::
echo inode > /sys/kernel/debug/orangefs/kernel-debug
No debugging (the default):
No debugging (the default)::
echo none > /sys/kernel/debug/orangefs/kernel-debug
Debugging from several source files:
Debugging from several source files::
echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
All debugging:
All debugging::
echo all > /sys/kernel/debug/orangefs/kernel-debug
Get a list of all debugging keywords:
Get a list of all debugging keywords::
cat /sys/kernel/debug/orangefs/debug-help
PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE
Protocol between Kernel Module and Userspace
============================================
Orangefs is a user space filesystem and an associated kernel module.
@ -234,7 +241,8 @@ The kernel module implements a pseudo device that userspace
can read from and write to. Userspace can also manipulate the
kernel module through the pseudo device with ioctl.
THE BUFMAP:
The Bufmap
----------
At startup userspace allocates two page-size-aligned (posix_memalign)
mlocked memory buffers, one is used for IO and one is used for readdir
@ -250,7 +258,8 @@ copied from user space to kernel space with copy_from_user and is used
to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
then contains:
* refcnt - a reference counter
* refcnt
- a reference counter
* desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
partition size, which represents the filesystem's block size and
is used for s_blocksize in super blocks.
@ -259,17 +268,19 @@ then contains:
* desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
* total_size - the total size of the IO buffer.
* page_count - the number of 4096 byte pages in the IO buffer.
* page_array - a pointer to page_count * (sizeof(struct page*)) bytes
* page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes
of kcalloced memory. This memory is used as an array of pointers
to each of the pages in the IO buffer through a call to get_user_pages.
* desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc))
* desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
bytes of kcalloced memory. This memory is further intialized:
user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
structure. user_desc->ptr points to the IO buffer.
pages_per_desc = bufmap->desc_size / PAGE_SIZE
offset = 0
::
pages_per_desc = bufmap->desc_size / PAGE_SIZE
offset = 0
bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
bufmap->desc_array[0].array_count = pages_per_desc = 1024
@ -293,7 +304,8 @@ then contains:
* readdir_index_lock - a spinlock to protect readdir_index_array during
update.
OPERATIONS:
Operations
----------
The kernel module builds an "op" (struct orangefs_kernel_op_s) when it
needs to communicate with userspace. Part of the op contains the "upcall"
@ -308,13 +320,19 @@ in flight at any given time.
Ops are stateful:
* unknown - op was just initialized
* waiting - op is on request_list (upward bound)
* inprogr - op is in progress (waiting for downcall)
* serviced - op has matching downcall; ok
* purged - op has to start a timer since client-core
* unknown
- op was just initialized
* waiting
- op is on request_list (upward bound)
* inprogr
- op is in progress (waiting for downcall)
* serviced
- op has matching downcall; ok
* purged
- op has to start a timer since client-core
exited uncleanly before servicing op
* given up - submitter has given up waiting for it
* given up
- submitter has given up waiting for it
When some arbitrary userspace program needs to perform a
filesystem operation on Orangefs (readdir, I/O, create, whatever)
@ -389,10 +407,15 @@ union of structs, each of which is associated with a particular
response type.
The several members outside of the union are:
- int32_t type - type of operation.
- int32_t status - return code for the operation.
- int64_t trailer_size - 0 unless readdir operation.
- char *trailer_buf - initialized to NULL, used during readdir operations.
``int32_t type``
- type of operation.
``int32_t status``
- return code for the operation.
``int64_t trailer_size``
- 0 unless readdir operation.
``char *trailer_buf``
- initialized to NULL, used during readdir operations.
The appropriate member inside the union is filled out for any
particular response.
@ -449,18 +472,20 @@ Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests
made by the kernel side.
A buffer_list containing:
- a pointer to the prepared response to the request from the
kernel (struct pvfs2_downcall_t).
- and also, in the case of a readdir request, a pointer to a
buffer containing descriptors for the objects in the target
directory.
... is sent to the function (PINT_dev_write_list) which performs
the writev.
PINT_dev_write_list has a local iovec array: struct iovec io_array[10];
The first four elements of io_array are initialized like this for all
responses:
responses::
io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
io_array[0].iov_len = sizeof(int32_t)
@ -475,7 +500,7 @@ responses:
of global variable vfs_request (vfs_request_t)
io_array[3].iov_len = sizeof(pvfs2_downcall_t)
Readdir responses initialize the fifth element io_array like this:
Readdir responses initialize the fifth element io_array like this::
io_array[4].iov_base = contents of member trailer_buf (char *)
from out_downcall member of global variable
@ -517,13 +542,13 @@ from a dentry is cheap, obtaining it from userspace is relatively expensive,
hence the motivation to use the dentry when possible.
The timeout values d_time and getattr_time are jiffy based, and the
code is designed to avoid the jiffy-wrap problem:
code is designed to avoid the jiffy-wrap problem::
"In general, if the clock may have wrapped around more than once, there
is no way to tell how much time has elapsed. However, if the times t1
and t2 are known to be fairly close, we can reliably compute the
difference in a way that takes into account the possibility that the
clock may have wrapped between times."
"In general, if the clock may have wrapped around more than once, there
is no way to tell how much time has elapsed. However, if the times t1
and t2 are known to be fairly close, we can reliably compute the
difference in a way that takes into account the possibility that the
clock may have wrapped between times."
from course notes by instructor Andy Wang
from course notes by instructor Andy Wang

View File

@ -1,3 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
===================
The QNX6 Filesystem
===================
@ -14,10 +17,12 @@ Specification
qnx6fs shares many properties with traditional Unix filesystems. It has the
concepts of blocks, inodes and directories.
On QNX it is possible to create little endian and big endian qnx6 filesystems.
This feature makes it possible to create and use a different endianness fs
for the target (QNX is used on quite a range of embedded systems) platform
running on a different endianness.
The Linux driver handles endianness transparently. (LE and BE)
Blocks
@ -26,6 +31,7 @@ Blocks
The space in the device or file is split up into blocks. These are a fixed
size of 512, 1024, 2048 or 4096, which is decided when the filesystem is
created.
Blockpointers are 32bit, so the maximum space that can be addressed is
2^32 * 4096 bytes or 16TB
@ -50,6 +56,7 @@ Each of these root nodes holds information like total size of the stored
data and the addressing levels in that specific tree.
If the level value is 0, up to 16 direct blocks can be addressed by each
node.
Level 1 adds an additional indirect addressing level where each indirect
addressing block holds up to blocksize / 4 bytes pointers to data blocks.
Level 2 adds an additional indirect addressing block level (so, already up
@ -57,11 +64,13 @@ to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree).
Unused block pointers are always set to ~0 - regardless of root node,
indirect addressing blocks or inodes.
Data leaves are always on the lowest level. So no data is stored on upper
tree levels.
The first Superblock is located at 0x2000. (0x2000 is the bootblock size)
The Audi MMI 3G first superblock directly starts at byte 0.
Second superblock position can either be calculated from the superblock
information (total number of filesystem blocks) or by taking the highest
device address, zeroing the last 3 bytes and then subtracting 0x1000 from
@ -84,6 +93,7 @@ Object mode field is POSIX format. (which makes things easier)
There are also pointers to the first 16 blocks, if the object data can be
addressed with 16 direct blocks.
For more than 16 blocks an indirect addressing in form of another tree is
used. (scheme is the same as the one used for the superblock root nodes)
@ -96,13 +106,18 @@ Directories
A directory is a filesystem object and has an inode just like a file.
It is a specially formatted file containing records which associate each
name with an inode number.
'.' inode number points to the directory inode
'..' inode number points to the parent directory inode
Eeach filename record additionally got a filename length field.
One special case are long filenames or subdirectory names.
These got set a filename length field of 0xff in the corresponding directory
record plus the longfile inode number also stored in that record.
With that longfilename inode number, the longfilename tree can be walked
starting with the superblock longfilename root node pointers.
@ -111,6 +126,7 @@ Special files
Symbolic links are also filesystem objects with inodes. They got a specific
bit in the inode mode field identifying them as symbolic link.
The directory entry file inode pointer points to the target file inode.
Hard links got an inode, a directory entry, but a specific mode bit set,
@ -126,9 +142,11 @@ Long filenames
Long filenames are stored in a separate addressing tree. The staring point
is the longfilename root node in the active superblock.
Each data block (tree leaves) holds one long filename. That filename is
limited to 510 bytes. The first two starting bytes are used as length field
for the actual filename.
If that structure shall fit for all allowed blocksizes, it is clear why there
is a limit of 510 bytes for the actual filename stored.
@ -138,6 +156,7 @@ Bitmap
The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap
root node in the superblock and each bit in the bitmap represents one
filesystem block.
The first block is block 0, which starts 0x1000 after superblock start.
So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical
address at which block 0 is located.
@ -149,11 +168,14 @@ Bitmap system area
------------------
The bitmap itself is divided into three parts.
First the system area, that is split into two halves.
Then userspace.
The requirement for a static, fixed preallocated system area comes from how
qnx6fs deals with writes.
Each superblock got it's own half of the system area. So superblock #1
always uses blocks from the lower half while superblock #2 just writes to
blocks represented by the upper half bitmap system area bits.

View File

@ -1,5 +1,11 @@
ramfs, rootfs and initramfs
.. SPDX-License-Identifier: GPL-2.0
===========================
Ramfs, rootfs and initramfs
===========================
October 17, 2005
Rob Landley <rob@landley.net>
=============================
@ -99,14 +105,14 @@ out of that.
All this differs from the old initrd in several ways:
- The old initrd was always a separate file, while the initramfs archive is
linked into the linux kernel image. (The directory linux-*/usr is devoted
to generating this archive during the build.)
linked into the linux kernel image. (The directory ``linux-*/usr`` is
devoted to generating this archive during the build.)
- The old initrd file was a gzipped filesystem image (in some file format,
such as ext2, that needed a driver built into the kernel), while the new
initramfs archive is a gzipped cpio archive (like tar only simpler,
see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst). The
kernel's cpio extraction code is not only extremely small, it's also
see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst).
The kernel's cpio extraction code is not only extremely small, it's also
__init text and data that can be discarded during the boot process.
- The program run by the old initrd (which was called /initrd, not /init) did
@ -139,7 +145,7 @@ and living in usr/Kconfig) can be used to specify a source for the
initramfs archive, which will automatically be incorporated into the
resulting binary. This option can point to an existing gzipped cpio
archive, a directory containing files to be archived, or a text file
specification such as the following example:
specification such as the following example::
dir /dev 755 0 0
nod /dev/console 644 0 0 c 5 1
@ -175,12 +181,12 @@ or extracting your own preprepared cpio files to feed to the kernel build
(instead of a config file or directory).
The following command line can extract a cpio image (either by the above script
or by the kernel build) back into its component files:
or by the kernel build) back into its component files::
cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
The following shell script can create a prebuilt cpio archive you can
use in place of the above config file:
use in place of the above config file::
#!/bin/sh
@ -202,14 +208,17 @@ use in place of the above config file:
exit 1
fi
Note: The cpio man page contains some bad advice that will break your initramfs
archive if you follow it. It says "A typical way to generate the list
of filenames is with the find command; you should give find the -depth option
to minimize problems with permissions on directories that are unwritable or not
searchable." Don't do this when creating initramfs.cpio.gz images, it won't
work. The Linux kernel cpio extractor won't create files in a directory that
doesn't exist, so the directory entries must go before the files that go in
those directories. The above script gets them in the right order.
.. Note::
The cpio man page contains some bad advice that will break your initramfs
archive if you follow it. It says "A typical way to generate the list
of filenames is with the find command; you should give find the -depth
option to minimize problems with permissions on directories that are
unwritable or not searchable." Don't do this when creating
initramfs.cpio.gz images, it won't work. The Linux kernel cpio extractor
won't create files in a directory that doesn't exist, so the directory
entries must go before the files that go in those directories.
The above script gets them in the right order.
External initramfs images:
--------------------------
@ -236,9 +245,10 @@ An initramfs archive is a complete self-contained root filesystem for Linux.
If you don't already understand what shared libraries, devices, and paths
you need to get a minimal root filesystem up and running, here are some
references:
http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
http://www.linuxfromscratch.org/lfs/view/stable/
- http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
- http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
- http://www.linuxfromscratch.org/lfs/view/stable/
The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
designed to be a tiny C library to statically link early userspace
@ -255,7 +265,7 @@ name lookups, even when otherwise statically linked.)
A good first step is to get initramfs to run a statically linked "hello world"
program as init, and test it under an emulator like qemu (www.qemu.org) or
User Mode Linux, like so:
User Mode Linux, like so::
cat > hello.c << EOF
#include <stdio.h>
@ -326,8 +336,8 @@ the above threads) is:
explained his reasoning:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
- http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
- http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
and, most importantly, designed and implemented the initramfs code.

View File

@ -1,3 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
==================================
relay interface (formerly relayfs)
==================================
@ -108,6 +111,7 @@ The relay interface implements basic file operations for user space
access to relay channel buffer data. Here are the file operations
that are available and some comments regarding their behavior:
=========== ============================================================
open() enables user to open an _existing_ channel buffer.
mmap() results in channel buffer being mapped into the caller's
@ -136,13 +140,16 @@ poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
close() decrements the channel buffer's refcount. When the refcount
reaches 0, i.e. when no process or kernel client has the
buffer open, the channel buffer is freed.
=========== ============================================================
In order for a user application to make use of relay files, the
host filesystem must be mounted. For example,
host filesystem must be mounted. For example::
mount -t debugfs debugfs /sys/kernel/debug
NOTE: the host filesystem doesn't need to be mounted for kernel
.. Note::
the host filesystem doesn't need to be mounted for kernel
clients to create or use channels - it only needs to be
mounted when user space applications need access to the buffer
data.
@ -154,7 +161,7 @@ The relay interface kernel API
Here's a summary of the API the relay interface provides to in-kernel clients:
TBD(curr. line MT:/API/)
channel management functions:
channel management functions::
relay_open(base_filename, parent, subbuf_size, n_subbufs,
callbacks, private_data)
@ -162,17 +169,17 @@ TBD(curr. line MT:/API/)
relay_flush(chan)
relay_reset(chan)
channel management typically called on instigation of userspace:
channel management typically called on instigation of userspace::
relay_subbufs_consumed(chan, cpu, subbufs_consumed)
write functions:
write functions::
relay_write(chan, data, length)
__relay_write(chan, data, length)
relay_reserve(chan, length)
callbacks:
callbacks::
subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
buf_mapped(buf, filp)
@ -180,7 +187,7 @@ TBD(curr. line MT:/API/)
create_buf_file(filename, parent, mode, buf, is_global)
remove_buf_file(dentry)
helper functions:
helper functions::
relay_buf_full(buf)
subbuf_start_reserve(buf, length)
@ -215,41 +222,41 @@ the file(s) created in create_buf_file() and is called during
relay_close().
Here are some typical definitions for these callbacks, in this case
using debugfs:
using debugfs::
/*
* create_buf_file() callback. Creates relay file in debugfs.
*/
static struct dentry *create_buf_file_handler(const char *filename,
struct dentry *parent,
umode_t mode,
struct rchan_buf *buf,
int *is_global)
{
return debugfs_create_file(filename, mode, parent, buf,
&relay_file_operations);
}
/*
* create_buf_file() callback. Creates relay file in debugfs.
*/
static struct dentry *create_buf_file_handler(const char *filename,
struct dentry *parent,
umode_t mode,
struct rchan_buf *buf,
int *is_global)
{
return debugfs_create_file(filename, mode, parent, buf,
&relay_file_operations);
}
/*
* remove_buf_file() callback. Removes relay file from debugfs.
*/
static int remove_buf_file_handler(struct dentry *dentry)
{
debugfs_remove(dentry);
/*
* remove_buf_file() callback. Removes relay file from debugfs.
*/
static int remove_buf_file_handler(struct dentry *dentry)
{
debugfs_remove(dentry);
return 0;
}
return 0;
}
/*
* relay interface callbacks
*/
static struct rchan_callbacks relay_callbacks =
{
.create_buf_file = create_buf_file_handler,
.remove_buf_file = remove_buf_file_handler,
};
/*
* relay interface callbacks
*/
static struct rchan_callbacks relay_callbacks =
{
.create_buf_file = create_buf_file_handler,
.remove_buf_file = remove_buf_file_handler,
};
And an example relay_open() invocation using them:
And an example relay_open() invocation using them::
chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL);
@ -339,23 +346,23 @@ whether or not to actually move on to the next sub-buffer.
To implement 'no-overwrite' mode, the userspace client would provide
an implementation of the subbuf_start() callback something like the
following:
following::
static int subbuf_start(struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
unsigned int prev_padding)
{
if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding;
static int subbuf_start(struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
unsigned int prev_padding)
{
if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding;
if (relay_buf_full(buf))
return 0;
if (relay_buf_full(buf))
return 0;
subbuf_start_reserve(buf, sizeof(unsigned int));
subbuf_start_reserve(buf, sizeof(unsigned int));
return 1;
}
return 1;
}
If the current buffer is full, i.e. all sub-buffers remain unconsumed,
the callback returns 0 to indicate that the buffer switch should not
@ -370,20 +377,20 @@ ready sub-buffers will relay_buf_full() return 0, in which case the
buffer switch can continue.
The implementation of the subbuf_start() callback for 'overwrite' mode
would be very similar:
would be very similar::
static int subbuf_start(struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
size_t prev_padding)
{
if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding;
static int subbuf_start(struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
size_t prev_padding)
{
if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding;
subbuf_start_reserve(buf, sizeof(unsigned int));
subbuf_start_reserve(buf, sizeof(unsigned int));
return 1;
}
return 1;
}
In this case, the relay_buf_full() check is meaningless and the
callback always returns 1, causing the buffer switch to occur

View File

@ -1,4 +1,8 @@
ROMFS - ROM FILE SYSTEM
.. SPDX-License-Identifier: GPL-2.0
=======================
ROMFS - ROM File System
=======================
This is a quite dumb, read only filesystem, mainly for initial RAM
disks of installation disks. It has grown up by the need of having
@ -51,9 +55,9 @@ the 16 byte padding for the name and the contents, also 16+14+15 = 45
bytes. This is quite rare however, since most file names are longer
than 3 bytes, and shorter than 15 bytes.
The layout of the filesystem is the following:
The layout of the filesystem is the following::
offset content
offset content
+---+---+---+---+
0 | - | r | o | m | \
@ -84,9 +88,9 @@ the source. This algorithm was chosen because although it's not quite
reliable, it does not require any tables, and it is very simple.
The following bytes are now part of the file system; each file header
must begin on a 16 byte boundary.
must begin on a 16 byte boundary::
offset content
offset content
+---+---+---+---+
0 | next filehdr|X| The offset of the next file header
@ -114,7 +118,9 @@ file is user and group 0, this should never be a problem for the
intended use. The mapping of the 8 possible values to file types is
the following:
== =============== ============================================
mapping spec.info means
== =============== ============================================
0 hard link link destination [file header]
1 directory first file's header
2 regular file unused, must be zero [MBZ]
@ -123,6 +129,7 @@ the following:
5 char device - " -
6 socket unused, MBZ
7 fifo unused, MBZ
== =============== ============================================
Note that hard links are specifically marked in this filesystem, but
they will behave as you can expect (i.e. share the inode number).
@ -158,24 +165,24 @@ to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
Pending issues:
- Permissions and owner information are pretty essential features of a
Un*x like system, but romfs does not provide the full possibilities.
I have never found this limiting, but others might.
Un*x like system, but romfs does not provide the full possibilities.
I have never found this limiting, but others might.
- The file system is read only, so it can be very small, but in case
one would want to write _anything_ to a file system, he still needs
a writable file system, thus negating the size advantages. Possible
solutions: implement write access as a compile-time option, or a new,
similarly small writable filesystem for RAM disks.
one would want to write _anything_ to a file system, he still needs
a writable file system, thus negating the size advantages. Possible
solutions: implement write access as a compile-time option, or a new,
similarly small writable filesystem for RAM disks.
- Since the files are only required to have alignment on a 16 byte
boundary, it is currently possibly suboptimal to read or execute files
from the filesystem. It might be resolved by reordering file data to
have most of it (i.e. except the start and the end) laying at "natural"
boundaries, thus it would be possible to directly map a big portion of
the file contents to the mm subsystem.
boundary, it is currently possibly suboptimal to read or execute files
from the filesystem. It might be resolved by reordering file data to
have most of it (i.e. except the start and the end) laying at "natural"
boundaries, thus it would be possible to directly map a big portion of
the file contents to the mm subsystem.
- Compression might be an useful feature, but memory is quite a
limiting factor in my eyes.
limiting factor in my eyes.
- Where it is used?
@ -183,4 +190,5 @@ limiting factor in my eyes.
Have fun,
Janos Farkas <chexum@shadow.banki.hu>

View File

@ -1,7 +1,11 @@
SQUASHFS 4.0 FILESYSTEM
.. SPDX-License-Identifier: GPL-2.0
=======================
Squashfs 4.0 Filesystem
=======================
Squashfs is a compressed read-only filesystem for Linux.
It uses zlib, lz4, lzo, or xz compression to compress files, inodes and
directories. Inodes in the system are very small and all blocks are packed to
minimise data overhead. Block sizes greater than 4K are supported up to a
@ -15,31 +19,33 @@ needed.
Mailing list: squashfs-devel@lists.sourceforge.net
Web site: www.squashfs.org
1. FILESYSTEM FEATURES
1. Filesystem Features
----------------------
Squashfs filesystem features versus Cramfs:
============================== ========= ==========
Squashfs Cramfs
Max filesystem size: 2^64 256 MiB
Max file size: ~ 2 TiB 16 MiB
Max files: unlimited unlimited
Max directories: unlimited unlimited
Max entries per directory: unlimited unlimited
Max block size: 1 MiB 4 KiB
Metadata compression: yes no
Directory indexes: yes no
Sparse file support: yes no
Tail-end packing (fragments): yes no
Exportable (NFS etc.): yes no
Hard link support: yes no
"." and ".." in readdir: yes no
Real inode numbers: yes no
32-bit uids/gids: yes no
File creation time: yes no
Xattr support: yes no
ACL support: no no
============================== ========= ==========
Max filesystem size 2^64 256 MiB
Max file size ~ 2 TiB 16 MiB
Max files unlimited unlimited
Max directories unlimited unlimited
Max entries per directory unlimited unlimited
Max block size 1 MiB 4 KiB
Metadata compression yes no
Directory indexes yes no
Sparse file support yes no
Tail-end packing (fragments) yes no
Exportable (NFS etc.) yes no
Hard link support yes no
"." and ".." in readdir yes no
Real inode numbers yes no
32-bit uids/gids yes no
File creation time yes no
Xattr support yes no
ACL support no no
============================== ========= ==========
Squashfs compresses data, inodes and directories. In addition, inode and
directory data are highly compacted, and packed on byte boundaries. Each
@ -47,7 +53,7 @@ compressed inode is on average 8 bytes in length (the exact length varies on
file type, i.e. regular file, directory, symbolic link, and block/char device
inodes have different sizes).
2. USING SQUASHFS
2. Using Squashfs
-----------------
As squashfs is a read-only filesystem, the mksquashfs program must be used to
@ -58,11 +64,11 @@ obtained from this site also.
The squashfs-tools development tree is now located on kernel.org
git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
3. SQUASHFS FILESYSTEM DESIGN
3. Squashfs Filesystem Design
-----------------------------
A squashfs filesystem consists of a maximum of nine parts, packed together on a
byte alignment:
byte alignment::
---------------
| superblock |
@ -229,15 +235,15 @@ location of the xattr list inside each inode, a 32-bit xattr id
is stored. This xattr id is mapped into the location of the xattr
list using a second xattr id lookup table.
4. TODOS AND OUTSTANDING ISSUES
4. TODOs and Outstanding Issues
-------------------------------
4.1 Todo list
4.1 TODO list
-------------
Implement ACL support.
4.2 Squashfs internal cache
4.2 Squashfs Internal Cache
---------------------------
Blocks in Squashfs are compressed. To avoid repeatedly decompressing

View File

@ -1,32 +1,36 @@
.. SPDX-License-Identifier: GPL-2.0
sysfs - _The_ filesystem for exporting kernel objects.
=====================================================
sysfs - _The_ filesystem for exporting kernel objects
=====================================================
Patrick Mochel <mochel@osdl.org>
Mike Murphy <mamurph@cs.clemson.edu>
Revised: 16 August 2011
Original: 10 January 2003
:Revised: 16 August 2011
:Original: 10 January 2003
What it is:
~~~~~~~~~~~
sysfs is a ram-based filesystem initially based on ramfs. It provides
a means to export kernel data structures, their attributes, and the
linkages between them to userspace.
a means to export kernel data structures, their attributes, and the
linkages between them to userspace.
sysfs is tied inherently to the kobject infrastructure. Please read
Documentation/kobject.txt for more information concerning the kobject
interface.
interface.
Using sysfs
~~~~~~~~~~~
sysfs is always compiled in if CONFIG_SYSFS is defined. You can access
it by doing:
it by doing::
mount -t sysfs sysfs /sys
mount -t sysfs sysfs /sys
Directory Creation
@ -37,7 +41,7 @@ created for it in sysfs. That directory is created as a subdirectory
of the kobject's parent, expressing internal object hierarchies to
userspace. Top-level directories in sysfs represent the common
ancestors of object hierarchies; i.e. the subsystems the objects
belong to.
belong to.
Sysfs internally stores a pointer to the kobject that implements a
directory in the kernfs_node object associated with the directory. In
@ -58,63 +62,63 @@ attributes.
Attributes should be ASCII text files, preferably with only one value
per file. It is noted that it may not be efficient to contain only one
value per file, so it is socially acceptable to express an array of
values of the same type.
values of the same type.
Mixing types, expressing multiple lines of data, and doing fancy
formatting of data is heavily frowned upon. Doing these things may get
you publicly humiliated and your code rewritten without notice.
you publicly humiliated and your code rewritten without notice.
An attribute definition is simply:
An attribute definition is simply::
struct attribute {
char * name;
struct module *owner;
umode_t mode;
};
struct attribute {
char * name;
struct module *owner;
umode_t mode;
};
int sysfs_create_file(struct kobject * kobj, const struct attribute * attr);
void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr);
int sysfs_create_file(struct kobject * kobj, const struct attribute * attr);
void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr);
A bare attribute contains no means to read or write the value of the
attribute. Subsystems are encouraged to define their own attribute
structure and wrapper functions for adding and removing attributes for
a specific object type.
a specific object type.
For example, the driver model defines struct device_attribute like:
For example, the driver model defines struct device_attribute like::
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device *dev, struct device_attribute *attr,
char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
};
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device *dev, struct device_attribute *attr,
char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
};
int device_create_file(struct device *, const struct device_attribute *);
void device_remove_file(struct device *, const struct device_attribute *);
int device_create_file(struct device *, const struct device_attribute *);
void device_remove_file(struct device *, const struct device_attribute *);
It also defines this helper for defining device attributes:
It also defines this helper for defining device attributes::
#define DEVICE_ATTR(_name, _mode, _show, _store) \
struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
#define DEVICE_ATTR(_name, _mode, _show, _store) \
struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
For example, declaring
For example, declaring::
static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
is equivalent to doing:
is equivalent to doing::
static struct device_attribute dev_attr_foo = {
.attr = {
.name = "foo",
.mode = S_IWUSR | S_IRUGO,
},
.show = show_foo,
.store = store_foo,
};
static struct device_attribute dev_attr_foo = {
.attr = {
.name = "foo",
.mode = S_IWUSR | S_IRUGO,
},
.show = show_foo,
.store = store_foo,
};
Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally
considered a bad idea." so trying to set a sysfs file writable for
@ -127,15 +131,21 @@ readable. The above case could be shortened to:
static struct device_attribute dev_attr_foo = __ATTR_RW(foo);
the list of helpers available to define your wrapper function is:
__ATTR_RO(name): assumes default name_show and mode 0444
__ATTR_WO(name): assumes a name_store only and is restricted to mode
__ATTR_RO(name):
assumes default name_show and mode 0444
__ATTR_WO(name):
assumes a name_store only and is restricted to mode
0200 that is root write access only.
__ATTR_RO_MODE(name, mode): fore more restrictive RO access currently
__ATTR_RO_MODE(name, mode):
fore more restrictive RO access currently
only use case is the EFI System Resource Table
(see drivers/firmware/efi/esrt.c)
__ATTR_RW(name): assumes default name_show, name_store and setting
__ATTR_RW(name):
assumes default name_show, name_store and setting
mode to 0644.
__ATTR_NULL: which sets the name to NULL and is used as end of list
__ATTR_NULL:
which sets the name to NULL and is used as end of list
indicator (see: kernel/workqueue.c)
Subsystem-Specific Callbacks
@ -143,12 +153,12 @@ Subsystem-Specific Callbacks
When a subsystem defines a new attribute type, it must implement a
set of sysfs operations for forwarding read and write calls to the
show and store methods of the attribute owners.
show and store methods of the attribute owners::
struct sysfs_ops {
ssize_t (*show)(struct kobject *, struct attribute *, char *);
ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
};
struct sysfs_ops {
ssize_t (*show)(struct kobject *, struct attribute *, char *);
ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
};
[ Subsystems should have already defined a struct kobj_type as a
descriptor for this type, which is where the sysfs_ops pointer is
@ -157,29 +167,29 @@ stored. See the kobject documentation for more information. ]
When a file is read or written, sysfs calls the appropriate method
for the type. The method then translates the generic struct kobject
and struct attribute pointers to the appropriate pointer types, and
calls the associated methods.
calls the associated methods.
To illustrate:
To illustrate::
#define to_dev(obj) container_of(obj, struct device, kobj)
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
#define to_dev(obj) container_of(obj, struct device, kobj)
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
struct device_attribute *dev_attr = to_dev_attr(attr);
struct device *dev = to_dev(kobj);
ssize_t ret = -EIO;
static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
struct device_attribute *dev_attr = to_dev_attr(attr);
struct device *dev = to_dev(kobj);
ssize_t ret = -EIO;
if (dev_attr->show)
ret = dev_attr->show(dev, dev_attr, buf);
if (ret >= (ssize_t)PAGE_SIZE) {
printk("dev_attr_show: %pS returned bad count\n",
dev_attr->show);
}
return ret;
}
if (dev_attr->show)
ret = dev_attr->show(dev, dev_attr, buf);
if (ret >= (ssize_t)PAGE_SIZE) {
printk("dev_attr_show: %pS returned bad count\n",
dev_attr->show);
}
return ret;
}
@ -188,11 +198,11 @@ Reading/Writing Attribute Data
To read or write attributes, show() or store() methods must be
specified when declaring the attribute. The method types should be as
simple as those defined for device attributes:
simple as those defined for device attributes::
ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
IOW, they should take only an object, an attribute, and a buffer as parameters.
@ -200,11 +210,11 @@ IOW, they should take only an object, an attribute, and a buffer as parameters.
sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
method. Sysfs will call the method exactly once for each read or
write. This forces the following behavior on the method
implementations:
implementations:
- On read(2), the show() method should fill the entire buffer.
- On read(2), the show() method should fill the entire buffer.
Recall that an attribute should only be exporting one value, or an
array of similar values, so this shouldn't be that expensive.
array of similar values, so this shouldn't be that expensive.
This allows userspace to do partial reads and forward seeks
arbitrarily over the entire file at will. If userspace seeks back to
@ -218,10 +228,10 @@ implementations:
When writing sysfs files, userspace processes should first read the
entire file, modify the values it wishes to change, then write the
entire buffer back.
entire buffer back.
Attribute method implementations should operate on an identical
buffer when reading and writing values.
buffer when reading and writing values.
Other notes:
@ -229,7 +239,7 @@ Other notes:
file position.
- The buffer will always be PAGE_SIZE bytes in length. On i386, this
is 4096.
is 4096.
- show() methods should return the number of bytes printed into the
buffer. This is the return value of scnprintf().
@ -246,31 +256,31 @@ Other notes:
through, be sure to return an error.
- The object passed to the methods will be pinned in memory via sysfs
referencing counting its embedded object. However, the physical
entity (e.g. device) the object represents may not be present. Be
sure to have a way to check this, if necessary.
referencing counting its embedded object. However, the physical
entity (e.g. device) the object represents may not be present. Be
sure to have a way to check this, if necessary.
A very simple (and naive) implementation of a device attribute is:
A very simple (and naive) implementation of a device attribute is::
static ssize_t show_name(struct device *dev, struct device_attribute *attr,
char *buf)
{
return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name);
}
static ssize_t show_name(struct device *dev, struct device_attribute *attr,
char *buf)
{
return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name);
}
static ssize_t store_name(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
snprintf(dev->name, sizeof(dev->name), "%.*s",
(int)min(count, sizeof(dev->name) - 1), buf);
return count;
}
static ssize_t store_name(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
snprintf(dev->name, sizeof(dev->name), "%.*s",
(int)min(count, sizeof(dev->name) - 1), buf);
return count;
}
static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
(Note that the real implementation doesn't allow userspace to set the
(Note that the real implementation doesn't allow userspace to set the
name for a device.)
@ -278,25 +288,25 @@ Top Level Directory Layout
~~~~~~~~~~~~~~~~~~~~~~~~~~
The sysfs directory arrangement exposes the relationship of kernel
data structures.
data structures.
The top level sysfs directory looks like:
The top level sysfs directory looks like::
block/
bus/
class/
dev/
devices/
firmware/
net/
fs/
block/
bus/
class/
dev/
devices/
firmware/
net/
fs/
devices/ contains a filesystem representation of the device tree. It maps
directly to the internal kernel device tree, which is a hierarchy of
struct device.
struct device.
bus/ contains flat directory layout of the various bus types in the
kernel. Each bus's directory contains two subdirectories:
kernel. Each bus's directory contains two subdirectories::
devices/
drivers/
@ -331,71 +341,71 @@ Current Interfaces
The following interface layers currently exist in sysfs:
- devices (include/linux/device.h)
----------------------------------
Structure:
devices (include/linux/device.h)
--------------------------------
Structure::
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device *dev, struct device_attribute *attr,
char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
};
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device *dev, struct device_attribute *attr,
char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
};
Declaring:
Declaring::
DEVICE_ATTR(_name, _mode, _show, _store);
DEVICE_ATTR(_name, _mode, _show, _store);
Creation/Removal:
Creation/Removal::
int device_create_file(struct device *dev, const struct device_attribute * attr);
void device_remove_file(struct device *dev, const struct device_attribute * attr);
int device_create_file(struct device *dev, const struct device_attribute * attr);
void device_remove_file(struct device *dev, const struct device_attribute * attr);
- bus drivers (include/linux/device.h)
--------------------------------------
Structure:
bus drivers (include/linux/device.h)
------------------------------------
Structure::
struct bus_attribute {
struct attribute attr;
ssize_t (*show)(struct bus_type *, char * buf);
ssize_t (*store)(struct bus_type *, const char * buf, size_t count);
};
struct bus_attribute {
struct attribute attr;
ssize_t (*show)(struct bus_type *, char * buf);
ssize_t (*store)(struct bus_type *, const char * buf, size_t count);
};
Declaring:
Declaring::
static BUS_ATTR_RW(name);
static BUS_ATTR_RO(name);
static BUS_ATTR_WO(name);
static BUS_ATTR_RW(name);
static BUS_ATTR_RO(name);
static BUS_ATTR_WO(name);
Creation/Removal:
Creation/Removal::
int bus_create_file(struct bus_type *, struct bus_attribute *);
void bus_remove_file(struct bus_type *, struct bus_attribute *);
int bus_create_file(struct bus_type *, struct bus_attribute *);
void bus_remove_file(struct bus_type *, struct bus_attribute *);
- device drivers (include/linux/device.h)
-----------------------------------------
device drivers (include/linux/device.h)
---------------------------------------
Structure:
Structure::
struct driver_attribute {
struct attribute attr;
ssize_t (*show)(struct device_driver *, char * buf);
ssize_t (*store)(struct device_driver *, const char * buf,
size_t count);
};
struct driver_attribute {
struct attribute attr;
ssize_t (*show)(struct device_driver *, char * buf);
ssize_t (*store)(struct device_driver *, const char * buf,
size_t count);
};
Declaring:
Declaring::
DRIVER_ATTR_RO(_name)
DRIVER_ATTR_RW(_name)
DRIVER_ATTR_RO(_name)
DRIVER_ATTR_RW(_name)
Creation/Removal:
Creation/Removal::
int driver_create_file(struct device_driver *, const struct driver_attribute *);
void driver_remove_file(struct device_driver *, const struct driver_attribute *);
int driver_create_file(struct device_driver *, const struct driver_attribute *);
void driver_remove_file(struct device_driver *, const struct driver_attribute *);
Documentation

View File

@ -1,25 +1,40 @@
.. SPDX-License-Identifier: GPL-2.0
==================
SystemV Filesystem
==================
It implements all of
- Xenix FS,
- SystemV/386 FS,
- Coherent FS.
To install:
* Answer the 'System V and Coherent filesystem support' question with 'y'
when configuring the kernel.
* To mount a disk or a partition, use
* To mount a disk or a partition, use::
mount [-r] -t sysv device mountpoint
The file system type names
The file system type names::
-t sysv
-t xenix
-t coherent
may be used interchangeably, but the last two will eventually disappear.
Bugs in the present implementation:
- Coherent FS:
- The "free list interleave" n:m is currently ignored.
- Only file systems with no filesystem name and no pack name are recognized.
(See Coherent "man mkfs" for a description of these features.)
(See Coherent "man mkfs" for a description of these features.)
- SystemV Release 2 FS:
The superblock is only searched in the blocks 9, 15, 18, which
corresponds to the beginning of track 1 on floppy disks. No support
for this FS on hard disk yet.
@ -28,12 +43,14 @@ Bugs in the present implementation:
These filesystems are rather similar. Here is a comparison with Minix FS:
* Linux fdisk reports on partitions
- Minix FS 0x81 Linux/Minix
- Xenix FS ??
- SystemV FS ??
- Coherent FS 0x08 AIX bootable
* Size of a block or zone (data allocation unit on disk)
- Minix FS 1024
- Xenix FS 1024 (also 512 ??)
- SystemV FS 1024 (also 512 and 2048)
@ -45,37 +62,51 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
all the block numbers (including the super block) are offset by one track.
* Byte ordering of "short" (16 bit entities) on disk:
- Minix FS little endian 0 1
- Xenix FS little endian 0 1
- SystemV FS little endian 0 1
- Coherent FS little endian 0 1
Of course, this affects only the file system, not the data of files on it!
* Byte ordering of "long" (32 bit entities) on disk:
- Minix FS little endian 0 1 2 3
- Xenix FS little endian 0 1 2 3
- SystemV FS little endian 0 1 2 3
- Coherent FS PDP-11 2 3 0 1
Of course, this affects only the file system, not the data of files on it!
* Inode on disk: "short", 0 means non-existent, the root dir ino is:
- Minix FS 1
- Xenix FS, SystemV FS, Coherent FS 2
================================= ==
Minix FS 1
Xenix FS, SystemV FS, Coherent FS 2
================================= ==
* Maximum number of hard links to a file:
- Minix FS 250
- Xenix FS ??
- SystemV FS ??
- Coherent FS >=10000
=========== =========
Minix FS 250
Xenix FS ??
SystemV FS ??
Coherent FS >=10000
=========== =========
* Free inode management:
- Minix FS a bitmap
- Minix FS
a bitmap
- Xenix FS, SystemV FS, Coherent FS
There is a cache of a certain number of free inodes in the super-block.
When it is exhausted, new free inodes are found using a linear search.
* Free block management:
- Minix FS a bitmap
- Minix FS
a bitmap
- Xenix FS, SystemV FS, Coherent FS
Free blocks are organized in a "free list". Maybe a misleading term,
since it is not true that every free block contains a pointer to
@ -86,13 +117,18 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
* Super-block location:
- Minix FS block 1 = bytes 1024..2047
- Xenix FS block 1 = bytes 1024..2047
- SystemV FS bytes 512..1023
- Coherent FS block 1 = bytes 512..1023
=========== ==========================
Minix FS block 1 = bytes 1024..2047
Xenix FS block 1 = bytes 1024..2047
SystemV FS bytes 512..1023
Coherent FS block 1 = bytes 512..1023
=========== ==========================
* Super-block layout:
- Minix FS
- Minix FS::
unsigned short s_ninodes;
unsigned short s_nzones;
unsigned short s_imap_blocks;
@ -101,7 +137,9 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
unsigned short s_log_zone_size;
unsigned long s_max_size;
unsigned short s_magic;
- Xenix FS, SystemV FS, Coherent FS
- Xenix FS, SystemV FS, Coherent FS::
unsigned short s_firstdatazone;
unsigned long s_nzones;
unsigned short s_fzone_count;
@ -120,23 +158,33 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
char s_fname[6];
char s_fpack[6];
then they differ considerably:
Xenix FS
Xenix FS::
char s_clean;
char s_fill[371];
long s_magic;
long s_type;
SystemV FS
SystemV FS::
long s_fill[12 or 14];
long s_state;
long s_magic;
long s_type;
Coherent FS
Coherent FS::
unsigned long s_unique;
Note that Coherent FS has no magic.
* Inode layout:
- Minix FS
- Minix FS::
unsigned short i_mode;
unsigned short i_uid;
unsigned long i_size;
@ -144,7 +192,9 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
unsigned char i_gid;
unsigned char i_nlinks;
unsigned short i_zone[7+1+1];
- Xenix FS, SystemV FS, Coherent FS
- Xenix FS, SystemV FS, Coherent FS::
unsigned short i_mode;
unsigned short i_nlink;
unsigned short i_uid;
@ -155,38 +205,55 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
unsigned long i_mtime;
unsigned long i_ctime;
* Regular file data blocks are organized as
- Minix FS
7 direct blocks
1 indirect block (pointers to blocks)
1 double-indirect block (pointer to pointers to blocks)
- Xenix FS, SystemV FS, Coherent FS
10 direct blocks
1 indirect block (pointers to blocks)
1 double-indirect block (pointer to pointers to blocks)
1 triple-indirect block (pointer to pointers to pointers to blocks)
* Inode size, inodes per block
- Minix FS 32 32
- Xenix FS 64 16
- SystemV FS 64 16
- Coherent FS 64 8
* Regular file data blocks are organized as
- Minix FS:
- 7 direct blocks
- 1 indirect block (pointers to blocks)
- 1 double-indirect block (pointer to pointers to blocks)
- Xenix FS, SystemV FS, Coherent FS:
- 10 direct blocks
- 1 indirect block (pointers to blocks)
- 1 double-indirect block (pointer to pointers to blocks)
- 1 triple-indirect block (pointer to pointers to pointers to blocks)
=========== ========== ================
Inode size inodes per block
=========== ========== ================
Minix FS 32 32
Xenix FS 64 16
SystemV FS 64 16
Coherent FS 64 8
=========== ========== ================
* Directory entry on disk
- Minix FS
- Minix FS::
unsigned short inode;
char name[14/30];
- Xenix FS, SystemV FS, Coherent FS
- Xenix FS, SystemV FS, Coherent FS::
unsigned short inode;
char name[14];
* Dir entry size, dir entries per block
- Minix FS 16/32 64/32
- Xenix FS 16 64
- SystemV FS 16 64
- Coherent FS 16 32
=========== ============== =====================
Dir entry size dir entries per block
=========== ============== =====================
Minix FS 16/32 64/32
Xenix FS 16 64
SystemV FS 16 64
Coherent FS 16 32
=========== ============== =====================
* How to implement symbolic links such that the host fsck doesn't scream:
- Minix FS normal
- Xenix FS kludge: as regular files with chmod 1000
- SystemV FS ??

View File

@ -1,3 +1,9 @@
.. SPDX-License-Identifier: GPL-2.0
=====
Tmpfs
=====
Tmpfs is a file system which keeps all files in virtual memory.
@ -14,7 +20,7 @@ If you compare it to ramfs (which was the template to create tmpfs)
you gain swapping and limit checking. Another similar thing is the RAM
disk (/dev/ram*), which simulates a fixed size hard disk in physical
RAM, where you have to create an ordinary filesystem on top. Ramdisks
cannot swap and you do not have the possibility to resize them.
cannot swap and you do not have the possibility to resize them.
Since tmpfs lives completely in the page cache and on swap, all tmpfs
pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
@ -26,7 +32,7 @@ tmpfs has the following uses:
1) There is always a kernel internal mount which you will not see at
all. This is used for shared anonymous mappings and SYSV shared
memory.
memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
set, the user visible part of tmpfs is not build. But the internal
@ -34,7 +40,7 @@ tmpfs has the following uses:
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
POSIX shared memory (shm_open, shm_unlink). Adding the following
line to /etc/fstab should take care of this:
line to /etc/fstab should take care of this::
tmpfs /dev/shm tmpfs defaults 0 0
@ -56,15 +62,17 @@ tmpfs has the following uses:
tmpfs has three mount options for sizing:
size: The limit of allocated bytes for this tmpfs instance. The
========= ============================================================
size The limit of allocated bytes for this tmpfs instance. The
default is half of your physical RAM without swap. If you
oversize your tmpfs instances the machine will deadlock
since the OOM handler will not be able to free that memory.
nr_blocks: The same as size, but in blocks of PAGE_SIZE.
nr_inodes: The maximum number of inodes for this instance. The default
nr_blocks The same as size, but in blocks of PAGE_SIZE.
nr_inodes The maximum number of inodes for this instance. The default
is half of the number of your physical RAM pages, or (on a
machine with highmem) the number of lowmem RAM pages,
whichever is the lower.
========= ============================================================
These parameters accept a suffix k, m or g for kilo, mega and giga and
can be changed on remount. The size parameter also accepts a suffix %
@ -82,6 +90,7 @@ tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be
adjusted on the fly via 'mount -o remount ...'
======================== ==============================================
mpol=default use the process allocation policy
(see set_mempolicy(2))
mpol=prefer:Node prefers to allocate memory from the given Node
@ -89,6 +98,7 @@ mpol=bind:NodeList allocates memory only from nodes in NodeList
mpol=interleave prefers to allocate from each node in turn
mpol=interleave:NodeList allocates from each node of NodeList in turn
mpol=local prefers to allocate memory from the local node
======================== ==============================================
NodeList format is a comma-separated list of decimal numbers and ranges,
a range being two hyphen-separated decimal numbers, the smallest and
@ -98,9 +108,9 @@ A memory policy with a valid NodeList will be saved, as specified, for
use at file creation time. When a task allocates a file in the file
system, the mount option memory policy will be applied with a NodeList,
if any, modified by the calling task's cpuset constraints
[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags, listed
below. If the resulting NodeLists is the empty set, the effective memory
policy for the file will revert to "default" policy.
[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags,
listed below. If the resulting NodeLists is the empty set, the effective
memory policy for the file will revert to "default" policy.
NUMA memory allocation policies have optional flags that can be used in
conjunction with their modes. These optional flags can be specified
@ -109,6 +119,8 @@ See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of
all available memory allocation policy mode flags and their effect on
memory policy.
::
=static is equivalent to MPOL_F_STATIC_NODES
=relative is equivalent to MPOL_F_RELATIVE_NODES
@ -128,9 +140,11 @@ on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'.
To specify the initial root directory you can use the following mount
options:
mode: The permissions as an octal number
uid: The user id
gid: The group id
==== ==================================
mode The permissions as an octal number
uid The user id
gid The group id
==== ==================================
These options do not have any effect on remount. You can change these
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
@ -141,9 +155,9 @@ will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
Author:
:Author:
Christoph Rohland <cr@sap.com>, 1.12.01
Updated:
:Updated:
Hugh Dickins, 4 June 2007
Updated:
:Updated:
KOSAKI Motohiro, 16 Mar 2010

View File

@ -1,3 +1,5 @@
.. SPDX-License-Identifier: GPL-2.0
:orphan:
.. UBIFS Authentication
@ -92,11 +94,11 @@ UBIFS Index & Tree Node Cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Basic on-flash UBIFS entities are called *nodes*. UBIFS knows different types
of nodes. Eg. data nodes (`struct ubifs_data_node`) which store chunks of file
contents or inode nodes (`struct ubifs_ino_node`) which represent VFS inodes.
Almost all types of nodes share a common header (`ubifs_ch`) containing basic
of nodes. Eg. data nodes (``struct ubifs_data_node``) which store chunks of file
contents or inode nodes (``struct ubifs_ino_node``) which represent VFS inodes.
Almost all types of nodes share a common header (``ubifs_ch``) containing basic
information like node type, node length, a sequence number, etc. (see
`fs/ubifs/ubifs-media.h`in kernel source). Exceptions are entries of the LPT
``fs/ubifs/ubifs-media.h`` in kernel source). Exceptions are entries of the LPT
and some less important node types like padding nodes which are used to pad
unusable content at the end of LEBs.

View File

@ -1,5 +1,11 @@
.. SPDX-License-Identifier: GPL-2.0
===============
UBI File System
===============
Introduction
=============
============
UBIFS file-system stands for UBI File System. UBI stands for "Unsorted
Block Images". UBIFS is a flash file system, which means it is designed
@ -79,6 +85,7 @@ Mount options
(*) == default.
==================== =======================================================
bulk_read read more in one go to take advantage of flash
media that read faster sequentially
no_bulk_read (*) do not bulk-read
@ -98,6 +105,7 @@ auth_key= specify the key used for authenticating the filesystem.
auth_hash_name= The hash algorithm used for authentication. Used for
both hashing and for creating HMACs. Typical values
include "sha256" or "sha512"
==================== =======================================================
Quick usage instructions
@ -107,12 +115,14 @@ The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax,
where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is
UBI volume name.
Mount volume 0 on UBI device 0 to /mnt/ubifs:
$ mount -t ubifs ubi0_0 /mnt/ubifs
Mount volume 0 on UBI device 0 to /mnt/ubifs::
$ mount -t ubifs ubi0_0 /mnt/ubifs
Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume
name):
$ mount -t ubifs ubi0:rootfs /mnt/ubifs
name)::
$ mount -t ubifs ubi0:rootfs /mnt/ubifs
The following is an example of the kernel boot arguments to attach mtd0
to UBI and mount volume "rootfs":
@ -122,5 +132,6 @@ References
==========
UBIFS documentation and FAQ/HOWTO at the MTD web site:
http://www.linux-mtd.infradead.org/doc/ubifs.html
http://www.linux-mtd.infradead.org/faq/ubifs.html
- http://www.linux-mtd.infradead.org/doc/ubifs.html
- http://www.linux-mtd.infradead.org/faq/ubifs.html

View File

@ -1,6 +1,8 @@
*
* Documentation/filesystems/udf.txt
*
.. SPDX-License-Identifier: GPL-2.0
===============
UDF file system
===============
If you encounter problems with reading UDF discs using this driver,
please report them according to MAINTAINERS file.
@ -18,8 +20,10 @@ performance due to very poor read-modify-write support supplied internally
by drive firmware.
-------------------------------------------------------------------------------
The following mount options are supported:
=========== ======================================
gid= Set the default group.
umask= Set the default umask.
mode= Set the default file permissions.
@ -34,6 +38,7 @@ The following mount options are supported:
longad Use long ad's (default)
nostrict Unset strict conformance
iocharset= Set the NLS character set
=========== ======================================
The uid= and gid= options need a bit more explaining. They will accept a
decimal numeric value and all inodes on that mount will then appear as
@ -47,13 +52,17 @@ the interactive user will always see the files on the disk as belonging to him.
The remaining are for debugging and disaster recovery:
novrs Skip volume sequence recognition
===== ================================
novrs Skip volume sequence recognition
===== ================================
The following expect a offset from 0.
========== =================================================
session= Set the CDROM session (default= last session)
anchor= Override standard anchor location. (default= 256)
lastblock= Set the last block of the filesystem/
========== =================================================
-------------------------------------------------------------------------------
@ -62,5 +71,5 @@ For the latest version and toolset see:
https://github.com/pali/udftools
Documentation on UDF and ECMA 167 is available FREE from:
http://www.osta.org/
http://www.ecma-international.org/
- http://www.osta.org/
- http://www.ecma-international.org/

View File

@ -1,5 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
.. _virtiofs_index:
===================================================
virtiofs: virtio-fs host<->guest shared file system
===================================================

View File

@ -1,4 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
================================================
ZoneFS - Zone filesystem for Zoned block devices
================================================
Introduction
============
@ -29,6 +33,7 @@ Zoned block devices
Zoned storage devices belong to a class of storage devices with an address
space that is divided into zones. A zone is a group of consecutive LBAs and all
zones are contiguous (there are no LBA gaps). Zones may have different types.
* Conventional zones: there are no access constraints to LBAs belonging to
conventional zones. Any read or write access can be executed, similarly to a
regular block device.
@ -158,6 +163,7 @@ Format options
--------------
Several optional features of zonefs can be enabled at format time.
* Conventional zone aggregation: ranges of contiguous conventional zones can be
aggregated into a single larger file instead of the default one file per zone.
* File ownership: The owner UID and GID of zone files is by default 0 (root)
@ -249,7 +255,7 @@ permissions.
Further action taken by zonefs I/O error recovery can be controlled by the user
with the "errors=xxx" mount option. The table below summarizes the result of
zonefs I/O error processing depending on the mount option and on the zone
conditions.
conditions::
+--------------+-----------+-----------------------------------------+
| | | Post error state |
@ -275,6 +281,7 @@ conditions.
+--------------+-----------+-----------------------------------------+
Further notes:
* The "errors=remount-ro" mount option is the default behavior of zonefs I/O
error processing if no errors mount option is specified.
* With the "errors=remount-ro" mount option, the change of the file access
@ -302,6 +309,7 @@ Mount options
zonefs define the "errors=<behavior>" mount option to allow the user to specify
zonefs behavior in response to I/O errors, inode size inconsistencies or zone
condition changes. The defined behaviors are as follow:
* remount-ro (default)
* zone-ro
* zone-offline
@ -333,78 +341,78 @@ Examples
--------
The following formats a 15TB host-managed SMR HDD with 256 MB zones
with the conventional zones aggregation feature enabled.
with the conventional zones aggregation feature enabled::
# mkzonefs -o aggr_cnv /dev/sdX
# mount -t zonefs /dev/sdX /mnt
# ls -l /mnt/
total 0
dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
# mkzonefs -o aggr_cnv /dev/sdX
# mount -t zonefs /dev/sdX /mnt
# ls -l /mnt/
total 0
dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
The size of the zone files sub-directories indicate the number of files
existing for each type of zones. In this example, there is only one
conventional zone file (all conventional zones are aggregated under a single
file).
file)::
# ls -l /mnt/cnv
total 137101312
-rw-r----- 1 root root 140391743488 Nov 25 13:23 0
# ls -l /mnt/cnv
total 137101312
-rw-r----- 1 root root 140391743488 Nov 25 13:23 0
This aggregated conventional zone file can be used as a regular file.
This aggregated conventional zone file can be used as a regular file::
# mkfs.ext4 /mnt/cnv/0
# mount -o loop /mnt/cnv/0 /data
# mkfs.ext4 /mnt/cnv/0
# mount -o loop /mnt/cnv/0 /data
The "seq" sub-directory grouping files for sequential write zones has in this
example 55356 zones.
example 55356 zones::
# ls -lv /mnt/seq
total 14511243264
-rw-r----- 1 root root 0 Nov 25 13:23 0
-rw-r----- 1 root root 0 Nov 25 13:23 1
-rw-r----- 1 root root 0 Nov 25 13:23 2
...
-rw-r----- 1 root root 0 Nov 25 13:23 55354
-rw-r----- 1 root root 0 Nov 25 13:23 55355
# ls -lv /mnt/seq
total 14511243264
-rw-r----- 1 root root 0 Nov 25 13:23 0
-rw-r----- 1 root root 0 Nov 25 13:23 1
-rw-r----- 1 root root 0 Nov 25 13:23 2
...
-rw-r----- 1 root root 0 Nov 25 13:23 55354
-rw-r----- 1 root root 0 Nov 25 13:23 55355
For sequential write zone files, the file size changes as data is appended at
the end of the file, similarly to any regular file system.
the end of the file, similarly to any regular file system::
# dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s
# dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s
# ls -l /mnt/seq/0
-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
# ls -l /mnt/seq/0
-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
The written file can be truncated to the zone size, preventing any further
write operation.
write operation::
# truncate -s 268435456 /mnt/seq/0
# ls -l /mnt/seq/0
-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
# truncate -s 268435456 /mnt/seq/0
# ls -l /mnt/seq/0
-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
Truncation to 0 size allows freeing the file zone storage space and restart
append-writes to the file.
append-writes to the file::
# truncate -s 0 /mnt/seq/0
# ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
# truncate -s 0 /mnt/seq/0
# ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
Since files are statically mapped to zones on the disk, the number of blocks of
a file as reported by stat() and fstat() indicates the size of the file zone.
a file as reported by stat() and fstat() indicates the size of the file zone::
# stat /mnt/seq/0
File: /mnt/seq/0
Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
Device: 870h/2160d Inode: 50431 Links: 1
Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2019-11-25 13:23:57.048971997 +0900
Modify: 2019-11-25 13:52:25.553805765 +0900
Change: 2019-11-25 13:52:25.553805765 +0900
Birth: -
# stat /mnt/seq/0
File: /mnt/seq/0
Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
Device: 870h/2160d Inode: 50431 Links: 1
Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2019-11-25 13:23:57.048971997 +0900
Modify: 2019-11-25 13:52:25.553805765 +0900
Change: 2019-11-25 13:52:25.553805765 +0900
Birth: -
The number of blocks of the file ("Blocks") in units of 512B blocks gives the
maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone

View File

@ -207,10 +207,10 @@ DPIO
CSR firmware support for DMC
----------------------------
.. kernel-doc:: drivers/gpu/drm/i915/intel_csr.c
.. kernel-doc:: drivers/gpu/drm/i915/display/intel_csr.c
:doc: csr support for dmc
.. kernel-doc:: drivers/gpu/drm/i915/intel_csr.c
.. kernel-doc:: drivers/gpu/drm/i915/display/intel_csr.c
:internal:
Video BIOS Table (VBT)

View File

@ -131,7 +131,6 @@ needed).
usb/index
PCI/index
misc-devices/index
mic/index
scheduler/index
Architecture-agnostic documentation

View File

@ -72,6 +72,10 @@ e.g., on Ubuntu for gcc-4.9::
apt-get install gcc-4.9-plugin-dev
Or on Fedora::
dnf install gcc-plugin-devel
Enable a GCC plugin based feature in the kernel config::
CONFIG_GCC_PLUGIN_CYC_COMPLEXITY = y

View File

@ -19,6 +19,7 @@ Kernel Build System
issues
reproducible-builds
gcc-plugins
.. only:: subproject and html

View File

@ -601,7 +601,7 @@ Defined in ``include/linux/export.h``
This is the variant of `EXPORT_SYMBOL()` that allows specifying a symbol
namespace. Symbol Namespaces are documented in
``Documentation/core-api/symbol-namespaces.rst``.
:doc:`../core-api/symbol-namespaces`
:c:func:`EXPORT_SYMBOL_NS_GPL()`
--------------------------------
@ -610,7 +610,7 @@ Defined in ``include/linux/export.h``
This is the variant of `EXPORT_SYMBOL_GPL()` that allows specifying a symbol
namespace. Symbol Namespaces are documented in
``Documentation/core-api/symbol-namespaces.rst``.
:doc:`../core-api/symbol-namespaces`
Routines and Conventions
========================

View File

@ -150,17 +150,17 @@ Locking Only In User Context
If you have a data structure which is only ever accessed from user
context, then you can use a simple mutex (``include/linux/mutex.h``) to
protect it. This is the most trivial case: you initialize the mutex.
Then you can call :c:func:`mutex_lock_interruptible()` to grab the
mutex, and :c:func:`mutex_unlock()` to release it. There is also a
:c:func:`mutex_lock()`, which should be avoided, because it will
Then you can call mutex_lock_interruptible() to grab the
mutex, and mutex_unlock() to release it. There is also a
mutex_lock(), which should be avoided, because it will
not return if a signal is received.
Example: ``net/netfilter/nf_sockopt.c`` allows registration of new
:c:func:`setsockopt()` and :c:func:`getsockopt()` calls, with
:c:func:`nf_register_sockopt()`. Registration and de-registration
setsockopt() and getsockopt() calls, with
nf_register_sockopt(). Registration and de-registration
are only done on module load and unload (and boot time, where there is
no concurrency), and the list of registrations is only consulted for an
unknown :c:func:`setsockopt()` or :c:func:`getsockopt()` system
unknown setsockopt() or getsockopt() system
call. The ``nf_sockopt_mutex`` is perfect to protect this, especially
since the setsockopt and getsockopt calls may well sleep.
@ -170,19 +170,19 @@ Locking Between User Context and Softirqs
If a softirq shares data with user context, you have two problems.
Firstly, the current user context can be interrupted by a softirq, and
secondly, the critical region could be entered from another CPU. This is
where :c:func:`spin_lock_bh()` (``include/linux/spinlock.h``) is
where spin_lock_bh() (``include/linux/spinlock.h``) is
used. It disables softirqs on that CPU, then grabs the lock.
:c:func:`spin_unlock_bh()` does the reverse. (The '_bh' suffix is
spin_unlock_bh() does the reverse. (The '_bh' suffix is
a historical reference to "Bottom Halves", the old name for software
interrupts. It should really be called spin_lock_softirq()' in a
perfect world).
Note that you can also use :c:func:`spin_lock_irq()` or
:c:func:`spin_lock_irqsave()` here, which stop hardware interrupts
Note that you can also use spin_lock_irq() or
spin_lock_irqsave() here, which stop hardware interrupts
as well: see `Hard IRQ Context <#hard-irq-context>`__.
This works perfectly for UP as well: the spin lock vanishes, and this
macro simply becomes :c:func:`local_bh_disable()`
macro simply becomes local_bh_disable()
(``include/linux/interrupt.h``), which protects you from the softirq
being run.
@ -216,8 +216,8 @@ Different Tasklets/Timers
~~~~~~~~~~~~~~~~~~~~~~~~~
If another tasklet/timer wants to share data with your tasklet or timer
, you will both need to use :c:func:`spin_lock()` and
:c:func:`spin_unlock()` calls. :c:func:`spin_lock_bh()` is
, you will both need to use spin_lock() and
spin_unlock() calls. spin_lock_bh() is
unnecessary here, as you are already in a tasklet, and none will be run
on the same CPU.
@ -234,14 +234,14 @@ The same softirq can run on the other CPUs: you can use a per-CPU array
going so far as to use a softirq, you probably care about scalable
performance enough to justify the extra complexity.
You'll need to use :c:func:`spin_lock()` and
:c:func:`spin_unlock()` for shared data.
You'll need to use spin_lock() and
spin_unlock() for shared data.
Different Softirqs
~~~~~~~~~~~~~~~~~~
You'll need to use :c:func:`spin_lock()` and
:c:func:`spin_unlock()` for shared data, whether it be a timer,
You'll need to use spin_lock() and
spin_unlock() for shared data, whether it be a timer,
tasklet, different softirq or the same or another softirq: any of them
could be running on a different CPU.
@ -259,38 +259,38 @@ If a hardware irq handler shares data with a softirq, you have two
concerns. Firstly, the softirq processing can be interrupted by a
hardware interrupt, and secondly, the critical region could be entered
by a hardware interrupt on another CPU. This is where
:c:func:`spin_lock_irq()` is used. It is defined to disable
spin_lock_irq() is used. It is defined to disable
interrupts on that cpu, then grab the lock.
:c:func:`spin_unlock_irq()` does the reverse.
spin_unlock_irq() does the reverse.
The irq handler does not to use :c:func:`spin_lock_irq()`, because
The irq handler does not need to use spin_lock_irq(), because
the softirq cannot run while the irq handler is running: it can use
:c:func:`spin_lock()`, which is slightly faster. The only exception
spin_lock(), which is slightly faster. The only exception
would be if a different hardware irq handler uses the same lock:
:c:func:`spin_lock_irq()` will stop that from interrupting us.
spin_lock_irq() will stop that from interrupting us.
This works perfectly for UP as well: the spin lock vanishes, and this
macro simply becomes :c:func:`local_irq_disable()`
macro simply becomes local_irq_disable()
(``include/asm/smp.h``), which protects you from the softirq/tasklet/BH
being run.
:c:func:`spin_lock_irqsave()` (``include/linux/spinlock.h``) is a
spin_lock_irqsave() (``include/linux/spinlock.h``) is a
variant which saves whether interrupts were on or off in a flags word,
which is passed to :c:func:`spin_unlock_irqrestore()`. This means
which is passed to spin_unlock_irqrestore(). This means
that the same code can be used inside an hard irq handler (where
interrupts are already off) and in softirqs (where the irq disabling is
required).
Note that softirqs (and hence tasklets and timers) are run on return
from hardware interrupts, so :c:func:`spin_lock_irq()` also stops
these. In that sense, :c:func:`spin_lock_irqsave()` is the most
from hardware interrupts, so spin_lock_irq() also stops
these. In that sense, spin_lock_irqsave() is the most
general and powerful locking function.
Locking Between Two Hard IRQ Handlers
-------------------------------------
It is rare to have to share data between two IRQ handlers, but if you
do, :c:func:`spin_lock_irqsave()` should be used: it is
do, spin_lock_irqsave() should be used: it is
architecture-specific whether all interrupts are disabled inside irq
handlers themselves.
@ -304,11 +304,11 @@ Pete Zaitcev gives the following summary:
(``copy_from_user*(`` or ``kmalloc(x,GFP_KERNEL)``).
- Otherwise (== data can be touched in an interrupt), use
:c:func:`spin_lock_irqsave()` and
:c:func:`spin_unlock_irqrestore()`.
spin_lock_irqsave() and
spin_unlock_irqrestore().
- Avoid holding spinlock for more than 5 lines of code and across any
function call (except accessors like :c:func:`readb()`).
function call (except accessors like readb()).
Table of Minimum Requirements
-----------------------------
@ -320,7 +320,7 @@ particular thread can only run on one CPU at a time, but if it needs
shares data with another thread, locking is required).
Remember the advice above: you can always use
:c:func:`spin_lock_irqsave()`, which is a superset of all other
spin_lock_irqsave(), which is a superset of all other
spinlock primitives.
============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ==============
@ -363,13 +363,13 @@ They can be used if you need no access to the data protected with the
lock when some other thread is holding the lock. You should acquire the
lock later if you then need access to the data protected with the lock.
:c:func:`spin_trylock()` does not spin but returns non-zero if it
spin_trylock() does not spin but returns non-zero if it
acquires the spinlock on the first try or 0 if not. This function can be
used in all contexts like :c:func:`spin_lock()`: you must have
used in all contexts like spin_lock(): you must have
disabled the contexts that might interrupt you and acquire the spin
lock.
:c:func:`mutex_trylock()` does not suspend your task but returns
mutex_trylock() does not suspend your task but returns
non-zero if it could lock the mutex on the first try or 0 if not. This
function cannot be safely used in hardware or software interrupt
contexts despite not sleeping.
@ -490,14 +490,14 @@ easy, since we copy the data for the user, and never let them access the
objects directly.
There is a slight (and common) optimization here: in
:c:func:`cache_add()` we set up the fields of the object before
cache_add() we set up the fields of the object before
grabbing the lock. This is safe, as no-one else can access it until we
put it in cache.
Accessing From Interrupt Context
--------------------------------
Now consider the case where :c:func:`cache_find()` can be called
Now consider the case where cache_find() can be called
from interrupt context: either a hardware interrupt or a softirq. An
example would be a timer which deletes object from the cache.
@ -566,16 +566,16 @@ which are taken away, and the ``+`` are lines which are added.
return ret;
}
Note that the :c:func:`spin_lock_irqsave()` will turn off
Note that the spin_lock_irqsave() will turn off
interrupts if they are on, otherwise does nothing (if we are already in
an interrupt handler), hence these functions are safe to call from any
context.
Unfortunately, :c:func:`cache_add()` calls :c:func:`kmalloc()`
Unfortunately, cache_add() calls kmalloc()
with the ``GFP_KERNEL`` flag, which is only legal in user context. I
have assumed that :c:func:`cache_add()` is still only called in
have assumed that cache_add() is still only called in
user context, otherwise this should become a parameter to
:c:func:`cache_add()`.
cache_add().
Exposing Objects Outside This File
----------------------------------
@ -592,7 +592,7 @@ This makes locking trickier, as it is no longer all in one place.
The second problem is the lifetime problem: if another structure keeps a
pointer to an object, it presumably expects that pointer to remain
valid. Unfortunately, this is only guaranteed while you hold the lock,
otherwise someone might call :c:func:`cache_delete()` and even
otherwise someone might call cache_delete() and even
worse, add another object, re-using the same address.
As there is only one lock, you can't hold it forever: no-one else would
@ -693,8 +693,8 @@ Here is the code::
We encapsulate the reference counting in the standard 'get' and 'put'
functions. Now we can return the object itself from
:c:func:`cache_find()` which has the advantage that the user can
now sleep holding the object (eg. to :c:func:`copy_to_user()` to
cache_find() which has the advantage that the user can
now sleep holding the object (eg. to copy_to_user() to
name to userspace).
The other point to note is that I said a reference should be held for
@ -710,7 +710,7 @@ number of atomic operations defined in ``include/asm/atomic.h``: these
are guaranteed to be seen atomically from all CPUs in the system, so no
lock is required. In this case, it is simpler than using spinlocks,
although for anything non-trivial using spinlocks is clearer. The
:c:func:`atomic_inc()` and :c:func:`atomic_dec_and_test()`
atomic_inc() and atomic_dec_and_test()
are used instead of the standard increment and decrement operators, and
the lock is no longer used to protect the reference count itself.
@ -802,7 +802,7 @@ name to change, there are three possibilities:
- You can make ``cache_lock`` non-static, and tell people to grab that
lock before changing the name in any object.
- You can provide a :c:func:`cache_obj_rename()` which grabs this
- You can provide a cache_obj_rename() which grabs this
lock and changes the name for the caller, and tell everyone to use
that function.
@ -861,11 +861,11 @@ Note that I decide that the popularity count should be protected by the
``cache_lock`` rather than the per-object lock: this is because it (like
the :c:type:`struct list_head <list_head>` inside the object)
is logically part of the infrastructure. This way, I don't need to grab
the lock of every object in :c:func:`__cache_add()` when seeking
the lock of every object in __cache_add() when seeking
the least popular.
I also decided that the id member is unchangeable, so I don't need to
grab each object lock in :c:func:`__cache_find()` to examine the
grab each object lock in __cache_find() to examine the
id: the object lock is only used by a caller who wants to read or write
the name field.
@ -887,7 +887,7 @@ trivial to diagnose: not a
stay-up-five-nights-talk-to-fluffy-code-bunnies kind of problem.
For a slightly more complex case, imagine you have a region shared by a
softirq and user context. If you use a :c:func:`spin_lock()` call
softirq and user context. If you use a spin_lock() call
to protect it, it is possible that the user context will be interrupted
by the softirq while it holds the lock, and the softirq will then spin
forever trying to get the same lock.
@ -985,12 +985,12 @@ you might do the following::
Sooner or later, this will crash on SMP, because a timer can have just
gone off before the :c:func:`spin_lock_bh()`, and it will only get
the lock after we :c:func:`spin_unlock_bh()`, and then try to free
gone off before the spin_lock_bh(), and it will only get
the lock after we spin_unlock_bh(), and then try to free
the element (which has already been freed!).
This can be avoided by checking the result of
:c:func:`del_timer()`: if it returns 1, the timer has been deleted.
del_timer(): if it returns 1, the timer has been deleted.
If 0, it means (in this case) that it is currently running, so we can
do::
@ -1012,9 +1012,9 @@ do::
Another common problem is deleting timers which restart themselves (by
calling :c:func:`add_timer()` at the end of their timer function).
calling add_timer() at the end of their timer function).
Because this is a fairly common case which is prone to races, you should
use :c:func:`del_timer_sync()` (``include/linux/timer.h``) to
use del_timer_sync() (``include/linux/timer.h``) to
handle this case. It returns the number of times the timer had to be
deleted before we finally stopped it from adding itself back in.
@ -1086,7 +1086,7 @@ adding ``new`` to a single linked list called ``list``::
list->next = new;
The :c:func:`wmb()` is a write memory barrier. It ensures that the
The wmb() is a write memory barrier. It ensures that the
first operation (setting the new element's ``next`` pointer) is complete
and will be seen by all CPUs, before the second operation is (putting
the new element into the list). This is important, since modern
@ -1097,7 +1097,7 @@ rest of the list.
Fortunately, there is a function to do this for standard
:c:type:`struct list_head <list_head>` lists:
:c:func:`list_add_rcu()` (``include/linux/list.h``).
list_add_rcu() (``include/linux/list.h``).
Removing an element from the list is even simpler: we replace the
pointer to the old element with a pointer to its successor, and readers
@ -1108,7 +1108,7 @@ will either see it, or skip over it.
list->next = old->next;
There is :c:func:`list_del_rcu()` (``include/linux/list.h``) which
There is list_del_rcu() (``include/linux/list.h``) which
does this (the normal version poisons the old object, which we don't
want).
@ -1116,9 +1116,9 @@ The reader must also be careful: some CPUs can look through the ``next``
pointer to start reading the contents of the next element early, but
don't realize that the pre-fetched contents is wrong when the ``next``
pointer changes underneath them. Once again, there is a
:c:func:`list_for_each_entry_rcu()` (``include/linux/list.h``)
list_for_each_entry_rcu() (``include/linux/list.h``)
to help you. Of course, writers can just use
:c:func:`list_for_each_entry()`, since there cannot be two
list_for_each_entry(), since there cannot be two
simultaneous writers.
Our final dilemma is this: when can we actually destroy the removed
@ -1127,14 +1127,14 @@ the list right now: if we free this element and the ``next`` pointer
changes, the reader will jump off into garbage and crash. We need to
wait until we know that all the readers who were traversing the list
when we deleted the element are finished. We use
:c:func:`call_rcu()` to register a callback which will actually
call_rcu() to register a callback which will actually
destroy the object once all pre-existing readers are finished.
Alternatively, :c:func:`synchronize_rcu()` may be used to block
Alternatively, synchronize_rcu() may be used to block
until all pre-existing are finished.
But how does Read Copy Update know when the readers are finished? The
method is this: firstly, the readers always traverse the list inside
:c:func:`rcu_read_lock()`/:c:func:`rcu_read_unlock()` pairs:
rcu_read_lock()/rcu_read_unlock() pairs:
these simply disable preemption so the reader won't go to sleep while
reading the list.
@ -1223,12 +1223,12 @@ this is the fundamental idea.
}
Note that the reader will alter the popularity member in
:c:func:`__cache_find()`, and now it doesn't hold a lock. One
__cache_find(), and now it doesn't hold a lock. One
solution would be to make it an ``atomic_t``, but for this usage, we
don't really care about races: an approximate result is good enough, so
I didn't change it.
The result is that :c:func:`cache_find()` requires no
The result is that cache_find() requires no
synchronization with any other functions, so is almost as fast on SMP as
it would be on UP.
@ -1240,9 +1240,9 @@ and put the reference count.
Now, because the 'read lock' in RCU is simply disabling preemption, a
caller which always has preemption disabled between calling
:c:func:`cache_find()` and :c:func:`object_put()` does not
cache_find() and object_put() does not
need to actually get and put the reference count: we could expose
:c:func:`__cache_find()` by making it non-static, and such
__cache_find() by making it non-static, and such
callers could simply call that.
The benefit here is that the reference count is not written to: the
@ -1260,11 +1260,11 @@ counter. Nice and simple.
If that was too slow (it's usually not, but if you've got a really big
machine to test on and can show that it is), you could instead use a
counter for each CPU, then none of them need an exclusive lock. See
:c:func:`DEFINE_PER_CPU()`, :c:func:`get_cpu_var()` and
:c:func:`put_cpu_var()` (``include/linux/percpu.h``).
DEFINE_PER_CPU(), get_cpu_var() and
put_cpu_var() (``include/linux/percpu.h``).
Of particular use for simple per-cpu counters is the ``local_t`` type,
and the :c:func:`cpu_local_inc()` and related functions, which are
and the cpu_local_inc() and related functions, which are
more efficient than simple code on some architectures
(``include/asm/local.h``).
@ -1289,10 +1289,10 @@ irq handler doesn't use a lock, and all other accesses are done as so::
enable_irq(irq);
spin_unlock(&lock);
The :c:func:`disable_irq()` prevents the irq handler from running
The disable_irq() prevents the irq handler from running
(and waits for it to finish if it's currently running on other CPUs).
The spinlock prevents any other accesses happening at the same time.
Naturally, this is slower than just a :c:func:`spin_lock_irq()`
Naturally, this is slower than just a spin_lock_irq()
call, so it only makes sense if this type of access happens extremely
rarely.
@ -1315,22 +1315,22 @@ from user context, and can sleep.
- Accesses to userspace:
- :c:func:`copy_from_user()`
- copy_from_user()
- :c:func:`copy_to_user()`
- copy_to_user()
- :c:func:`get_user()`
- get_user()
- :c:func:`put_user()`
- put_user()
- :c:func:`kmalloc(GFP_KERNEL) <kmalloc>`
- kmalloc(GP_KERNEL) <kmalloc>`
- :c:func:`mutex_lock_interruptible()` and
:c:func:`mutex_lock()`
- mutex_lock_interruptible() and
mutex_lock()
There is a :c:func:`mutex_trylock()` which does not sleep.
There is a mutex_trylock() which does not sleep.
Still, it must not be used inside interrupt context since its
implementation is not safe for that. :c:func:`mutex_unlock()`
implementation is not safe for that. mutex_unlock()
will also never sleep. It cannot be used in interrupt context either
since a mutex must be released by the same task that acquired it.
@ -1340,11 +1340,11 @@ Some Functions Which Don't Sleep
Some functions are safe to call from any context, or holding almost any
lock.
- :c:func:`printk()`
- printk()
- :c:func:`kfree()`
- kfree()
- :c:func:`add_timer()` and :c:func:`del_timer()`
- add_timer() and del_timer()
Mutex API reference
===================
@ -1400,26 +1400,26 @@ preemption
bh
Bottom Half: for historical reasons, functions with '_bh' in them often
now refer to any software interrupt, e.g. :c:func:`spin_lock_bh()`
now refer to any software interrupt, e.g. spin_lock_bh()
blocks any software interrupt on the current CPU. Bottom halves are
deprecated, and will eventually be replaced by tasklets. Only one bottom
half will be running at any time.
Hardware Interrupt / Hardware IRQ
Hardware interrupt request. :c:func:`in_irq()` returns true in a
Hardware interrupt request. in_irq() returns true in a
hardware interrupt handler.
Interrupt Context
Not user context: processing a hardware irq or software irq. Indicated
by the :c:func:`in_interrupt()` macro returning true.
by the in_interrupt() macro returning true.
SMP
Symmetric Multi-Processor: kernels compiled for multiple-CPU machines.
(``CONFIG_SMP=y``).
Software Interrupt / softirq
Software interrupt handler. :c:func:`in_irq()` returns false;
:c:func:`in_softirq()` returns true. Tasklets and softirqs both
Software interrupt handler. in_irq() returns false;
in_softirq() returns true. Tasklets and softirqs both
fall into the category of 'software interrupts'.
Strictly speaking a softirq is one of up to 32 enumerated software

View File

@ -128,6 +128,10 @@ since we already have a valid pointer that we own a refcount for. The
put needs no lock because nothing tries to get the data without
already holding a pointer.
In the above example, kref_put() will be called 2 times in both success
and error paths. This is necessary because the reference count got
incremented 2 times by kref_init() and kref_get().
Note that the "before" in rule 1 is very important. You should never
do something like::

View File

@ -291,8 +291,8 @@ and QUERYMENU. And G/S_CTRL as well as G/TRY/S_EXT_CTRLS are automatically suppo
In practice the basic usage as described above is sufficient for most drivers.
Inheriting Controls
-------------------
Inheriting Sub-device Controls
------------------------------
When a sub-device is registered with a V4L2 driver by calling
v4l2_device_register_subdev() and the ctrl_handler fields of both v4l2_subdev
@ -757,8 +757,8 @@ attempting to find another control from the same handler will deadlock.
It is recommended not to use this function from inside the control ops.
Inheriting Controls
-------------------
Preventing Controls inheritance
-------------------------------
When one control handler is added to another using v4l2_ctrl_add_handler, then
by default all controls from one are merged to the other. But a subdev might

View File

@ -20,4 +20,5 @@ fit into other categories.
isl29003
lis3lv02d
max6875
mic/index
xilinx_sdfec

Some files were not shown because too many files have changed in this diff Show More