Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Notes on Filesystem Layout
--------------------------
These notes describe what mkcramfs generates. Kernel requirements are
a bit looser, e.g. it doesn't care if the <file_data> items are
swapped around (though it does care that directory entries (inodes) in
a given directory are contiguous, as this is used by readdir).
All data is currently in host-endian format; neither mkcramfs nor the
kernel ever do swabbing. (See section `Block Size' below.)
<filesystem>:
<superblock>
<directory_structure>
<data>
<superblock>: struct cramfs_super (see cramfs_fs.h).
<directory_structure>:
For each file:
struct cramfs_inode (see cramfs_fs.h).
Filename. Not generally null-terminated, but it is
null-padded to a multiple of 4 bytes.
The order of inode traversal is described as "width-first" (not to be
confused with breadth-first); i.e. like depth-first but listing all of
a directory's entries before recursing down its subdirectories: the
same order as `ls -AUR' (but without the /^\..*:$/ directory header
lines); put another way, the same order as `find -type d -exec
ls -AU1 {} \;'.
Beginning in 2.4.7, directory entries are sorted. This optimization
allows cramfs_lookup to return more quickly when a filename does not
exist, speeds up user-space directory sorts, etc.
<data>:
One <file_data> for each file that's either a symlink or a
regular file of non-zero st_size.
<file_data>:
nblocks * <block_pointer>
(where nblocks = (st_size - 1) / blksize + 1)
nblocks * <block>
padding to multiple of 4 bytes
The i'th <block_pointer> for a file stores the byte offset of the
*end* of the i'th <block> (i.e. one past the last byte, which is the
same as the start of the (i+1)'th <block> if there is one). The first
<block> immediately follows the last <block_pointer> for the file.
<block_pointer>s are each 32 bits long.
The order of <file_data>'s is a depth-first descent of the directory
tree, i.e. the same order as `find -size +0 \( -type f -o -type l \)
-print'.
<block>: The i'th <block> is the output of zlib's compress function
applied to the i'th blksize-sized chunk of the input data.
(For the last <block> of the file, the input may of course be smaller.)
Each <block> may be a different size. (See <block_pointer> above.)
<block>s are merely byte-aligned, not generally u32-aligned.
Holes
-----
This kernel supports cramfs holes (i.e. [efficient representation of]
blocks in uncompressed data consisting entirely of NUL bytes), but by
default mkcramfs doesn't test for & create holes, since cramfs in
kernels up to at least 2.3.39 didn't support holes. Run mkcramfs
with -z if you want it to create files that can have holes in them.
Tools
-----
The cramfs user-space tools, including mkcramfs and cramfsck, are
located at <http://sourceforge.net/projects/cramfs/>.
Future Development
==================
Block Size
----------
(Block size in cramfs refers to the size of input data that is
compressed at a time. It's intended to be somewhere around
PAGE_SIZE for cramfs_readpage's convenience.)
The superblock ought to indicate the block size that the fs was
written for, since comments in <linux/pagemap.h> indicate that
PAGE_SIZE may grow in future (if I interpret the comment
correctly).
Currently, mkcramfs #define's PAGE_SIZE as 4096 and uses that
for blksize, whereas Linux-2.3.39 uses its PAGE_SIZE, which in
turn is defined as PAGE_SIZE (which can be as large as 32KB on arm).
This discrepancy is a bug, though it's not clear which should be
changed.
One option is to change mkcramfs to take its PAGE_SIZE from
<asm/page.h>. Personally I don't like this option, but it does
require the least amount of change: just change `#define
PAGE_SIZE (4096)' to `#include <asm/page.h>'. The disadvantage
is that the generated cramfs cannot always be shared between different
kernels, not even necessarily kernels of the same architecture if
PAGE_SIZE is subject to change between kernel versions
(currently possible with arm and ia64).
The remaining options try to make cramfs more sharable.
One part of that is addressing endianness. The two options here are
`always use little-endian' (like ext2fs) or `writer chooses
endianness; kernel adapts at runtime'. Little-endian wins because of
code simplicity and little CPU overhead even on big-endian machines.
The cost of swabbing is changing the code to use the le32_to_cpu
etc. macros as used by ext2fs. We don't need to swab the compressed
data, only the superblock, inodes and block pointers.
The other part of making cramfs more sharable is choosing a block
size. The options are:
1. Always 4096 bytes.
2. Writer chooses blocksize; kernel adapts but rejects blocksize >
PAGE_SIZE.
3. Writer chooses blocksize; kernel adapts even to blocksize >
PAGE_SIZE.
It's easy enough to change the kernel to use a smaller value than
PAGE_SIZE: just make cramfs_readpage read multiple blocks.
The cost of option 1 is that kernels with a larger PAGE_SIZE
value don't get as good compression as they can.
The cost of option 2 relative to option 1 is that the code uses
variables instead of #define'd constants. The gain is that people
with kernels having larger PAGE_SIZE can make use of that if
they don't mind their cramfs being inaccessible to kernels with
smaller PAGE_SIZE values.
Option 3 is easy to implement if we don't mind being CPU-inefficient:
e.g. get readpage to decompress to a buffer of size MAX_BLKSIZE (which
must be no larger than 32KB) and discard what it doesn't need.
Getting readpage to read into all the covered pages is harder.
The main advantage of option 3 over 1, 2, is better compression. The
cost is greater complexity. Probably not worth it, but I hope someone
will disagree. (If it is implemented, then I'll re-use that code in
e2compr.)
Another cost of 2 and 3 over 1 is making mkcramfs use a different
block size, but that just means adding and parsing a -b option.
Inode Size
----------
Given that cramfs will probably be used for CDs etc. as well as just
silicon ROMs, it might make sense to expand the inode a little from
its current 12 bytes. Inodes other than the root inode are followed
by filename, so the expansion doesn't even have to be a multiple of 4
bytes.