diff --git a/locale/en/docs/guides/working-with-different-filesystems.md b/locale/en/docs/guides/working-with-different-filesystems.md new file mode 100644 index 0000000000000..36494b9bb1642 --- /dev/null +++ b/locale/en/docs/guides/working-with-different-filesystems.md @@ -0,0 +1,224 @@ +--- +title: Working with Different Filesystems +layout: docs.hbs +--- + +# Working with Different Filesystems + +Node exposes many features of the filesystem. But not all filesystems are alike. +The following are suggested best practices to keep your code simple and safe +when working with different filesystems. + +## Filesystem Behavior + +Before you can work with a filesystem, you need to know how it behaves. +Different filesystems behave differently and have more or less features than +others: case sensitivity, case insensitivity, case preservation, Unicode form +preservation, timestamp resolution, extended attributes, inodes, Unix +permissions, alternate data streams etc. + +Be wary of inferring filesystem behavior from `process.platform`. For example, +do not assume that because your program is running on Darwin that you are +therefore working on a case-insensitive filesystem (HFS+), as the user may be +using a case-sensitive filesystem (HFSX). Similarly, do not assume that because +your program is running on Linux that you are therefore working on a filesystem +which supports Unix permissions and inodes, as you may be on a particular +external drive, USB or network drive which does not. + +The operating system may not make it easy to infer filesystem behavior, but all +is not lost. Instead of keeping a list of every known filesystem and behavior +(which is always going to be incomplete), you can probe the filesystem to see +how it actually behaves. The presence or absence of certain features which are +easy to probe, are often enough to infer the behavior of other features which +are more difficult to probe. + +Remember that some users may have different filesystems mounted at various paths +in the working tree. + +## Avoid a Lowest Common Denominator Approach + +You might be tempted to make your program act like a lowest common denominator +filesystem, by normalizing all filenames to uppercase, normalizing all filenames +to NFC Unicode form, and normalizing all file timestamps to say 1-second +resolution. This would be the lowest common denominator approach. + +Do not do this. You would only be able to interact safely with a filesystem +which has the exact same lowest common denominator characteristics in every +respect. You would be unable to work with more advanced filesystems in the way +that users expect, and you would run into filename or timestamp collisions. You +would most certainly lose and corrupt user data through a series of complicated +dependent events, and you would create bugs that would be difficult if not +impossible to solve. + +What happens when you later need to support a filesystem that only has 2-second +or 24-hour timestamp resolution? What happens when the Unicode standard advances +to include a slightly different normalization algorithm (as has happened in the +past)? + +A lowest common denominator approach would tend to try to create a portable +program by using only "portable" system calls. This leads to programs that are +leaky and not in fact portable. + +## Adopt a Superset Approach + +Make the best use of each platform you support by adopting a superset approach. +For example, a portable backup program should sync btimes (the created time of a +file or folder) correctly between Windows systems, and should not destroy or +alter btimes, even though btimes are not supported on Linux systems. The same +portable backup program should sync Unix permissions correctly between Linux +systems, and should not destroy or alter Unix permissions, even though Unix +permissions are not supported on Windows systems. + +Handle different filesystems by making your program act like a more advanced +filesystem. Support a superset of all possible features: case-sensitivity, +case-preservation, Unicode form sensitivity, Unicode form preservation, Unix +permissions, high-resolution nanosecond timestamps, extended attributes etc. + +Once you have case-preservation in your program, you can always implement +case-insensitivity if you need to interact with a case-insensitive filesystem. +But if you forego case-preservation in your program, you cannot interact safely +with a case-preserving filesystem. The same is true for Unicode form +preservation and timestamp resolution preservation. + +If a filesystem provides you with a filename in a mix of lowercase and +uppercase, then keep the filename in the exact case given. If a filesystem +provides you with a filename in mixed Unicode form or NFC or NFD (or NFKC or +NFKD), then keep the filename in the exact byte sequence given. If a filesystem +provides you with a millisecond timestamp, then keep the timestamp in +millisecond resolution. + +When you work with a lesser filesystem, you can always downsample appropriately, +with comparison functions as required by the behavior of the filesystem on which +your program is running. If you know that the filesystem does not support Unix +permissions, then you should not expect to read the same Unix permissions you +write. If you know that the filesystem does not preserve case, then you should +be prepared to see `ABC` in a directory listing when your program creates `abc`. +But if you know that the filesystem does preserve case, then you should consider +`ABC` to be a different filename to `abc`, when detecting file renames or if the +filesystem is case-sensitive. + +## Case Preservation + +You may create a directory called `test/abc` and be surprised to see sometimes +that `fs.readdir('test')` returns `['ABC']`. This is not a bug in Node. Node +returns the filename as the filesystem stores it, and not all filesystems +support case-preservation. Some filesystems convert all filenames to uppercase +(or lowercase). + +## Unicode Form Preservation + +*Case preservation and Unicode form preservation are similar concepts. To +understand why Unicode form should be preserved , make sure that you first +understand why case should be preserved. Unicode form preservation is just as +simple when understood correctly.* + +Unicode can encode the same characters using several different byte sequences. +Several strings may look the same, but have different byte sequences. When +working with UTF-8 strings, be careful that your expectations are in line with +how Unicode works. Just as you would not expect all UTF-8 characters to encode +to a single byte, you should not expect several UTF-8 strings that look the same +to the human eye to have the same byte representation. This may be an +expectation that you can have of ASCII, but not of UTF-8. + +You may create a directory called `test/café` (NFC Unicode form with byte +sequence `<63 61 66 c3 a9>` and `string.length === 5`) and be surprised to see +sometimes that `fs.readdir('test')` returns `['café']` (NFD Unicode form with +byte sequence `<63 61 66 65 cc 81>` and `string.length === 6`). This is not a +bug in Node. Node returns the filename as the filesystem stores it, and not all +filesystems support Unicode form preservation. + +HFS+, for example, will normalize all filenames to a form almost always the same +as NFD form. Do not expect HFS+ to behave the same as NTFS or EXT4 and +vice-versa. Do not try to change data permanently through normalization as a +leaky abstraction to paper over Unicode differences between filesystems. This +would create problems without solving any. Rather, preserve Unicode form and use +normalization as a comparison function only. + +## Unicode Form Insensitivity + +Unicode form insensitivity and Unicode form preservation are two different +filesystem behaviors often mistaken for each other. Just as case-insensitivity +has sometimes been incorrectly implemented by permanently normalizing filenames +to uppercase when storing and transmitting filenames, so Unicode form +insensitivity has sometimes been incorrectly implemented by permanently +normalizing filenames to a certain Unicode form (NFD in the case of HFS+) when +storing and transmitting filenames. It is possible and much better to implement +Unicode form insensitivity without sacrificing Unicode form preservation, by +using Unicode normalization for comparison only. + +## Comparing Different Unicode Forms + +Node provides `string.normalize('NFC' / 'NFD')` which you can use to normalize a +UTF-8 string to either NFC or NFD. You should never store the output from this +function but only use it as part of a comparison function to test whether two +UTF-8 strings would look the same to the user. + +You can use `string1.normalize('NFC') === string2.normalize('NFC')` or +`string1.normalize('NFD') === string2.normalize('NFD')` as your comparison +function. Which form you use does not matter. + +Normalization is fast but you may want to use a cache as input to your +comparison function to avoid normalizing the same string many times over. If the +string is not present in the cache then normalize it and cache it. Be careful +not to store or persist the cache, use it only as a cache. + +Note that using `normalize()` requires that your version of Node include ICU +(otherwise `normalize()` will just return the original string). If you download +the latest version of Node from the website then it will include ICU. + +## Timestamp Resolution + +You may set the `mtime` (the modified time) of a file to `1444291759414` +(millisecond resolution) and be surprised to see sometimes that `fs.stat` +returns the new mtime as `1444291759000` (1-second resolution) or +`1444291758000` (2-second resolution). This is not a bug in Node. Node returns +the timestamp as the filesystem stores it, and not all filesystems support +nanosecond, millisecond or 1-second timestamp resolution. Some filesystems even +have very coarse resolution for the atime timestamp in particular, e.g. 24 hours +for some FAT filesystems. + +## Do Not Corrupt Filenames and Timestamps Through Normalization + +Filenames and timestamps are user data. Just as you would never automatically +rewrite user file data to uppercase the data or normalize `CRLF` to `LF` +line-endings, so you should never change, interfere or corrupt filenames or +timestamps through case / Unicode form / timestamp normalization. Normalization +should only ever be used for comparison, never for altering data. + +Normalization is effectively a lossy hash code. You can use it to test for +certain kinds of equivalence (e.g. do several strings look the same even though +they have different byte sequences) but you can never use it as a substitute for +the actual data. Your program should pass on filename and timestamp data as is. + +Your program can create new data in NFC (or in any combination of Unicode form +it prefers) or with a lowercase or uppercase filename, or with a 2-second +resolution timestamp, but your program should not corrupt existing user data by +imposing case / Unicode form / timestamp normalization. Rather, adopt a superset +approach and preserve case, Unicode form and timestamp resolution in your +program. That way, you will be able to interact safely with filesystems which do +the same. + +## Use Normalization Comparison Functions Appropriately + +Make sure that you use case / Unicode form / timestamp comparison functions +appropriately. Do not use a case-insensitive filename comparison function if you +are working on a case-sensitive filesystem. Do not use a Unicode form +insensitive comparison function if you are working on a Unicode form sensitive +filesystem (e.g. NTFS and most Linux filesystems which preserve both NFC and NFD +or mixed Unicode forms). Do not compare timestamps at 2-second resolution if you +are working on a nanosecond timestamp resolution filesystem. + +## Be Prepared for Slight Differences in Comparison Functions + +Be careful that your comparison functions match those of the filesystem (or +probe the filesystem if possible to see how it would actually compare). +Case-insensitivity for example is more complex than a simple `toLowerCase()` +comparison. In fact, `toUpperCase()` is usually better than `toLowerCase()` +(since it handles certain foreign language characters differently). But better +still would be to probe the filesystem since every filesystem has its own case +comparison table baked in. + +As an example, Apple's HFS+ normalizes filenames to NFD form but this NFD form +is actually an older version of the current NFD form and may sometimes be +slightly different from the latest Unicode standard's NFD form. Do not expect +HFS+ NFD to be exactly the same as Unicode NFD all the time.