Author: Vince Darley <[email protected]>
State: Final
Type: Project
Vote: Done
Created: 17-Nov-2000
Post-History:
Tcl-Version: 8.4.0
Abstract
Many of the most exciting recent developments in Tcl have involved putting virtual file systems in a file (e.g. Prowrap, Freewrap, Wrap, TclKit) but these have been largely ad hoc hacks of various internal APIs. This TIP seeks to replace this with a common underlying API that will, in addition, make porting of Tcl to new platforms a simpler task as well.
Overview
There are two current drawbacks to Tcl's filesystem implementation:
virtual filesystems are not properly supported.
it is all string-based, rather than Tcl_Obj-based.
Prowrap (http://sourceforge.net/projects/tclpro\), Freewrap (http://home.nycap.rr.com/dlabelle/freewrap/freewrap.html\), Wrap (http://members1.chello.nl/~j.nijtmans/wrap.html\), TclKit (http://www.equi4.com/jcw/wiki.cgi/19.html\), ... are all attempts to provide an ability to place Tcl scripts and other data inside a single file (or just a small number of files). The best and simplest way to achieve that task (and many other useful tasks) is to let Tcl handle the contents of a single 'wrapped document' as if it were a filesystem: the contents may be opened, sourced, stat'd, copied, globbed, etc. Also note that at the European Tcl/Tk meeting, the (equal) second-ranked request for Tcl was support for standalone executables (http://mini.net/cgi-bin/wikit/837.html\).
This TIP suggests that Tcl's core be modified to allow non-native filesystems to be plugged in to the core, and hence allow perfect virtual filesystems to exist. The implementations provided by all of the above tools are very far from perfect. The most obvious types of virtual filesystem which should be supported are:
wrapped/archived document 'bundles' such as TclKits, .zip files, etc.
remote filesystems (e.g. an FTP site).
but the main point is that all filesystem access should occur through a hookable interface, so that Tcl neither knows nor cares what type of filesystem it is dealing with.
Furthermore this hookable interface should be Tcl_Obj based, providing a new 'Path' object type, which should be designed with two goals in mind:
allow caching of 'native path representations' (all native Tclp... filesystem calls involve various Utf->Native conversions). For example, quick testing for 'file exists' shows that a 20% speed up can be achieved by caching the native representation (Windows 2000sp1).
allow virtual filesystems to operate very efficiently - this will probably require caching of the filesystem to use for a particular file.
If all of these goals are achieved, Tcl will have a new filesystem which is both more efficient and more powerful than the existing implementation.
Technical discussion
Virtual filesystems
An examination of the core shows that a very limited support was added to tclIOUtil.c in June 1998 (presumably by Scriptics to support prowrap) so that TclStat, TclAccess and Tcl_OpenFileChannel commands could be intercepted. (See http://cvs.sourceforge.net/cgi-bin/cvsweb.cgi/tcl/generic/tclIOUtil.c?rev=1.2&content-type=text/x-cvsweb-markup&cvsroot=tcl\)
This TIP seeks to provide a complete implementation of virtual file system support, rather than these piecemeal functions.
Fortunately, since Tcl is already abstracted across three different filesystem types (through the Tclp...) functions, it is not that big a task to abstract away to any generic filesystem.
One goal of this TIP is to allow an extension to be written so that one can implement a virtual filesystem entirely in Tcl: i.e. to provide sufficient hooks into Tcl's core so that an extension can capture all filesystem requests and divert them if desired. The goal is not to provide Tcl-level hooks in Tcl's core. Such hooks will only be at the C level, and an extension would be required to expose them to the Tcl level.
Objectified filesystem interface.
Every filesystem access in Tcl's core usually involves several calls to 'access', 'stat', etc.
For example 'file atime $path' requires two calls to 'stat' and one call to 'utime', all with the same $path argument. Each of these requires a conversion from the same Utf path to the same native string representation. No caching is performed, so each of these goes through Tcl_UtfToExternal. Often Tcl code will use the same $path objects for an entire sequence of Tcl 'file' operations. Clearly a representation which cached the native path would speed up all of these operations (except the first).
The second reason why objectification is desirable is that in a pluggable-fs environment we must determine, for each file operation, which filesystem to use (whether native, a mounted .zip file, a remote FTP site, etc.). If this information can be cached for a particular path, again we will not need to recalculate it at every step. A similar technique to that used by Tcl's bytecode compilation will be used: each cached object will have a "filesystemEpoch" counter, so that we can tell with each access whether the filesystem has been modified (and we must discard the cached information). Mounting/unmounting filesystems will obviously modify the filesystemEpoch.
A relatively complete implementation of this TIP, and a sample "vfs" extension now exist, and have been tested through TclKit. On both the "virtual" and "objectification" parts of this tip, the implementation is known to be stable and complete (at least on Windows): TclKit can operate through this new vfs implementation without the need to override a single Tcl core command at the script level, and all reasonable filesystem tests (cmdAH.test, fCmd.test, fileName.test, io.test) pass in a scripted document. Commands which operate on files (image, source, etc.) and extensions like Image, Winico can be made to work in a TclKit automatically! There is still some room for optimisation in some parts of the new objectification code (which wasn't possible in the old string-based API). The current implementation has great efficiency gains for vfs's implemented at the script level, since the same Path objects can be passed through the entire process, without an intermediate conversion (and string duplication which would otherwise be required). The combination of caching and objectification changes the existing list of steps from
Tcl_Obj -> string -> filesystem -> convert-to-native -> native-call
or (with vfs hooked in)
Tcl_Obj -> string -> vfilesystem -> pick-filesystem -> convert-to-native -> native-call
and
Tcl_Obj -> string -> vfilesystem -> pick-filesystem -> Tcl_NewStringObj -> Tcl-vfs-call
to
Tcl_Obj -> vfilesystem -> native-call
and
Tcl_Obj -> vfilesystem -> Tcl-vfs-call
A final side-benefit of this proposal would be that it further modularises the core of Tcl, so that one could, in principle:
remove the native filesystem support entirely from Tcl (perhaps useful for embedded devices etc), since there will be a clean layer separating Tcl from its native filesystem functionality.
use Tcl's filesystem for other purposes (outside of Tcl).
However these final two points are explicitly not the goal of this TIP! I simply want to improve Tcl to add vfs support, and the best way to do that seems (to me) to be along the lines of this TIP.
Proposal
The changes to Tcl's core for virtual filesystem support are actually very minor. Every occurrence of a Tclp-filesystem call must be replaced by a call to a hookable procedure. The current filesystem structure (defined in tclInt.h) and hookable procedure list is as follows (for documentation on this structure, see Documentation section below):
/*
* struct Tcl_Filesystem:
*
* One such structure exists for each type (kind) of filesystem.
* It collects together in one place all the functions that are
* part of the specific filesystem. Tcl always accesses the
* filesystem through one of these structures.
*
* Not all entries need be non-NULL; any which are NULL are simply
* ignored. However, a complete filesystem should provide all of
* these functions.
*/
typedef struct Tcl_Filesystem {
CONST char *typeName; /* The name of the filesystem. */
int structureLength; /* Length of this structure, so future
* binary compatibility can be assured. */
Tcl_FilesystemVersion version;
/* Version of the filesystem type. */
TclfsPathInFilesystem_ *pathInFilesystemProc;
/* Function to check whether a path is in this
* filesystem */
TclfsDupInternalRep_ *dupInternalRepProc;
/* Function to duplicate internal fs rep */
TclfsFreeInternalRep_ *freeInternalRepProc;
/* Function to free internal fs rep */
TclfsInternalToNormalizedProc_ *internalToNormalizedProc_;
/* Function to convert internal representation
* to a normalized path */
TclfsConvertToInternalProc_ *convertToInternalProc_;
/* Function to convert object to an
* internal representation */
TclfsStatProc_ *statProc; /* Function to process a 'Tcl_Stat()' call */
TclfsAccessProc_ *accessProc;
/* Function to process a 'Tcl_Access()' call */
TclfsOpenFileChannelProc_ *openFileChannelProc;
/* Function to process a 'Tcl_OpenFileChannel()' call */
TclfsMatchInDirectoryProc_ *matchInDirectoryProc;
/* Function to process a 'Tcl_MatchInDirectory()' */
TclfsGetCwdProc_ *getCwdProc;
/* Function to process a 'Tcl_GetCwd()' call */
TclfsChdirProc_ *chdirProc;
/* Function to process a 'Tcl_Chdir()' call */
TclfsLstatProc_ *lstatProc;
/* Function to process a 'Tcl_Lstat()' call */
TclfsCopyFileProc_ *copyFileProc;
/* Function to process a 'Tcl_CopyFile()' call */
TclfsDeleteFileProc_ *deleteFileProc;
/* Function to process a 'Tcl_DeleteFile()' call */
TclfsRenameFileProc_ *renameFileProc;
/* Function to process a 'Tcl_RenameFile()' call */
TclfsCreateDirectoryProc_ *createDirectoryProc;
/* Function to process a 'Tcl_CreateDirectory()' call */
TclfsCopyDirectoryProc_ *copyDirectoryProc;
/* Function to process a 'Tcl_CopyDirectory()' call */
TclfsRemoveDirectoryProc_ *removeDirectoryProc;
/* Function to process a 'Tcl_RemoveDirectory()' call */
TclfsLoadFileProc_ *loadFileProc;
/* Function to process a 'Tcl_LoadFile()' call */
TclfsUnloadFileProc_ *unloadFileProc;
/* Function to unload a previously successfully
* loaded file */
TclfsReadlinkProc_ *readlinkProc;
/* Function to process a 'Tcl_Readlink()' call */
TclfsListVolumesProc_ *listVolumesProc;
/* Function to list any filesystem volumes added
* by this filesystem */
TclfsFileAttrStringsProc_ *fileAttrStringsProc;
/* Function to list all attributes strings which
* are valid for this filesystem */
TclfsFileAttrsGetProc_ *fileAttrsGetProc;
/* Function to process a 'Tcl_FileAttrsGet()' call */
TclfsFileAttrsSetProc_ *fileAttrsSetProc;
/* Function to process a 'Tcl_FileAttrsSet()' call */
TclfsUtimeProc_ *utimeProc;
/* Function to process a 'Tcl_Utime()' call */
TclfsNormalizePathProc_ *normalizePathProc;
/* Function to normalize a path */
} Tcl_Filesystem;
Once that is done, almost no more changes need be made to Tcl's core. We must simply add code (to tclIOUtil.c and declarations to tclInt.h) to implement the hookable functions and to provide a simple API by which extensions can hook into the new filesystem support.
This gives us the simplest level of vfs. Most remaining changes are objectifying the way Tcl's core uses filesystems. Many of these changes actually simplify the core, for example, we replace:
case FILE_COPY: {
int result;
char **argv;
argv = StringifyObjects(objc, objv);
result = TclFileCopyCmd(interp, objc, argv);
ckfree((char *) argv);
return result;
}
with
case FILE_COPY: {
return TclFileCopyCmd(interp, objc, objv);
}
and the Unix versions of stat, access, chdir are as simple as:
int TclpObjStat(Tcl_Obj *pathPtr, struct stat *buf) {
return stat(Tclfs_GetNativePath(pathPtr), buf);
}
int TclpObjChdir(Tcl_Obj *pathPtr) {
return chdir(Tclfs_GetNativePath(pathPtr));
}
int TclpObjAccess(Tcl_Obj *pathPtr, int mode) {
return access(Tclfs_GetNativePath(pathPtr));
}
There are a few other small changes required, some which are absolutely necessary, and some which make the implementation of a Tcl-level vfs much simpler and more robust:
Cross-filesystem copy and rename operations will fail. A patch was added so that Tcl can fall back on 'open r/open w/fcopy/file mtime' as a copying method for files, and a new function ::tcl::copyDirectory for directories. These techniques are only used if the source/dest are in different filesystems, or if the filesystem Tcl tries to use returns the EXDEV posix result in 'errno'. This is a natural extension of Tcl's current way of falling back from 'rename' to 'copy and delete'.
Add -tails flag to glob (and internally to TclGlob) to indicate that we only want the tails of the files to be returned.
Add file normalize path subcommand to file, which returns an absolute path in which all "..", "." sequences have been removed, and the file is a platform-normalized path (e.g. the longname is used on windows).
Modify the TclDoGlob implementation so it handles recursion itself, rather than passing it on to the various TclpMatchFilesTypes functions. This simplifies the platform-specific code, and makes vfs support of glob much more robust, easy, complete, etc. This has the side- benefit that TclpMatchFilesTypes need not operate directly on the interpreter's result. The simpler function with a different signature has been named TclpMatchInDirectory.
Modify the implementation of encoding names to use the TCL_GLOBMODE_TAILS flag to TclGlob, simplifying that code.
Add an API to tclIO.c to allow us to Unregister a channel without deleting it. We need this to be able to take a channel created in Tcl (registered and with refcount of 1) and turn it into a "pristine channel with refcount 0" as returned by Tcl_OpenFileChannel. This is called Tcl_DetachChannel.
Add a function to tclInt.decls called 'TclpVerifyInitialEncodings' which is required when all of Tcl is packaged in a virtual filesystem (e.g. TclKit), since Tcl's very early call to TclpSetInitialEncodings fails to achieve anything useful.
the perfect vfs support can have some weird side-effects. For instance, if I embed all of tcltest and tests/ inside a TclKit, and try to source 'all.tcl', I get errors that each file does not exist. This is because the test code tries to pipe each file in turn to a newly created tcl process (open "| tclsh foo.test r"), but the files don't really exist (as far as the OS is concerned). We therefore add an introspection command 'file system $path' which returns a list of two items: the name of the filesystem and the type of the file within that filesystem. For example it might be 'native local', 'native networked', 'vfs ftp', 'vfs zip' etc.
vfs systems on different platforms may require different directory separator characters (different to the native characters), especially on MacOS (in which a valid file might be HD:Tcl:tcl.zip/lib/tcl8.4/history.tcl), and we therefore add 'file separator ?filename?' command to retrieve the correct separator to use.
As mentioned above an implementation of all of this now exists. The modified Tcl core passes all Tcl tests, and works with a new version of TclKit (which can itself pass all reasonable tests when executing the test suite inside a scripted document). It has been tested with a variety of wrapped demos (tclhttpd, bwidgets, widgets, alphatk), and performs very well. The patch is available from the author of this TIP, and a version (possibly not the most recent) can be downloaded from: ftp://ftp.ucsd.edu/pub/alpha/tcl/tclobjvfs.diff
Documentation: vfs-aware extensions
All calls to filesystem functions in Tcl's core and in extensions should preferably be made through the new API defined in tclInt.decls. All these functions have the prefix 'Tclfs_'. Of course extensions which call older string-based APIs (e.g. Tcl_OpenFileChannel) will still work, but will not benefit from the efficiency of the cached object representation. Most of these functions are not commonly used by extensions (e.g. Tclfs_CopyDirectory), so only the most common are listed here:
Tcl_Channel Tclfs_OpenFileChannel(Tcl_Interp *interp, Tcl_Obj *pathPtr,
char *modeString, int permissions)
int Tclfs_EvalFile(Tcl_Interp *interp, Tcl_Obj *fileName)
int Tclfs_Stat(Tcl_Obj *pathPtr, struct stat *buf)
int Tclfs_Access(Tcl_Obj *pathPtr, int mode)
These replace the equivalent string-based APIs.
int Tclfs_ConvertToPathType(Tcl_Interp *interp, Tcl_Obj *pathPtr)
Attempts to convert the given object to a path type. This is a little more than a simple wrapper around Tcl_ConvertToType(interp, pathPtr, &tclFsPathType). If it returns TCL_ERROR, the object is not a valid path. In this sense it is very similar to Tcl_TranslateFileName for the existing string-based API. It should be called before attempting to pass an object to any of the other filesystem APIs (again in much the same way as Tcl_TranslateFileName was used in the core).
int Tclfs_EqualPaths(Tcl_Obj* firstPtr, Tcl_Obj* secondPtr)
Tcl_Obj* Tclfs_SplitPath(Tcl_Obj* pathPtr, int *lenPtr)
Tcl_Obj* Tclfs_JoinPath(Tcl_Obj *listObj, int elements)
Tcl_Obj* Tclfs_JoinToPath(Tcl_Obj *basePtr, int objc, Tcl_Obj *CONST objv[])
These all manipulate paths. They return Tcl_Obj* with refCounts of zero.
Tcl_Obj* Tclfs_GetNormalizedPath(Tcl_Interp *interp, Tcl_Obj* pathObjPtr)
char* Tclfs_GetTranslatedPath(Tcl_Interp *interp, Tcl_Obj* pathPtr)
and finally:
char* Tclfs_GetNativePath(Tcl_Obj* pathObjPtr)
which is used by native filesystems only, and is a shorthand for getting at the cached native representation for MacOS, Windows or Unix (as appropriate). This is always a string based representation, but may really be of type TCHAR* on Windows, for example.
Documentation: writing a new filesystem
The objPtr->internalRep.otherValuePtr field is a pointer to one of these structures, for objects of "path" type.
typedef struct FsPath {
char *translatedPathPtr; /* Name without any ~user sequences */
Tcl_Obj *normPathPtr; /* Normalized absolute path, without .
* or .. sequences, and without ~user
* sequences. */
Tcl_Obj *cwdPtr; /* If null, path is absolute, else
* this points to the cwd object used
* for this path. We have a refCount
* on the object. */
ClientData nativePathPtr; /* Native representation of this path,
* which is filesystem dependent. */
int filesystemEpoch; /* Used to ensure the path representation
* was generated during the correct
* filesystem epoch. The epoch changes
* when filesystem-mounts are changed. */
struct Tcl_FilesystemRecord *fsRecPtr;
/* Pointer to the filesystem record
* entry to use for this path. */
} FsPath;
Path to filesystem mapping:
int TclfsPathInFilesystem_ (Tcl_Obj *pathPtr, ClientData *clientDataPtr)
Is the given path in this filesystem? This function should return either -1, or it should return TCL_OK. If it returns TCL_OK, it may wish to set the clientData parameter to point to a filesystem specific representation of the path. (The native filesystem actually postpones the calculation of the native representation until it is requested, but TclKit's vfs immediately allocates a structure containing an int and Tcl_Obj* which it uses as an internal representation).
Internal representation manipulation:
void TclfsFreeInternalRep_ (ClientData clientData)
ClientData TclfsDupInternalRep_ (ClientData clientData)
These two are called to duplicate and free the clientData field of the FsPath structure. If they are NULL, they are ignored (and on duplication, the new object's clientData field is set to NULL).
Path normalization:
int TclfsNormalizePathProc_ (Tcl_Interp *interp, Tcl_Obj *pathPtr,
int nextCheckpoint)
This function should check the string representation of pathPtr, starting at character index 'nextCheckpoint', and convert it from that point onwards (if possible) to a filesystem-specific unique form. It should return the character index one beyond where is could no longer apply (e.g. pointing to a directory separator or end of string). That index is then passed on to the next filesystem to continue. Most filesystems do not support path ambiguity, in which case the function need not be implemented at all (a NULL entry in the lookup table is acceptable). For example on Windows, the path "c:/PROGRA~1/tcl/tclkit.exe/lib" would be modified to "C:/Program Files/Tcl/tclkit.exe/lib" by the core's normalization procedure, which would return '31', pointing to the '/' in-between .exe and lib.
File manipulation:
For each filesystem function which is implemented, these procs should be declared:
TclfsAccessProc_ *accessProc;
TclfsStatProc_ *statProc;
TclfsOpenFileChannelProc_ *openFileChannelProc;
TclfsMatchFilesTypesProc_ *matchFilesTypesProc;
TclfsLstatProc_ *lstatProc;
TclfsCopyFileProc_ *copyFileProc;
TclfsRenameFileProc_ *renameFileProc;
TclfsCopyDirectoryProc_ *copyDirectoryProc;
TclfsDeleteFileProc_ *deleteFileProc;
TclfsCreateDirectoryProc_ *createDirectoryProc;
TclfsRemoveDirectoryProc_ *removeDirectoryProc;
TclfsLoadFileProc_ *loadFileProc;
TclfsReadlinkProc_ *readlinkProc;
TclfsListVolumesProc_ *listVolumesProc;
TclfsFileAttrStringsProc_ *fileAttrStringsProc;
TclfsFileAttrsGetProc_ *fileAttrsGetProc;
TclfsFileAttrsSetProc_ *fileAttrsSetProc;
TclfsUtimeProc_ *utimeProc;
In fact, copy/rename file need not be implemented, because Tcl will fallback: from rename to copy and delete, and from copy to open-r/open-w/fcopy/mtime when necessary. However a filesystem may well be able to implement these more efficiently than that.
Cd/pwd support:
TclfsChdirProc_ *chdirProc;
TclfsGetCwdProc_ *getCwdProc;
the chdir proc need only return TCL_OK if the path is a valid directory, and TCL_ERROR otherwise. There is no need to remember the path in any way. Native filesystems will of course want to make the appropriate system calls to change the real cwd. Most filesystems will not implement the 'Cwd' proc, since Tcl keeps track of the cwd for you. However, the native filesystem should implement it.
Unload file support:
TclfsUnloadFileProc_ *unloadFileProc;
This function is called automatically by Tcl's core to unload a file, if this filesystem was the one which successfully loaded the file initially.
Philosophy
This TIP is influenced by the thoughts behind the TkGS project (http://sourceforge.net/projects/tkgs/\). Whereas TkGS provides a general and efficient graphics system, the aim of this TIP is to provide a similarly general and efficient filesystem.
Alternatives
Alternatives to adding vfs support
TclKit manages a pretty good job of vfs support. It is limited by the inadequacy of overriding at the Tcl level. Prowrap is limited by the inability to glob, load, cd, pwd, etc.
There are currently no better alternatives: if Tcl's C core calls C functions directly (as it does), or if extensions call C functions directly (as they do), then complete vfs support requires a patch like this to Tcl's core.
Alternatives to objectification
A previous patch added string-based vfs support to Tcl's core, and required very few core changes at all. It could be adopted instead of an objectified filesystem. This would make Tcl's filesystem more complete, but would not make it any more efficient. Also it is much harder to implement complete 'glob' emulation without the newer API.
Objections
Won't all these hooks slow down Tcl's core a lot?
There are actually remarkably few changes required, so the only slowdown would occur if additional filesystems are hooked into the core. This is similar to the impact of the 'stacked channels' implementation. With the objectified filesystem, this does actually speed up Tcl's core (as remarked above, 'file exists' is 20% faster on Windows 2000).
Won't this break backwards compatibility ("The Tcl question")?
Not at all. With the current vfs patch, the entire test suite passes as before, even with an extra 'reporting' filesystem activated. Most reasonable tests now pass even when the test suite is embedded in a wrapped document.
Won't this make Tcl's core more complex??
Adding a Tcl_Obj interface is definitely a bit more complex in some areas than the existing string-based system, but in other areas it cleans things up a lot. Indeed, one result will be that Tcl's filesystem is properly abstracted away, which conceptually simplifies the core (there will be 10-15 functions which are called for all filesystem access, whether it is native or virtual).
Future thoughts
This section contains items which are outside the scope of this TIP, but it was thought useful to raise and have documented for the record.
- Should we remove the native 'Tclpxx' filesystem functions from Tcl's API? Or perhaps require a new #define TCL_PROVIDE_NATIVE_FILESYSTEM to allow an extension to access these calls? They are all inside tclInt.h, so we could easily protect them with such a define.
This patch still places the native filesystem in a preferential position, and it is hard-coded as the tail of the fs-lookup list. There are two changes which could be made in the future:
Move the native-fs support to a static extension which is loaded on startup. This would ensure the layer now separating Tcl from the native FS is not violated, and might let others use Tcl or pieces of Tcl in new ways.
By incorporating some pieces of the 'vfs' extension into the core in the future, and probably making some changes to some of the Tclp native-fs functions, we could make Tcl entirely filesystem-agnostic (e.g. we could do weird things like mount the native filesystem inside a virtual filesystem). Alternatively, if the native filesystem is not loaded at all, that makes for a very good way to ensure a wrapped executable is 'safe', because it cannot even access the local disk.
Also,
Once prowrap is updated to use the new APIs, we should probably remove the primitive vfs hooks it currently uses, this will remove some obsolete stuff from Tcl's core without affecting anything else (I think - any extensions out there use those APIs?). Prowrap simply needs to register a Tcl_Filesystem with the stat, access and openfilechannel fields set to its existing procedures; all other fields can be NULL. (They would also need to be objectified).
file copy can now potentially copy across filesystems, which could be both very slow (across the internet) and may even want different eol conventions on each end. We could add a -command flag to file copy (and perhaps file rename), and we could perhaps add optional ways of specifying the encoding/translation of the transfer? (The main issue is to distinguish between text and binary files, which require automatic and binary -translation respectively).
Copyright
This document has been placed in the public domain.