Tcl Source Code

View Ticket
Login
Ticket UUID: 270f78ca95b642fbed81ed03ad381d64a0d0f7d
Title: file mkdir: race condition if two workers creates same directory and one worker deletes it immediately
Type: Bug Version: >= 8.5
Submitter: anonymous Created on: 2018-07-09 11:38:43
Subsystem: 37. File System Assigned To: sebres
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2018-07-16 17:12:13
Resolution: Fixed Closed By: anonymous
    Closed on: 2018-07-16 17:12:13
Description:
When running in parallel two terminals doing:

 1. while {1} { file mkdir bla }
 2. while {1} { file mkdir bla; file delete bla }

mkdir eventually fails. On ext4 this is not as bad though it happens (and breaks delete sometimes as well), but on nfs mkdir consistently fails when two process make the same dir. This does not happen using 

    exec mkdir -p bla

Related SO question with some tests results of mine:
https://stackoverflow.com/questions/48095178/atomicity-of-mkdir
User Comments: anonymous added on 2018-07-16 17:12:13:
Sorry guys, stupid GMAIL sent mails from this thread to spam:

Definitely fixed, 3 terminals on NFS:

 1. tclsh: while { 1 } { file mkdir bla }
 2. tclsh: while { 1 } { file mkdir bla }
 3. bash : while [ 1 ]; do rm -rf   bla done

running 5 minutes straight and no errors on the tcl shells. The bash shell reports errors, but that's expects as there is a race condition with rm (which propagates to file delete I guess).

Looks really fixed on my side.

sebres added on 2018-07-12 10:35:03: (text/x-fossil-wiki)
I mean in [./artifact?udc=1&ln=259-263&name=ee3766c8dd773316|!S_ISDIR(statBuf.st_mode)]...

P.S. wiki should be really the default remark format.

sebres added on 2018-07-12 10:33:23:
If EEXIST is raised twice, it can only be something (which going possibly deleted inbetween), because otherwise (it's a file) it will exit with error in [./artifact?udc=1&ln=259-263&name=ee3766c8dd773316|!S_ISDIR(statBuf.st_mode)].
Endless cycle is not needed (would be too dangerous).

aspect added on 2018-07-12 10:21:53: (text/x-fossil-wiki)
I'm not convinced by [1c12ee9e45222d6c].  If I'm reading the patch correctly ..

If EEXIST is raised twice, TclFileMakeDirsCmd simply ignores it.  The comment above "goto nextPart" suggests something different, so I suspect this isn't intended.

Removing the call to FSStat means we can report success if the target exists and is a file.  This is clearly wrong.

Removing the Tcl_ResetResult() call rings alarm bells, but it looks like that was redundant in the first place.

I think lines 287-289 (post-patch) should simply be deleted so the success case can pass through Tcl_FSStat success on line 258.

sebres added on 2018-07-11 00:22:24: (text/x-fossil-wiki)
So I assume it works now for you also as regards your NFS (NAS share)?

But what I don't understand, how it can fail without delete at all (especially with EEXIST error code).
Possibly you use something like temporary folders (just with range of the same names)? So it could be deleted (after work done)?
In case like this I think to resolve such time-issues (and they are rather a time- as a race-condition indirectly) you should possibly add some business logic (e. g. with some time-component in the name of the folder) to avoid deleting of still "active" directory (as already said like described in RFE [4f322b9d21]).
Note, that in this case both creater-workers will return successfully, but folder will be not exist (because deleted) so you'll get a consequential error just later.

Also note on NFS (especially under control of Windows) there are some rude antivirus or other IDS-similar software's (could use quarantine, worker suspending and co), that may cause similar errors there.

anonymous added on 2018-07-10 11:29:56:
Thanks for the super fast fix. I'd like to comment that this was originally discovered with **no delete at all** - our daily regressions running in parallel would sometimes create directories at the same place (which doesn't matter for our purposes) but would often fail. I think the fact this is a NFS file system made the race condition prevalent (and the error is different compared to ext4 when removing).

sebres added on 2018-07-09 17:38:03: (text/x-fossil-wiki)
Fixed in [1c12ee9e45222d6c], thus close.

Although the case is very strange (the delete implicitly after create is IMHO not good way in any case, because can cause a lot of other race conditions), but I could imagine situations where it may be expected resp. could occur once (e. g. the three-point delete/create/create).

So this one is fixed now (under assumption the time of creation inside one worker is important and if another worker deletes the folder, each worker has exactly one attempt to repeat the creation. Hereafter it gets off with the success, if it's a tail (despite the directory may not exist), or fails deeper on create of child folders if it was not a tail directory.

Other imaginable race conditions on mkdir are unwanted resp. impossible to fix and should be resolved another way's (like RFE [4f322b9d21]).

sebres added on 2018-07-09 16:03:29: (text/x-fossil-wiki)
Hmmm, you're right.

Corresponding the code this can indeed happen, but this time as dual race condition:

<table border=1>
<tr><th>Worker 1</th><th>Worker 2</th><th>Worker 3</th></tr>
<tr><td>Tcl_FSCreateDirectory (mkdir) succeeds</td><td>Tcl_FSCreateDirectory (mkdir) fails with EEXIST</td><td>-</td></tr>
<tr><td>return TCL_OK</td><td>Tcl_FSStat and check is S_ISDIR</td><td><b>deleted the directory</b></td></tr>
<tr><td>-</td><td><b>not exists, so return TCL_ERROR</b></td><td>-</td></tr>
</table>
This was introduced (even as race condition prevention;) in [./artifact?udc=1&ln=276-277&name=d0f1daac754e8554|d0f1daac754e8554].
But ATM it produces another race condition (as worker 3 deletes the directory and worker 2 doing the stats check)...

I'll fix it, thanks! WiP.

anonymous added on 2018-07-09 15:12:33:
This can reproduced with mkdir alone. On nfs with 3 terminals, 

 1. bash : while [[ 1 ]]; do rm -rf bla; done
 2. tclsh: while {1} {file mkdir bla}

This will already cause bash to throw weird errors (but loop will continue)
 
 rm: cannot remove `bla': Is a directory

even though we used -rf. Now add
 3. tclsh: while {1} {file mkdir bla}

Almost immediately one of the tcl shells will produce:

 can't create directory "bla": file already exists

These things are tcl specific, and won't happen with builtin mkdir -p.

sebres added on 2018-07-09 13:16:37: (text/x-fossil-wiki)
This reace condition was already fixed in [27b6822849], so I'll close it as duplicate.
Please reopen if I'm wrong.