Tcl Source Code

View Ticket
Bounty program for improvements to Tcl and certain Tcl packages.
Ticket UUID: 336441ed59c9f49fb2dc5414911f5c90c7acdec3
Title: socket -async stall on windows
Type: Bug Version: 8.5.15
Submitter: oehhar Created on: 2014-03-11 10:27:03
Subsystem: 24. Channel Commands Assigned To: oehhar
Priority: 5 Medium Severity: Critical
Status: Closed Last Modified: 2014-05-30 10:42:35
Resolution: Fixed Closed By: oehhar
    Closed on: 2014-05-30 10:42:35
Description: (text/x-fossil-wiki)
Within branch [|win-sock-async-connect-race-fix], in checkin [521b7229c4], Andreas Kupries reports about socket-async stalls due to not delivered FD_CONNECT event by the operating system.
Read the checkin comment for the issue.


   *   gets or puts may never return on a socket with async connection
   *   the connection notification FD_CONNECT is not observed

   *   FD_CONNECT is not sent by the OS

   *   Use FD_WRITE as a fallback. There is already such code in the notifier proc. The proposed fix uses that to exit gets/puts.

This ticket is created to discuss the issue.

Reinhard Max and Harald Oehlmann are working on the 8.6.x socket code in the branch [|bug-13d3af3ad5].

There, the same issue was observed. The conclusion was: 
   *   FD_CONNECT is not delivered, if the socet connect fails between the connect() call and the WSAAsyncSelect() call
   *   FD_CONNECT is ignored if delivered between the WSAAsyncSelect() call and the insertion of the socket structure in TcpThreadActionProc()

Within the branch [|bug-13d3af3ad5], those issues are fixed.

The workaround to use FD_WRITE instead FD_CONNECT was removed in [|bug-13d3af3ad5].
The issue of this is, that an eventual connect failure is not detected.
It should be reintroduced, if the information "FD_CONNECT's are not delivered" is correct.

This work is in accordance what is done on the Unix side.
User Comments: oehhar added on 2014-05-30 10:42:35: (text/x-fossil-wiki)
Here is the message by Andreas about the test result I have even not hoped about but it happend:
I have now completed the check and it seems to have done the trick.
Using a stackato client wrapped with the basekit build from the specified revision.
I ran my "testcase" (iterated 'stackato info' against a https target) and saw no hangs for 13 minutes at about 50 iterations/minute, i.e. circa 650 iterations.

When the problem was active I could expect a hang within a minute and two at most.

Thank you very much for the work on this.

Having this in the Tcl 8.5 core branch will make me happy and willing to switch to it again, away from my "win-sock-async-connect-race-fix" branch (which I can/will then close).

So, merged by commit [6ecb583012], bug closed, thank you all,

oehhar added on 2014-05-29 14:58:41: (text/x-fossil-wiki)
Another test version in bug branch: commit [a658836882]:
   *   Don't switch monitoring off when waiting for FD_CONNECT to not loose it

Andreas, I would appreciate, if you could test this.

Thank you,

oehhar added on 2014-04-29 20:00:01:
Andreas has tested the patch on 8.5 and it failed.
Here is his message:

Running our stackato.exe in a loop, simply asking for information from
the target, with https (TLS) active the application hangs after about
14-28 iterations, with about 14 iterations per minute, so within 1 to
2 minutes. Symptom of not accepting ^C is the same as before I should

After activating the --debug-http-log it is not hanging itself within
10 minutes anymore.
As that option only activates more output, i.e. introduces delays this
looks as if there is still a race condition present, old or new.

This means that I will still have to use my fix and branch of Tcl 8.5
for the stackato client, instead of head.
Sorry about the bad news.

The rough outline of operations done in the client is:

-1- register tls for https, with http
-2- open a https -async socket to a webserver
-3- read some data data, via readable fileevent
-4- close the socket
-5- format and print data

Note that the iterations I speak of here are always new stackato
processes, with each doing the above. The iteration does NOT happen in
a single stackato process.

The last time I had to investigate the hang happend inside of TLS,
during the open of the socket, i.e. step 2. The TLS transform does
sync read/writes to perform the TLS handshake, without using

I suspect that this is true this time as well.

oehhar added on 2014-04-02 10:11:18: (text/x-fossil-wiki)
Test added, which, at least, works on my machine.
As test is timing dependent, it may not show the error on other machines.

Commited to core-8-5-branch by commit [1dfe1390d8].

Bug closed.

oehhar added on 2014-04-01 13:47:39: (text/x-fossil-wiki)
Reinhard has created a test where this bug shows-up on my machine:
set sock [socket -async 42424]
after 10000 {set x timeout}
fileevent $sock writable {set x writable}
vwait x
close $sock
puts $x

The bug shows up as <b>socket 42424</b> fails so quickly on my machine with "network is unreachable".
For me, the writable event does never fire and the timeout fires.

This is on Windows Vista 32 bit with tcl8.5.15.

oehhar added on 2014-03-22 16:18:53: (text/x-fossil-wiki)
Proposed solution in
checkin [2596fec7bd] in branch "bug-336441ed59" ready for check.

Backported fix from commit [65b320b464] from branch "bug-[13d3af3ad5]".