Tcl Source Code

View Ticket
Login
Ticket UUID: 336441ed59c9f49fb2dc5414911f5c90c7acdec3
Title: socket -async stall on windows
Type: Bug Version: 8.5.15
Submitter: oehhar Created on: 2014-03-11 10:27:03
Subsystem: 24. Channel Commands Assigned To: oehhar
Priority: 5 Medium Severity: Critical
Status: Closed Last Modified: 2014-05-30 10:42:35
Resolution: Fixed Closed By: oehhar
    Closed on: 2014-05-30 10:42:35
Description:

Within branch win-sock-async-connect-race-fix, in checkin [521b7229c4], Andreas Kupries reports about socket-async stalls due to not delivered FD_CONNECT event by the operating system. Read the checkin comment for the issue.

Issues:

  • gets or puts may never return on a socket with async connection
  • the connection notification FD_CONNECT is not observed

Conclusion:

  • FD_CONNECT is not sent by the OS

Action:

  • Use FD_WRITE as a fallback. There is already such code in the notifier proc. The proposed fix uses that to exit gets/puts.

This ticket is created to discuss the issue.

Reinhard Max and Harald Oehlmann are working on the 8.6.x socket code in the branch bug-13d3af3ad5.

There, the same issue was observed. The conclusion was:

  • FD_CONNECT is not delivered, if the socet connect fails between the connect() call and the WSAAsyncSelect() call
  • FD_CONNECT is ignored if delivered between the WSAAsyncSelect() call and the insertion of the socket structure in TcpThreadActionProc()

Within the branch bug-13d3af3ad5, those issues are fixed.

The workaround to use FD_WRITE instead FD_CONNECT was removed in bug-13d3af3ad5. The issue of this is, that an eventual connect failure is not detected. It should be reintroduced, if the information "FD_CONNECT's are not delivered" is correct.

This work is in accordance what is done on the Unix side.

User Comments: oehhar added on 2014-05-30 10:42:35:

Here is the message by Andreas about the test result I have even not hoped about but it happend:

I have now completed the check and it seems to have done the trick.
Using a stackato client wrapped with the basekit build from the specified revision.
I ran my "testcase" (iterated 'stackato info' against a https target) and saw no hangs for 13 minutes at about 50 iterations/minute, i.e. circa 650 iterations.

When the problem was active I could expect a hang within a minute and two at most.

Thank you very much for the work on this.

Having this in the Tcl 8.5 core branch will make me happy and willing to switch to it again, away from my "win-sock-async-connect-race-fix" branch (which I can/will then close).

So, merged by commit [6ecb583012], bug closed, thank you all, Harald


oehhar added on 2014-05-29 14:58:41:

Another test version in bug branch: commit [a658836882]:

  • Don't switch monitoring off when waiting for FD_CONNECT to not loose it

Andreas, I would appreciate, if you could test this.

Thank you, Harald


oehhar added on 2014-04-29 20:00:01:
Andreas has tested the patch on 8.5 and it failed.
Here is his message:

Running our stackato.exe in a loop, simply asking for information from
the target, with https (TLS) active the application hangs after about
14-28 iterations, with about 14 iterations per minute, so within 1 to
2 minutes. Symptom of not accepting ^C is the same as before I should
note.

After activating the --debug-http-log it is not hanging itself within
10 minutes anymore.
As that option only activates more output, i.e. introduces delays this
looks as if there is still a race condition present, old or new.

This means that I will still have to use my fix and branch of Tcl 8.5
for the stackato client, instead of head.
Sorry about the bad news.

The rough outline of operations done in the client is:

-1- register tls for https, with http
-2- open a https -async socket to a webserver
-3- read some data data, via readable fileevent
-4- close the socket
-5- format and print data

Note that the iterations I speak of here are always new stackato
processes, with each doing the above. The iteration does NOT happen in
a single stackato process.

The last time I had to investigate the hang happend inside of TLS,
during the open of the socket, i.e. step 2. The TLS transform does
sync read/writes to perform the TLS handshake, without using
fileevents.

I suspect that this is true this time as well.

oehhar added on 2014-04-02 10:11:18:

Test added, which, at least, works on my machine. As test is timing dependent, it may not show the error on other machines.

Commited to core-8-5-branch by commit [1dfe1390d8].

Bug closed.


oehhar added on 2014-04-01 13:47:39:

Reinhard has created a test where this bug shows-up on my machine:

set sock [socket -async 169.254.0.0 42424]
after 10000 {set x timeout}
fileevent $sock writable {set x writable}
vwait x
close $sock
puts $x

The bug shows up as socket 169.254.0.0 42424 fails so quickly on my machine with "network is unreachable". For me, the writable event does never fire and the timeout fires.

This is on Windows Vista 32 bit with tcl8.5.15.


oehhar added on 2014-03-22 16:18:53:

Proposed solution in checkin [2596fec7bd] in branch "bug-336441ed59" ready for check.

Backported fix from commit [65b320b464] from branch "bug-[13d3af3ad5]".