Segfault under wish at first tls::import, not under tclsh
|User & Date:||anonymous on 2020-09-01 09:44:20|
Okay, I think I've diagnosed this now, though I don't know how to fix it. There doesn't seem to be an error in the TclTLS source code as such, so the fix is probably rather to adjust the build process -- figure out the right options to pass, and it should work. However I suspect TclTLS still needs improving because of this issue, since it does seem to be a more general vulnerability; it should be easy to make a hardened build, even if one does not make it the default. In short, the problem is that under Wish I'm getting two different versions of libcrypto loaded, and the dynamic linker is making TclTLS call some functions from one version and other functions from the other; since they are rather far from being binary compatible, a crash is only to be expected. tclsh avoids the problem by not loading the first version of libcrypto, so that all calls go to the second version against which TclTLS was actually built, *but*: should tclsh load any other library with a dependence upon the first libcrypto version before it loads TclTLS then the problem will surface also there. Right now TclTLS is vulnerable in that it depends on other parties not doing things it hadn't expected, instead of making sure that it gets what it requires. It might be argued that having different versions of a library in the same process is a big no-no, and that the fix is to update everything to use the same most recent version, but that is frankly impossible; the old libcrypto is being brought in by some system library (don't know which one, but the one next to it in the list is called TrustEvaluationAgent, which seems a plausible user) and this ain't linux -- we can't hope to go recompiling those. Moreover it might be remarked that the (long) list of shared libraries Wish loads includes a decade-old libsqlite3.dylib, but that has never been a problem for tclsqlite3, presumably because that links statically with the sqlite3 library. I see no reason why TclTLS couldn't have an option to do the same, but I cannot find anything in the documentation to that effect. Indeed, the documentation for OpenSSL 1.0.x strongly recommends linking statically against the library (for security reasons), though the OpenSSL 1.1.x documentation does not make this point (that I can see). I have however only been able to build TclTLS to link against the dylibs. Much longer, my investigation was as follows: I saw the program crash in pthread_rwlock_wrlock, called from the SSL_CTX_set_tmp_dh(ctx, dh); line in CTX_Init (tls.c around line 1260, but I've done some hacking to debug so my line numbers are off), with a KERN_INVALID_ADDRESS at 0x0000000000000000 So... a NULL pointer dereferencing. It turns out the pointer in question is dh->lock up in tls.c (or it would be, had the DH type not been opaque in tls.c, but it is a field in the struct that dh points to). What SSL_CTX_set_tmp_dh is doing at the time of the crash is that it is adjusting the refcount of those DH parameters, and the lock in question is protecting that refcount. The CRYPTO_UP_REF operation being used does have something like four different implementations, many of which don't need this lock at all (but not the last resort implementation via pthreads I end up with), so it would not be outrageous to suspect a bug in libcrypto where this field never gets initialised, but that is not what is going on. Some hacking to insert statements printing the value of dh->lock reveals that it is indeed NULL when run under Wish, but non-NULL when run under tclsh. Some more hacking in dh_params.h reveals that this is what that struct is like already when DH_new returns it. This could still be a matter of uninitialised memory just happening to have a fortuitous value in tclsh but not Wish, though to be sure one needs to see what DH_new (or rather the underlying DH_new_method, both living in openssl-1.1.1g/crypto/dh/dh_lib.c) is doing. This is where things get weird, because from RTFS that function has an explicit check whether lock is NULL, and will return NULL for the dh pointer if it is. Yet when run, DH_new is returning a non-NULL pointer for dh such that dh->lock is NULL. This should not happen. Might some later instruction in DH_new_method be overwriting the lock with garbage? Only way to tell is to bring out a debugger. That's where things get *really* weird. (In retrospect, working with this issue the first time one uses gdb is probably not ideal, but hey, I've learnt some gdb now! Good for me.) For some reason, gdb refuses to step into the DH_new function, complaining about there not being line number information available. In retrospect this could have been a clue, but I spend a lot of time fiddling with compiler options for debug information, hoping that this will do the trick. It might even have helped a bit, but eventually I'm down to stepping at the machine code level. The gdb list command is not showing anything that makes sense, but the disassemble command is my friend, once I refresh myself on the memnonics. What I see is … plausibly the DH_new_method function, but it looks like it has been optimised to hell and back, and indeed the crucial CRYPTO_THREAD_lock_new call is missing. But this libcrypto was compiled with -O0, so why would it be optimised?? Indeed, when running gdb on just libcrypto.1.1.dylib I get a disassembly looking the way I might expect (even overly roundabout). Even more strange, I get the same listing when I ask the gdb debugging Wish to disassemble DH_new_method So why isn't *that* the code that this process is running?? Then it hits me: the addresses don't match. The good DH_new_method is at 0x0000000114e7bfe0, but the one that gets called is at 0x00007fff8a4e6e90. gdb says both are called DH_new_method, but 0x00007fff8a4e6e90 is in libcrypto.0.9.8.dylib, whereas 0x0000000114e7bfe0 is in libcrypto.1.1.dylib, just like call that invokes the CRYPTO_UP_REF operation. Mystery solved, to some extent. Now I only need to figure out how to make TclTLS bind to the right instances of these symbols. It would help getting some input from people who actually know how all these things are supposed to fit together, instead of having to guess while randomly poking.
- login: "anonymous"
- mimetype: "text/x-fossil-plain"
- priority changed to: "Immediate"
- resolution changed to: "Open"
- username: "lars_h"