TclTLS: Ticket Change Details

Overview
Artifact ID:	bb8c278bf244f123958c3f4901da2fe84976cd8da778b18f8aa54a39d86d649e
Ticket:	e01f02a12180ac37cdb96ec16d0e03f89033cd77 Segfault under wish at first tls::import, not under tclsh
User & Date:	anonymous on 2020-09-01 09:44:20
Changes
icomment:
Okay, I think I've diagnosed this now, though I don't know how to fix it.
There doesn't seem to be an error in the TclTLS source code as such, so
the fix is probably rather to adjust the build process -- figure out the
right options to pass, and it should work. However I suspect TclTLS still 
needs improving because of this issue, since it does seem to be a more 
general vulnerability; it should be easy to make a hardened build, even 
if one does not make it the default.

In short, the problem is that under Wish I'm getting two different 
versions of libcrypto loaded, and the dynamic linker is making TclTLS 
call some functions from one version and other functions from the other; 
since they are rather far from being binary compatible, a crash is only 
to be expected. tclsh avoids the problem by not loading the first version 
of libcrypto, so that all calls go to the second version against which 
TclTLS was actually built, *but*: should tclsh load any other library 
with a dependence upon the first libcrypto version before it loads TclTLS 
then the problem will surface also there. Right now TclTLS is vulnerable 
in that it depends on other parties not doing things it hadn't expected, 
instead of making sure that it gets what it requires.

It might be argued that having different versions of a library in the 
same process is a big no-no, and that the fix is to update everything to 
use the same most recent version, but that is frankly impossible; the 
old libcrypto is being brought in by some system library (don't know 
which one, but the one next to it in the list is called TrustEvaluationAgent,
which seems a plausible user) and this ain't linux -- we can't hope to 
go recompiling those. Moreover it might be remarked that the (long) list 
of shared libraries Wish loads includes a decade-old libsqlite3.dylib, 
but that has never been a problem for tclsqlite3, presumably because that 
links statically with the sqlite3 library. I see no reason why TclTLS 
couldn't have an option to do the same, but I cannot find anything in 
the documentation to that effect.

Indeed, the documentation for OpenSSL 1.0.x strongly recommends linking 
statically against the library (for security reasons), though the 
OpenSSL 1.1.x documentation does not make this point (that I can see). 
I have however only been able to build TclTLS to link against the dylibs.

Much longer, my investigation was as follows:

I saw the program crash in pthread_rwlock_wrlock, called from the 
	SSL_CTX_set_tmp_dh(ctx, dh);
line in CTX_Init (tls.c around line 1260, but I've done some hacking 
to debug so my line numbers are off), with a 
  KERN_INVALID_ADDRESS at 0x0000000000000000
So... a NULL pointer dereferencing. It turns out the pointer in 
question is dh->lock up in tls.c (or it would be, had the DH type not 
been opaque in tls.c, but it is a field in the struct that dh points to). 
What SSL_CTX_set_tmp_dh is doing at the time of the crash is that it 
is adjusting the refcount of those DH parameters, and the lock in 
question is protecting that refcount.

The CRYPTO_UP_REF operation being used does have something like four 
different implementations, many of which don't need this lock at all 
(but not the last resort implementation via pthreads I end up with), 
so it would not be outrageous to suspect a bug in libcrypto where this 
field never gets initialised, but that is not what is going on.

Some hacking to insert statements printing the value of dh->lock 
reveals that it is indeed NULL when run under Wish, but non-NULL 
when run under tclsh. Some more hacking in dh_params.h reveals that 
this is what that struct is like already when DH_new returns it. 
This could still be a matter of uninitialised memory just happening 
to have a fortuitous value in tclsh but not Wish, though to be sure 
one needs to see what DH_new (or rather the underlying DH_new_method, 
both living in openssl-1.1.1g/crypto/dh/dh_lib.c) is doing.

This is where things get weird, because from RTFS that function has 
an explicit check whether lock is NULL, and will return NULL for the 
dh pointer if it is. Yet when run, DH_new is returning a non-NULL 
pointer for dh such that dh->lock is NULL. This should not happen. 
Might some later instruction in DH_new_method be overwriting the lock 
with garbage? Only way to tell is to bring out a debugger. That's 
where things get *really* weird.

(In retrospect, working with this issue the first time one uses gdb 
is probably not ideal, but hey, I've learnt some gdb now! Good for me.)

For some reason, gdb refuses to step into the DH_new function, 
complaining about there not being line number information available. 
In retrospect this could have been a clue, but I spend a lot of time 
fiddling with compiler options for debug information, hoping that this 
will do the trick. It might even have helped a bit, but eventually I'm 
down to stepping at the machine code level. The gdb list command is not 
showing anything that makes sense, but the disassemble command is my 
friend, once I refresh myself on the memnonics. What I see is … 
plausibly the DH_new_method function, but it looks like it has been 
optimised to hell and back, and indeed the crucial CRYPTO_THREAD_lock_new 
call is missing. But this libcrypto was compiled with -O0, so why would 
it be optimised??

Indeed, when running gdb on just libcrypto.1.1.dylib I get a disassembly 
looking the way I might expect (even overly roundabout). Even more 
strange, I get the same listing when I ask the gdb debugging Wish to 
  disassemble DH_new_method
So why isn't *that* the code that this process is running?? Then it hits 
me: the addresses don't match. The good DH_new_method is at 
0x0000000114e7bfe0, but the one that gets called is at 0x00007fff8a4e6e90. 
gdb says both are called DH_new_method, but 0x00007fff8a4e6e90 is in 
libcrypto.0.9.8.dylib, whereas 0x0000000114e7bfe0 is in 
libcrypto.1.1.dylib, just like call that invokes the CRYPTO_UP_REF 
operation. Mystery solved, to some extent.

Now I only need to figure out how to make TclTLS bind to the right 
instances of these symbols. It would help getting some input from 
people who actually know how all these things are supposed to fit 
together, instead of having to guess while randomly poking.
login: "anonymous"
mimetype: "text/x-fossil-plain"
priority changed to: "Immediate"
resolution changed to: "Open"
username: "lars_h"