Ticket Change Details
Overview

Artifact ID: bb8c278bf244f123958c3f4901da2fe84976cd8da778b18f8aa54a39d86d649e
Ticket: e01f02a12180ac37cdb96ec16d0e03f89033cd77
Segfault under wish at first tls::import, not under tclsh
User & Date: anonymous on 2020-09-01 09:44:20
Changes

  1. icomment:
    Okay, I think I've diagnosed this now, though I don't know how to fix it.
    There doesn't seem to be an error in the TclTLS source code as such, so
    the fix is probably rather to adjust the build process -- figure out the
    right options to pass, and it should work. However I suspect TclTLS still 
    needs improving because of this issue, since it does seem to be a more 
    general vulnerability; it should be easy to make a hardened build, even 
    if one does not make it the default.
    
    In short, the problem is that under Wish I'm getting two different 
    versions of libcrypto loaded, and the dynamic linker is making TclTLS 
    call some functions from one version and other functions from the other; 
    since they are rather far from being binary compatible, a crash is only 
    to be expected. tclsh avoids the problem by not loading the first version 
    of libcrypto, so that all calls go to the second version against which 
    TclTLS was actually built, *but*: should tclsh load any other library 
    with a dependence upon the first libcrypto version before it loads TclTLS 
    then the problem will surface also there. Right now TclTLS is vulnerable 
    in that it depends on other parties not doing things it hadn't expected, 
    instead of making sure that it gets what it requires.
    
    It might be argued that having different versions of a library in the 
    same process is a big no-no, and that the fix is to update everything to 
    use the same most recent version, but that is frankly impossible; the 
    old libcrypto is being brought in by some system library (don't know 
    which one, but the one next to it in the list is called TrustEvaluationAgent,
    which seems a plausible user) and this ain't linux -- we can't hope to 
    go recompiling those. Moreover it might be remarked that the (long) list 
    of shared libraries Wish loads includes a decade-old libsqlite3.dylib, 
    but that has never been a problem for tclsqlite3, presumably because that 
    links statically with the sqlite3 library. I see no reason why TclTLS 
    couldn't have an option to do the same, but I cannot find anything in 
    the documentation to that effect.
    
    Indeed, the documentation for OpenSSL 1.0.x strongly recommends linking 
    statically against the library (for security reasons), though the 
    OpenSSL 1.1.x documentation does not make this point (that I can see). 
    I have however only been able to build TclTLS to link against the dylibs.
    
    Much longer, my investigation was as follows:
    
    I saw the program crash in pthread_rwlock_wrlock, called from the 
    	SSL_CTX_set_tmp_dh(ctx, dh);
    line in CTX_Init (tls.c around line 1260, but I've done some hacking 
    to debug so my line numbers are off), with a 
      KERN_INVALID_ADDRESS at 0x0000000000000000
    So... a NULL pointer dereferencing. It turns out the pointer in 
    question is dh->lock up in tls.c (or it would be, had the DH type not 
    been opaque in tls.c, but it is a field in the struct that dh points to). 
    What SSL_CTX_set_tmp_dh is doing at the time of the crash is that it 
    is adjusting the refcount of those DH parameters, and the lock in 
    question is protecting that refcount.
    
    The CRYPTO_UP_REF operation being used does have something like four 
    different implementations, many of which don't need this lock at all 
    (but not the last resort implementation via pthreads I end up with), 
    so it would not be outrageous to suspect a bug in libcrypto where this 
    field never gets initialised, but that is not what is going on.
    
    Some hacking to insert statements printing the value of dh->lock 
    reveals that it is indeed NULL when run under Wish, but non-NULL 
    when run under tclsh. Some more hacking in dh_params.h reveals that 
    this is what that struct is like already when DH_new returns it. 
    This could still be a matter of uninitialised memory just happening 
    to have a fortuitous value in tclsh but not Wish, though to be sure 
    one needs to see what DH_new (or rather the underlying DH_new_method, 
    both living in openssl-1.1.1g/crypto/dh/dh_lib.c) is doing.
    
    This is where things get weird, because from RTFS that function has 
    an explicit check whether lock is NULL, and will return NULL for the 
    dh pointer if it is. Yet when run, DH_new is returning a non-NULL 
    pointer for dh such that dh->lock is NULL. This should not happen. 
    Might some later instruction in DH_new_method be overwriting the lock 
    with garbage? Only way to tell is to bring out a debugger. That's 
    where things get *really* weird.
    
    (In retrospect, working with this issue the first time one uses gdb 
    is probably not ideal, but hey, I've learnt some gdb now! Good for me.)
    
    For some reason, gdb refuses to step into the DH_new function, 
    complaining about there not being line number information available. 
    In retrospect this could have been a clue, but I spend a lot of time 
    fiddling with compiler options for debug information, hoping that this 
    will do the trick. It might even have helped a bit, but eventually I'm 
    down to stepping at the machine code level. The gdb list command is not 
    showing anything that makes sense, but the disassemble command is my 
    friend, once I refresh myself on the memnonics. What I see is … 
    plausibly the DH_new_method function, but it looks like it has been 
    optimised to hell and back, and indeed the crucial CRYPTO_THREAD_lock_new 
    call is missing. But this libcrypto was compiled with -O0, so why would 
    it be optimised??
    
    Indeed, when running gdb on just libcrypto.1.1.dylib I get a disassembly 
    looking the way I might expect (even overly roundabout). Even more 
    strange, I get the same listing when I ask the gdb debugging Wish to 
      disassemble DH_new_method
    So why isn't *that* the code that this process is running?? Then it hits 
    me: the addresses don't match. The good DH_new_method is at 
    0x0000000114e7bfe0, but the one that gets called is at 0x00007fff8a4e6e90. 
    gdb says both are called DH_new_method, but 0x00007fff8a4e6e90 is in 
    libcrypto.0.9.8.dylib, whereas 0x0000000114e7bfe0 is in 
    libcrypto.1.1.dylib, just like call that invokes the CRYPTO_UP_REF 
    operation. Mystery solved, to some extent.
    
    Now I only need to figure out how to make TclTLS bind to the right 
    instances of these symbols. It would help getting some input from 
    people who actually know how all these things are supposed to fit 
    together, instead of having to guess while randomly poking.
    
  2. login: "anonymous"
  3. mimetype: "text/x-fossil-plain"
  4. priority changed to: "Immediate"
  5. resolution changed to: "Open"
  6. username: "lars_h"