Tk Library Source Code

View Ticket
Login
Ticket UUID: 2941841
Title: htmlparse crashes on really badly formed html
Type: Bug Version: None
Submitter: dvrsn Created on: 2010-01-28 21:38:57
Subsystem: htmlparse Assigned To: andreas_kupries
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2012-08-03 05:22:48
Resolution: Fixed Closed By: andreas_kupries
    Closed on: 2012-08-02 22:22:48
Description:
% htmlparse::parse  "<a<a/>>" 
extra characters after close-brace

The string being parsed here is nonsense.  I'm not sure what a sane way to parse it would be.  Perhaps as a singleton tag named "a<a" followed by a naked ">".

Erroring out is undesirable, because it doesn't attempt to do any partial parsing.
User Comments: andreas_kupries added on 2012-08-03 05:22:48:

allow_comments - 1

Applied patch in CVS head, package version is 1.2.1.

andreas_kupries added on 2012-08-03 01:16:02:
Thanks. Today doesn't seem to be my day. Second time I missed stuff like that.

dvrsn added on 2012-08-03 01:06:48:
The patch is already attached.

andreas_kupries added on 2012-08-03 00:51:21:
Re: " It can be changed to actually rewrite those tags to be an opening tag followed by a closing tag"

Do you have a patch for this change, per chance ?

dvrsn added on 2012-03-14 23:56:20:

File Added - 438255: htmlparse.patch

dvrsn added on 2012-03-14 23:54:33:
The cause behind this bug and duplicate 3504018 is that the trnasformation into a command stream is done by a pair of regexps, and when the param value (i.e., tag attributes) of a null end tag construct looks like html, it gets reparsed by the second regexp, making things break in interesting ways.

The first regexp handles NET cases, resulting in commands that make it look like a opening tag followed immediately by a closing tag.  It can be changed to actually rewrite those tags to be an opening tag followed by a closing tag, and then have the normal tag regexp reparse it intentionally instead of accidentally.  The effect is the same, but the mechanism prevents already formed commands from being changed.

tomk added on 2010-01-29 07:08:42:
OK, I agree with your comments. After reading the docs for the command it's it looks like your interpretation for the string is the only one that will work with the defined interface. Including a "/" in the tag doesn't make sense because it should be detected and returned as a flag in the callback. Thus, you really only have one chose.

dvrsn added on 2010-01-29 06:47:35:
There are two reasons why I said error-ing is undesirable.  

First, because the documentation doesn't suggest that it can throw an error due to badly formed html.  (It raises errors for incomplete html, but that is an intentionally raised error).  If an error like this was detectable and raised as "insane html" or something like that, and this was documented, that might be ok.  Unbalanced braces or incorrect number of arguments gives no clue as to the nature of the error.

Second, an error being raised in this case would allow partial parsing, but only up to the point of the error.  This malformed html is roughly comparable to line noise, where there's a glitch somewhere in the middle but after that things can continue.  But raising an error doesn't let things continue, even if the caller wanted to do so.  A document can be made completely unparsable by having this malformed string at the beginning.

tomk added on 2010-01-29 06:05:26:
That is a really nasty case. Your proposal makes sense in terms of continuing the parse but I'm a little confused by the "Erroring out is undesirable...". It seems to me that you should error out. If the using proc needs to test the html then it can capture the error with a catch and decide what to do.
tomk

Attachments: