Tk Library Source Code

View Ticket
Login
Ticket UUID: 3064747
Title: Failure to match { and } in parser tools
Type: Bug Version: None
Submitter: dargosch Created on: 2010-09-12 12:37:58
Subsystem: pt (parsetools) Assigned To: andreas_kupries
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2010-09-14 01:00:38
Resolution: Invalid Closed By: andreas_kupries
    Closed on: 2010-09-13 18:00:38
Description:
I'm trying to match everything, the entire Unicode charset (including special chars like {} () and such) up to a specific sequence of chars.
With the supplied PEG grammar, I am able to parse unicode strings ok, but grouping chars is still a problem, it seems.

I am using this test snipplet:


------------- BEGIN ---------------
proc putss {p in } {
global currparsing
set currparsing $in
set inast [$p parset $in ]
puts $inast
puts [pt::ast print $inast ]
}
  
puts "Simple case"
putss $p {Structure=adasd} 
puts "Simple case, wierd chars"
putss $p {Structure='ƈɠħç§'}  
puts "Simple case, escaped special grouping chars"
putss $p {Structure='\}'}  
------------- END ---------------

and what I get is:

------------- BEGIN ---------------
Simple case
expression 0 14 {labeltypeexpression 0 14 {labeltype 0 8} {stringcompareoperator 9 9} {labelsequence 10 14 {label 10 14 {asciistring 10 14}}}}
<expression> :: 0 14
    <labeltypeexpression> :: 0 14
        <labeltype> :: 0 8
        <stringcompareoperator> :: 9 9
        <labelsequence> :: 10 14
            <label> :: 10 14
                <asciistring> :: 10 14
Simple case, wierd chars
expression 0 16 {labeltypeexpression 0 16 {labeltype 0 8} {stringcompareoperator 9 9} {labelsequence 10 16 {label 10 16 {unicodestring 11 11} {unicodestring 12 12} {unicodestring 13 13} {unicodestring 14 14} {unicodestring 15 15}}}}
<expression> :: 0 16
    <labeltypeexpression> :: 0 16
        <labeltype> :: 0 8
        <stringcompareoperator> :: 9 9
        <labelsequence> :: 10 16
            <label> :: 10 16
                <unicodestring> :: 11 11
                <unicodestring> :: 12 12
                <unicodestring> :: 13 13
                <unicodestring> :: 14 14
                <unicodestring> :: 15 15
Simple case, escaped special grouping chars
pt::rde 10 {{n WHITESPACE} {n labelsequence}}
    while executing
"$myparser complete"
    (procedure "::emuql_parser::Snit_methodparset" line 9)
    invoked from within
"$p parset $in "
    (procedure "putss" line 4)
    invoked from within
"putss $p {Structure='\}'}  "
    (file "Test_emuql_parser.tcl" line 145)
------------- END ---------------


This is on a Mac Show Leopard and Tcl 8.6.

/Fredrik
User Comments: andreas_kupries added on 2010-09-14 01:00:38:

allow_comments - 1

Ok. It was me misunderstanding the quoting rules of Tcl for a moment.
Look at the following 4 commands, and their output (after running), and you should be able to see the problem and fix:

(1)    #puts {Structure='}'}
(2)    puts {Structure='{}'}
(3)    puts {Structure='\}'}
(4)    puts {Structure='\{\}'}
(5)    puts "Structure='\}'"

The first one you already know, and I explained why it is wrong. The second one fixes the issue, by balancing the nested braces. If that always happens in your case you are now fine. If you allow for unbalanced braces I said to use (3) in my post to comp.lang.tcl. I was wrong. The \ before } protects the Tcl parser from miscounting, true.

What I thought was that it would also collapse the \} sequence into a single } internally. It doesn't, and re-reading rule [6] the Tcl interpreter is right, and me wrong.

So your parser sees \} and that is a bad backslash sequence per your grammar, and so it bails out with an error.

Item (4) in the list is the same as (3).
Item (5) shows the proper quoting. Use "...", and the \} is indeed collapsed into a single }, which your parser then accepts. "..." quoting does backslash substitution. {...} doesn't. My fault.

Note that all of this applies _only_ to the parsing of examples you specify directly in Tcl code. If you read your input from a file into a Tcl variable then all this quoting of braces is not necessary at all.

Final conclusions:
* ParseTools work correctly.
* Your parser works.
* My advice on comp.lang.tcl was faulty. And later in the evening I will post a correction there as well.

Closing this no as 'invalid' == 'no bug' (except in my brain).

andreas_kupries added on 2010-09-14 00:10:26:
Ok. I have started looking into this.

Can you give me the exact unicode code point you used for theweird chars, i.e. in \uXXXX notation, where XXXX is a 4-digit hex-value ?

They don't seem to have survived the copying out of the bug report. (Which is another reason we ask that such things (scripts, input files, result files), are attached to the report, the chances of getting it back unmangled are significantly better.

Running with Tcl 8.5.7 I also get an error back for the last example, however the location is two characters further in (12 instead of 10), and a different set of symbols is expected. Alternatively this might be because I used the PEG interpreter instead of converting the PEG into a hardwired snit-based parser.

I will now compare to a snit parser and also try to thin the PEG, i.e. remove parts not needed for the example. As part of figuring my way through it.

dargosch added on 2010-09-12 19:38:00:

File Added - 386404: EmuQL.peg

Attachments: