Marpa: Timeline

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

50 check-ins using file unicode/cc_objtype.tcl version 44a66ba0c7

2018-03-20
05:45		rtC's mixed use of both `char` and `unsigned char` in various APIs and code, plus interaction with `int` breaks lexing when attempting to go beyond ASCII, even when restricted to the BMP, due to bytes > 127 showing up as negative. Fixed by changing all uses of `char` to `unsigned char`. Further changed extraction of semantic values. - Lexeme length is now counted in characters, not bytes. - Similarly, lexeme end is now characters from start. - Input is now byte- and character counted, for proper lexeme start. Character counting in C strings pulled from tclUtf.c Error messages now use the new char offsets, plus byte offsets for partially read characters. check-in: b0d7fa6f75 user: aku tags: trunk
2018-03-16
23:44		Specials done, make them available to the main line. check-in: 754111fe8e user: aku tags: trunk
23:43		Implemented special semantic action `::first`. Added support in the tcl and rtc runtimes, and the generators for these runtimes. Plus test. Closed-Leaf check-in: ef22cbb99b user: aku tags: specials
23:24		Updated work on specials with bugfixes for issues found with it. check-in: 8918fc0839 user: aku tags: specials
23:22		Added test triggering rtC code path where a new lexer starts out with the parser exhausted, i.e. nothing acceptable. Fixed missing closing of the earleme for that case, and missing handling of the `lexer exhausted` error from libmarpa. The missing closing operation also caused miscommunication between lexer and gate, ultimately crashing the latter with a symbol id outside of the byte range. The gate change is only a tweak to get better tracing, i.e. print the acceptables before working with them, not after. check-in: 64775966bd user: aku tags: trunk
06:13		Added test triggering the code path for the generation of semantic actions for grammars which have multiple actions across their G1 symbols. Fixed variable name typo in that code path. check-in: 2f455e21d2 user: aku tags: trunk
2018-03-15
07:04		Added description of the various files found in the test grammars, and their relationships. check-in: eaefc0b630 user: aku tags: specials
2018-03-14
23:00		Implemented special semantic action `::array`. This one was trivial, maps to array action `values`. Mapping is done in the semantics, the backends do not see the special. Started on special semantic action `::first`. Tests, no implementation yet. check-in: 7626d0194b user: aku tags: specials
2018-03-02
06:23		unicode. Fixed bad condition for handling the final element in negate-class, after the main loop. Triggered by the next-to-final element ending just before the UNI_MAX. Test added, demonstrating bug and fix. check-in: 546018b243 user: aku tags: trunk
2018-02-20
18:40		Added more literals outside of the BMP to test cases. check-in: 8d53644d07 user: aku tags: reunification
07:35		Created a Marpa parser for the string and CC lexemes. This parser handles all the various forms for character escapes. Plus a semantics backend which generates the internal literal representation directly from the AST. The above replaced the entire existing literal processor (parse, decode, unescape, type, tags). This was needed because Tcl (especially `subst` as the core of the old `unescape`) was/is not able to handle the full set of unicodepoints (at this time). Updated the bootstrap slif parser to handle the extended escapes too (\u hex x5/x6, \U hex x8). Updated tests, although not all. With this commit the entire input side is now able to handle the full set of unicode, with suitable escape sequences for characters outside of the BMP as well. check-in: ac7dc5acdc user: aku tags: reunification
2018-02-19
05:57		Merge C code generator fix into unicode work. check-in: eac4dc279c user: aku tags: reunification
05:55		marpa::gen::runtime::c Fixed mishandling of zero-length chunks. Generates bad C syntax. Triggered by grammar without any `:discard` clauses. The fix prevents insertion of discard chunks if there are no such. Furthermore now also errors out in the low level ChunkedArray code for zero-length chunks, to catch possible future problems. Reviewed all uses, made notes that none are zero-length now. Added a test demonstrating the possibility. check-in: 14698e1f84 user: aku tags: trunk
2018-02-18
10:27		Continued rework of the unicode layer. See first commit in the branch for the plan. Reworked the big tables of test cases in `literal.test` and moved their setup into separate files (`tests/cases/...`). check-in: a69c32992a user: aku tags: reunification
09:03		Continued rework of the unicode layer. See first commit in the branch for the plan. Created a wrapper around the foreach loops to make the tables of test cases look a bit nicer, and more semantic like. check-in: 60f9856faa user: aku tags: reunification
05:05		Continued rework of the unicode layer. See first commit in the branch for the plan. Removed full vs bmp from the `unidata` tool and generated tables. The tool now always generates tables covering the full unicode range. Updated some tests, but not all. The known test failures are in the various generators and the middleware, due to the CC differences coming from full coverage. Fixing it now does not make sense, because we will have to clean it up again after the introduction of MUTF-8 and CESU-8 support into the middle layers. We clean up these up after that is all done. check-in: 9db9661dae user: aku tags: reunification
03:38		Continued rework of the unicode layer. See first commit in the branch for the plan. Dropped ASBR and grammar generation for the named classes from the `unidata` tool. With ASBR creation in the C level of the main Marpa package it is fast enough to not require caching. This also removes the cache of byte ranges shared among the classes. Remember also that the C code generator backends do their own automatic sharing of byte ranges and refactoring for sharing. check-in: 604b92a271 user: aku tags: reunification
02:37		Start a rework of the unicode layer. The overall plan is to remove the distinction between bmp and full in this layer, and move it into the generators, with some support in the middle, i.e. in literal handling and the transform from codepoints to byte sequences. Optimization: Moved the commands `2utf`, `mode`, and `max` into the C layer. check-in: b481c06f7c user: aku tags: reunification
2018-02-03
00:51		Merged fix for the issue of RT-C mishandling the `proper`-flag into the branch where it was found. Updated test results to match. Marked a number of tests touching on unicode/utf handling as known bugs. Address them when the general utf handling trouble is more solved. Still to address: All the `i_` tests. check-in: 9ed3031891 user: aku tags: language-json
00:28		Fixed mismatch between Tcl and C runtimes. Issue in the C runtime. Forgot to properly convert a boolean `proper` into the flag taken by `marpa_g_sequence_new()`. Conversion added, test cases added. check-in: b86ffae080 user: aku tags: trunk
00:27		Re-enable full set of lang/result checks check-in: 94232b7386 user: aku tags: trunk
2018-02-02
23:19		Fix bad phrasing in comment check-in: 90ce04b75a user: aku tags: trunk
19:50		Added a script to run a fixed demo, from grammar to parser to its use. Plus example json files to use as input. check-in: f82c10dea8 user: aku tags: language-json
18:23		Continued testsuite work. Fixed definition of JSON `whitespace`. Regenerated parsers. Updated parse failure results to match. check-in: eb17740e4d user: aku tags: language-json
2018-02-01
23:56		Continued testsuite work - Added the n_* cases (must reject), and 1st round of results. Reorganized the input/ and result/ directories to separate the various groups better (y\|n\|i, c\|tcl, ...) - A number failures to reject input. - 4x grammar error: \f is not whitespace. - 10x input accepted which should not be (c (bad) vs tcl (ok)) - process vs process-file differences in rt-c (encoding differences?) check-in: 20cb6ea243 user: aku tags: language-json
20:37		Pull rt-c bug fix into the branch which exposed the issue. check-in: 2945ca6cd1 user: aku tags: language-json
20:36		Testsuite work - Clean up of the support code, removed unused procedures. - Ensure that files are read with the proper encoding before fed into the string 'process' (See `fgetc` decl and use). - Allow setting of constraints, runtime-specific - Set __known bug__ constraint for eight y_* tests where rt-c currently diverges from rt-tcl (1). (1) These are all in the unicode/utf-8 handling, which differs between the available runtimes. * rt-tcl operates on chars and defers to Tcl's parsing of utf-8 sequences. * rt-c OTOH operates on bytes, does its own utf-8 parsing, and is more strict (invalid sequences are a parse error). I have to see if I can define a char class (:invalid:) to contain the invalid sequences. Using that would allow me to either accept or discard them (depending on context). Similarly I might have to allow the class of surrogates (:Cs:), as acceptable characters, and as sequences for the characters past BMP. That would allow such characters even in Marpa limited to BMP. These are all things in the MarpaTcl core however, and not something specific to JSON. JSON just exposed the issues here. check-in: 7ab2e4bf63 user: aku tags: language-json
20:17		Added tool similar to `od`, to decode and display utf8 sequences in the input (file, stdin). check-in: 669660f659 user: aku tags: trunk
20:15		Changed gate to lexer flush signaling from in-band `(byte) -1` to a separate function. This removes any possibility that a `(byte) -1` from actual input causes a bogus flush. Added debug function allowing INBOUND to properly print a batch of input bytes. Fixed a crash of the RT-C where the loop searching for the end of the lexeme tried to pop a byte from the empty lexeme, triggering an underflow assert. This may happen when `lexer_complete` is called for an empty-valued lexeme. I.e. when the GATE rejects the first byte after the end of a lexeme as invalid and signals a flush before any byte was entered into the lexer at all. Note that this does not necessarily indicate a mismatch. The current set of acceptable lexemes may contain some which allow an empty value. We have to keep recognizing them. And after that the new context may have caused the invalid byte to be valid. So we only skip the attempt of making an empty value even emptier. The deeper issue is that for LATM-mode symbols the earley-set id does not match the length of the lexeme due to the zero-width ACS guards in front; causing an additional round through the loop before it can declare mismatch. The concrete example which triggered the issue are the `string` and `lstring` symbols in the JSON grammar, for input `[""]`. check-in: 0de21b2314 user: aku tags: trunk
2018-01-31
21:16		Pull the Tcl lexer fix over into the branch where the issue was found. check-in: d411dda199 user: aku tags: language-json
21:09		Fixed typo in the spec of escaped characters in strings. Fixed definition of `control` characters for JSON. Updated the results to match the tweaked grammar. For the Tcl runtime all tests pass except a few showing mishandling of numeric lexemes. A fix for that is waiting on trunk. RT-C still crashing. check-in: 6dfddb13e8 user: aku tags: language-json
20:58		Fix mishandling of lexemes interpretable as Tcl number by the Tcl runtime (lexer component). By going through `expr` a lexeme which looks like a number can be shimmered and may change its string rep when printed. Example: For JSON the lexeme `1E-2` became `0.01`. check-in: 2a442c3255 user: aku tags: trunk
00:51		The json testsuite is becoming more functional. Of the must-accept-inputs only 10 failures over 95 inputs. Some unexpected parse failures with bogus inputs. These are in part - Possibly due to reading of input with the wrong encoding (Need utf-8?). - Unexpected numeric reformatting reaching the AST (1E-2 vs 0.01) One crash in the RT-C to investigate. Tweaked the grammar a bit to have proper symbols for the constants, and to separate G1 and L0 better. check-in: dda6670b00 user: aku tags: language-json
2018-01-30
23:25		Pulled fix for Tcl code generator issue into the branch where it was discovered. check-in: 5f8cb41c75 user: aku tags: language-json
23:19		Fix issue in the core code generator for parsers and lexer using the Tcl-based runtime. A bug in package `char` (See `char quote tcl`) caused the generation of bogus Tcl charclass regexes from the internal data, when non-ASCII characters in [:control:] are involved. The generator now works around the issue. check-in: 65b1517840 user: aku tags: trunk
21:02		Added the first larger grammar example outside of the SLIF meta grammar: JSON. Known issues at this point: * Due to apparent trouble with Kettle (`build.tcl test` seems to ignore `--include-dir`) the testsuite is not yet functional. A basic test via `tools/trial` however works. * The generated Tcl parser is bogus. The main character class for string characters (`plain`) is bogus, it contains a bad range which is rejected by Tcl's `regexp` during parser construction. The C-based parser is ok however, modulo lurking unknowns. check-in: 5199afa673 user: aku tags: language-json
10:17		Fix oops, forgot to add test output for the slif meta grammar. check-in: 466c1ebc4d user: aku tags: trunk
10:16		Added formatter producing a SLIF grammar from a grammar container. Note, this is not fully round-trip at the moment (The special @LEX symbols can not be read back, violating identifier syntax). It is also sub-optimal with regard to LATM flags, g1 actions, etc. These are shown as attributes of each rule instead of making use of defaults to reduce duplication. It should be good enough however to serve as debugging aid. check-in: 3bfc0de63c user: aku tags: trunk
2018-01-29
19:28		Extended the set of formatters producing code initializing a grammar container (GC). Renamed the existing GC formatter to `gc-compact`. Added two formatters to generate non-compact human-readable code, using reduction rules for Tcl and C. check-in: 8d77fed34b user: aku tags: trunk
2017-10-17
16:30		README tweaks check-in: d2d1b00d53 user: aku tags: trunk
16:22		Updated the README to match the current organization of the (code in the) repository. check-in: f45f21924c user: aku tags: trunk
03:18		Merged fixes on flush behaviour to mainline. check-in: 62d99b6274 user: aku tags: trunk
03:13		Fixed demo grammar (wrong start symbol), then shown fix vs not in Tcl vs C runtimes. Then fixed C runtime flush behaviour. Further fixed mishandling of lexeme value and length in the presence of redo. Closed-Leaf check-in: a78dda3a4d user: aku tags: flush-fix
2017-10-16
23:17		Demonstrate the multi-flush bug. Fixed RT-C issue with actual lastchar lost/overwritten by redo, messing up the error message generated. check-in: 886eb6bb40 user: aku tags: trunk
22:24		And back check-in: ce762c6d5a user: aku tags: trunk
22:20		Pull trunk. Closed-Leaf check-in: f32641a83d user: aku tags: runtime-tests
2017-10-15
16:55		Use OSX fixes. They were done as separate branches to remember to check behaviour when back on linux. check-in: 97bbaff3f9 user: aku tags: trunk
16:54		Silence compiler complaint on OSX. Leaf check-in: 09b264fb4a user: aku tags: osx-complaints
16:53		Added return after assert to silence compiler comlaint (OSX). check-in: 12ad722f66 user: aku tags: osx-complaints
16:50		Fixed problems in the handling of charclass as set of code-points and -ranges. Range validation was incomplete, allowing bad input to crash. Fixed, and tests added. Tracing as well, plus more notes when certain code paths will be reached. check-in: ac18987fd3 user: aku tags: trunk