Tcl Library Source Code

Changes On Branch new-module-string
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Changes In Branch new-module-string Excluding Merge-Ins

This is equivalent to a diff from d2ffcb187b to c975000d03

2013-02-15
20:47
Merged string::token packages. Fixed last-minute doc breakage. Regenerated embedded documentation. check-in: 169c53e4cc user: andreask tags: trunk
20:34
Module integration into overall Tcllib Closed-Leaf check-in: c975000d03 user: andreask tags: new-module-string
20:08
Documentation for shell tokenization, fixed errors in unquoted word definition, extended testsuite check-in: 01e0367f9c user: andreask tags: new-module-string
01:38
New module, string/text utilities, 8.5+, string namespace/ensemble check-in: 2b763d88de user: andreask tags: new-module-string
2013-02-14
18:48
The packages fileutil::decode and zipfile::*, while residing in the AS/perforce had versions 0.1 and 0.2, with the change number attached later on installation, making them actually version 0.1.nnn, etc. During the move of the sources to Tcllib the numbers should have been incremented to properly distinguish the code in the new location from these old revisions. This did not happen, making the tcllib variant look like older/lesser revisions. This change fixes the issue now and bumps the version numbers, giving the Tcllib sources precedence. check-in: d2ffcb187b user: andreask tags: trunk
03:34
[Bug 3604129] Accepted patch by Gerhard Reithofer. Missing chan parameter added to all imaptotcl* procs. Bumped version to 0.5.1. check-in: b3a3d726b1 user: aku tags: trunk

Added modules/string/ChangeLog.

















>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
2013-02-15  Andreas Kupries  <[email protected]>

	* New module 'string'. String/text utilities, 8.5+.
	  First packages:
	  - string::token        - regex based lexing.
	  - string::token::shell - parsing basic shell command line syntax.


Added modules/string/pkgIndex.tcl.













>
>
>
>
>
>
1
2
3
4
5
6
if {![package vsatisfies [package provide Tcl] 8.5]} {
    # FRINK: nocheck
    return
}
package ifneeded string::token        1 [list source [file join $dir token.tcl]]
package ifneeded string::token::shell 1 [list source [file join $dir token_shell.tcl]]

Added modules/string/token.man.















































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
[manpage_begin string::token n 1]
[moddesc   {Text and string utilities}]
[titledesc {Regex based iterative lexing}]
[category  {Text processing}]
[keywords string tokenization lexing regex]
[require Tcl 8.5]
[require string::token [opt 1]]
[require fileutil]
[description]

This package provides commands for regular expression based lexing
(tokenization) of strings.

[para]

The complete set of procedures is described below.

[list_begin definitions]

[comment {- - -- --- ----- -------- ------------- ---------------------}]
[call [cmd {::string token text}] [arg lex] [arg string]]

This command takes an ordered dictionary [arg lex] mapping regular
expressions to labels, and tokenizes the [arg string] according to
this dictionary.

[para] The result of the command is a list of tokens, where each token
is a 3-element list of label, start- and end-index in the [arg string].

[para] The command will throw an error if it is not able to tokenize
the whole string.

[comment {- - -- --- ----- -------- ------------- ---------------------}]
[call [cmd {::string token file}] [arg lex] [arg path]]

This command is a convenience wrapper around
[cmd {::string token text}] above, and [cmd {fileutil::cat}], enabling
the easy tokenization of whole files.

[emph Note] that this command loads the file wholly into memory before
starting to process it.

[para] If the file is too large for this mode of operation a command
directly based on [cmd {::string token chomp}] below will be
necessary.

[comment {- - -- --- ----- -------- ------------- ---------------------}]
[call [cmd {::string token chomp}] [arg lex] [arg startvar] [arg string] [arg resultvar]]

This command is the work horse underlying [cmd {::string token text}]
above. It is exposed to enable users to write their own lexers, which,
for example may apply different lexing dictionaries according to some
internal state, etc.

[para] The command takes an ordered dictionary [arg lex] mapping
regular expressions to labels, a variable [arg startvar] which
indicates where to start lexing in the input [arg string], and a
result variable [arg resultvar] to extend.

[para] The result of the command is a tri-state numeric code
indicating one of
[list_begin]
[def [const 0]] No token found.
[def [const 1]] Token found.
[def [const 2]] End of string reached.
[list_end]

Note that recognition of a token from [arg lex] is started at the
character index in [arg startvar].

[para] If a token was recognized (status [const 1]) the command will
update the index in [arg startvar] to point to the first character of
the [arg string] past the recognized token, and it will further extend
the [arg resultvar] with a 3-element list containing the label
associated with the regular expression of the token, and the start-
and end-character-indices of the token in [arg string].

[para] Neither [arg startvar] nor [arg resultvar] will be updated if
no token is recognized at all.

[para] Note that the regular expressions are applied (tested) in the
order they are specified in [arg lex], and the first matching pattern
stops the process. Because of this it is recommended to specify the
patterns to lex with from the most specific to the most general.

[para] Further note that all regex patterns are implicitly prefixed
with the constraint escape [const \A] to ensure that a match starts
exactly at the character index found in [arg startvar].

[list_end]

[section {BUGS, IDEAS, FEEDBACK}]

This document, and the package it describes, will undoubtedly contain
bugs and other problems.

Please report such in the category [emph textutil] of the
[uri {http://sourceforge.net/tracker/?group_id=12883} {Tcllib SF Trackers}].

Please also report any ideas for enhancements you may have for either
package and/or documentation.

[manpage_end]

Added modules/string/token.tcl.





























































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# # ## ### ##### ######## ############# #####################
## Copyright (c) 2013 Andreas Kupries, BSD licensed

# # ## ### ##### ######## ############# #####################
## Requisites

package require Tcl 8.5
package require fileutil ;# cat

# # ## ### ##### ######## ############# #####################
## API setup
#

namespace eval ::string::token {
    namespace export chomp file text
    namespace ensemble create
}

## NOTE: We are placing the 'token' ensemble command into the Tcl
##       core's builtin 'string' ensemble.

apply {{} {
    set map [namespace ensemble configure ::string -map]
    dict set map token ::string::token
    namespace ensemble configure ::string -map $map
    return
}}

# # ## ### ##### ######## ############# #####################
## API

proc ::string::token::file {map path args} {
    return [text $map [fileutil::cat {*}$args $path]]
}

proc ::string::token::text {map text} {
    # map = dict (regex -> label)
    #   note! order is important, most specific to most general.

    # result = list (token)
    # where
    #   token = list(label start-index end-index)

    set start  0
    set result {}

    # status values:
    #  0: no token found, abort
    #  1: token found, continue
    #  2: no token found, end of string reached, stop, ok.
    set status 1
    while {$status == 1} {
	set status [chomp $map start $text result]
    }
    if {$status == 0} {
	return -code error \
	    -errorcode {STRING TOKEN BAD CHARACTER} \
	    "Unexpected character '[string index $text $start]' at offset $start"
    }
    return $result
}

# # ## ### ##### ######## ############# #####################
## Internal, helpers.

proc ::string::token::chomp {map sv text rv} {
    upvar 1 $sv start $rv result

    # Stop when trying to match after the end of the string.
    if {$text eq {}} {return 2}
    if {$start >= [string length $text]} {return 2}

    #puts |$start||[string range $text $start end]||$result|

    foreach {pattern label} $map {
	if {![regexp -start $start -indices -- \\A($pattern) $text -> range]} continue

	lappend result [list $label {*}$range]
	lassign $range a e

	#puts MATCH|$pattern|[string range $text $a $e]|

	set start $e
	incr start
	return 1
    }
    return 0
}

# # ## ### ##### ######## ############# #####################
## Ready

package provide string::token 1
return

Added modules/string/token_shell.man.























































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
[manpage_begin string::token::shell n 1]
[moddesc   {Text and string utilities}]
[titledesc {Parsing of shell command line}]
[category  {Text processing}]
[keywords string shell bash tokenization parsing lexing]
[require Tcl 8.5]
[require string::token::shell [opt 1]]
[require string::token [opt 1]]
[require fileutil]
[description]

This package provides a command which parses a line of text using
basic [syscmd sh]-syntax into a list of words.

[para]

The complete set of procedures is described below.

[list_begin definitions]

[call [cmd {::string token shell}] [arg string]]

This command parses the input [arg string] under the assumption of it
following basic [syscmd sh]-syntax.

The result of the command is a list of words in the [arg string].

An error is thrown if the input does not follow the allowed syntax.

[para] The basic shell syntax accepted here are unquoted, single- and
double-quoted words, separated by whitespace. Leading and trailing
whitespace are possible too, and stripped.

Shell variables in their various forms are [emph not] recognized, nor
are sub-shells.

As for the recognized forms of words, see below for the detailed
specification.

[list_begin definitions]

[def [const {single-quoted word}]]

A single-quoted word begins with a single-quote character, i.e.
[const '] (ASCII 39) followed by zero or more unicode characters not a
single-quote, and then closed by a single-quote.

[para] The word must be followed by either the end of the string, or
whitespace. A word cannot directly follow the word.

[def [const {double-quoted word}]]

A double-quoted word begins with a double-quote character, i.e.
[const {"}] (ASCII 34) followed by zero or more unicode characters not a
double-quote, and then closed by a double-quote.

[para] Contrary to single-quoted words a double-quote can be embedded
into the word, by prefacing, i.e. escaping, i.e. quoting it with a
backslash character [const \\] (ASCII 92). Similarly a backslash
character must be quoted with itself to be inserted literally.

[def [const {unquoted word}]]

Unquoted words are not delimited by quotes and thus cannot contain
whitespace or single-quote characters. Double-quote and backslash
characters can be put into unquoted words, by quting them like for
double-quoted words.

[def [const whitespace]]

Whitespace is any unicode space character.
This is equivalent to [cmd {string is space}], or the regular
expression \\s.

[para] Whitespace may occur before the first word, or after the last word. Whitespace must occur between adjacent words.

[list_end]
[list_end]

[section {BUGS, IDEAS, FEEDBACK}]

This document, and the package it describes, will undoubtedly contain
bugs and other problems.

Please report such in the category [emph textutil] of the
[uri {http://sourceforge.net/tracker/?group_id=12883} {Tcllib SF Trackers}].

Please also report any ideas for enhancements you may have for either
package and/or documentation.

[manpage_end]

Added modules/string/token_shell.tcl.





























































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# # ## ### ##### ######## ############# #####################
## Copyright (c) 2013 Andreas Kupries, BSD licensed

# # ## ### ##### ######## ############# #####################
## Requisites

package require Tcl 8.5
package require string::token

# # ## ### ##### ######## ############# #####################
## API setup

namespace eval ::string::token {
    # Note: string::token claims the "text" and "file" commands.
    namespace export shell
    namespace ensemble create
}

proc ::string::token::shell {text} {
    # result = list (word)

    set space    \\s
    set     lexer {}
    lappend lexer ${space}+                                  WSPACE
    lappend lexer {'[^']*'}                                  S:QUOTED
    lappend lexer "\"(\[^\"\]|(\\\\\")|(\\\\\\\\))*\""       D:QUOTED
    lappend lexer "((\[^ $space'\"\])|(\\\\\")|(\\\\\\\\))+" PLAIN
    lappend lexer {.*}                                       ERROR

    set dequote [list \\" \" \\\\ \\ ] ; #"

    set result {}

    # Parsing of a shell line is a simple grammar, RE-equivalent
    # actually, thus tractable with a plain finite state machine.
    #
    # States:
    # - WS-WORD : Expected whitespace or word.
    # - WS      : Expected whitespace
    # - WORD    : Expected word.

    # We may have leading whitespace.
    set state WS-WORD
    foreach token [text $lexer $text] {
	lassign $token type start end

	#puts "[format %7s $state] + ($token) = <<[string range $text $start $end]>>"

	switch -glob -- ${type}/$state {
	    ERROR/* {
		return -code error \
		    -errorcode {STRING TOKEN SHELL BAD SYNTAX CHAR} \
		    "Unexpected character '[string index $text $start]' at offset $start"
	    }
	    WSPACE/WORD {
		# Impossible
		return -code error \
		    -errorcode {STRING TOKEN SHELL BAD SYNTAX WHITESPACE} \
		    "Expected start of word, got whitespace at offset $start."
	    }
	    PLAIN/WS -
	    *:QUOTED*/WS {
		return -code error \
		    -errorcode {STRING TOKEN SHELL BAD SYNTAX WORD} \
		    "Expected whitespace, got start of word at offset $start"
	    }
            WSPACE/WS* {
		# Ignore leading, inter-word, and trailing whitespace
		# Must be followed by a word
		set state WORD
	    }
	    S:QUOTED/*WORD {
		# Quoted word, single, extract it, ignore delimiters.
		# Must be followed by whitespace.
		incr start
		incr end -1
		lappend result [string range $text $start $end]
		set state WS
	    }
	    D:QUOTED/*WORD {
		# Quoted word, double, extract it, ignore delimiters.
		# Have to check for and reduce escaped double quotes and backslashes.
		# Must be followed by whitespace.
		incr start
		incr end -1
		lappend result [string map $dequote [string range $text $start $end]]
		set state WS
	    }
	    PLAIN/*WORD {
		# Unquoted word. extract.
		# Have to check for and reduce escaped double quotes and backslashes.
		# Must be followed by whitespace.
		lappend result [string map $dequote [string range $text $start $end]]
		set state WS
	    }
	    * {
		return -code error \
		    -errorcode {STRING TOKEN SHELL INTERNAL} \
		    "Illegal token/state combination $type/$state"
	    }
        }
    }
    return $result
}

# # ## ### ##### ######## ############# #####################
## Ready

package provide string::token::shell 1
return

Added modules/string/token_shell.test.









































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# -*- tcl -*-
# Testsuite string::token::shell
#
# Copyright (c) 2013 by Andreas Kupries <[email protected]>
# All rights reserved.

# -------------------------------------------------------------------------

source [file join \
	[file dirname [file dirname [file join [pwd] [info script]]]] \
	devtools testutilities.tcl]

testsNeedTcl     8.5
testsNeedTcltest 2.0

support {
    use      fileutil/fileutil.tcl  fileutil
    useLocal token.tcl              string::token
}
testing {
    useLocal token_shell.tcl string::token::shell
}

# -------------------------------------------------------------------------

test string-token-shell-1.0 {string token shell, wrong#args} -body {
    string token shell
} -returnCodes error -result {wrong # args: should be "string token shell text"}

test string-token-shell-1.1 {string token shell, wrong#args} -body {
    string token shell T X
} -returnCodes error -result {wrong # args: should be "string token shell text"}

# -------------------------------------------------------------------------

foreach {n label line tokens} {
    0  empty               {}        {}
    1  leading-whitespace  {  }      {}
    2  plain-words         {a}       {a}
    3  plain-words         {a b}     {a b}
    4  trailing-whitespace {a b c  } {a b c}
    5  inter-whitespace    {a   b}   {a b}
    6  single-quoted-words {'a' b}   {a b}
    7  single-quoted-words {a 'b'}   {a b}
    8  single-quoted-words {a 'b' c} {a b c}
    9  single-quoted-words {a '' c}  {a {} c}
    10 double-quoted-words {"a" b}   {a b}
    11 double-quoted-words {a "b"}   {a b}
    12 double-quoted-words {a "b" c} {a b c}
    13 double-quoted-words {a "\"" c} {a {"} c}
    14 mixed-quoted-words  {a "\"" ''} {a {"} {}}
    15 double-quoted-words {a "" c}  {a {} c}
    16 mixed               {a 'b' "c" d "e\"\"f" } {a b c d e\"\"f}
    17 backslashes         {a "\\" c}   {a \\ c}
    18 backslashes         {a "\"\\" c} {a \"\\ c}
    19 escaping-plain      {a \\ c}     {a \\ c}
    20 escaping-plain      {a \" c}     {a {"} c}
    21 escaping-plain      {a \"b c}    {a {"b} c}
} {
    test string-token-shell-2.$n "string token shell, $label" -body {
	string token shell $line
    } -result $tokens
}

foreach {n label line tokens} {
    0 words-without-whitespace {'a'"b"} {Expected whitespace, got start of word at offset 3}
    1 words-without-whitespace {"a"'b'} {Expected whitespace, got start of word at offset 3}
    2 words-without-whitespace {'a''b'} {Expected whitespace, got start of word at offset 3}
    3 words-without-whitespace {"a""b"} {Expected whitespace, got start of word at offset 3}
    4 words-without-whitespace {a"b"}   {Expected whitespace, got start of word at offset 1}
    5 words-without-whitespace {a'b'}   {Expected whitespace, got start of word at offset 1}
    6 incomplete-word-at-end   {a '}    {Unexpected character ''' at offset 2}
    7 incomplete-word-at-end   {a "}    {Unexpected character '"' at offset 2}
    8 incomplete-word-at-end   {a'}     {Unexpected character ''' at offset 1}
    9 incomplete-word-at-end   {a"}     {Unexpected character '"' at offset 1}
} {
    test string-token-shell-3.$n "string token shell, $label" -body {
	string token shell $line
    } -returnCodes error -result $tokens
}

#----------------------------------------------------------------------
testsuiteCleanup
return

Changes to support/installation/modules.tcl.

118
119
120
121
122
123
124

125
126
127
128
129
130
131
Module  sasl        _tcl  _man  _exa
Module  sha1        _tcl  _man  _null
Module  simulation  _tcl  _man  _null
Module  smtpd       _tcl  _man _exa
Module  snit        _tcl  _man  _null
Module  soundex     _tcl  _man  _null
Module  stooop      _tcl  _man  _null

Module  stringprep  _tcl  _man  _null
Module  struct      _tcl  _man _exa
Module  tar         _tcl  _man  _null
Module  tepam       _tcl  _man  _exa
Module  term         _tcr _man _exa
Module  textutil     _tex _man  _null
Module  tie         _tcl  _man  _exa







>







118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
Module  sasl        _tcl  _man  _exa
Module  sha1        _tcl  _man  _null
Module  simulation  _tcl  _man  _null
Module  smtpd       _tcl  _man _exa
Module  snit        _tcl  _man  _null
Module  soundex     _tcl  _man  _null
Module  stooop      _tcl  _man  _null
Module  string      _tcl  _man  _null
Module  stringprep  _tcl  _man  _null
Module  struct      _tcl  _man _exa
Module  tar         _tcl  _man  _null
Module  tepam       _tcl  _man  _exa
Module  term         _tcr _man _exa
Module  textutil     _tex _man  _null
Module  tie         _tcl  _man  _exa