Tk Library Source Code

View Ticket
Login
Ticket UUID: 606141
Title: csv bug: empty field in middle of row
Type: Bug Version: None
Submitter: todolson Created on: 2002-09-07 21:39:21
Subsystem: csv Assigned To: andreas_kupries
Priority: 8 Severity:
Status: Closed Last Modified: 2003-04-26 01:12:48
Resolution: Fixed Closed By: andreas_kupries
    Closed on: 2003-04-25 18:12:48
Description:
csv has a bug parsing an empty field in the middle of a
row.

In the attached tarball, see sample.data (taken from MS
Access export file).  The fourth field in the last
three rows is empty, but csv parses it as containing a
single double-quote character.  The fix is a single
line added to  csv.tcl, see csv.diff in the tarball, or
csv-fixed.tcl.  Run the sample program  csv-test.tcl to
see the difference in action.

I'd be grateful if you could incluge this fix in the
next release, as I am relying on the csv package in one
of my projects.

-Tod Olson <[email protected]>
User Comments: andreas_kupries added on 2003-04-24 07:15:23:
Logged In: YES 
user_id=75003

Did a complete rewrite of the parser for the alternate syntax. 
The one I committed last was a derivate of the original parser 
and simply could not handle the nested "". The new parser 
just splits into the primary tokens (", sepchar, remainder) and 
then converts the token sequence through a tcl-coded DFA 
(state-machine). This is able to detect an embedded "" 
sequence correctly.

Committed now. No known bugs in the extended testsuite. 
Full pass.

andreas_kupries added on 2003-04-24 06:19:51:
Logged In: YES 
user_id=75003

Committed changes to head. Please test. The testsuite has 
one of the new cases marked as knownBug. I.e. this code is 
completely correct, but handles the majority of cases. It 
dislikes "" inside of a value and handles that incorrectly.

andreas_kupries added on 2003-04-24 00:39:39:
Logged In: YES 
user_id=75003

Actually there is a sampledata file in the attached tarball. I 
am looking into this now.

todolson added on 2003-04-23 20:56:25:
Logged In: YES 
user_id=450877

The package certainly works as advertised.  However, there
are many applications that generate the ill-defined CSV
format. The most common CSV files that I see are exported
from MicroSoft Access, such as what was provided in the
original bug report. The utility of this package would be
greatly improved if it could parse files exported from these
programs. Otherwise, those of us who deal with such data
have to roll our own CSV parsers, as some of my collegues
do, or patch every release of tcllib.

I could easily provide a small number of test files which
include the awkward cases.

lvirden added on 2003-04-23 19:57:37:
Logged In: YES 
user_id=15949

The man page I am seeing says this:
 
FORMAT
     Each record of a csv file (comma-separated values, as
exported
     e.g. by Excel) is a set of ASCII values separated by
",". For
     other languages it may be ";" however, although this is not
     important for this case (The functions provided here
allow any
     separator character).
 
     If a value contains itself the separator ",", then it
(the value)
     is put between "".
 
     If a value contains ", it is replaced by "".
---- 

1. the format is a bit off in the man page - it probably
should be 
(the functions ...)
 ^                                                         
            

2. The "it" in the third point is a bit vague - probably
should say something
like "If a value needs to contain the " character, the
character must be 
represented as "".
                                                           
           
2. I don't see anything here that references missing data.
And that was the
point of at least 2 or more bug reports.

todolson added on 2002-09-08 04:39:21:

File Added - 30645: csvpatch.tar.gz

Attachments: