Ticket UUID: | 606141 | |||
Title: | csv bug: empty field in middle of row | |||
Type: | Bug | Version: | None | |
Submitter: | todolson | Created on: | 2002-09-07 21:39:21 | |
Subsystem: | csv | Assigned To: | andreas_kupries | |
Priority: | 8 | Severity: | ||
Status: | Closed | Last Modified: | 2003-04-26 01:12:48 | |
Resolution: | Fixed | Closed By: | andreas_kupries | |
Closed on: | 2003-04-25 18:12:48 | |||
Description: |
csv has a bug parsing an empty field in the middle of a row. In the attached tarball, see sample.data (taken from MS Access export file). The fourth field in the last three rows is empty, but csv parses it as containing a single double-quote character. The fix is a single line added to csv.tcl, see csv.diff in the tarball, or csv-fixed.tcl. Run the sample program csv-test.tcl to see the difference in action. I'd be grateful if you could incluge this fix in the next release, as I am relying on the csv package in one of my projects. -Tod Olson <[email protected]> | |||
User Comments: |
andreas_kupries added on 2003-04-24 07:15:23:
Logged In: YES user_id=75003 Did a complete rewrite of the parser for the alternate syntax. The one I committed last was a derivate of the original parser and simply could not handle the nested "". The new parser just splits into the primary tokens (", sepchar, remainder) and then converts the token sequence through a tcl-coded DFA (state-machine). This is able to detect an embedded "" sequence correctly. Committed now. No known bugs in the extended testsuite. Full pass. andreas_kupries added on 2003-04-24 06:19:51: Logged In: YES user_id=75003 Committed changes to head. Please test. The testsuite has one of the new cases marked as knownBug. I.e. this code is completely correct, but handles the majority of cases. It dislikes "" inside of a value and handles that incorrectly. andreas_kupries added on 2003-04-24 00:39:39: Logged In: YES user_id=75003 Actually there is a sampledata file in the attached tarball. I am looking into this now. todolson added on 2003-04-23 20:56:25: Logged In: YES user_id=450877 The package certainly works as advertised. However, there are many applications that generate the ill-defined CSV format. The most common CSV files that I see are exported from MicroSoft Access, such as what was provided in the original bug report. The utility of this package would be greatly improved if it could parse files exported from these programs. Otherwise, those of us who deal with such data have to roll our own CSV parsers, as some of my collegues do, or patch every release of tcllib. I could easily provide a small number of test files which include the awkward cases. lvirden added on 2003-04-23 19:57:37: Logged In: YES user_id=15949 The man page I am seeing says this: FORMAT Each record of a csv file (comma-separated values, as exported e.g. by Excel) is a set of ASCII values separated by ",". For other languages it may be ";" however, although this is not important for this case (The functions provided here allow any separator character). If a value contains itself the separator ",", then it (the value) is put between "". If a value contains ", it is replaced by "". ---- 1. the format is a bit off in the man page - it probably should be (the functions ...) ^ 2. The "it" in the third point is a bit vague - probably should say something like "If a value needs to contain the " character, the character must be represented as "". 2. I don't see anything here that references missing data. And that was the point of at least 2 or more bug reports. todolson added on 2002-09-08 04:39:21: File Added - 30645: csvpatch.tar.gz |
Attachments:
- csvpatch.tar.gz [download] added by todolson on 2002-09-08 04:39:21. [details]