Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add a procedure for estimating probability density functions by means of the kernel density estimation method |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
bfc68668cd5da876e3629480017441d4 |
User & Date: | markus 2014-01-18 12:13:30.497 |
Context
2014-01-18
| ||
14:20 | Added three test cases for the kernel density estimation. Resulted in a few small corrections (deal with missing values) check-in: 2456e0d413 user: markus tags: trunk | |
12:13 | Add a procedure for estimating probability density functions by means of the kernel density estimation method check-in: bfc68668cd user: markus tags: trunk | |
2014-01-10
| ||
00:03 | Fix installer breaking on empty doc directories of excluded packages. check-in: e889e36583 user: andreask tags: trunk | |
Changes
Changes to modules/math/ChangeLog.
1 2 3 4 5 6 7 | 2013-12-20 Arjen Markus <[email protected]> * interpolate.tcl: [Ticket 843c2257d2] Added special case for points coincident with the data points * interpolate.test: [Ticket 843c2257d2] Added test case for coincident points 2013-12-17 Andreas Kupries <[email protected]> * decimal.man: Fixed missing requirement of the package | > > > > > > > | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 | 2014-01-18 Arjen Markus <[email protected]> * statistics.tcl: Added stat_kernel.tcl * stat_kernel.tcl: Implements a straightforward kernel density estimation procedure * statistics.man: Describe the kernel denstity estimation procedure, moved the description of several tests to the general section * pkgIndex.tcl: Bumped version of statistics package to 0.9 2013-12-20 Arjen Markus <[email protected]> * interpolate.tcl: [Ticket 843c2257d2] Added special case for points coincident with the data points * interpolate.test: [Ticket 843c2257d2] Added test case for coincident points 2013-12-17 Andreas Kupries <[email protected]> * decimal.man: Fixed missing requirement of the package |
︙ | ︙ |
Changes to modules/math/pkgIndex.tcl.
︙ | ︙ | |||
10 11 12 13 14 15 16 | package ifneeded math::fourier 1.0.2 [list source [file join $dir fourier.tcl]] if {![package vsatisfies [package provide Tcl] 8.3]} {return} package ifneeded math::roman 1.0 [list source [file join $dir romannumerals.tcl]] if {![package vsatisfies [package provide Tcl] 8.4]} {return} # statistics depends on linearalgebra (for multi-variate linear regression). | | | 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | package ifneeded math::fourier 1.0.2 [list source [file join $dir fourier.tcl]] if {![package vsatisfies [package provide Tcl] 8.3]} {return} package ifneeded math::roman 1.0 [list source [file join $dir romannumerals.tcl]] if {![package vsatisfies [package provide Tcl] 8.4]} {return} # statistics depends on linearalgebra (for multi-variate linear regression). package ifneeded math::statistics 0.9 [list source [file join $dir statistics.tcl]] package ifneeded math::optimize 1.0 [list source [file join $dir optimize.tcl]] package ifneeded math::calculus 0.7.2 [list source [file join $dir calculus.tcl]] package ifneeded math::interpolate 1.1 [list source [file join $dir interpolate.tcl]] package ifneeded math::linearalgebra 1.1.4 [list source [file join $dir linalg.tcl]] package ifneeded math::bignum 3.1.1 [list source [file join $dir bignum.tcl]] package ifneeded math::bigfloat 1.2.2 [list source [file join $dir bigfloat.tcl]] package ifneeded math::machineparameters 0.1 [list source [file join $dir machineparameters.tcl]] |
︙ | ︙ |
Added modules/math/stat_kernel.tcl.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | # stat_kernel.tcl -- # # Part of the statistics package for basic statistical analysis # Based on http://en.wikipedia.org/wiki/Kernel_(statistics) and # http://en.wikipedia.org/wiki/Kernel_density_estimation # # version 0.1: initial implementation, january 2014 # kernel-density -- # Estimate the probability density using the kernel density # estimation method # # Arguments: # data List of univariate data # args List of options in the form of keyword-value pairs: # -weights weights: per data point the weight # -bandwidth value: bandwidth to be used for the estimation # -number value: number of bins to be returned # -interval {begin end}: begin and end of the interval for # which the density is returned # -kernel function: kernel to be used (gaussian, cosine, # epanechnikov, uniform, triangular, biweight, # logistic) # For all options more or less sensible defaults are # provided. # # Result: # A list of the bin centres, a list of the corresponding density # estimates and a list containing several computational parameters: # begin and end of the interval, mean, standard deviation and bandwidth # # Note: # The conditions for the kernel function are fairly weak: # - It should integrate to 1 # - It should be symmetric around 0 # # As for the implementation in Tcl: it should be reachable in the # ::math::statistics namespace. As a consequence, you can define # your own kernel function too. Hence there is no check. # proc ::math::statistics::kernel-density {data args} { # # Determine the basic statistics # set basicStats [BasicStats all $data] set mean [lindex $basicStats 0] set ndata [lindex $basicStats 3] set stdev [lindex $basicStats 4] if { $ndata < 1 } { return -code error -errorcode ARG -errorinfo "Too few actual data" } # # Get the options (providing defaults as needed) # set opt(-weights) {} set opt(-number) 100 set opt(-kernel) gaussian # # The default bandwidth is set via a simple expression, which # is supposed to be optimal for the Gaussian kernel. # Perhaps a more sophisticated method should be provided as well # set opt(-bandwidth) [expr {1.06 * $stdev / pow($ndata,0.2)}] # # The default interval is derived from the mean and the # standard deviation # set opt(-interval) [list [expr {$mean - 3.0 * $stdev}] [expr {$mean + 3.0 * $stdev}]] # # Retrieve the given options from $args # if { [llength $args] % 2 != 0 } { return -code error -errorcode ARG -errorinfo "The options must all have a value" } array set opt $args # # Elementary checks # if { $opt(-bandwidth) <= 0.0 } { return -code error -errorcode ARG -errorinfo "The bandwidth must be positive: $opt(-bandwidth)" } if { $opt(-number) <= 0.0 } { return -code error -errorcode ARG -errorinfo "The number of bins must be positive: $opt(-number)" } if { [llength [info proc $opt(-kernel)]] == 0 } { return -code error -errorcode ARG -errorinfo "Unknown kernel function: $opt(-kernel)" } # # Construct the weights # if { [llength $opt(-weights)] > 0 } { if { [llength $data] != [llength $opt(-weights)] } { return -code error -errorcode ARG -errorinfo "The list of weights must match the data" } set sum 0.0 foreach d $data w $opt(-weights) { if { $d != {} } { set sum [expr {$sum + $w}] } } set scale [expr {1.0/$sum/$ndata}] set weight {} foreach w $opt(-weights) { if { $d != {} } { lappend weight [expr {$w / $scale}] } else { lappend weight {} } } } else { set weight [lrepeat $ndata [expr {1.0/$ndata}]] } # # Construct the centres of the bins # set xbegin [lindex $opt(-interval) 0] set xend [lindex $opt(-interval) 1] set dx [expr {($xend - $xbegin) / $opt(-number)}] set xb [expr {$xbegin + 0.5 * $dx}] set xvalue {} for {set i 0} {$i < $opt(-number)} {incr i} { lappend xvalue [expr {$xb + $i * $dx}] } # # Construct the density function # set density {} set scale [expr {$opt(-bandwidth)}] foreach x $xvalue { set sum 0.0 foreach d $data w $weight { if { $d != {} } { set kvalue [$opt(-kernel) [expr {$scale * ($x-$d)}]] set sum [expr {$sum + $w * $kvalue}] } } lappend density [expr {$sum * $scale}] } # # Return the result # return [list $xvalue $density [list $xbegin $xend $mean $stdev $scale]] } # gaussian, uniform, triangular, epanechnikov, biweight, cosine, logistic -- # The Gaussian kernel # # Arguments: # x (Scaled) argument # # Result: # Value of the kernel # # Note: # The standard deviation is 1. # proc ::math::statistics::gaussian {x} { return [expr {exp(-0.5*$x*$x) / sqrt(2.0*acos(-1.0))}] } proc ::math::statistics::uniform {x} { if { abs($x) < 1.0 } { return 0.5 } else { return 0.0 } } proc ::math::statistics::triangular {x} { if { abs($x) < 1.0 } { return [expr {1.0 - abs($x)}] } else { return 0.0 } } proc ::math::statistics::epanechnikov {x} { if { abs($x) < 1.0 } { return [expr {0.75 * (1.0 - abs($x)*abs($x))}] } else { return 0.0 } } proc ::math::statistics::biweight {x} { if { abs($x) < 1.0 } { return [expr {0.9375 * pow((1.0 - abs($x)*abs($x)),2)}] } else { return 0.0 } } proc ::math::statistics::cosine {x} { if { abs($x) < 1.0 } { return [expr {0.25 * acos(-1.0) * cos(0.5 * acos(-1.0) * $x)}] } else { return 0.0 } } proc ::math::statistics::logistic {x} { return [expr {1.0 / (exp($x) + 2.0 + exp(-$x))}] } |
Changes to modules/math/statistics.man.
|
| | | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | [manpage_begin math::statistics n 0.9] [keywords {data analysis}] [keywords mathematics] [keywords statistics] [moddesc {Tcl Math Library}] [titledesc {Basic statistical functions and procedures}] [category Mathematics] [require Tcl 8.4] [require math::statistics 0.9] [description] [para] The [package math::statistics] package contains functions and procedures for basic statistical data analysis, such as: [list_begin itemized] |
︙ | ︙ | |||
413 414 415 416 417 418 419 420 421 422 423 424 425 426 | Returns a list of subsamples (their indices) that indeed violate the limits. [list_begin arguments] [arg_def list control] - Control limits as returned by the "control-Rchart" procedure [arg_def list data] - List of observed data [list_end] [para] [list_end] [section "MULTIVARIATE LINEAR REGRESSION"] Besides the linear regression with a single independent variable, the statistics package provides two procedures for doing ordinary | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 | Returns a list of subsamples (their indices) that indeed violate the limits. [list_begin arguments] [arg_def list control] - Control limits as returned by the "control-Rchart" procedure [arg_def list data] - List of observed data [list_end] [para] [call [cmd ::math::statistics::test-Kruskal-Wallis] [arg confidence] [arg args]] Check if the population medians of two or more groups are equal with a given confidence level, using the Kruskal-Wallis test. [list_begin arguments] [arg_def float confidence] - Confidence level to be used (0-1) [arg_def list args] - Two or more lists of data [list_end] [para] [call [cmd ::math::statistics::analyse-Kruskal-Wallis] [arg args]] Compute the statistical parameters for the Kruskal-Wallis test. Returns the Kruskal-Wallis statistic and the probability that that value would occur assuming the medians of the populations are equal. [list_begin arguments] [arg_def list args] - Two or more lists of data [list_end] [para] [call [cmd ::math::statistics::group-rank] [arg args]] Rank the groups of data with respect to the complete set. Returns a list consisting of the group ID, the value and the rank (possibly a rational number, in case of ties) for each data item. [list_begin arguments] [arg_def list args] - Two or more lists of data [list_end] [para] [call [cmd ::math::statistics::test-Wilcoxon] [arg sample_a] [arg sample_b]] Compute the Wilcoxon test statistic to determine if two samples have the same median or not. (The statistic can be regarded as standard normal, if the sample sizes are both larger than 10. Returns the value of this statistic. [list_begin arguments] [arg_def list sample_a] - List of data comprising the first sample [arg_def list sample_b] - List of data comprising the second sample [list_end] [para] [call [cmd ::math::statistics::spearman-rank] [arg sample_a] [arg sample_b]] Return the Spearman rank correlation as an alternative to the ordinary (Pearson's) correlation coefficient. The two samples should have the same number of data. [list_begin arguments] [arg_def list sample_a] - First list of data [arg_def list sample_b] - Second list of data [list_end] [para] [call [cmd ::math::statistics::spearman-rank-extended] [arg sample_a] [arg sample_b]] Return the Spearman rank correlation as an alternative to the ordinary (Pearson's) correlation coefficient as well as additional data. The two samples should have the same number of data. The procedure returns the correlation coefficient, the number of data pairs used and the z-score, an approximately standard normal statistic, indicating the significance of the correlation. [list_begin arguments] [arg_def list sample_a] - First list of data [arg_def list sample_b] - Second list of data [list_end] [call [cmd ::math::statistics::kernel-density] [arg data] opt [arg "-option value"] ...]] Return the density function based on kernel density estimation. The procedure is controlled by a small set of options, each of which is given a reasonable default. [nl] The return value consists of three lists: the centres of the bins, the associated probability density and a list of computational parameters (begin and end of the interval, mean and standard deviation and the used bandwidth). The computational parameters can be used for further analysis. [list_begin arguments] [arg_def list data] - The data to be examined [arg_def list args] - Option-value pairs: [list_begin definitions] [def "[option -weights] [arg weights]"] Per data point the weight (default: 1 for all data) [def "[option -bandwidth] [arg value]"] Bandwidth to be used for the estimation (default: determined from standard deviation) [def "[option -number] [arg value]"] Number of bins to be returned (default: 100) [def "[option -interval] [arg "{begin end}"]"] Begin and end of the interval for which the density is returned (default: mean +/- 3*standard deviation) [def "[option -kernel] [arg function]"] Kernel to be used (One of: gaussian, cosine, epanechnikov, uniform, triangular, biweight, logistic; default: gaussian) [list_end] [list_end] [list_end] [section "MULTIVARIATE LINEAR REGRESSION"] Besides the linear regression with a single independent variable, the statistics package provides two procedures for doing ordinary |
︙ | ︙ | |||
907 908 909 910 911 912 913 | [list_end] [para] [call [cmd ::math::statistics::subdivide]] Routine [emph PM] - not implemented yet [para] | < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < | 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 | [list_end] [para] [call [cmd ::math::statistics::subdivide]] Routine [emph PM] - not implemented yet [para] [list_end] [section "PLOT PROCEDURES"] The following simple plotting procedures are available: [list_begin definitions] [call [cmd ::math::statistics::plot-scale] [arg canvas] \ |
︙ | ︙ |
Changes to modules/math/statistics.tcl.
︙ | ︙ | |||
12 13 14 15 16 17 18 19 | # Eric Kemp-Benedict, february 2007 # version 0.5: added the population standard deviation and variance, # as suggested by Dimitrios Zachariadis # version 0.6: added pdf and cdf procedures for various distributions # (provided by Eric Kemp-Benedict) # version 0.7: added Kruskal-Wallis test (by Torsten Berg) # version 0.8: added Wilcoxon test and Spearman rank correlation | > | | 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | # Eric Kemp-Benedict, february 2007 # version 0.5: added the population standard deviation and variance, # as suggested by Dimitrios Zachariadis # version 0.6: added pdf and cdf procedures for various distributions # (provided by Eric Kemp-Benedict) # version 0.7: added Kruskal-Wallis test (by Torsten Berg) # version 0.8: added Wilcoxon test and Spearman rank correlation # version 0.9: added kernel density estimation package provide math::statistics 0.9 package require math # ::math::statistics -- # Namespace holding the procedures and variables # namespace eval ::math::statistics { |
︙ | ︙ | |||
1293 1294 1295 1296 1297 1298 1299 | if { $range < $rlower } { lappend result $i } if { $range > $rupper } { lappend result $i } } return $result } | < < < < > | 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 | if { $range < $rlower } { lappend result $i } if { $range > $rupper } { lappend result $i } } return $result } # # Load the auxiliary scripts # source [file join [file dirname [info script]] pdf_stat.tcl] source [file join [file dirname [info script]] plotstat.tcl] source [file join [file dirname [info script]] liststat.tcl] source [file join [file dirname [info script]] mvlinreg.tcl] source [file join [file dirname [info script]] kruskal.tcl] source [file join [file dirname [info script]] wilcoxon.tcl] source [file join [file dirname [info script]] stat_kernel.tcl] # # Define the tables # namespace eval ::math::statistics { variable student_t_table |
︙ | ︙ |