An idea I had for 'gesture recognition', for simple mouse movement.
An idea which came to me while walking to the office this morning, on how to implement the 'mouse movement' recognition. Assuming that OpenKinect is not providing such by itself.
Mail to save the braindump before I forget. CC you for redundant storage in 2nd brain.
<brain-dump>
The idea assumes that a person interacting with Kinect/Display will stand in front of it and have their hand stretched forward, towards the display.
Use the 'Depth' image.
Find the pixel whose value indicates that it is nearest to the display. Depending on the depth encoding this should be either the maximum pixel value, or the minimum.
<< This should at least be the user's hand. Depending on resolution and posture it may actually be the tip of a finger. It could also be the whole body if the user is not yet pointing to the display. That is actually a good thing, as this would track persons as they pass by, giving us the effect we want, automatically. >>
(Possibly find all pixels with that value and compute their centroid).
We now have an x/y location, and we can track it as new depth images are send to us by the device.
And the actual depth (i.e. max/min) gives us a z-channel too.
(Possibly run a windowed average over the stream of coordinates to smooth things out).
Advanced idea:
Depending on the depth resolution Kinect may be able to separate finger tips from the hand. They might be all in the same plane, or in different but nearby planes.
Get the bounding box inside of which all nearest the pixels are. Tracking the size of this box over time gives a sense of 'scale', i.e. how open (flat, wide) vs closed (small clump) the hand is. This could be bound to 'zoom'. => What is the tablet gesture called where two fingers pinch, move apart ? Same idea, using the whole hand.
(We might have to compensate for size changes due to depth changes, i.e. normalize to a fixed depth)
More advanced:
Having the (normalized) bounding box with the hand/fingers we can use known techniques on the region to determine rotation of the content over time (*) and get another channel of movement.
This might even work if the finger tips are not separated from the hand, if the shape of the whole hand itself is distinctive enough to latch on.
This would actually be also another way of getting scale changes, albeit more computationally expensive.
(*) See Crimp GSoC, Affine Registration: Transform images to compare to log-polar representation, run FFT on it and perform a phase correlation.
The resulting shift values in the log-polar domain translate into scale and rotation values in the cartesian domain.
</brain-dump>