Thursday, May 1, 2008

3D Visual Detection of Correct NGT Sign Production

[Summary]


In this paper, the authors create a system that can help people in learning Dutch Sign Language (DSL). The recognition system is vision-based. Two calibrated video cameras are set on top on a table where people perform their hand gestures. The user's head and hands are tracked based on following skin-colored segments of the image from frame to frame. The head is used as a stationary reference point.


The adaptive chrominance model can work with different lighting and backgrounds. Skin color is modeled by a 2D Gaussian per-pendicular to the main direction of the distribu-tion of the positive skin samples in RGB space. Tracking the hands and head is done separately in both cameras by following their respective blobs over consecutive frames or, when hand blobs cannot be separated (due to occlusion), by performing a template search over skin areas using the gray image of the hand in the pre-vious frame. The hands and head locations are reinitialized by their position using the three largest skin blobs in the image and tracked by finding the nearest blob or best template match in the next frame.


For classification, fifty different properties have been derived that are related to the 2D/3D location and movement of the hands. These properties are measured in each frame. And each property is trained as one classifier.


A set of 120 different NGT signs performed by 70 individuals are used to test the sign classification. They also perform the cross validation, and the overall recognition accuray is 95%. They compare their results with linear time warping and dynamic time warping.


[Discussion]


One problem of this vision-based recognition system is that they cannot recognize the hand-crossing gestures correctly since they use simply the left blob as the left hand and right blob as the right hand. There's another problem which is also a common problem for vision-based tracking is that the occlusion problem. Becuase they put the video camera very close to each other (15cm) and both pointing at the hands from the similar direction, it is hard to avoid that there would be some occlusion parts in video.


Another thing is that I really don't think train each feature separately is a good idea since differen feature are related rather than independent. In addition, 50 features may be too many, why they didn't consider using some feature selection techniques.

0 comments: