[ieee 2011 ieee international conference on cloud computing and intelligence systems (ccis) -...

Proceedings of IEEE CCIS2011 DETECT AND TRACK THE DYNAMIC

DEFORMATION HUMAN BODY WITH THE ACTIVE SHAPE MODEL MODIFIED BY MOTION VECTORS

Jia Ma1, Fuji Ren1,2

1 Faculty of Engineering, University of Tokushima, Japan 2 AIM, Hefei University of Technology, China

[email protected], [email protected]

Abstract At present, many systems are being used in the detection of human body from video, especially the moving human body. These systems usually recognize the human body features, such as learning the human body features by training, recognition of the skin color or body shape, and identify the specifically movement of human body. These methods have some problems. The training methods require a lot of training data, but only identify specific targets. The methods of identify the shape of human body and skin color are a great influenced by the video quality and content. The methods of detecting motion features are often complex and greatly relied to the 3D data. In this study, we propose an Active Shape Model (ASM) modified by the motion vectors to detect and track the deformed human body for action. Active shape model is used to identify the particular shapes; it is also a training method. We use kinds of original human body model modified by motion vectors, instead of training result. Thus the active shape model can detect the human body after deformation. So that human body won't be lost or repeated detected. In this method, we use Optical-flow method to get the motion vectors. The function of the method is conformed by experiment and the detecting success rate of moving human body is 94%. The experimental results and discussion are also presented. Keywords: Haartraining and Adaboost; Motion vector, Active Shape Model (ASM); Optical-Flow

1 Introduction Currently, the research about the artificial perceptual and advanced intelligence is being widely concerned. Our laboratory take this study as one of the main research directions[1, 2]. In our study, we pay attention to humanunderstanding and recognition of human emotions. Depend on the work about language processing[3-7], image processing[8-11] and voice processing[12], we

hope to identify and analyse the human emotions. So as to make a solid foundation for establishing the artificial perceptual and advanced intelligence system. We try to build an automatic robot system which can deal with human language, image and voice. This system should be able to communicate with people freely and friendly. As the preparation of the study, the robot should be able to automatically discover the human around it, and exchange them as communication objects. Therefore, we need to develop a system for human identification. In our study, we used the camera-based system. At present, the camera-based automation systems of human body detection are widely used. These systems usually detect the human body shape or skin color, and the special human action, such as [13]. But, these systems have some problems. The method based on training use a lot of training data to make the system understand some of the human features, such as Haartraining and Adboost [14], ASM [15], and active appearance model (AAM) [16]. But this method can only be used for specific goals. The methods for identify the skin color and human face are relatively mature, such as [17]. But because of clothing, these methods can only confirm some parts of the human body. And, face is not always easy to detect, because it's not always be clear to see in the video. Other researchers [18, 19] use multiple cameras or cameras with special sensor to detect human body. The researchers used some specialized equipment to get 3D videos, so that they can identify the body or the action of the human body. These methods have good effect, however, the set up of this system and analysis of the 3d data are very complicated. To solve the above problems, a simple, high speed and generalization human body detection method is required. In our method, we use a gradual structure, to ensure that the mobile human body is detected with high success rate. Therefore, the whole method contents several algorithms comprehensively. To get the original human body ___________________________________

978-1-61284-204-2/11/$26.00 ©2011 IEEE

shape, Haartraining and Adaboost [14] are used. Optical-flow [20] is used for getting the motion vectors, the active shape model (ASM) [15, 21] modified by motion vectors is used for detecting and tracking the human body shape in the following video. The rest of this paper is organized as follows. Section 2 introduces the main concept and algorithms for integrating our method. Section 3 describes the experiments performed in order to test the method, and presenting the results of the experiment and the related discussions. In Section 4, we present our conclusions.

2 Instruction of the method The video with moving human body is divided into a sequence of images at 20 frames/s. The human body in special pose is detected in every frame, while the detected target will be detected and tracked with the modified ASM even though the human body is deformed for the human action. As shown in Figure 1, three steps are used to detect and track the target people. The output is the restored video that is integrated using all the treated frames.

Figure 1 The flow of our method The optical-flow method describe the moving objects with a basic principle: velocity vectors are assigned to all pixels to form a motion framework of the frame. This velocity vectors are called motion vectors. Depended on these motion vectors of pixels, the frame features can be dynamically analysed. If there are no moving objects, the motion vectors in the entire frame change continuously. Else, when the moving objects exit, the relative motion between the targets and background appears. The motion vectors of the moving objects must be different from those of the background. With this method, the global motion vectors of the background and the local motion vectors of the moving objects can be acquired from the frame. By extracting all the regions that have different local motion vectors with the global motion vectors, all

moving objects can detected. The location data and location motion vectors of the pixels on the contours of the moving objects and can be recorded and passed to step two. To detect moving people, it is not realistic to use only the optical-flow method. Since this method can not describe the object features but the motion. Therefore, the method of Haartraining and Adaboost is used to detect the original human body shape model in step two, because it is an efficient method. To detect the human bodies in the frames, we use the model correspond to the front view of an upright human body. This model was built with Haartraining[22], and the detection process is completed with the Adaboost method [23]. The detected human body is recorded as original shape model and sent to step three. In our study, we used the method that is based on the detection of weak features of graduated images, which was proposed by Paul Viola and Michael Jones in 2001 [14]. For training the Haar cascade classifier, four steps are required: 1) Preparing the positive and negative samples; 2) Constructing the sample collection; 3) Training with the Haartraining program for getting the final classifier model; 4) Testing the classifier. In our study, the positive samples are the frames including human bodies. These videos are taken with an omnidirection camera, which will be introduced in section three, for getting larger viewing angle and complex image. The negative samples are frames that do not include human bodies, and their sizes are larger than the positive samples. The objects in the video taken with the omnidirection camera were deformed. Therefore, in our study, the relevant models were not used. We used OpenCV [24] to obtain new models for corresponding to the deformation. When the people is moving or performing different actions, the shape and location of human body is different in the following frame. To identify any human body in each frame and track the detected target in the next frame, it is required to ensure that the detected human body was not lost or detected multiple times because of the deformation cause by human action. In our study, ASM [15] is not only used to detect but also track human bodies, because ASM is used to detect assigned objects. By learning the object shape, ASM can detect the target. However, the training data, which means the original shape, should be prepared. In order to get the shape data as accurate as possible, the original shape are usually obtained manually. However, the shape of

Optical-flow Haartraining and Adaboost

Modified ASM

the human body is not fixed in our study. Therefore, the contour of the detected body is recorded to replace the manually shape model as the original shape, thus allowing the original training data to be automatically obtained. The ASM method determines the variations using principal component analysis (PCA), which enables the possible contour of the target being recognized. However, the human body in our video is moving or acting and the body shape is being deformed. Therefore, the original ASM model does not fit our requirement. Depend on the motion vectors recorded in step one for a particular frame, the pixels recorded as original shape in step two, which has the same or similar coordinates in the same frame, will be modified in the next frame. The modified pixels are connected to form a new contour. This new contour is treated as the new training data of ASM. The system therefore modifies ASM to fit the deformation of the targets. The reason that not using the active appearance model (AAM) [16], which is the advanced method of ASM, is as follows. Although AAM can determine the changing about the size and direction of the target and fit them with the

appearance of the pixels, this fitting depends on the contour of the target. However, in our study, the contour of the target is deforming, therefore, the active appearance may result in confusion.

3 An experiment for confirming the method and the discussion about the experimental result In order to get a larger viewing angle and a more complex videos, and omnidirectional camera is used in our experiment, which has a 360° horizontal viewing angle. The model of the camera is VS-C15N and developed by the company called VSTONE. The output of the camera is a 24-bit RGB 320 × 240 NTSC video. As it was shown in Figure 2, these images in columns show the results in different steps of the system. The first image shows the result for detecting the moving object using Optic-Flow method. The second image shows the results for detecting the human body using the Haartraining and Adaboost method in the original images. The third image shows the result for detecting and tracking the moving people using ASM. The forth image shows the output.

Figure 2 Sample image for the experimental result

In our study, people performed different motions and had different poses and locations in videos. In the experiment, 20 videos, each of duration 30 seconds, were taken with the omnidirectional camera. In total, 60 target persons were continuously moving or acting in videos. 10 of these target person maintained standstill The experimental results are as follow: Third, 52 target people in the videos were detected. Sixteen of these people were not detected in the beginning of the videos because they were not standing upright and facing the camera at that time. Depending on their actions and movement, they were detected when they stood upright and faced the camera. This implies that the video contained a total of 60 humanoid objects. The success rate for detecting humanoid objects is about 87%. In the failure part, five of target people maintained standstill in the video was not detected, that means half of the targets maintained standstill. Three moving and acting people were not detected, that means in the entire 50 moving target people, 94% of them were succeeded to be detected. Eight target people were failed to be detect because of the following reasons: First, five of target people maintained standstill in the video but stood too close to the camera. As shown in the first image of Figure 6, such as the person who was wearing a half-sleeved shirt, these target people could not be completely included in the video. Therefore, the model for the total human body could not fit in the image. Second, the other three targets, who were not detected, were continuously moving or acting in the videos, but they did not stand upright and show the front of their torsos in any frame. To solve these problems, in our future work, we may use a distance sensor to detect persons in close proximity from the camera, or develop some other models for only parts or other postures of the human body. Another possible option is to increase the frame capture rate for obtaining much more information.

4 Conclusions and future work In this study, we attempted to establish a new method for detecting and tracking the moving people. In our method, the optical-flow method was chosen to describe the moving object and get the motion vectors, while the Haartraining and Adaboost method was used to detect the original human body shape as the original training data. The Haartraining model for detecting the human body are retrained for adapting to the deformation of the video, which is caused by the omnidirectional camera. The ASM method

modified by motion vectors is used to detect and track the moving human body. The function of this method was verified experimentally. Depend on the experimental results, we found some problems that should be solved in our future work. First, new models for detecting human body in particular positions or postures should be developed. Second, other sensors may be used to detect the people in particular positions. In our future work, we will strengthen the function of the human detectiong system, and begin to identify the human expression with the camera on the robot. Acknowledgements This research has been partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (A), 22240021.

References [1] Fuji Ren, Affective Information Processing

and Recognizing Human Emotion, Electronic Notes in Theoretical Computer Science, Vol.225, No.2009, pp.39-50, 2009.

[2] Fuji Ren, From Cloud Computing to Language Engineering, Affective Computing and Advanced Intelligence, International Journal of Advanced Intelligence, Vol.2, No.1, pp.1-14, 2010

[3] Changqin Quan, Fuji Ren, A blog emotion corpus for emotional expression analysis in Chinese, Computer Speech and Language, Vol.24, No.1, pp.726-749, 2010

[4] Ai Hakamata, Fuji Ren and Seiji Tsuchiya, Human Emotion Model Based on Discourse Sentence for Expression Generation of Conversation Agent, International Journal of Innovative Computing, Information and Control, Vol.6, No.3(B), pp.1537-1548, 2010

[5] Changqin Quan, Fuji Ren, Sentence Emotion Analysis and Recognition Based on Emotion Words Using Ren-CECps, International Journal of Advanced Intelligence, Vol.2, No.1, pp.105-117, 2010

[6] Kazuyuki Matsumoto, Fuji Ren, Estimation of Word Emotions Based on Part of Speech and Positional Information, Computers in Human Behavior, Elsevier Ltd.,Vol.2011, No.27, pp.1553-1564,2011

[7] Changqin Quan, Fuji Ren, Recognition of Word Emotion State in Sentences, IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, Vol.6, No.S1, pp.34-41, 2011

[8] Jia Ma, Motoyuki Suzuki and Fuji Ren, Detect the Possible Spokesperson with an Omni-directional Camera, in a Robot-human Communication System, Proc. IEEE International Conference on Natural Language Processing and Knowledge Enginerring, pp.159-163, Dalian, Sep. 2009

[9] Peilin JIANG, Fuji Ren and Nanning ZHENG, Advanced Emotion Categorization and Tagging, IEEE NLP-KE 2008, pp.249-254, Beijing, Oct. 2008

[10] Jiang Peilin, Ma Jia, Minamoto Yusuke, Seiji Tsuchiya, Ryosuke Sumitomo and Fuji Ren, Orient video database for facial expression analysis, Proceeding of the 10th IASTED International Conference Intelligent Systems and Control, pp.211-214, Massachusetts, Nov. 2007

[11] Jia Ma, Fuji Ren Shingo Kuroiwa, Picture Processing in Expression Recognition, Proceedings of International Symposium on Artificial Intelligence and Affective Computing 2006, pp.147-157, Tokushima, Nov. 2006

[12] Adachi Masashi, Seiji Tsuchiya and Fuji Ren, EMOTION INFERENCE METHOD BASED ON WORD'S MEANING AND UTTERANCE FEATURES, ICAI 2008, pp.138-141, Beijing, Oct. 2008

[13] JETRO, Digital Video Recorder, PC based DVR (4/8/16ch), http:// www.jetro. Go. Jp /ttppoas /anken/0001091000/1091683_e.html.

[14] Paul Viola, Michael Jones. Robust Real-time Object Detection, Second international workshop on statistical and computational theories of vision - Modeling, Learning, Computing, and Sampling Vancouver, Canada, July 13, 2001.

[15] T. Cootes, C. Taylor, D. Cooper, J. Graham. Active Shape Models - Their Training and Application. Computer Vision and Image Understanding, Volume 61, Issue 1, pp. 38-59, 1995.

[16] T. Cootes, G. Edwards, C. Taylor. Active Appearance Models, 5th European Conference on Computer Vision, Volume 2, pp. 484-498, Springer, Freiburg, Germany, 1998.

[17] Krystian Mikolajczyk, Cordelia Schmid, Andrew Zisserman. Human Detection Based on a Probabilistic Assembly of Robust Part Detectors, Computer Vision - ECCV 2004, Lecture Notes in Computer Science, Volume 3021/2004, pp. 69-82, 2004

[18] Cai, Q., Aggarwal, J.K. Tracking human motion using multiple cameras, Pattern Recognition, 1996., Proceedings of the 13th International Conference on, Volume 3, pp. 68-72, 2002.

[19] Ankur Agarwal, Bill Triggs. 3D Human Pose from Silhouettes by Relevance Vector Regression, Computer Vision and Pattern Recognition, IEEE, Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'04), Volume 2, pp. 882-888, 2004.

[20] Cutler R., Turk M. View-based interpretation of real-time optical flow for gesture recognition, Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, pp. 416-421, 2002.

[21] B. van Ginneken, A.F.Frangi, J.J.Stall, B. ter Haar Romeny. Active shape model segmentation with optimal features. IEEE-TMI, 21:924-933, 2002.

[22] Rainer Lienhart, Alexander Kuranov, Vadim Pisarevsky. Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection, Pattern Recognition, Lecture Notes in Computer Science, Volume 2781/2003, pp. 297-304, 2003.

[23] Fengjun Lv, Ramakant Nevatia. Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost, Computer Vision -ECCV 2006, Lecture Notes in Computer Science, Volume 3954/2006, pp. 359-372, 2006.

[24] Gary R. Bradski, Vadim Pisarevsky, “Intel's Computer Vision Library: Applications in Calibration, Stereo, Segmentation, Tracking, Gesture, Face and Object Recognition,” cvpr, Volume 2, pp. 2796, 2000 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'00) - Volume 2, 2000.

[ieee 2011 ieee international conference on cloud computing and intelligence systems (ccis) -...

Documents