Nadia Magnenat Thalmann1, Daniel Thalmann2
1 MIRALAB - CUI
University of Geneva
24 rue du Général-Dufour
CH1211 Geneva 4, Switzerland
{Igor.Pandzic,Nadia.Thalmann}@cui.unige.ch
http://miralabwww.unige.ch/
2 Computer Graphics Laboratory
Swiss Federal Institute of Technology (EPFL)
CH1015 Lausanne, Switzerland
{capin, thalmann}@lig.di.epfl.ch
http://ligwww.epfl.ch/
Keywords: Networked Collaborative Virtual Environments, Virtual Humans, Communication, Facial Communication, Gestural Communication, Multimedia.
Although Networked Collaborative Virtual Environments have been around as a topic of research for quite some time, in most of the existing systems the communication between participants is restricted to text messages and/or to audio communication [Barrus96][Macedonia94][Singh95]. The DIVE system [Carlsson93] includes a means of gestural communication by choosing some predefined gestures. The natural human communication is richer than this. Facial expressions, lip movement, body postures and gestures all play an important role in our everyday communication. Ideally, all these means of communication should be incorporated seamlessly in the Virtual Environment, preferably in a non-intrusive way. Ohya et al. [Ohya95] recognize this need and present a system where facial expressions are tracked using tape markers while body and hands carry magnetic trackers, allowing both face and body to be synthesized. In this paper we discuss several different ways of integrating natural means of communication in a NCVE. Our approach is based on the use of the realistically modeled and animated Virtual Humans. The described methods are integrated in our Virtual Life Network (VLNET) system. We first give a brief introduction to the VLNET system and to the Virtual Humans used within it. After that we discuss different means of communication within the NCVE: audio, facial and gestural. Our work does not particularly concentrate on the audio communications, but we describe several ways to implement facial and gestural communications. Finally we give conclusions and ideas for future work.
The VLNET core consists of four logical units, called engines, each with a particular task and an interface to external applications (drivers).
The Object Behavior Engine takes care of the predefined object behaviors, like rotation or falling, and has an interface allowing to program different behaviors using external drivers.
The Navigation and Object Manipulation Engine takes care of the basic user input: navigation, picking and displacement of objects. It provides an interface for the navigation driver. If no navigation driver is activated, standard mouse navigation exists internally. Navigation drivers exist for the SpaceBall and FOB/Cyberglove combination. New drivers can easily be programmed for any device.
The Body Representation Engine is responsible for the deformation of the body. In any given body posture (defined by a set of joint angles) this Engine will provide a deformed body ready to be rendered. This engine provides the interface for changing the body posture. A standard Body Posture Driver is provided, that connects also to the navigation interface to get the navigation information, then uses the Walking Motor and the Arm Motor [Boulic90][Pandzic96] to generate the natural body movement based on the navigation. Another possibility is to replace this Body Posture Driver by a simpler one that is directly coupled to a set of Flock Of Birds sensors on the users body, providing direct posture control.
The Facial Representation Engine provides the synthetic faces with a possibility to change expressions or the facial texture. The Facial Expression Interface is used for this. It can be used to animate a set of parameters defining the facial expression.
All the engines in the VLNET core process are coupled to the main shared memory and to the message queue. Cull and Draw processes access the main shared memory and perform the functions that their names suggest. These processes are standard SGI Performer [Rohlf94] processes.
The Communication Process receives any messages from the network (actually from the VLNET server; VLNET is a client/server system) and puts them into the Message Queue. All the engines read from the queue and react to messages that concern them (e.g. Navigation Engine would react to a Move message, but ignore a Facial Expression message which would be handled by the Facial Representation Engine). All the Engines can write into the outgoing Message Queue, and the Communication Process will send out all the messages. All messages in VLNET use the standard message packet. The packet has a standard header determining the sender and the message type, and the message body content depends on the type of the message but is always of the same size: 74 bytes.
The data coming to any Engine through its external Interface is packed into a message packet and put into the Message Queue by the Engine. The Communication process sends out the packet, and the Communication Processes of other participants receive it and put it into the Message Queue. The appropriate Engine reads it from the Message Queue and processes it. In this way the data input from any Driver comes to the appropriate Engine at each participating site.
The Data Base Process takes care of the off-line loading of objects
and user representations. It reacts to messages from the Message Queue
demanding such operations.
Figure 1: Virtual Life Network system overview
The body representation uses a Metaball structure attached to a skeleton
model [Boulic95] in order to produce a deformable body. The skeleton model
is anatomically modeled. It consists of 74 degrees of freedom without the
hands, with an additional 30 degrees of freedom for each hand. The skeleton
is represented by a 3D articulated hierarchy of joints, each with realistic
maximum and minimum limits. A Metaball structure is attached to the skeleton
to simulate the muscle structure, and the final triangle mesh representing
the skin is calculated based on the positions of the metaballs when the
skeleton moves [Thalmann96]. To ensure the calculation of the skin deformations
in real time, most of the skin is precomputed in the neutral position of
the skeleton, and only the parts susceptible to frequent and strong deformations
(e.g. around the joints) are recalculated in each frame.
Figure 2: Virtual Humans in VLNET
The face is a polygon mesh model with defined regions and Free Form Deformations modeling the muscle actions [Kalra92]. It can be controlled on several levels. On the lowest level, an extensive set of Minimal Perceptible Actions (MPAs), closely related to muscle actions and similar to FACS Action Units, can be directly controlled. There are 65 MPAs, and they can completely describe the facial expression. On a higher level, phonemes and/or facial expressions can be controlled spatially and temporally. On the highest level, complete animation scripts can be input defining speech and emotion over time. Algorithms exist to map texture on such facial model.
Figure 2 shows two Virtual Humans in a VLNET environment, from the perspective of the third participant whose hand is visible.
Obviously, the way to a complete system as described above is long and paved with problems. Capturing facial expressions or gestures non-intrusively and with enough precision is an extremely complicated task. The synthesis of realistically looking human bodies and faces, and their animation in real time is also very demanding. Communication protocols must insure that the multi-modal data is transmitted to all the participants, and in the final synthesis the multi-modal outputs have to be synchronized.
We are trying to solve some of these problems within the Virtual Life Network system and to provide solutions leading to the complete communications as described above.
So far our work was not particularly concentrated on the audio (speech) communication. We use public-domain audio conferencing tools (VAT) to integrate this capability in the VLNET system. Therefore, audio communication is not discussed in this paper.
Next two sections will present several solutions for the facial communication, as well as some solutions for the gestural communication of the body.
We discuss four methods of integrating facial expressions in a Networked Collaborative Virtual Environment: video-texturing of the face, model-based coding of facial expressions, lip movement synthesis from speech and predefined expressions or animations.
Each facial image in the video sequence is compressed by the Driver using SGI Compression Library and the compresed images are passed to the Facial Representation Engine of VLNET, thn redirected to the Communication Process. Obviously, the color images of 120 x 80 pixels, even compressed, do not fit in the standard VLNET message packets used by the Communication process. Therefore special data channels are open for this video communication.
On the receiving end, the images are received by the Communication process, decompressed by the Data Base process and texture-mapped on the face of the virtual human representing the user. Currently we use a simple frontal projection for texture mapping. A simplified head model with attenuated features is used. This allows for less precise texture mapping. If the head model with all the facial features is used, any misalignment of the topological features in the 3D model and the features in the texture produces quite unnatural artifacts. The only way to avoid this is to have the coordinates of characteristic feature points in the image which can be used to calculate the texture coordinates in such a way that the features in the image are aligned with the topology. This is called texture fitting. However, currently our texture fitting algorithm does not work in real time.
Figure 3 illustrates the video texturing of the face, showing
the original images of the user and the corresponding images of the Virtual
Human representation.
Figure 3: Video texturing of the face
The recognition method relies on the "soft mask", which is a set
of points adjusted interactively by the user on the image of the face.
Using the mask, various characteristic measures of the face are calculated
at the time of initialization. Color samples of the skin, background, hair
etc., are also registered. Recognition of the facial features is primarily
based on color sample identification and edge detection. Based on the characteristics
of human face, variations of these methods are used in order to find the
optimal adaptation for the particular case of each facial feature. Special
care is taken to make the recognition of one frame independent from the
recognition of the previous one in order to avoid the accumulation of error.
The data extracted from the previous frame is used only for the features
that are relatively easy to track (e.g. the neck edges), making the risk
of error accumulation low. A reliability test is performed and the data
is reinitialized if necessary. This makes the recognition very robust.
The set of extracted parameters includes:
* vertical head rotation (nod)
* horizontal head rotation (turn)
* head inclination (roll)
* aperture of the eyes
* horizontal position of the iris
* eyebrow elevation
* distance between the eyebrows (eyebrow squeeze)
* jaw rotation
* mouth aperture
* mouth stretch/squeeze
The analysis is performed by a special Facial Expression Driver. The extracted parameters are easily translated into Minimal Perceptible Actions, which are passed to the Facial Representation Engine, then to the Communication process, where they are packed into a standard VLNET message packet and transmitted.
On the receiving end, the Facial Representation Engine receives
messages containing facial expressions described by MPAs and performs the
facial animation accordingly. Figure 4 illustrates this method with a sequence
of original images of the user (with overlaid recognition indicators) and
the corresponding images of the synthesized face.
Figure 4: Model-based coding of the face - original and synthetic face
This method can be used in combination with texture mapping. The model needs an initial image of the face together with a set of parameters describing the position of the facial features within the texture image in order to fit the texture to the face. Once this is done, the texture is fixed with respect to the face and does not change, but it is deformed together with the face, in contrast with the previous approach where the face was static and the texture was changing. Some texture-mapped faces with expressions are shown in figure 5.
Figure 5 shows some examples of predefined facial expressions.
Figure 5: Predefined facial expressions - surprise, sleep, boredom
* instantaneous gestures: Most of the time, often even unconsciously, we accompany our speech with gestures. They stress the speech and give emphasis on particular words. They also very often have a meaning in themselves. The whole body posture also conveys information about the person's state and possibly emotions. For example, from the posture it can be determined if the person is tired, tense or relaxed.
* gesture commands: these are gestures that the user makes to specify some action. For example, the sign 'come here' can be speficied by raising the arm. These movements can change from one person or culture to another, therefore there is no well-defined set of rules for the meanings.
* rule-based sign language: these are gestures, for example used by deaf people that essentially follow well-defined rules to specify words or sounds. The signs typically work as a metaphor for defining other objects or language. The gestures can also be used by the software to define special tasks (e.g. showing forward direction to initiate a walk).
All these types of gestures can be controlled by two different
methods: direct tracking and predefined postures or gestures. The type
of control might be suitable for a specific type of gestures, however a
combination of them can be used for different tasks.
Figure 6: Direct gestural communication using magnetic trackers
The main approaches to this problem are: inverse kinematics using constraints, closed form solutions, and motor functions. The inverse kinematics approach is based on an iterative algorithm, where an end-effector coordinate frame (for example the hand) tries to reach a goal (the reach position) coordinate frame, using a set of joints which control the end effector. The advantage of this approach is that any number of sensors can be attached to any body part, and multiple constraints can be combined through assigning weights. However, this might slow down the simulation significantly as it requires excessive computation. The closed form solution solves this problem using 10 sensors attached to the body, and solving for the joint angles analytically. The human skeleton is divided into smaller chains, and each joint angle is computed within the chain it belongs to. For example, the joint angle for the elbow is computed using the sensors attached to the upper arm and lower arm, and computing the angle between the sensor coordinate frames.
The raw data coming from the trackers has to be filtered and processed to obtain a usable structure. The software developed at the Swiss Federal Institute of Technology [Molet96] permits to convert the raw tracker data into joint angle data for all the 75 joints in the standard HUMANOID skeleton used within VLNET [Capin95], with additional 25 joints for each hand. As shown in Figure 1, this software is viewed as Body Posture Driver by the VLNET system, and VLNET communicates with it through the Body Posture Interface. VLNET Body Representation Engine obtains this joint table from the Body Posture interface and uses such data to produce deformed bodies ready for rendering. The posture data in the form of joint angles fits into a VLNET message packet by reducing each angle to 8 bits, with a maximal error of 1.4 degrees in 360 degrees. This error rate is sufficient enough to provide body postures which is visually similar to the real body. By coupling the Flock of Birds driver with VLNET we can obtain full gestural communication in a very direct, though intrusive, way. Figure 6 illustrates this approach by showing the user wearing the trackers and his Virtual Human representation.
The main difference between direct control and predefined postures/gestures
is that the direct control provides more correpondence to the real posture
of the participant. Therefore, it is expected to provide more immersive
feeling. However, the predefined postures can increase the communication
among participants in networked environments in the absence of enough number
of trackers. These two types of control can be combined to the animate
the participant's body for different types of gestures. There is a need
to investigate and define a set of tools that provide sufficient proprioceptive
information for instantaneous gestures, while providing easy and natural
control for rule-based signs and sign gesture commands.
Figure 7: Some examples of predefined postures: tiredness, surprise, tension
We intend to work further on the facial communication algorithms, improving the texture mapping with feature alignment for the video texturing approach. The presented approaches for facial communication can also be combined, especially the model-based approach and the speech analysis approach, combining the visual and audio analysis in order to obtain better results. Gesture recognition algorithms might be used to obtain the gestural communication in a non-invasive manner. Further integration and synchronization of facial and gestural communication is needed.
[Boulic 90] Boulic R., Magnenat-Thalmann N. M.,Thalmann D. "A Global Human Walking Model with Real Time Kinematic Personification", The Visual Computer, Vol.6(6),1990.
[Boulic 95] Boulic R., Capin T., Huang Z., Kalra P., Lintermann B., Magnenat-Thalmann N., Moccozet L., Molet T., Pandzic I., Saar K., Schmitt A., Shen J., Thalmann D., "The Humanoid Environment for Interactive Animation of Multiple Deformable Human Characters", Proceedings of Eurographics '95, 1995.
[Capin 95] Capin T.K., Pandzic I.S., Magnenat-Thalmann N., Thalmann, D., "Virtual Humans for Representing Participants in Immersive Virtual Environments", Proceedings of FIVE '95, London, 1995.
[Carlsson93] Carlsson C., Hagsand O., "DIVE - a Multi-User Virtual Reality System", Proceedings of IEEE VRAIS '93, Seattle, Washington, 1993.
[Kalra92] Kalra P., Mangili A., Magnenat Thalmann N., Thalmann D., "Simulation of Facial Muscle Actions Based on Rational Free Form Deformations", Proc. Eurographics '92, pp.59-69., 1992.
[Lavagetto95] F.Lavagetto, "Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People", IEEE Trans. on Rehabilitation Engineering, Vol.3, N.1, pp.90-102, 1995.
[Macedonia 94] Macedonia M.R., Zyda M.J., Pratt D.R., Barham P.T., Zestwitz, "NPSNET: A Network Software Architecture for Large-Scale Virtual Environments", Presence: Teleoperators and Virtual Environments, Vol. 3, No. 4, 1994.
[Molet96] Molet T., Boulic R., Thalmann D., "A Real Time Anatomical Converter for Human Motion Capture", Proc. of Eurographics Workshop on Computer Animation and Simulation, 1996.
[Ohya95] Ohya J., Kitamura Y., Kishino F., Terashima N., "Virtual Space Teleconferencing: Real-Time Reproduction of 3D Human Images", Journal of Visual Communication and Image Representation, Vol. 6, No. 1, pp. 1-25, 1995.
[Pandzic 94] Pandzic I.S., Kalra P., Magnenat-Thalmann N., Thalmann D., "Real-Time Facial Interaction", Displays, Vol. 15, No 3, 1994.
[Pandzic96] I.S. Pandzic, T.K. Capin, N. Magnenat Thalmann, D. Thalmann, "Motor functions in the VLNET Body-Centered Networked Virtual Environment", Proc. of 3rd Eurographics Workshop on Virtual Environments, Monte Carlo, 1996.
[Rohlf94] Rohlf J., Helman J., "IRIS Performer: A High Performance Multiprocessing Toolkit for Real-Time 3D Graphics", Proc. SIGGRAPH'94, 1994.
[Semwal96] S. K. Semwal, R. Hightower, S. Stansfield, "Closed Form and Geometric Algorithms for Real-Time Control of an Avatar", Proc. VRAIS 96. pp. 177-184.
[Singh95] Singh G., Serra L., Png W., Wong A., Ng H., "BrickNet: Sharing Object Behaviors on the Net", Proceedings of IEEE VRAIS '95, 1995.
[Thalmann95] D. Thalmann, T. K. Capin, N. Magnenat Thalmann, I. S. Pandzic, "Participant, User-Guided, Autonomous Actors in the Virtual Life Network VLNET", Proc. ICAT/VRST '95, pp. 3-11.
[Thalmann96] D. Thalmann, J. Shen, E. Chauvineau, "Fast Realistic
Human Body Deformations for Animation and VR Applications", Proc. Computer
Graphics International '96, Pohang, Korea,1996.