Towards Natural Communication in Networked Collaborative Virtual Environments

Igor Sunday Pandzic1, Tolga K. Capin2,

Nadia Magnenat Thalmann1, Daniel Thalmann2
 
 

1 MIRALAB - CUI
 

University of Geneva

 24 rue du Général-Dufour

 CH1211 Geneva 4, Switzerland

 {Igor.Pandzic,Nadia.Thalmann}@cui.unige.ch

 http://miralabwww.unige.ch/
 
 

2 Computer Graphics Laboratory
Swiss Federal Institute of Technology (EPFL)

 CH1015 Lausanne, Switzerland

 {capin, thalmann}@lig.di.epfl.ch

 http://ligwww.epfl.ch/


Abstract: Networked Collaborative Virtual Environments (NCVE) have been a hot topic of research for some time now. However, most of the existing NCVE systems restrict the communication between the participants to text messages or audio communication. The natural means of human communication are richer than this. Facial expressions, lip movements, body postures and gestures all play an important role in our everyday communication. Part of our research effort in the field of Networked Collaborative Virtual Environments thrives to incorporate such natural means of communication in a Virtual Environment. This effort is mostly based on the use of realistically modeled and animated Virtual Humans. This paper discusses several ways to use virtual human bodies for facial and gestural communication within a Virtual Environment.

Keywords: Networked Collaborative Virtual Environments, Virtual Humans, Communication, Facial Communication, Gestural Communication, Multimedia.

Introduction

The Networked Collaborative Virtual Environments (NCVE) are often described as systems that permit to the users to feel as if they were together in a shared Virtual Environment. Indeed, the feeling of "being together" is extremely important for collaboration. Probably the most important aspect of being together with someone, either in the real or a virtual world, is the ability to communicate. Most of us have felt the frustration of being with someone who doesn't speak the same language. In a Networked Collaborative Virtual Environment worse situations can occur, for example sharing the VE with someone, being able to see his/her representation moving within the virtual world, but without the possibility to talk or at least write to each other.

Although Networked Collaborative Virtual Environments have been around as a topic of research for quite some time, in most of the existing systems the communication between participants is restricted to text messages and/or to audio communication [Barrus96][Macedonia94][Singh95]. The DIVE system [Carlsson93] includes a means of gestural communication by choosing some predefined gestures. The natural human communication is richer than this. Facial expressions, lip movement, body postures and gestures all play an important role in our everyday communication. Ideally, all these means of communication should be incorporated seamlessly in the Virtual Environment, preferably in a non-intrusive way. Ohya et al. [Ohya95] recognize this need and present a system where facial expressions are tracked using tape markers while body and hands carry magnetic trackers, allowing both face and body to be synthesized. In this paper we discuss several different ways of integrating natural means of communication in a NCVE. Our approach is based on the use of the realistically modeled and animated Virtual Humans. The described methods are integrated in our Virtual Life Network (VLNET) system. We first give a brief introduction to the VLNET system and to the Virtual Humans used within it. After that we discuss different means of communication within the NCVE: audio, facial and gestural. Our work does not particularly concentrate on the audio communications, but we describe several ways to implement facial and gestural communications. Finally we give conclusions and ideas for future work.

Virtual Life Network

Virtual Life Network (VLNET) is a Networked Collaborative Virtual Environment system using highly realistic Virtual Humans for the participant representation. Figure 1 presents an overview of the system. The design is highly modular, with functionalities split into a number of processes. VLNET has an open architecture, with a set of interfaces allowing a user with some programming knowledge to access the system core and change or extend the system by plugging custom-made modules, called drivers, into the VLNET interfaces. These drivers only have to use a defined API to connect to VLNET. They can run on the local host, or on a remote host.

 The VLNET core consists of four logical units, called engines, each with a particular task and an interface to external applications (drivers).

 The Object Behavior Engine takes care of the predefined object behaviors, like rotation or falling, and has an interface allowing to program different behaviors using external drivers.

 The Navigation and Object Manipulation Engine takes care of the basic user input: navigation, picking and displacement of objects. It provides an interface for the navigation driver. If no navigation driver is activated, standard mouse navigation exists internally. Navigation drivers exist for the SpaceBall and FOB/Cyberglove combination. New drivers can easily be programmed for any device.

 The Body Representation Engine is responsible for the deformation of the body. In any given body posture (defined by a set of joint angles) this Engine will provide a deformed body ready to be rendered. This engine provides the interface for changing the body posture. A standard Body Posture Driver is provided, that connects also to the navigation interface to get the navigation information, then uses the Walking Motor and the Arm Motor [Boulic90][Pandzic96] to generate the natural body movement based on the navigation. Another possibility is to replace this Body Posture Driver by a simpler one that is directly coupled to a set of Flock Of Birds sensors on the users body, providing direct posture control.

 The Facial Representation Engine provides the synthetic faces with a possibility to change expressions or the facial texture. The Facial Expression Interface is used for this. It can be used to animate a set of parameters defining the facial expression.

 All the engines in the VLNET core process are coupled to the main shared memory and to the message queue. Cull and Draw processes access the main shared memory and perform the functions that their names suggest. These processes are standard SGI Performer [Rohlf94] processes.

 The Communication Process receives any messages from the network (actually from the VLNET server; VLNET is a client/server system) and puts them into the Message Queue. All the engines read from the queue and react to messages that concern them (e.g. Navigation Engine would react to a Move message, but ignore a Facial Expression message which would be handled by the Facial Representation Engine). All the Engines can write into the outgoing Message Queue, and the Communication Process will send out all the messages. All messages in VLNET use the standard message packet. The packet has a standard header determining the sender and the message type, and the message body content depends on the type of the message but is always of the same size: 74 bytes.

 The data coming to any Engine through its external Interface is packed into a message packet and put into the Message Queue by the Engine. The Communication process sends out the packet, and the Communication Processes of other participants receive it and put it into the Message Queue. The appropriate Engine reads it from the Message Queue and processes it. In this way the data input from any Driver comes to the appropriate Engine at each participating site.

 The Data Base Process takes care of the off-line loading of objects and user representations. It reacts to messages from the Message Queue demanding such operations.
 
 


Figure 1: Virtual Life Network system overview


 
 

Virtual Humans in VLNET

The Virtual Humans for participant representation are one of the main features of the VLNET system [Capin95][Thalmann95]. They are based on the HUMANOID software [Boulic95] and adapted for real-time usage.

The body representation uses a Metaball structure attached to a skeleton model [Boulic95] in order to produce a deformable body. The skeleton model is anatomically modeled. It consists of 74 degrees of freedom without the hands, with an additional 30 degrees of freedom for each hand. The skeleton is represented by a 3D articulated hierarchy of joints, each with realistic maximum and minimum limits. A Metaball structure is attached to the skeleton to simulate the muscle structure, and the final triangle mesh representing the skin is calculated based on the positions of the metaballs when the skeleton moves [Thalmann96]. To ensure the calculation of the skin deformations in real time, most of the skin is precomputed in the neutral position of the skeleton, and only the parts susceptible to frequent and strong deformations (e.g. around the joints) are recalculated in each frame.
 
 

Figure 2: Virtual Humans in VLNET


 
 

The face is a polygon mesh model with defined regions and Free Form Deformations modeling the muscle actions [Kalra92]. It can be controlled on several levels. On the lowest level, an extensive set of Minimal Perceptible Actions (MPAs), closely related to muscle actions and similar to FACS Action Units, can be directly controlled. There are 65 MPAs, and they can completely describe the facial expression. On a higher level, phonemes and/or facial expressions can be controlled spatially and temporally. On the highest level, complete animation scripts can be input defining speech and emotion over time. Algorithms exist to map texture on such facial model.

 Figure 2 shows two Virtual Humans in a VLNET environment, from the perspective of the third participant whose hand is visible.

Communication in NCVE

Natural human communication is based on speech, facial expressions and gestures. Ideally, all these means of communication should also be supported within a Networked Collaborative Virtual Environment. This means that the user's speech, facial expressions and hand/body gestures should be captured, transmitted through the network and faithfully reproduced for the other participants on their sites. The capturing should be done in a non-intrusive way to improve the user's comfort.

 Obviously, the way to a complete system as described above is long and paved with problems. Capturing facial expressions or gestures non-intrusively and with enough precision is an extremely complicated task. The synthesis of realistically looking human bodies and faces, and their animation in real time is also very demanding. Communication protocols must insure that the multi-modal data is transmitted to all the participants, and in the final synthesis the multi-modal outputs have to be synchronized.

 We are trying to solve some of these problems within the Virtual Life Network system and to provide solutions leading to the complete communications as described above.

 So far our work was not particularly concentrated on the audio (speech) communication. We use public-domain audio conferencing tools (VAT) to integrate this capability in the VLNET system. Therefore, audio communication is not discussed in this paper.

 Next two sections will present several solutions for the facial communication, as well as some solutions for the gestural communication of the body.

Facial Communication

Facial expressions play an important role in human communication. They can express the speaker's emotions and subtly change the meaning of what was said. At the same time, lip movement is an important aid to the understanding of speech, especially if the audio conditions are not perfect or in the case of hearing-impaired listener.

 We discuss four methods of integrating facial expressions in a Networked Collaborative Virtual Environment: video-texturing of the face, model-based coding of facial expressions, lip movement synthesis from speech and predefined expressions or animations.

Video-texturing of the face

In this approach the video sequence of the user's face is continuously texture mapped on the face of the virtual human. The user must be in front of the camera, in such a position that the camera captures his head and shoulders. A simple and fast image analysis algorithm is used to find the bounding box of the user's face within the image. The algorithm requires that head & shoulder view is provided and that the background is static (though not necessarily uniform). Thus the algorithm primarily consists of comparing each image with the original image of the background. Since the background is static, any change in the image is caused by the presence of the user, so it is fairly easy to detect his/her position. This allows the user a reasonably free movement in front of the camera without the facial image being lost. The video capture and analysis is performed by a special Facial Expression Driver.

 Each facial image in the video sequence is compressed by the Driver using SGI Compression Library and the compresed images are passed to the Facial Representation Engine of VLNET, thn redirected to the Communication Process. Obviously, the color images of 120 x 80 pixels, even compressed, do not fit in the standard VLNET message packets used by the Communication process. Therefore special data channels are open for this video communication.

 On the receiving end, the images are received by the Communication process, decompressed by the Data Base process and texture-mapped on the face of the virtual human representing the user. Currently we use a simple frontal projection for texture mapping. A simplified head model with attenuated features is used. This allows for less precise texture mapping. If the head model with all the facial features is used, any misalignment of the topological features in the 3D model and the features in the texture produces quite unnatural artifacts. The only way to avoid this is to have the coordinates of characteristic feature points in the image which can be used to calculate the texture coordinates in such a way that the features in the image are aligned with the topology. This is called texture fitting. However, currently our texture fitting algorithm does not work in real time.

 Figure 3 illustrates the video texturing of the face, showing the original images of the user and the corresponding images of the Virtual Human representation.
 
 

Figure 3: Video texturing of the face


 
 

Model-based coding of facial expressions

Instead of transmitting whole facial images as in the previous approach, in this approach the images are analyzed and a set of parameters describing the facial expression is extracted [Pandzic94]. As in the previous approach, the user has to be in front of the camera that digitizes the video images of head-and-shoulders type. Accurate recognition and analysis of facial expressions from video sequence requires detailed measurements of facial features. Currently, it is computationally expensive to perform these measurements precisely. As our primary concern has been to extract the features in real time, we have focused our attention on recognition and analysis of only a few facial features.

 The recognition method relies on the "soft mask", which is a set of points adjusted interactively by the user on the image of the face. Using the mask, various characteristic measures of the face are calculated at the time of initialization. Color samples of the skin, background, hair etc., are also registered. Recognition of the facial features is primarily based on color sample identification and edge detection. Based on the characteristics of human face, variations of these methods are used in order to find the optimal adaptation for the particular case of each facial feature. Special care is taken to make the recognition of one frame independent from the recognition of the previous one in order to avoid the accumulation of error. The data extracted from the previous frame is used only for the features that are relatively easy to track (e.g. the neck edges), making the risk of error accumulation low. A reliability test is performed and the data is reinitialized if necessary. This makes the recognition very robust. The set of extracted parameters includes:
 

* vertical head rotation (nod)
 

* horizontal head rotation (turn)
 

* head inclination (roll)
 

* aperture of the eyes
 

* horizontal position of the iris
 

* eyebrow elevation
 

* distance between the eyebrows (eyebrow squeeze)
 

* jaw rotation
 

* mouth aperture
 

* mouth stretch/squeeze

 The analysis is performed by a special Facial Expression Driver. The extracted parameters are easily translated into Minimal Perceptible Actions, which are passed to the Facial Representation Engine, then to the Communication process, where they are packed into a standard VLNET message packet and transmitted.

 On the receiving end, the Facial Representation Engine receives messages containing facial expressions described by MPAs and performs the facial animation accordingly. Figure 4 illustrates this method with a sequence of original images of the user (with overlaid recognition indicators) and the corresponding images of the synthesized face.
 
 

Figure 4: Model-based coding of the face - original and synthetic face


 
 

This method can be used in combination with texture mapping. The model needs an initial image of the face together with a set of parameters describing the position of the facial features within the texture image in order to fit the texture to the face. Once this is done, the texture is fixed with respect to the face and does not change, but it is deformed together with the face, in contrast with the previous approach where the face was static and the texture was changing. Some texture-mapped faces with expressions are shown in figure 5.

Lip movement synthesis from speech

It might not always be practical for the user to be in front of the camera (e.g. if he doesn't have one, or if he wants to use a HMD). Nevertheless, the facial communication does not have to be abandoned. Fabio Lavagetto [Lavagetto95] shows that it is possible to extract visual parameters of the lip movement by analyzing the audio signal of the speech. An application doing such recognition and generating MPAs for the control of the face can be hooked to VLNET as the Facial Expression Driver, and the Facial Representation Engine will be able to synthesize the face with the appropriate lip movement. An extremely primitive version of such system would just open and close the mouth when there is any speech, allowing the participants to know who is speaking. A more sophisticated system would be able to actually synthesize a realistic lip movement which is an important aid for speech understanding.

Predefined expressions or animations

In this approach the user can simply choose between a set of predefined facial expressions or movements (animations). The choice can be done from the keyboard through a set of "smileys" similar to the ones used in e-mail messages. The Facial Expression Driver in this case stores a set of defined expressions and animations and just feeds them to the Facial Representation Engine as the user selects them.

 Figure 5 shows some examples of predefined facial expressions.
 
 


Figure 5: Predefined facial expressions - surprise, sleep, boredom


 
 

Gestural Communication

Gestures play an important role in human communication. Using the body, many messages can be communicated. The body movements can be roughly divided into three groups:

 * instantaneous gestures: Most of the time, often even unconsciously, we accompany our speech with gestures. They stress the speech and give emphasis on particular words. They also very often have a meaning in themselves. The whole body posture also conveys information about the person's state and possibly emotions. For example, from the posture it can be determined if the person is tired, tense or relaxed.

 * gesture commands: these are gestures that the user makes to specify some action. For example, the sign 'come here' can be speficied by raising the arm. These movements can change from one person or culture to another, therefore there is no well-defined set of rules for the meanings.

 * rule-based sign language: these are gestures, for example used by deaf people that essentially follow well-defined rules to specify words or sounds. The signs typically work as a metaphor for defining other objects or language. The gestures can also be used by the software to define special tasks (e.g. showing forward direction to initiate a walk).

 All these types of gestures can be controlled by two different methods: direct tracking and predefined postures or gestures. The type of control might be suitable for a specific type of gestures, however a combination of them can be used for different tasks.
 


Figure 6: Direct gestural communication using magnetic trackers


 
 

Direct tracking

A complete representation of the participant actor's body should have the same movements as the real participant body for more immersive interaction. This can be best achieved by using a large number of sensors to track every degree of freedom in the real body. Molet et al.[Molet96] discuss that a minimum of 14 sensors are required to manage a biomechanically correct posture, and Semwal et al.[Semwal96] present a closed-form algorithm to approximate the body using up to 10 sensors. However this is generally not possible due to limitations in the number and technology of the sensing devices, as it is either too expensive to have this many sensors, or it is too difficult for the participants to move with so many attached sensors. Therefore, the limited tracked information should be connected with behavioral human animation knowledge and different motion generators in order to "interpolate" the joints of the body which are not tracked.

 The main approaches to this problem are: inverse kinematics using constraints, closed form solutions, and motor functions. The inverse kinematics approach is based on an iterative algorithm, where an end-effector coordinate frame (for example the hand) tries to reach a goal (the reach position) coordinate frame, using a set of joints which control the end effector. The advantage of this approach is that any number of sensors can be attached to any body part, and multiple constraints can be combined through assigning weights. However, this might slow down the simulation significantly as it requires excessive computation. The closed form solution solves this problem using 10 sensors attached to the body, and solving for the joint angles analytically. The human skeleton is divided into smaller chains, and each joint angle is computed within the chain it belongs to. For example, the joint angle for the elbow is computed using the sensors attached to the upper arm and lower arm, and computing the angle between the sensor coordinate frames.

 The raw data coming from the trackers has to be filtered and processed to obtain a usable structure. The software developed at the Swiss Federal Institute of Technology [Molet96] permits to convert the raw tracker data into joint angle data for all the 75 joints in the standard HUMANOID skeleton used within VLNET [Capin95], with additional 25 joints for each hand. As shown in Figure 1, this software is viewed as Body Posture Driver by the VLNET system, and VLNET communicates with it through the Body Posture Interface. VLNET Body Representation Engine obtains this joint table from the Body Posture interface and uses such data to produce deformed bodies ready for rendering. The posture data in the form of joint angles fits into a VLNET message packet by reducing each angle to 8 bits, with a maximal error of 1.4 degrees in 360 degrees. This error rate is sufficient enough to provide body postures which is visually similar to the real body. By coupling the Flock of Birds driver with VLNET we can obtain full gestural communication in a very direct, though intrusive, way. Figure 6 illustrates this approach by showing the user wearing the trackers and his Virtual Human representation.

Predefined postures or body gestures

In a similar fashion as for the facial expressions, the body postures or gestures can also be predefined and chosen by a metaphor. For example, the smileys normally used within emails can be used to set a subset of the joints in the current body using the keyboard. The body posture driver just stores the predefined postures and gestures (i.e. animated postures) and feeds them to the Body Posture Engine as the user selects them. Figure 7 shows some examples of predefined postures.

 The main difference between direct control and predefined postures/gestures is that the direct control provides more correpondence to the real posture of the participant. Therefore, it is expected to provide more immersive feeling. However, the predefined postures can increase the communication among participants in networked environments in the absence of enough number of trackers. These two types of control can be combined to the animate the participant's body for different types of gestures. There is a need to investigate and define a set of tools that provide sufficient proprioceptive information for instantaneous gestures, while providing easy and natural control for rule-based signs and sign gesture commands.
 
 

Figure 7: Some examples of predefined postures: tiredness, surprise, tension


 
 

Conclusions & Future Work

This paper has presented some ongoing work in the field of Natural Communication in Networked Collaborative Virtual Environments. Obviously a lot of work is left to be done before we can claim to have a really natural communication, but some of the results are very encouraging, especially in the field of Facial Communication. We have also presented a structure that permits to integrate various methods within a unique system - the Virtual Life Network.

We intend to work further on the facial communication algorithms, improving the texture mapping with feature alignment for the video texturing approach. The presented approaches for facial communication can also be combined, especially the model-based approach and the speech analysis approach, combining the visual and audio analysis in order to obtain better results. Gesture recognition algorithms might be used to obtain the gestural communication in a non-invasive manner. Further integration and synchronization of facial and gestural communication is needed.

References

[Barrus96] Barrus J. W., Waters R. C., Anderson D. B., "Locales and Beacons: Efficient and Precise Support For Large Multi-User Virtual Environments", Proceedings of IEEE VRAIS, 1996.

 [Boulic 90] Boulic R., Magnenat-Thalmann N. M.,Thalmann D. "A Global Human Walking Model with Real Time Kinematic Personification", The Visual Computer, Vol.6(6),1990.

 [Boulic 95] Boulic R., Capin T., Huang Z., Kalra P., Lintermann B., Magnenat-Thalmann N., Moccozet L., Molet T., Pandzic I., Saar K., Schmitt A., Shen J., Thalmann D., "The Humanoid Environment for Interactive Animation of Multiple Deformable Human Characters", Proceedings of Eurographics '95, 1995.

 [Capin 95] Capin T.K., Pandzic I.S., Magnenat-Thalmann N., Thalmann, D., "Virtual Humans for Representing Participants in Immersive Virtual Environments", Proceedings of FIVE '95, London, 1995.

 [Carlsson93] Carlsson C., Hagsand O., "DIVE - a Multi-User Virtual Reality System", Proceedings of IEEE VRAIS '93, Seattle, Washington, 1993.

 [Kalra92] Kalra P., Mangili A., Magnenat Thalmann N., Thalmann D., "Simulation of Facial Muscle Actions Based on Rational Free Form Deformations", Proc. Eurographics '92, pp.59-69., 1992.

 [Lavagetto95] F.Lavagetto, "Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People", IEEE Trans. on Rehabilitation Engineering, Vol.3, N.1, pp.90-102, 1995.

 [Macedonia 94] Macedonia M.R., Zyda M.J., Pratt D.R., Barham P.T., Zestwitz, "NPSNET: A Network Software Architecture for Large-Scale Virtual Environments", Presence: Teleoperators and Virtual Environments, Vol. 3, No. 4, 1994.

 [Molet96] Molet T., Boulic R., Thalmann D., "A Real Time Anatomical Converter for Human Motion Capture", Proc. of Eurographics Workshop on Computer Animation and Simulation, 1996.

 [Ohya95] Ohya J., Kitamura Y., Kishino F., Terashima N., "Virtual Space Teleconferencing: Real-Time Reproduction of 3D Human Images", Journal of Visual Communication and Image Representation, Vol. 6, No. 1, pp. 1-25, 1995.

 [Pandzic 94] Pandzic I.S., Kalra P., Magnenat-Thalmann N., Thalmann D., "Real-Time Facial Interaction", Displays, Vol. 15, No 3, 1994.

 [Pandzic96] I.S. Pandzic, T.K. Capin, N. Magnenat Thalmann, D. Thalmann, "Motor functions in the VLNET Body-Centered Networked Virtual Environment", Proc. of 3rd Eurographics Workshop on Virtual Environments, Monte Carlo, 1996.

 [Rohlf94] Rohlf J., Helman J., "IRIS Performer: A High Performance Multiprocessing Toolkit for Real-Time 3D Graphics", Proc. SIGGRAPH'94, 1994.

 [Semwal96] S. K. Semwal, R. Hightower, S. Stansfield, "Closed Form and Geometric Algorithms for Real-Time Control of an Avatar", Proc. VRAIS 96. pp. 177-184.

 [Singh95] Singh G., Serra L., Png W., Wong A., Ng H., "BrickNet: Sharing Object Behaviors on the Net", Proceedings of IEEE VRAIS '95, 1995.

 [Thalmann95] D. Thalmann, T. K. Capin, N. Magnenat Thalmann, I. S. Pandzic, "Participant, User-Guided, Autonomous Actors in the Virtual Life Network VLNET", Proc. ICAT/VRST '95, pp. 3-11.

 [Thalmann96] D. Thalmann, J. Shen, E. Chauvineau, "Fast Realistic Human Body Deformations for Animation and VR Applications", Proc. Computer Graphics International '96, Pohang, Korea,1996.