This paper presents a model for visual focus of attention (VFOA ) and conversational estimation in meetings from audio-visual perceptual cues. Rather than independently recognizing the VFOA of each participant from his own head pose, we propose to recognize participants' VFOA jointly in order to introduce context dependent interaction models that relates to group activity and the social dynamics of communication. To this end, we designed a dynamic Bayesian network (DBN) , whose hidden states are the joint VFOA of all participants, and the meeting conversational events. The observation used to infer the hidden states are the people's head poses and speaking status. Interaction models are introduced in the DBN by the use of the conversational events, and the projection screen activity which are contextual cues that affect the temporal evolution of the joint VFOA sequence, allowing us to model group dynamics that accounts for people's tendency to share the same focus, or to have their VFOA driven by contextual cues such as projection screen activity or conversational events. The model is rigorously evaluated on a publicly available dataset of 4 real meetings of a total duration of 1h30 minutes.