The focus of this paper is on the recognition of single object behavior from monocular image sequences. The general literature trend is to perform behavior recognition separately after an initial phase of feature/attribute extraction. We propose a framework where behavior recognition is performed jointly with attribute extraction, allowing the two tasks to mutually improve their results. To this end, we express the joint recognition / extraction problem in terms of a probabilistic temporal model, allowing its resolution via a variation of the Viterbi decoding algorithm, adapted to our model. Within the algorithm derivation, we translate probabilistic attribute extraction into a variational segmentation scheme. We demonstrate the viability of the proposed framework through a particular implementation for finger-spelling recognition. The obtained results illustrate the superiority of our collaborative model with respect to the traditional approach, where attribute extraction and behavior recognition are performed sequentially.