The ubiquity of social media in our daily life, the intense user participation, and the explosion of multimedia content have generated an extraordinary interest from computer and social scientists to investigate the traces left by users to understand human behavior online. From this perspective, YouTube can be seen as the largest collection of audiovisual human behavioral data, among which conversational video blogs (vlogs) are one of the basic formats. Conversational vlogs have evolved from the initial "chat from your bedroom" format to a rich form of expression and communication that is expanding to innovative applications where a more natural and engaging way of reaching audiences is either necessary or might be beneficial. This video genre, available online in huge quantities, is a unique scenario for the study and characterization of complex human behavior in social media, that contrarily to social networks, text blogs, and microblogs, has remained unexplored so far. The automatic behavioral understanding of conversational vlogs is a new domain for multimedia research. In short, the goal of our research is the understanding of the processes involved in this social media type, based not only on the verbal channel – what is said – but also on the nonverbal channel – how it is said. The nonverbal channel includes prosody, gaze, facial expression, posture, gesture, etc. and has been studied in depth in the field of nonverbal communication. While the study of vlogging contributes to user behavior research in social media, it also adds to a larger research agenda in social computing by analyzing behavioral data at scales not previously achievable in other scenarios. These type of analysis pose important challenges regarding the development and integration of methods for robust and tractable audiovisual processing. In this thesis, we address the problem of mining user behavior inside conversational videos by addressing three main aspects. First, we integrate state-of-the art audio processing and computer vision techniques to analyze conversational social video. While the initial focus of the thesis is the nonverbal aspect of vlogger behavior, we also investigate the verbal content. Second, we study some of the interpersonal and social processes that link vlogger behavior and vlog consumption in social media platforms such as YouTube. In this context, we examine the phenomenon of social attention in vlogs, and we investigate the use of crowdsourcing as a scalable method to annotate large multimodal corpora with interpersonal perception impressions. Finally, we propose a computational framework to predict interpersonal impressions automatically using multimedia analysis, crowdsourced impressions, and machine learning techniques. We anticipate that the work presented in this dissertation will motivate future work in social and behavioral sciences, media analysis, natural language processing, and affective and social computing applied to the large-scale analysis of human interaction in social video.