Visual information is taking a predominant place in our society. With the advent of digital technologies, visual information will find new applications in domains ranging from communication, commerce to entertainment. New functionalities will be required that permit an extended interaction with the visual information. In particular, this is the case for digital television and the related problem of video sequence compression. The extent of possible interactions depends on the manner in which the visual information is represented. Up to now, the canonical representation is used. Also referred to as the waveform representation, it is a technical artifact of the image capture procedure. This representation severely restricts the functionalities available to the end-user. The latter is unable to freely manipulate or customize the received visual information. In this dissertation, we propose to describe the visual information through a semantically meaningful representation. This representation directly derives from the scene content, which is decomposed in terms of its constituting objects. Consequently, the viewing process is totally disconnected from the image capture procedure. This permits full interactivity with the visual information, leading to enhanced functionalities for the end-user. The essence of this dissertation is to automatically define the objects forming the scene arid to automatically track them through the video sequence. Two main issues are identified: the initial segmentation of the objects and their tracking. The first issue deals with segmenting the scene into its constituent objects. This has to be performed on the basis of the information available in two consecutive frames. In order to solve the problem, a split-and-merge approach is used. First, the image is segmented into small, spatio-temporally homogeneous regions. This is achieved through a top-down approach where spatial, temporal and change information is combined. These regions are used as a starting point to define the objects forming the scene. A bottom-up approach is used which iteratively merges the regions. The propensity of the regions to form an object is assessed in terms of both spatial and temporal information. The second issue deals with segmenting and tracking the objects in the successive frames. The coherence between the successive segmentations is ensured by using past and current information. Also, the different objects composing the scene are identified throughout the video sequence. This identification relies on temporal, spatial and spatio-temporal features of the objects. The proposed representation of the visual information finds a natural application in video coding. A second generation video coding scheme is presented which combines compression efficiency with extended functionalities. In particular, scalable coding is achieved in terms of both scene content and object coding quality.