While a first generation of video coding techniques proposed to remove the redundancies in and between image frames to get smaller bitstreams, second generation schemes like MPEG-4 and MPEG-7 aim at doing content-based coding and interactivity. To reach this goal, tools for the extraction and description of semantic objects need to be developed. In this work, we propose an algorithm for the extraction and tracking of semantic objects and an MPEG-7 compliant descriptor set for generic objects; together, they can be seen like a smart camera for automatic scene description. Some parts of the proposed system will be tested by software. The tracking algorithm has been laid out so as to follow generic objects in scenes including partial occlusions and merging. To do this, we first localize each moving object of the scene using a change-detection mask. Then, a certain number of representative points called centroids is given to the objects by a fuzzy C-means algorithm. For each centroid of some current frame, we try to find the closest centroid in the previous frame. Once we found these pairs, each object can be labelled according to its corresponding previous centroids. The description structure is a subset of the DDL language used in MPEG-7. The main concern was to find a simple, but flexible descriptor set for generic objects. A corresponding C-structure for software implementations is also proposed and partially tested.