We address the detection, tracking, and relative localization of the agents of a drone swarm from a human perspective using a headset equipped with a single camera and an Inertial Measurement Unit (IMU). We train and deploy a deep neural network detector on image data to detect the drones. A joint probabilistic data association filter resolves the detection problems and couples this information with the headset IMU data to track the agents. In order to estimate the drones’ relative poses in 3D space with respect to the human, we use an additional deep neural network that processes image regions of the drones provided by the tracker. Finally, to speed up the deep neural networks’ training, we introduce an automated labeling process relying on a motion capture system. Several experimental results validate the effectiveness of the proposed approach. The approach is real-time, does not rely on any communication between the human and the drones, and can scale to a large number of agents, often called swarms. It can be used to spatially task a swarm of drones and also employed without a headset for formation control and coordination of terrestrial vehicles.