User-level Internet tomography and overlay routing

In this thesis, we study methods to detect, localize and overcome performance problems experienced by end-users in communication networks. These problems are extremely difficult to solve when the only available information are end-to-end data. We first consider the simple problem of identifying poorly performing links in a network. Even though end-to-end data do not provide sufficient information to calculate link loss rates exactly, we show that in many settings they provide enough information to identify the lossy links. We introduce an inference technique based on the maximum likelihood inference principle, which handles well noisy measurements and routing changes. We apply our technique to locate lossy links in sensor networks. Compared to wired networks, sensor networks pose two additional challenges for monitoring functions: they support much less probing traffic, and they change their routing topologies much more frequently. A Boolean performance model that requires a small number of probes is therefore particularly suited to sensor networks. We evaluate the performance of our inference algorithm in simulation and on real-network traces. We find that the proposed technique achieves high detection and low false positive rates. In IP networks where the network topology is stable and the number of end-to-end probes is large, we consider a second approach to the Boolean tomography problem where we take multiple measurements of the network to learn the probability that a link is going to be congested. We show that these probabilities can be uniquely identified from a small set of measurements by using properties of Boolean algebra. We can then use the learnt probabilities as priors to find rapidly the congested links at any time, with an order-of-magnitude gain in accuracy over existing algorithms. We validate our results both by simulation and real implementation in the PlanetLab network. Taking multiple snapshots of the network also allows us to calculate link loss rates from end-to-end measurements exactly. This operation is possible despite that the system of equations relating link loss rates with end-to-end loss rates has multiple solutions. The main reasons are that losses due to congestion occur in bursts, hence the loss rates of congested links have high variances. On the contrary, most links on the Internet are un-congested, hence the averages and variances of their loss rates are virtually zero. We first prove that the variances of link loss rates can be uniquely calculated from the covariances of the measured end-to-end loss rates in any realistic topology. After calculating the link variances, we remove the un-congested links with small variances from the first-order moment equations in order to obtain a full-rank linear system of equations, from which we can calculate precisely the loss rates of the remaining congested links. Our proposed solution uses only regular unicast probes and thus is applicable in today's Internet. It is accurate and scalable, as shown in our simulations and experiments on PlanetLab. After locating the performance bottlenecks, we study methods that mitigate these performance problems using relay nodes, end-hosts that act as intermediary points to bridge connections. Efficiently selecting a relay node is not a trivial problem, especially in a large-scale unstructured overlay system where end-hosts are heterogeneous and do not have the full knowledge of all available relays. Despite these facts, good relay selection algorithms should effectively balance the aggregate load across the set of relay nodes. We address these problems using algorithms based on the two random choices method. In the presence of the relay heterogeneity, we first prove that the classic load-based algorithm can effectively balance the load, even when relays are heterogeneous, and that its performance depends directly on relay heterogeneity. Second, we propose an utilization-based random choice algorithm to distribute load in order to balance relay utilization. Numerical evaluations through simulations illustrate the effectiveness of this algorithm, indicating that it might also yield provable performance (which we conjecture). Finally, we support our theoretical findings through simulations of various large-scale scenarios, with realistic relay heterogeneity. When end-hosts have only a limited knowledge of available relays, we prove that even if each end-host knows only a small fraction of the number of relays, the two-random-choice algorithm still guarantees good load balancing performance. However, when the number of known relays for each end-host does not grow fast enough with the total number of relays, a modified two-choice algorithm is needed to guarantee good load balancing. This algorithm consists in first selecting two relay sets and then sampling two relays, one from each of them.


Related material