Back To The Source: An Online Approach for Sensor Placement and Source Localization

Source localization, the act of finding the originator of a disease or rumor in a network, has become an important problem in sociology and epidemiology. The localization is done using the infection state and time of infection of a few designated sensor nodes; however, maintaining sensors can be very costly in practice. We propose the first online approach to source localization: We deploy a priori only a small number of sensors (which reveal if they are reached by an infection) and then iteratively choose the best location to place a new sensor in order to localize the source. This approach allows for source localization with a very small number of sensors; moreover, the source can be found while the epidemic is still ongoing. Our method applies to a general network topology and performs well even with random transmission delays.


INTRODUCTION
Computer worms, or rumors spreading on social networks, often trigger the question of how to identify the source of an epidemic. This problem also arises in epidemiology, when health authorities investigate the origin of a disease outbreak. The problem of source localization has received considerable attention in the past few years; because of its combinatorial nature, it is inherently difficult: the infection of a few nodes can be explained by multiple and possibly very different epidemic propagations. Researchers have considered various models and algorithms that differ in the epidemic spreading model and in the information that is available for source localization. Such models are often not realistic, either because they rely on some strong assumptions about the epidemic features (tree networks, deterministic transmission delays, etc.) or because they require an overwhelming amount of information to localize the source. The costs of retrieving information for source localization cannot be disregarded. Data collection is never free; moreover, due to privacy concerns, individuals are becoming aware of the value of their data and resistant to share it for free [9]. In the case of infectious diseases, performing the necessary medical exams and the subsequent data analysis on many suspected households or communities can be exorbitantly expensive, whereas the efficient allocation of resources can lead to enormous savings [29].
Driven by the demand for general models for source localization and by practical resource-allocation constraints, we adopt a very general setting in terms of the epidemic model and prior information available, and we focus on designing a resource-efficient algorithm for information collection and source localization.
Our model. We model the connections across which an epidemic can spread with an undirected network G(V, E) of size N = |V |. Each edge uv ∈ E is given a weight wuv ∈ R + that is the expected time required for an infection to spread from u to v. The edge weights induce a distance metric d on G: d(u, v) is the length of the shortest path from u to v.
An epidemic spreads on G starting from a single source v at an unknown time t . The unknown source v is drawn from a prior distribution π on V . At any time, a node can be in one of two states: susceptible or infected. If u becomes infected at time tu, a susceptible neighbor v of u will become infected at time tu + θuv, where θuv is a continuous random variable with expected value wuv.
When a node is chosen as a sensor, it can reveal its infection state and, if it is infected, its infection time. We have two different types of sensors: static sensors S and dynamic sensors D. Static sensors are placed a priori in the network. They serve the purpose of detecting any ongoing epidemic and of triggering the source search process. When a static sensor s0 ∈ S gets infected, the epidemic is detected and the online placement of the dynamic sensors starts.
Our results. Most source-localization approaches assume that all available sensors are chosen a priori, independently of any particular epidemic instance, and, commonly, the source can be localized only after the epidemic spreads across the entire network. Instead, we propose a novel approach where we start the source-localization process as soon as an epidemic is detected and we place dynamic sensors actively while the epidemic spreads.
We approach the problem of source-localization asking the following question: Who is the most informative individual, given our current knowledge about the ongoing epidemic? Indeed, depending on the particular epidemic instance, the infection time or the state of some individuals might be more informative than that of others, hence we want to observe them, i.e., to choose them as sensors.
Our methods are practical because they apply to general graphs and both deterministic and non-deterministic settings. We validate our results with extensive experiments on synthetic and real-world networks. We experimentally show that, when we have a limited budget for the dynamic sensors, we dramatically outperform a static strategy with the same budget -improving the success rate of finding the source from ∼5% to ∼75% of the time. Moreover, when we are unconstrained by a budget, we can localize the source with few sensors: Many purely-static approaches to sensor placement require a large fraction of the nodes to be sensors (e.g., > 30%, see the discussion in Section 5), while our dynamic placement uses ∼3% on all topologies (see Figure 2). Intuitively, the reason for these dramatic improvements is the dual approach of using static and dynamic sensors: Once a static sensor is infected, it effectively cuts down the network to a region of size N/|S| that contains the source. Then, the |D| dynamic sensors only need to localize the source in this smaller network. Proving this formally would be an interesting direction for future work.
We focus on studying source localization and dynamic sensor placement, assuming that a set of static sensors is given. We consider two objectives: first, under budget-constraints for the number of sensors, we are interested in minimizing the uncertainty on the identity of the source (i.e., the number of nodes that, given the available observations, have a positive probability of being the source); second, when the budget for sensors is not limited, we want to minimize the number of sensors needed to exactly identify the source.

Model Assumptions
What we assume. We make the following assumptions.
(A.1) We assume that the network topology is known. This is a common assumption when studying source localization (see, e.g., [28,1,26,27,22]). (A.2) We assume that, when a node is chosen as dynamic sensor, it reveals its state (healthy or infected). If it is infected, it also reveals the time at which it became infected. This is not a strong assumption because, by interviewing social-networks users (or, in the case of a disease, patients), the infection time of an individual can be retrieved [38].
What we do not assume. In order to obtain a tractable setting, much prior work has made assumptions which are not always feasible in practice and which we do not make. In particular, we do not make the following assumptions.
(B.1) Knowledge of the state of all the nodes at a given point in time. This might be prohibitively expensive when one should maintain a very large number of monitoring systems [40]. Instead, we detect the source based on the infection time of a very small set of nodes. (B.2) Knowledge of the time at which the epidemic starts.
This information is in most practical cases not available [15,26]. Hence we do not make assumptions about the starting time of the epidemic. (B.3) Observation of multiple epidemics. Observing multiple epidemics started by the same source certainly helps in its localization [26,11]. In this work, we consider a single epidemic because we are interested in localizing the source while the epidemic spreads. (B.4) A specific class of network topologies. A large part of the literature assumes tree topologies. Having a unique path between any two nodes makes source localization much easier [15]. Instead, our methods work on arbitrary graphs. (B.5) Deterministic or discretized transmission delays.
When the transmission delays are deterministic, given the position of the source, the epidemic is deterministic. Hence, if the source is unknown, tracking back its position becomes much easier [30]. Also, assuming that infection times are discrete is limiting and may result in a loss of important information [4]. We assume transmission delays to be randomly drawn from continuous distributions with bounded support, which include deterministic delays as a particular case and can, in practice, approximate unimodal distributions with unbounded support (e.g., Gaussians). (B.6) A specific epidemic model. Our method only uses the time of first-infection of the sensors (no assumption on recovery or re-infection dynamics is made). Hence, it can be applied to most epidemic models, including the well known SIS or SIR (provided that nodes do not recover before infecting their neighbors).

Model Description and Notation
Sensor Placement. The set of static sensors is denoted by S, with |S| = Ks. Let τ0 ∈ R be the first time at which a subset of static sensors S0 ⊆ S are infected. At this time the placement of dynamic sensors starts. A new dynamic sensor is placed at each time τi = τ0 + iδ, i ∈ N + , where δ > 0 is called the placement delay.
The i th dynamic sensor, i.e., the one placed at time τi, is denoted by di. The set of dynamic sensors deployed in the network before or at step i is denoted by Di. The number of dynamic sensors is limited by a budget K d , hence the maximum total number of sensors is Ks + K d . If we do not have a limited budget for dynamic sensors, we trivially set K d = ∞. We stop adding dynamic sensors when the source is localized or when the number of dynamic sensors reaches the budget K d . The set of all static and dynamic sensors is denoted by U. The cardinality of the latter set, |U |, is the total number of sensors used in the localization process and is our metric for the cost of localization.
Positive and Negative Observations. A sensor gives information in two possible ways: If it is infected, it reveals its infection time; otherwise it reveals that it is susceptible. In the first (respectively, second) case we say that the sensor gives a positive (respectively, negative) observation. We will see that an observation contributes to the localization process even if it is negative. We represent each observation ω as a tuple (uω, tω) where uω ∈ V denotes the sensor and tω ∈ R is its infection time if the observation is positive, whereas tω = ∅ if the observation is negative. For every step i of the localization process, we denote the set of all observations (positive or negative) collected before time at which the epidemic is detected τi, i ∈ N + time at which the i t h dynamic sensor is placed δ placement delay, τi − τi−1 = δ ∀i ∈ N + Di, i ∈ N + set of dynamic sensors at time τi Oi, i ∈ N set of observations at time τi observation of node uω: set of candidate sources given Oi Ci, i ∈ N + set of candidate dynamic sensors at τi Candidate Dynamic Sensors. The set of nodes among which we can choose a dynamic sensor at time τi is called Ci. Clearly, C1 = V \S and, for i ≥ 2, Ci = V \(S ∪ Di−1).
Candidate Sources. At step i, v is a candidate source if, conditioned on Oi it has a non-zero probability of being the source. Bi is the set of candidate sources at step i, i.e., In particular, the initial set of candidate sources is Double Metric Dimension. Finally we recall the definition of Double Resolving Set (DRS) and Double Metric Dimension (DMD) of a network [3], which will be useful in the following sections.
i.e., v1, v2 can be distinguished based on their distances to z1, z2. We will use the following lemma [6].
When an epidemic spreads on G and the transmission delays are deterministic, the infection times of a DRS suffice for distinguishing between any two possible sources [6]. The minimum size of a DRS of G is called the DMD of G. Computing the DMD of a network is NP-hard [6]. Finding the set U of k nodes that maximize the number of nodes that are distinguished by any two nodes in U is also a NP-hard problem to which we refer as k-DRS [30]. An approximate solution of k-DRS can be found with a natural greedy heuristic [30] (see the extended version [31] for details). With a slight abuse of notation we denote by k-DRS the set Z, such that |Z| = k, obtained via the latter heuristic.

Deterministic Transmission Delays
For ease of exposition, we first present our algorithm in the case of deterministic transmission delays, i.e., θuv = wuv. In Section 3.2 we will show that our results naturally extend to random delays.
The following lemma formalizes that, when epidemics spread deterministically, the only source of randomness is the position of v .
Lemma 2. Let i ∈ N + and let Oi be the set of observations collected before or at τi. Then, Since the starting time t of the epidemic is unknown, no single observation taken in isolation is informative about the position of the source (see Assumption (B.2)). Instead, two (or more) observations can become informative (which explains the importance of DMD and DRS for source localization). For this reason, we only consider the probability of two or more observations together. Let ω1 (u, tu), and ω2 (w, tw) two observations. If tu, tw = ∅, we define the We have the following lemma, which immediately follows from the definitions above.
Algorithm description. The key idea is to iteratively choose the most informative node as a dynamic sensor. At every step i, we first select as new dynamic sensor di the node that maximizes the expected improvement (gain) in the localization process; then, we compute Bi using the information given by the dynamic sensor di and by the nodes in S ∪ Di−1 that became infected in (τi−1, τi]. The pseudocode for our algorithm is given in Algorithm 1. The running time of Algorithm 1 depends on the definition of Gain and will be discussed at the end of this section. We describe the functions InitializeCandSources, Update and Gain in the following subsections. Initial Candidate-Sources Set B0. Based on the first observation available (i.e., the infection time τ0 of the first infected static sensors S0 ⊆ S), the initial set of candidate sources B0 contains all nodes that are closer to S0 than to S\S0.
Proposition 1. Let S0 be the set of the first infected static sensors and O0 be the first observation set. For every v ∈ V , let S v 0 be the set of the static sensors that are at minimum distance from v, i.e., In the deterministic setting any O0 collected from a given epidemic has non-zero probability, hence P(O0) > 0. Now, Using Lemma 4, at step i, we compute the set of candidate sources Bi based on Bi−1 and on Oi\Oi−1. More specifically, in Update we compute Bi by applying Proposition 2.
(B) The proof follows similarly to (A).
(C) If v ∈ B i ω for all ω ∈ Oi\Oi−1, by (2), we have that P({ω, ω0}|v = v ) = 1 for all ω ∈ Oi\Oi−1. By a reasoning similar to (A)(ii), this implies that P(Oi|v = v ) = 1, hence v ∈ Bi and Correctness of Algorithm 1. We are now ready to prove the correctness of Algorithm 1, which, in fact, does not depend on the definition of Gain: As we will see in Section 4, Gain has an effect on the convergence speed of Algorithm 1 but not on the localization of the source. Proof. From Prop. 1, it follows that v ∈ B0. Moreover, from Prop. 2, it follows that v ∈ Bi at every step i of the algorithm. Thus, it only remains to prove that we make progress, i.e., that for any v ∈ B0\{v }, there is a step i such that v / ∈ Bi. By Lemma 1, for any v ∈ B0\{v } and s0 ∈ S0, there exists w ∈ V such that d s0). Let i ∈ N + be the first step such that the infection time tw of w satisfies tw ≤ τi. Then, if w ∈ S, (2)) and hence v / ∈ Bi. If w / ∈ S, let j ∈ N + be the iteration step at which we choose w as a sensor. Then, for = max(i, j), v / ∈ B (w,tw ) , and hence v / ∈ B .
We know from Prop. 2 that every new observation potentially reduces the number of candidate sources and makes the localization progress. At each step of Algorithm 1, Gain evaluates the expected progress in localization for all candidate sensors and we choose as dynamic sensor the node that yields to the maximum value. We consider three possible Gain functions: Size-Gain, DRS-Gain and RC-Gain. It is not a priori clear which version of Gain leads to a faster convergence. Hence, we experiment with all of them.
Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain Size-Gain. Perhaps the most natural Gain function is the one that computes the expected reduction in the number of candidate sources. Call B is the set of possible infection times of c by step i.
Note that both Size-Gain and DRS-Gain account only for the benefit of adding the dynamic sensor c: For tractability, we ignore all observations ω ∈ Oi\Oi−1 such that uω = c.

RC-Gain RC-Gain RC-Gain RC-Gain RC-Gain
RC-Gain RC-Gain RC-Gain RC-Gain RC-Gain RC-Gain RC-Gain RC-Gain RC-Gain RC-Gain RC-Gain RC-Gain. RC-Gain (Random-Candidate-Gain) assigns gain 1 to all candidates sources and gain 0 to all nodes that are not candidate sources: At step i, for c ∈ Ci we set g RC (c) = 1 if c ∈ Bi−1, g RC (c) = 0 otherwise. In other words, we randomly choose the dynamic sensors among the candidate sources. Note that if the infection time of at least one node in Bi−1 is already observed, adding a sensor in any other node in Bi−1 implies |Bi| ≤ |Bi−1|. Hence, this very simple Gain ensure that the source-localization makes progress at each step.
Running time. In the worst case, the while loop of Algorithm 1 is entered N times. At step i, both the Update and the computation of any of the proposed Gain functions takes O(|Bi|) steps. Hence, with the proposed definitions of Gain, the i th iteration takes O(|Ci| · |Bi|) ⊆ O(N 2 ). Although the running time can potentially reach Θ(N 3 ), our experiments show that, in many practical cases, |Bi| is sublinear.

Non-Deterministic Transmission Delays
In this section we assume that the transmission delays are independent continuous random variables such that, for every uv ∈ E, the support of the transmission delay θuv is bounded and symmetric with respect to wuv, i.e., is [wuv(1−ε), wuv(1+ε)], with ε ∈ [0, 1]. We refer to ε as noise parameter. For ε > 0, the transmission delay over an edge of weight w can deviate up to εw from its expected value. ε = 0 corresponds to the deterministic model of Section 3.1.
The structure of the algorithm for sensor placement and source localization is identical to that of Algorithm 1, the only changes are in InitializeCandSources and Update.
The following proposition characterizes the candidate sources at step i through necessary conditions. It is used in InitializeCandSources and in Update to discard, at step i, the nodes v such that P(v = v |Oi) = 0.

Let ω1, ω2 ∈ Oi with tω
3. Let ω1, ω2 ∈ Oi with tω 1 = ∅, tω 2 = ∅ and let ω2 ∈ Oi. If v ∈ Bi, then Prop. 4 is similar in spirit to Prop. 2. Note in particular, that by setting ε = 0 in (7) and (8) we get, for two arbitrary observations ω1, ω2 ∈ Oi, the respective of the conditions on the infection times used to define B i ω in (2). However, differently from Prop. 2, when ε > 0, we cannot give necessary and sufficient conditions for a node to be the source by simply comparing all observations with a reference observation. Hence, when ε > 0, at step i the function Update keeps in Bi only the nodes such that both (7) and (8)    Proof. The proof follows the structure of that of Theorem 1. First note that nodes are removed from the set of candidate sources if and only if they do not satisfy some of the necessary conditions expressed by inequalities (7) and (8). Hence, because of Proposition 4, the source v is never removed from the set of candidates. Next, we want to prove that, for every node v = v , there exists a node w ∈ V such that, when the infection time of w is observed, v is removed from the set of candidate sources. Take w = v and suppose that its infection time tw is observed. Let v = w be another node for which the infection tv time is also observed. As w = v , we have tv > tw. Note that inequality (7) cannot hold for v and w: Indeed, we would have 0 < (1 − ε)d(v, w) ≤ tw − tv < 0, which gives a contradiction. Let i ∈ N + such that w, v ∈ S ∪ Di and such that tv is smaller than τi. Then, v / ∈ Bi.
Gain Gain Gain Gain Gain Gain Gain Gain Gain Gain Gain Gain Gain Gain Gain Gain Gain. Building on the deterministic case, we can compute an approximate version of the Size-Gain value g SIZE i (c) for the case in which ε = 0. For the details of this computation see the extended version [31]. DRS-Gain and RC-Gain do not depend on the epidemic model, hence remain unchanged with respect to Section 3.1.
Approximate Source Localization. When K d < ∞ and the convergence of the algorithm is not guaranteed, we could consider substituting ε withε = Cε, 0 < C ≤ 1, in inequalities (7) and (8). Here, C plays the role of a tolerance constant. Intuitively, when C is small, we quickly narrow the candidate sources set, but the probability that the correct source is not identified by the algorithm is high; when C is large, the probability that the algorithm identifies the real source as a candidate source is high, but possibly we have many false positives. The setting C < 1 can be interesting for the case in which the transmission delays θuv are not uniform, e.g., when the delays are more concentrated around their expected value values. A study of this extension is left for future work.

Experimental Setup
In our experiments, the transmission delays are uniformly distributed. The uniform distribution is, among the unimodal distributions on a bounded support, the one that maximizes the variance [13]. Hence, uniform delays are a very challenging setting for source localization.
The choice of static sensors is inspired by the work of Spinelli et al. [30], where static sensor placement is extensively studied. We let S = k-DRS with k = Ks (see Section 2), so that the number of nodes that are distinguished by the static sensors is maximized. 1 We also do not evaluate the impact of the budget Ks, rather we are concerned with decreasing total number of sensors |U|. We set Ks = 0.02·N .
A study of different static placement strategies and of the trade-off between Ks and the timeliness of source localization is left for future work.
We evaluate the performance of the different approaches in terms of the (relative) cost of the sensor placement, i.e., the fraction |U|/N of the sensors used for localization. All results are averaged over at least 100 simulations in which the position of the source is chosen uniformly at random.
The placement delay δ, unless otherwise specified, is δ = 1. This means that the epidemic and the localization process have approximately the same speed, which we believe is a realistic assumption in many applications. Moreover, in Section 4.3 we present an experiment that evaluates the effect of this parameter and in which δ = 1 emerges as a good tradeoff between the cost of the algorithm and the time needed for detection (see Figure 3).

Algorithms & Baselines.
We study the performance of Algorithm 1 for Size-Gain, Drs-Gain and RC-Gain (see Section 3.1).
As recalled in Section 2, with a static sensor placement (i.e, K d = 0), the minimum number of sensors required to localize the source when the transmission delays are deterministic is the DMD of the network [6]. Hence, we use DMD as one natural benchmark for the cost of our algorithm.
Moreover we compare with the following baselines: Random. We run Algorithm 1 but, at each step i, we select di at random from V \(S ∪ Di−1). AllStatic. When K d < N , we compare the performance of Algorithm 1 (with Ks static and K d dynamic sensors) with an entirely static version of Algorithm 1 where the budget for static sensors is K s = Ks + K d and the budget for dynamic sensors is K d = 0.

Network Topologies
We consider both synthetic and real-world networks; the network properties and statistics are reported in Table 1.
Synthetic networks. We generated synthetic networks from the following classes: Erdös-Rényi networks (ER) [10], Barabási-Albert networks (BA) [2], Random Geometric Graph on the sphere (RGG) [25], regular trees of degree 3 (RT) and trees with power-law distributed node degree (PLT). For each network class, 10 connected instances of size 250 with unit edge weights were generated.
Real-world networks. Facebook Egonets (FB). This dataset is a subset of the Facebook network, consisting of 3732 nodes. It was obtained from the union of 10 Facebook egonet networks [23] after removing the ego nodes 2 and taking the largest connected component. We set all weights to w = 1 as there is not a straightforward method for deriving realistic edge weights for this network. World Airline Network (WAN). This network is obtained from a publicly available dataset [24] that provides the aircraft type used for every daily connection between over three thousands airports. Using this data we can derive the number of seats available on each route daily. We preprocess the network by removing the connections on which less than 20 seats per day are available and by assigning to each connection (u, v) the average between the number of seats available from u to v and from v to u. Also, we iteratively remove leaf nodes (for which we believe connections are not well represented in the dataset), and we obtain a network of 2258 nodes. The definition of the edge weights is inspired by a work by Colizza et al [7]. An edge (u, v) is weighted with an integer 3 approximation of the expected time between the infection of city u and the arrival of an infected individual at city j (see the extended version [31] for details). This gives

Results
Different Gain functions. We study the effect of Gain on the performance of Algorithm 1. For each variant, i.e., Size-Gain, DRS-Gain, RC-Gain, and for the Random heuristic, we report the relative cost. We let K d = ∞; hence, by Theorems 1 and 2, Algorithm 1 always localizes the source. We consider both a deterministic setting (ε = 0) and a nondeterministic setting with ε = 0.2, which means that the transmission delays can deviate up to 20% from their average value. The results are depicted in Figure 1(a)-1(b). We observe that for the real networks and ε = 0 all proposed Gain have similar performance. For FB and U-WAN, this is true also when ε > 0. These are also the cases where our algorithm has the smallest cost, hence we conclude that, when source localization is less challenging, Gain does not have a strong impact. In all other cases, Size-Gain consistently gives the best performance. The improvement with respect to Drs-Gain is most noticeable when ε > 0; indeed, in this setting Drs-Gain is outperformed by the simple RC-Gain. We attribute this to the fact that, when there is high variance in the transmission delays, splitting the candidate sources into subsets of nodes which have different average infection times (see the definition of Drs-Gain in Eq. (6)), does not guarantee that we are able to distinguish them based on the observed infection times [30]. Instead, as mentioned in Section 3.1, RC-Gain enforces a continuous progress in shrinking the set of candidate sources. Since Size-Gain emerges as the best Gain among those we consider, we will use it in the remaining experiments (unless otherwise specified). DMD vs. Cost of Algorithm 1. We now focus on the deterministic case (ε = 0) when K d = ∞, and compare |U|/N with the (approximate) DMD. We recall (see Section 2) that the DMD is the size of the optimal offline sensor placement for this setting. The results are depicted in Figure 2. For all topologies, |U|/N is much smaller than DMD/N . The improvement is particularly significant for trees where, on the one hand, DMD is very large (equal to the number of leaves [6]) and, on the other hand, the topology makes it easy for our algorithm to rapidly narrow the search for the source to a small set of candidates.
AllStatic vs. Algorithm 1. We look at the performance of Algorithm 1 when the budget for dynamic sensors is limited to a small fraction of nodes; we let K d = 0.02 · N = K d . We compare Algorithm 1 with different Gain (Size-Gain, DRS-Gain and RC-Gain) against the AllStatic baseline with K d = 0 and K s = Ks + K d = 0.04 · N (see Section 4.1). As K d < ∞, it is no longer guaranteed that we localize the source; instead we evaluate the success of an algorithm with the metric 1 /|B K d |, where BK d is the set of candidate sources at the last iteration step. Hence, the success is 1 when the source is localized (since |BK d | = 1), and is decreasing in the size of BK d . Note that |U | ≤ 0.04·N and, in particular, |U| < 0.04 · N , only if the source was localized with fewer than K d dynamic sensors. The results are presented in Figure 4. We observe that our approach outperforms the static sensor placement in terms of the budget used by the algorithm. Furthermore, for both ε = 0 and ε > 0, our algorithm gives a much higher success in source localization than AllStatic. Among the Gain tested, Size-Gain is again the best one, giving both the higher success and the minimum cost.
Placement delay. An important parameter used by Algorithm 1 is the placement delay δ, i.e., the time between two consecutive placements of a dynamic sensor. On the one hand, the larger δ is, the smaller we expect the cost of our algorithm to be; on the other hand, the smaller δ is, the less time we expect to need for localizing the source, hence the fewer individuals are infected before we do so. We vary δ and look at the number |D| of dynamic sensors used, the fraction µ of infected individuals at the time of localization, and the time T between the beginning of the epidemic and the localization of the source 4 (see Figure 3). We observe a trade-off between |D| and both T and µ.
Cost of localization and size of |Bi| for real networks. Finally, we evaluate the cost of localization in the practical setting of real networks with random delays. Moreover, to estimate how the running time varies for different values of the noise parameter and for the different topologies considered, we look at how the cardinality of the candidate set Bi defined by Eq. (1) decreases along the successive steps. We note beforehand that the approximate DMD is 303 (around 0.08 · N ) for the FB network, 751 (around 0.3 · N ) for WAN and 484 for U-WAN. Hence, source localization is more challenging on the WAN network. This is confirmed by the results shown in Figure 5. On the FB network, with noise parameter ε = 0.3, the correct localization of the source is achieved with a total cost |U| ≈ 0.025 · N of sensors. The average number of sensors needed is slightly larger for the U-WAN network (|U| ≈ 0.03 · N ). We attribute this effect to the presence of bottleneck edges, i.e., edges that appear on many different shortest paths and make it difficult to estimate the source based on its distance to the sensors. This effect becomes even stronger with the weighted version of the WAN network (where the total cost needed is around |U| ≈ 0.085 · N ). This last result highlights that the high variability among the edge-weights makes source localization substantially more difficult, especially for ε > 0 (see Figure 1 for a comparison of the cost between deterministic and non-deterministic delays). Given the high regime of the noise parameter we consider and the small percentage of sensors deployed, we conclude that our algorithm outperforms most other approaches to source-localization, which either need more sensors or tolerate smaller amounts of noise. Step i Step i Step i

RELATED WORK
We briefly review some important contributions to source localization (see [15] for an in-depth discussion).
Complete observation. The first source-estimator was proposed by Shah and Zaman [28] in 2009. This work, and many others that followed, rely on what is often called a complete observation of the epidemic (see Assumption (B.1) in Section 1) [37,27,32]. In these models, the source is estimated by maximum likelihood estimation (MLE).
The results of [28] have been extended in many ways, e.g., to the case of multiple sources [21] or to obtain a local source estimator [8]. An alternate line of work that also uses Assumption (B.1), allows the observed states to be noisy, i.e., potentially inaccurate. For example, a model in which it is not possible to distinguish between susceptible and recovered nodes was studied by Zhu et al. [39].
Partial observation. Follow-up work considers a partial observation setting where a randomly-selected fraction of nodes reveal their state [18,40,22,33]. These works do not assume that the infection times are known (see Assumption (A.2)), hence they need a large fraction of the nodes to be sensors (typically more than 30%) to localize the source.
Static sensor placement. Other works address the problem of strategically selecting sensor nodes a-priori, i.e., finding a static sensor placement. In the deterministic setting (see Assumption (B.5)) some works considered the problem of minimizing the budget required for detecting the source. This question is similar to the one we address, except that we allow random transmission delays and, most importantly, we propose an online solution. On trees, under (B.2) and (B.5), the minimization of the number of sensors has been studied [34]. Without (B.2) and (B.4), but with (B.5), approximation algorithms have been developed by Chen et al. [6].
Budgeted sensor placement. In a network of N nodes, the minimal budget required for source-localization can go up to N − 1, in which case the result of Chen et al. is not practical. Hence, researchers have looked into a budgeted version of the problem, i.e., how to place sensors given that only a limited number of them is available. In this direction, "common sense" approaches, e.g., using high-degree vertices, or centrality measures were first evaluated [26,20]. Later, the budgeted optimization problem was solved on trees [5] (B.4). Without (B.4), a heuristic approach, based on the definition of Double Resolving Set of a graph (see Section 2), has been shown to outperform all previous heuristics [30].
Due to budget restrictions, none of the works mentioned above can guarantee exact source localization.
Sequential sensor placement. Working under (B.5) and (B.2), Zejnilovic et al. [35], proposed an algorithm that sequentially places sensors in order to localize the source after the epidemic has spread through the entire network. Adopting very different techniques, we propose a solution that selects the sensors while the epidemic evolves, enhancing both cost-and time-efficiency. Moreover, our approach works without (B.5) and (B.2).
Other related work. Two-stage resource allocation is also studied in the context of robust optimization where, to reach some objective, we allocate a-priori only a part of the resources and another part is deployed, at a higher cost, when more information is available [14]. Another related line of work in the Artificial Intelligence field is that of active learning which studies how one can, based on sparse data, adaptively take a sequence of decisions in order to optimize a given objective [12].