Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Scrub: Online TroubleShooting for Large Mission-Critical Applications
 
conference paper

Scrub: Online TroubleShooting for Large Mission-Critical Applications

Satish, Arjun
•
Shiou, Thomas
•
Zhang, Chuck
Show more
April 23, 2018
EuroSys '18: Proceedings of the Thirteenth EuroSys Conference
Eurosys '18

Scrub is a troubleshooting tool for distributed applications that operate under strict SLOs common in production environments. It allows users to formulate queries on events occurring during execution in order to assess the correctness of the application’s operation. Scrub has been in use for two years at Turn, where developers and users have relied on it to resolve numerous issues in its online advertisement bidding platform. This platform spans thousands of machines across the globe, serving several million bid requests per second, and dispensing many millions of dollars in advertising budgets. Troubleshooting distributed applications is notoriously hard, and its difficulty is exacerbated by the presence of strict SLOs, which requires the troubleshooting tool to have only minimal impact on the hosts running the application. Furthermore, with large amounts of money at stake, users expect to be able to run frequent diagnostics and demand quick evaluation and remediation of any problems. These constraints have led to a number of design and implementation decisions, that go counter to conventional wisdom. In particular, Scrub supports only a restricted form of joins. Its query execution strategy eschews imposing any overhead on the application hosts. In particular, joins, group-by operations and aggregations are sent to a dedicated centralized facility. In terms of implementation, Scrub avoids the overhead and security concerns of dynamic instrumentation. Finally, at all levels of the system, accuracy is traded for minimal impact on the hosts. We present the design and implementation of Scrub and contrast its choices to those made in earlier systems. We illustrate its power by describing a number of use cases, and we demonstrate its negligible overhead on the underlying application. On average, we observe a maximum CPU overhead of up to 2.5% on application hosts and a 1% increase in request latency. These overheads allow the advertisement bidding platform to operate well within its SLOs.

  • Files
  • Details
  • Metrics
Type
conference paper
DOI
10.1145/3190508.3190513
Author(s)
Satish, Arjun
Shiou, Thomas
Zhang, Chuck
Elmeleegy, Khaled
Zwaenepoel, Willy  
Date Issued

2018-04-23

Published in
EuroSys '18: Proceedings of the Thirteenth EuroSys Conference
Total of pages

15

Start page

5

Subjects

Scrub

•

Advertising

•

Mission Critical

•

Big Data

•

Query Processing

•

Troubleshooting

•

Debugging

•

Distributed Systems

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LABOS  
Event nameEvent placeEvent date
Eurosys '18

Porto Portugal

April 23-26, 2018

Available on Infoscience
March 13, 2018
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/145518
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés