Canopy: An End-to-End Performance Tracing And Analysis System
(link) Jonathan Kaldor, Jonathan Mace, Joe O’Neill, Kian Win Ong, Vinod Venkataraman, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Bill Schaller, Pingjia Shan, Brendan Viscomi, Kaushik Veeraraghavan, Yee Jiun Song
- What is the problem? Analyzing performance in large-scale distributed systems is challenging, especially when that system is used in many ways and at scale.
- Why is it important? Understanding and resolving performance issues is critical to increasingly complex services.
- What is the approach? Provide an end-to-end performance tracing infrastructure that can trace causally related performance data from client application to backend service.
- What is the result? Currently in production at Facebook and processes over 1 billion traces per day
- Three main challenges
- Performance data is heterogeneous: how to collect, how to consume?
- Granularity mismatch between raw data and analysis: how to conduct high-level analyses on immense amounts of raw data?
- Tracing infrastructure must serve many users
- What happens if a sub-system that’s causing the problem hasn’t been instrumented into the Canopy system yet?