Daniel Suo

Scientific progress goes 'boink'

Ph.D. Candidate
Princeton University
Department of Computer Science

Resilient Distributed Datasets (Spark)

A fault-tolerant abstraction for in-memory cluster computing (link)

Overview

Enables

  • Interactive data manipulation
  • Iterative computations

By

  • Leveraging distributed shared memory for performance
  • Using coarse-grained transformations for fault-tolerance
    • Lineage allows in-memory storage while still achieving fault-tolerance
  • Offering a wider set of computations (i.e., not just map and reduce)

What are limitations

  • Memory-intensive
  • Communication among nodes challenging outside well-defined abstractions
  • Applications that need to asynchronously update fine-grained state