Important Things about Apache Spark
- “Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an open-source, distributed computing engine, and it provides a productive environment for data analysis owing to its lightning speed and support for various libraries.
- Why in-memory data processing systems?
- Real-time data processing — Since data can be accessed fast, in-memory processing can be used in cases where immediate results are required.
- Accessing data randomly in memory — Since data is stored in the RAM, the memory can be accessed randomly without scanning the entire storage.
- Iterative and interactive operations — Intermediate results are stored in memory and not in disk storage and so, we can use this output in other computations.
- Spark Architecture includes Driver Node and Worker Node. Cluster Manager is used for allocating resources to each component of the architecture.
- Spark RDDs are the core data structure in Spark.
- Dataframe and Dataset APIs are useful for handling structured data.