Important Things about Apache Spark

  • “Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an open-source, distributed computing engine, and it provides a productive environment for data analysis owing to its lightning speed and support for various libraries.
  • Why in-memory data processing systems?
  • Real-time data processing — Since data can be accessed fast, in-memory processing can be used in cases where immediate results are required.
  • Accessing data randomly in memory — Since data is stored in the RAM, the memory can be accessed randomly without scanning the entire storage.
  • Iterative and interactive operations — Intermediate results are stored in memory and not in disk storage and so, we can use this output in other computations.
  • Spark Architecture includes Driver Node and Worker Node. Cluster Manager is used for allocating resources to each component of the architecture.
  • Spark RDDs are the core data structure in Spark.
  • Dataframe and Dataset APIs are useful for handling structured data.

--

--

--

Tech enthusiastic, life explorer, single, motivator, blogger, writer, software engineer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Event Sourcing explained with real code

PHP is DEAD

Everlend Updated Risk Framework

Atom + Ledger Hardware wallet

Distributed data, decentralised governance and organic accountability

My Journey

Guide to How you can connect Innr bulbs to Phillips Hue

Nested Serializers in Django Rest Framework

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
TechGuy

TechGuy

Tech enthusiastic, life explorer, single, motivator, blogger, writer, software engineer

More from Medium

An Introduction to BIG DATA

How can Apache Spark help your big data?

Data Model in DBMS