Spark APIs

  • Unstructured data is generally free-form text that lacks a schema (which defines the organisation of the data).
  • Examples of such data include text files, log files, images, videos, etc.
  • To deal with unstructured data, Spark uses an unstructured API in the form of a Resilient Distributed Dataset (RDD).
  • RDD is the core component of Spark, and it helps in working with unstructured data.
  • Structured data includes a schema.
  • The data could be structured in a columnar or a row format.
  • Structured data formats include ORC files, Parquet files, tables or dataframes in SQL, Python, etc.
  • To deal with this type of data, Spark provides multiple APIs, including SparkSQL, DataFrame and Dataset.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
TechGuy

TechGuy

211 Followers

Tech enthusiastic, life explorer, single, motivator, blogger, writer, software engineer