Spark APIs

  • Unstructured data is generally free-form text that lacks a schema (which defines the organisation of the data).
  • Examples of such data include text files, log files, images, videos, etc.
  • To deal with unstructured data, Spark uses an unstructured API in the form of a Resilient Distributed Dataset (RDD).
  • RDD is the core component of Spark, and it helps in working with unstructured data.
  • Structured data includes a schema.
  • The data could be structured in a columnar or a row format.
  • Structured data formats include ORC files, Parquet files, tables or dataframes in SQL, Python, etc.
  • To deal with this type of data, Spark provides multiple APIs, including SparkSQL, DataFrame and Dataset.



