Spark has the following two APIs:
- Unstructured data is generally free-form text that lacks a schema (which defines the organisation of the data).
- Examples of such data include text files, log files, images, videos, etc.
- To deal with unstructured data, Spark uses an unstructured API in the form of a Resilient Distributed Dataset (RDD).
- RDD is the core component of Spark, and it helps in working with unstructured data.
- Structured data includes a schema.
- The data could be structured in a columnar or a row format.
- Structured data formats include ORC files, Parquet files, tables or dataframes in SQL, Python, etc.
- To deal with this type of data, Spark provides multiple APIs, including SparkSQL, DataFrame and Dataset.