Types of Data and File Formats

  • Structured data:
  • This is organised data that is generally stored in a database.
  • It can be easily stored, entered, queried and analysed efficiently using Structured Query Language (SQL).
  • It can be read easily by machines.
  • Examples: Financial data, user identification data, etc.
  • Unstructured data:
  • This is the opposite of structured data. It cannot be stored and organised easily in databases.
  • NoSQL databases can be used for storing and processing this type of data.
  • Around 80% of the data being produced today is unstructured in nature.
  • Examples: Images, audio, videos, chat messages, etc.
  • Semi-structured data:
  • This data does not have a predefined scheme, unlike structured data.
  • It may have an internal structure and markings that help identify separate data elements. However, the schema does not constrain the data, unlike RDBMSs.
  • Example: XML and JSON files. Emails are a good example of semi-structured data.
  • Emails contain various fields, such as ‘From’, ‘To’, ‘Subject’ and ‘Body’. They maintain internal tags and markings that identify separate data elements; this enables information grouping and helps in creating hierarchies among the elements. However, this schema does not constrain the data as in the case of an RDBMS, e.g., the ‘Subject’ and ‘Body’ fields may contain the text of any size, thereby making it difficult for machines to read them.
  • Text/CSV:
  • CSV refers to comma-separated values.
  • This is the most commonly used file format for exchanging large data sets between Hadoop and external systems.
  • It has limited support for scheme evolution.
  • It does not support block compression.
  • XML and JSON:
  • The full form of XML is an extensible markup language and that of JSON is JavaScript object notation.
  • XML defines a set of rules that can be used to encode documents in a machine- and a human-readable format.
  • JSON is an open-standard file format consisting of key-value pairs.
  • Essentially, text files do not support block compression and are not compact.
  • Splitting is difficult and cannot be easily processed parallelly as no in-built InputFormat is available in Hadoop.
  • Sequence file:
  • It stores data as binary key-value pairs in a binary format.
  • It is more compact than text files.
  • It supports block compression and can be easily processed parallelly.
  • Avro:
  • This is a language-neutral data serialisation system developed within Apache’s Hadoop project.
  • It can be easily read after creation, even with a language different from the one that was used to write the file.
  • These are a type of Binary files and so are compact
  • It is self-describing, compressible and splittable; this file format is suitable for MapReduce jobs.
  • It supports schema evolution and block compression.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
TechGuy

TechGuy

211 Followers

Tech enthusiastic, life explorer, single, motivator, blogger, writer, software engineer