Types of Data and File Formats
Choosing a particular tool for data ingestion depends on the type of data to be ingested. Even if we are aware of the type of data, each type can be stored in many different file formats, and each format has its advantages and disadvantages.
different types of data, which can be categorised as follows:
- Structured data:
- This is organised data that is generally stored in a database.
- It can be easily stored, entered, queried and analysed efficiently using Structured Query Language (SQL).
- It can be read easily by machines.
- Examples: Financial data, user identification data, etc.
- Unstructured data:
- This is the opposite of structured data. It cannot be stored and organised easily in databases.
- NoSQL databases can be used for storing and processing this type of data.
- Around 80% of the data being produced today is unstructured in nature.
- Examples: Images, audio, videos, chat messages, etc.
- Semi-structured data:
- This data does not have a predefined scheme, unlike structured data.
- It may have an internal structure and markings that help identify separate data elements. However, the schema does not constrain the data, unlike RDBMSs.
- Example: XML and JSON files. Emails are a good example of semi-structured data.
- Emails contain various fields, such as ‘From’, ‘To’, ‘Subject’ and ‘Body’. They maintain internal tags and markings that identify separate data elements; this enables information grouping and helps in creating hierarchies among the elements. However, this schema does not constrain the data as in the case of an RDBMS, e.g., the ‘Subject’ and ‘Body’ fields may contain the text of any size, thereby making it difficult for machines to read them.
Data is stored in files, and each file has a format. A file format represents the way in which the information is stored or encoded in a computer.
Choosing a particular file format is important for achieving maximum efficiency in terms of factors such as processing power, network bandwidth and available storage. A file format directly affects the processing power of the system ingesting the data, the capacity of the network carrying the data and the storage available for the ingested data.
Some of the commonly used file formats that we deal with in the process of data ingestion are as follows:
- CSV refers to comma-separated values.
- This is the most commonly used file format for exchanging large data sets between Hadoop and external systems.
- It has limited support for scheme evolution.
- It does not support block compression.
- XML and JSON:
- XML defines a set of rules that can be used to encode documents in a machine- and a human-readable format.
- JSON is an open-standard file format consisting of key-value pairs.
- Essentially, text files do not support block compression and are not compact.
- Splitting is difficult and cannot be easily processed parallelly as no in-built InputFormat is available in Hadoop.
- Sequence file:
- It stores data as binary key-value pairs in a binary format.
- It is more compact than text files.
- It supports block compression and can be easily processed parallelly.
- This is a language-neutral data serialisation system developed within Apache’s Hadoop project.
- It can be easily read after creation, even with a language different from the one that was used to write the file.
- These are a type of Binary files and so are compact
- It is self-describing, compressible and splittable; this file format is suitable for MapReduce jobs.
- It supports schema evolution and block compression.