Tools for Data Ingestion

here are multiple sources of data at the industry level. We have structured data coming from RDBMS, multi-structured data coming from social media, and streaming data and real-time data coming in from back-end servers and weblogs.
As we discussed in the previous segments, data is being produced and consumed at an extremely high rate and will continue to grow exponentially as more companies become data-driven.
To ingest huge volumes of data from all the different types of sources, you can use the following commands and tools:
- File transfer using commands:
- ‘distcp’: This command helps you copy large data sets between two Hadoop clusters.
- ‘put’ and ‘get’: The ‘put’ command helps you copy files from the local file system to the HDFS, whereas the ‘get’ command helps you perform the opposite operation.
- Apache Sqoop: Sqoop is short for ‘SQL to Hadoop’. This tool is used for importing data from an RDBMS to any big data ecosystem (Hive, HBase, etc.) and for exporting the data back to the RDBMS after it is processed.
- Apache Flume: This is a distributed data collection service for collecting, aggregating and transporting large volumes of real-time data from various sources to a centralised store where the data can be processed.
- Apache Kafka: Kafka is a fast, scalable, distributed system that can handle large volumes of data; it enables programmers to pass messages from one point to another.
- Apache Gobblin: Gobblin is an open-source data ingestion framework for extracting, transforming and loading large volumes of data from different data sources. It supports both streaming and batch data ecosystems.
Now, apart from the tools that you have learnt about so far, Apache Storm, Apache Chukwa and Apache Spark, Apache Flink are some other tools that you can use for ingesting data as per your requirement.