Tools for Hadoop

There are many different types of tools used for different use cases in the Hadoop ecosystem as well. Some of these are as follows:
- Data Ingestion Tools:
- Flume: Apache Flume is a system for collecting, aggregating and transporting huge volumes of streaming data such as log files, events, etc., from various sources into the HDFS.
- Sqoop: It gets its name from SQL-to-Hadoop and is a command-line-based application that is used for transferring data between relational databases(RDBMSs), such as Oracle and MySQL, and Hadoop.
- NoSQL Databases:
- HBase: Apache HBase is a popular and highly efficient column-based NoSQL database built on top of HDFS. It offers horizontal scalability. It allows you to perform operations such as read/write on real-time data from large data sets.
- Cassandra: Apache Cassandra is an open-source distributed, wide column store, NoSQL Database management system. It is designed in such a way that it can easily handle large amounts of data across many commodity machines, while also providing high availability and maintaining fault tolerance by ensuring that there is no single point of failure in the system.
- High-Level Languages:
- Pig: Apache Pig is a high-level data flow platform for executing Hadoop MapReduce programs. It can help in writing MapReduce jobs in a fraction of the code length in standard Java.
- Hive: Hive is an ETL and data warehousing tool that is built on top of the HDFS. It provides an SQL-like interface to perform functions such as data summarisation and analysis, as well as to query databases and file systems in Hadoop.
- SparkSQL: SparkSQL is a Spark module for structured data processing. It basically brings native support for SQL to Spark.
- Predictive Analysis:
- Mahout: Mahout is a distributed linear algebra framework that is primarily focussed on providing an implementation of scalable and distributed machine learning (ML) models such as for linear regression, clustering, etc.
- Spark ML: SparkML is an extension of the core Spark API that provides a uniform set of high-level APIs that help in creating and tuning practical ML pipelines.
- Real-Time Analysis:
- Spark Streaming: Spark Streaming is an extension of the core Spark API, which enables scalable and fault-tolerant stream processing of live data.
- Flink: Flink is an open-source stream-processing framework.
- Kafka: Apache Kafka is an open-source stream-processing software platform developed by LinkedIn.
- Scheduling:
- Oozie and Airflow: Oozie and Airflow are scalable and reliable server-based workflow scheduling systems for managing Hadoop jobs. These tools can be used for pipelining all sorts of programs in the desired order and can also be scheduled for these programs at a particular time.