Saturday, January 07, 2017

Hadoop - Data Ingestion Using Flume & Sqoop

Had some fun today experimenting with Flume and Sqoop to ingest data into HDFS. Here's a brief summary of my approach -



Flume is an optional service (unlike HDFS & YARN) available in Hadoop to ingest unstructured data (logs, social media data) into HDFS. 

I wanted to leverage Flume's spooldir protocol support to copy data from a staging area within my local file system to a directory within Hadoop. 

My sandbox - A CentOS 6.5 based VM provided by Edureka that came with Hadoop, Flume and Sqoop pre-installed. 

Methodology -

Step 1. Create a sample text file in the /home directory that would be copied to HDFS. 
Useful URLs -
http://www.thegeekstuff.com/2010/09/linux-file-system-structure/?utm_source=tuicool
http://www.howtogeek.com/199687/how-to-quickly-create-a-text-file-using-the-command-line-in-linux/

Step 2. Create a directory within HDFS to store the data ingested by Flume -
hadoop fs -mkdir /data/flume

Step 3. Obtain the standard flume configuration file format - 
https://flume.apache.org/FlumeUserGuide.html

Step 4. Use the above template to create a sample.conf file -
https://drive.google.com/open?id=0B1xeTI1i_SxtTndqTUZfNEhMUm8

Step 5. Navigate to the Flume directory -
cd /usr/lib/flume-ng/apache-flume-1.4.0-bin/bin

Step 6. Execute the Flume Agent -
./flume-ng agent --conf conf --conf-file /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/sample.conf --name a1 -Dflume.root.logger=INFO,console

Step 7. After the logs indicate that the file has been copied to HDFS, do a Ctrl + C to cancel the Flume process.

My text file was renamed to 'sample_flume.txt.COMPLETED'
I could also see the ingested data in HDFS using -
hadoop fs -ls /data/flume/FlumeData.148730432550





Sqoop is also an optional service within Hadoop. It is used to transfer structured data between HDFS and RDBMS. Notably, Sqoop supports both to & fro data transfer.

My plan was to transfer a table from a MySQL database on my Sandbox to a HIVE table sitting on HDFS.

Methodology -

Step 1. Install MySQL on my host & create a sample table.
Useful URLs -
https://support.rackspace.com/how-to/installing-mysql-server-on-centos/
https://www.linode.com/docs/databases/mysql/how-to-install-mysql-on-centos-6

SQL Commands -
show databases;
use test;
show tables;
create table family(Name VARCHAR(100), Age INT);
insert into family values('Anand',35);
insert into family values('Kalpana',40);
select * from family;
quit;

Step 2. Navigate to the Sqoop binary directory -
cd /usr/lib/sqoop-1.4.4/bin

Step 3. Run the command - 
sqoop import --connect jdbc:mysql://localhost:3306/test --username=root --password= --table=family --hive-import --hive-table=family_demo --target-dir=/data/sqoop -m 1

Step 4. Check the results by reviewing the /data/sqoop directory and through HIVE -
hadoop fs -ls /data/sqoop

HIVE QUERIES-
show tables;
select * from family_demo;
quit;

No comments: