Thursday, January 26, 2017

Watching Moneyball

In keeping up with my quest to explore more and more of what data is capable of, I really felt like watching the movie Moneyball today.

I found the streaming URL for the same on Putlockerr. However, for some reason, my Internet connection started acting up. The video kept buffering every few minutes, rendering the entire movie watching experience jerky & inconsistent. So, I decided I must first download the movie so that I could then watch it in peace. Here's how I accomplished that objective -

1) Installed the extension 'Flash Video Downloader' in Google Chrome. It was able to detect the media file on the Putlockerr page and started to download the same.

2) Here's when I ran into my other problem - The download was too slow. Installed 'Download Accelerator Plus', fed it the movie download URL from the extension & lo and behold, was soon able to download the entire movie in a fraction of time it would have taken otherwise.

It is too late to watch the movie by the time I was done. Had to get up early and take my guys for a walk. However, now that I had the movie safely stored on my laptop, I would be able to watch it while I have my breakfast tomorrow :)

Saturday, January 07, 2017

Hadoop - Data Ingestion Using Flume & Sqoop

Had some fun today experimenting with Flume and Sqoop to ingest data into HDFS. Here's a brief summary of my approach -

Flume is an optional service (unlike HDFS & YARN) available in Hadoop to ingest unstructured data (logs, social media data) into HDFS. 

I wanted to leverage Flume's spooldir protocol support to copy data from a staging area within my local file system to a directory within Hadoop. 

My sandbox - A CentOS 6.5 based VM provided by Edureka that came with Hadoop, Flume and Sqoop pre-installed. 

Methodology -

Step 1. Create a sample text file in the /home directory that would be copied to HDFS. 
Useful URLs -

Step 2. Create a directory within HDFS to store the data ingested by Flume -
hadoop fs -mkdir /data/flume

Step 3. Obtain the standard flume configuration file format -

Step 4. Use the above template to create a sample.conf file -

Step 5. Navigate to the Flume directory -
cd /usr/lib/flume-ng/apache-flume-1.4.0-bin/bin

Step 6. Execute the Flume Agent -
./flume-ng agent --conf conf --conf-file /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/sample.conf --name a1 -Dflume.root.logger=INFO,console

Step 7. After the logs indicate that the file has been copied to HDFS, do a Ctrl + C to cancel the Flume process.

My text file was renamed to 'sample_flume.txt.COMPLETED'
I could also see the ingested data in HDFS using -
hadoop fs -ls /data/flume/FlumeData.148730432550

Sqoop is also an optional service within Hadoop. It is used to transfer structured data between HDFS and RDBMS. Notably, Sqoop supports both to & fro data transfer.

My plan was to transfer a table from a MySQL database on my Sandbox to a HIVE table sitting on HDFS.

Methodology -

Step 1. Install MySQL on my host & create a sample table.
Useful URLs -

SQL Commands -
show databases;
use test;
show tables;
create table family(Name VARCHAR(100), Age INT);
insert into family values('Anand',35);
insert into family values('Kalpana',40);
select * from family;

Step 2. Navigate to the Sqoop binary directory -
cd /usr/lib/sqoop-1.4.4/bin

Step 3. Run the command - 
sqoop import --connect jdbc:mysql://localhost:3306/test --username=root --password= --table=family --hive-import --hive-table=family_demo --target-dir=/data/sqoop -m 1

Step 4. Check the results by reviewing the /data/sqoop directory and through HIVE -
hadoop fs -ls /data/sqoop

show tables;
select * from family_demo;