Sunday 27 January 2013

In this blog, I have given a brief introduction to Hadoop and R. Which is followed by Hadoop and R installation steps in an Ubuntu machine. And later I have tried giving brief introduction to three approaches to solve BigData analytics problems.


     In today's world of Data, Hadoop is the new 'Mantra' for processing huge amount of data fast using parallel processing. To achieve this we use the Mapeduce (Distributed computation) and HDFS (Distributed Storage). Today this framework is well proven and is being accepted widely as the Next Super-set of traditional RDBMS systems. The introduction of Pig (High-level programming language for Hadoop computations) and Hive (Data warehouse with SQL-like access) has improved the programmability which makes it easy. Not only this we have Zookeeper and Oozie which supports Hadoop in coordination and workflow.

     Similarly, in the world of Analytics, R is the widely used open source software environment for statistical computing and graphics. The wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others makes it wide popular among statisticians and data miners. Since R is easily extensible through functions and extensions, it can be used along with the MapReduce which help us to analyse huge amount of data (structured, semi-structured and unstructured) fast and then analyse them for better business decisions.  

     Having said so now its time for Hands-On. So to start with first we need to have both Hadoop and R installed into our machine. I proffered Ubuntu as its open source. Below I have provided step by step installation for both Hadoop and R.

Installation of Hadoop and R in ubuntu machine

Steps to Install Hadoop in ubuntu machine:

To install Hadoop in ubuntu machine check my blog which has detailed installation steps for Hadoop(CDH3) 

Steps to Install R in Ubuntu Machine:

Step 1>
Go to the link http://www.nic.funet.fi/pub/mirrors/fedora.redhat.com/pub/epel and search out for the suitable repository. Check ur PC whether its 32 bits or 64 bit. Mine is a 64 bit CentOS-5, thus I used below script to get the repository:

$ sudo rpm -Uvh http://www.nic.funet.fi/pub/mirrors/fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

Step 2>
Install R using this script:

$ sudo yum install R  

Note: a> It will take some time to install R as the total download size is close to 85 MB(19 files). Be patient...:)
         b> If your ubuntu machine is other than CentOS you will have to use apt-get in place of yum. for more information's read http://centos.org/docs/4/html/yum/sn-managing-packages.html

Step 3> 
Testing the installation:

Launch R using "R" in terminal (excluding quotes)
You will see below prompt:
>
> print("Kunal Welomes you to R in ubuntu")


                                                    Fig 1: R installed in Ubunt Terminal

To use R studio you can simply follow the steps in below you tube video 

http://www.youtube.com/watch?v=P8wx4HY9me0  ("Implementing Rstudio and R on Ubuntu" by Riccardo Klinger)


                                                    Fig 2: Running the first plot using R


Introduction to three approaches to solve BigData analytics problems to help you in your Business decisions.

* R+Streaming -> This allows to execute R script in the map and reduce phase. This provides strong control over MapReduce functions such as partioning and sorting. But on the other hand its hard to invoke directly from existing R scripts. R+Streaming uses the streaming technology. Since in R+Streaming, R is not integrated with the client, you need to use Hadoop command line to launch a streaming job and specify as arguments the map-side and reduce-side R scripts.

* RHadoop -> When you need R as a wrapper around MapReduce so that both of them can be integrated on the client side. Its preferable when you want to work with exiting MapReduce inputs and output format classes along with R. But this approach consumes huge amount of memory because the values are not streamed to the reduced function. RHadoop is a simple, thin wrapper on top of Hadoop and Streaming. Therefore, it has no proprietary MapReduce code, and has a simple wrapper R script which is called from Streaming and in turn calls the user’s map and reduce R functions. RHadoop has high client-side integration with R. RHadoop is also an R library, where users define their map and reduce functions in R. R must be installed on each data node.

* RHipe -> Its like RHadoop, using which we can integrate MapReduce with R on client side, but unlike RHadoop it requires proprietary Input and output to work with the Protocal Buffers encoded data. Unlike R+Streaming RHipe uses its own map and reduce Java functions, which stream the map and reduce inputs in Protocol Buffers encoded form to a Rhipe C executable, which uses embedded R to invoke the user’s map and reduce R functions.Instead of using Streaming, Rhipe uses its own map and reduce Java functions, which stream the map and reduce inputs in Protocol Buffers encoded form to a Rhipe C executable, which uses embedded R to invoke the user’s map and reduce R functions.

In my next blog, I will put some real time Applicable scenarios and showing a small POC.

Note: The contents of this blog are simply for learning purpose. This blog was created keeping beginners in mind. It provides some links so as to easily direct new user.