Etiquetas

,

mapreduce

The last post we see how to load and query data using hive, this time we going to use the .NET SDK for Hadoop for the same purpose.

In this case we are using the SDK for Hadoop this allow us to use Map/Reduce Jobs to query data in a distribute file system environment that can be composed for hundreds or thousands of nodes. MapReduce is a programming model for processing large data sets being typically used to do distributed computing on clusters of computers.

 In a typically Map/Reduces program we can find two class the mapper and the reducer.

The mapper: this is the collection data phase, in this phase the Mapper breaks up large pieces of work into smaller ones and then takes action on each pieces.

The Reducer: this is the processing phase. Reduce combines the many results from the map step into a single output.

mapreduceDIagram

The first thing we will do is to load the data, in this case I going to use the same file Products.csv from my last post and load it into the hadoop file system.

For this purpose open the Hadoop command line and type:

>hadoop fs -copyFromLocal C:\BigData\Products.csv input/Products/Products.csv

We can see the file we load using the Hadoop namenode status and browsing the file system.

Productcsv

Now we need to create a Class Library project using Visual Studio 2012, once the project has been created we need to add as reference the Microsoft Hadoop dlls, for this we are going to use Nuget Packages, so if we right click our project you will find the option Manage Nuget Packages (if not you need to install this Visual Studio add-in).

Nutget

When Manage Nuget Packages open type hadoop in the search box and install all the packages.

nuget hadoop

The project have three class

MyMapReduceAPP: implement the HadoopJob interface, is the entry point  of the job and indicate the mapper and the reducer class.

ProductMapper: collect and process the data

Reducer: aggregate the data.

map reduce job

To test our dll will need to execute it using the mrrunner.exe. So in the Hadoop command line, type.

>cd C:\Users\Administrator.SHAREPOINT2013L\Documents\visual studio 2012\Projects\MapReduceAPP\MapReduceAPP\mrlib

>mrrunner -dll “C:\Users\Administrator.SHAREPOINT2013L\Documents\visual studio 2012\Projects\MapReduceAPP\MapReduceAPP\bin\Debug\MapReduceAPP.dll”

We can see the result using the Hadoop namenode status and browsing the file system.

result job

Has you can see is the same result we got before using hive.