How to create a Map/Reduce Job using .NET SDK for Hadoop

09 martes Abr 2013

Posted by Alex in Bases de Datos, business intelligence, Escalabilidad, rendimiento

Etiquetas

bigdata;business intelligence;hdinsight, mapreduces

The last post we see how to load and query data using hive, this time we going to use the .NET SDK for Hadoop for the same purpose.

In this case we are using the SDK for Hadoop this allow us to use Map/Reduce Jobs to query data in a distribute file system environment that can be composed for hundreds or thousands of nodes. MapReduce is a programming model for processing large data sets being typically used to do distributed computing on clusters of computers.

In a typically Map/Reduces program we can find two class the mapper and the reducer.

The mapper: this is the collection data phase, in this phase the Mapper breaks up large pieces of work into smaller ones and then takes action on each pieces.

The Reducer: this is the processing phase. Reduce combines the many results from the map step into a single output.

Keep reading

Loading and querying data using Hive with HDInsight

03 miércoles Abr 2013

Posted by Alex in Bases de Datos, business intelligence, Escalabilidad, rendimiento

≈ Deja un comentario

Etiquetas

bigdata;business intelligence;hdinsight, hive

In the last post we talk about what big data is and how HDInsight bring to the Microsoft and Windows world all the capacities of Big Data. This time we are going to create an example of how we can load data from other system to our Hadoop Node using Hive. Hive allow us to run MapReduce job using a SQL-like scripting language, called HiveQL.

For this example I create a File called Products.csv, this file was created from the Product table of the AdventureWorksDW2012 Database and look like this.

Now the first thing we will do is create a table using Hive, for this will need to open our Hadoop command line, we can find a direct access in our desktop.

Now we need to type the next:

1. Type hive to begin using hive commands.

>hive

2. We create a table called Product with the column ProductKey,ProductName, Color, SafetyStockLevel and Status.

>CREATE TABLE Product(ProductKey string,ProductName string, Color string, SafetyStockLevel string, Status string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;

3. Load the data into our table

>LOAD DATA LOCAL INPATH «c:\BigData\Products.csv» OVERWRITE INTO TABLE Product;

4. Make a query and see the results.

>SELECT Color AS sev, COUNT(*) AS cnt FROM Product GROUP BY Color;

Also we can see the table and the data that we load using the Hadoop namenode status and browsing the file system through the hive directory.

Introduction to HDInsight and the Big Data explosion

01 lunes Abr 2013

Posted by Alex in Bases de Datos, business intelligence, Escalabilidad, rendimiento

≈ Deja un comentario

Etiquetas

bigdata;business intelligence;hdinsight

When we talk about Big Data we are talking about solution that need to analyze a huge volume of data, so big data is not the solution for every scenario but is the solution for the next scenarios:

Scenario that need to offer analytical capacities over petabytes or zeta bytes of information
Scenario where a lot of information is generated every second
Scenarios where data is generated in huge volume (thousands of MB or GB)

We can say that what every big data solution have in common is that all of them need to deal with load, processing and analysis enormous amount of data.

We can find several examples of big data utilization in social network applications like Facebook, twitter or foursquare where millions of photo are upload every minute, millions of comments, shares, twits, and check in are generated every second.

Keep reading

~ Building scalable and high performance solutions

Archivos de etiqueta: bigdata;business intelligence;hdinsight

How to create a Map/Reduce Job using .NET SDK for Hadoop

Loading and querying data using Hive with HDInsight

Introduction to HDInsight and the Big Data explosion