Big data

When we talk about Big Data we are talking about solution that need to analyze a huge volume of data, so big data is not the solution for every scenario but is the solution for the next scenarios:

  • Scenario that need to offer analytical capacities over petabytes or zeta bytes of information
  • Scenario where a lot of information is generated every second
  • Scenarios where data is generated in huge volume (thousands of MB or GB)

We can say that what every big data solution have in common is that all of them need to deal with load, processing and analysis enormous amount of data.

We can find several examples of big data utilization in social network applications like Facebook, twitter or foursquare where millions of photo are upload every minute, millions of comments, shares, twits, and check in are generated every second.

Socia networking sites

But we also can find a lot of big data technology utilization in areas like sensor networks, internet search indexing, biogeochemical, astronomy, large-scale e-commerce and a lot of other business where data is crusial.

Some examples of companies that are using big data are:

  • Amazon:  to process millions of sessions daily for analytics and build Amazon’s product search indices.
  • AOL: for statistics generation to running advanced algorithms for doing behavioral analysis and targeting
  • EBay: Using it for Search optimization and Research
  • Fox: Use for log analysis, data mining and machine learning
  • for charts calculation, royalty reporting, log analysis
  • Linkedin: use it  for discovering People You May Know
  • Spotify: use Hadoop for content generation, data aggregation, reporting and analysis
  • E:Circle: marketing data handling
  • Optivo: use Hadoop to aggregate and analyse email campaigns and user interactions
  • Performable: use Hadoop to process web clickstream, marketing, CRM, & email data in order to create multi-channel analytic reports.

You can find other companies using big data here.

Now let’s talk about HDInsight. For those who still don’t know HDInsight, this is the implementation of Hadoop developed by Microsoft so the companies can run Big Data Solution on Windows Server.

Hadoop is known as an alternative for the development of databases under the NoSQL philosophy, NoSQL movement was born in San Francisco in the hands of a group of people who were unhappy with the current performance offered by the relational database therefore proposed alternatives that provide more effective and less costly solutions that could lead the new web 2.0 to handle millions of transactions simultaneously in real time, this way  began to emerge alternatives to traditional relational databases (Oracle, MySQL, sql Server, etc) like  Cassandra  developed by Facebook which according to Facebook engineers can write up to 50GB of data on disk in 0.12 milliseconds more than 2500 times faster than MySQL. Thus emerge others alternatives developed by big internet companies like Amazon Dynamo, Big Table by Goolge, Hadoop, Voldemort and Dynomite.

According to Hadoop web page Hadoop is “a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.”

HDInsight is available either on Windows Server or as a Windows Azure service. When you use HDInsight as a Windows Azure service you have the advantage that you don’t need to make a huge invest on hardware for running all the hadoop nodes you maybe will need, also you don’t need to care about maintaining servers, you don’t need to care about redundancy, you don’t need to care about make backups, and you save a lot of money on hardware and energy.

To use HdInsight as a Windows Azure service  you need to go to  Widnows Azure Portal a create an account.

In other hand you have the possibility of create your on-premise deploy of hadoop downloading the HDInsight Server and installing it on your own environment.

For install HDInsight Server you need to download the Microsoft Web Platform Installer


Run Microsoft Web Platform Installer and type hadoop.

Hadoop Installs

Click the Add button to select the Microsoft HDInsight for Windows Server Community Technology Preview item to install and then click the Install button.

Once is installed will be created on the desktop the direct access to:

Hadoop HDInsight

  • Hadoop Command Line: this is where we will be able to use Hive, Pig o Hadoop commands to create tables, load data or make querys.
  • Hadoop MapReduce Status Web site: here you can see the status of the MapReduce Jobs that are running in your server.

mapreduce administration

  • Hadoop Name node Status Web site: hear you can see the status of the nodes and browse the hadoop file system.


  • Microsoft HDInsight Dashboard: here will have access to the web console for use Hive and javascript and create new jobs using jar files.

hdinsight dashboard

In the next post will see how to load data using hive.