Wednesday, September 16, 2015

Storm Apache, how it works

Hello Friends, today we are going to look on to very emerging technology of Apache which is being created by Twitter and later took by Apache that gives us the way of real time processing in a step wise manner.

You might be thinking, step wise manner? What the hell is this step wise manner?

Now let me give you a very basic example of hotel or restaurant. If you go in to a hotel, you will find that one person just opens the door, one person just takes the order, one person cooks it and one person serves it. And when everything is over you go to another person to pay the bill.

Now this is a very simple and basic example where you can see that the overall system, that is a restaurant, has multiple actors (gate boy, waiter, cook, receptionist etc), and these multiple actors are doing there work in a sync being the part of one system.

In the same way storm gives us the functionality of processing the data with the help of multiple actors(running in same/different machines, under a common roof, tell you after sometime).

In storm world we have different terms to describe the functionality of the part of the system.
  • Spout
    • Spout is a starting point of the ecosystem. In our example you being a customer is a Spout. This is b'coz unless and until you enters the restaurant, the process don't starts, which means all the actors becomes active only when you enters the restaurant.
  • Bolt
    • Blot is a part of the Storm ecosystem who do the things, or in other words the doers, actors or executors.
  • Topology
    • The overall ecosystem is bound by Topology. It is the one who tell Storm, how to react and communicate to which actor after which actor in which manner. Or I can say that it is the head of all the connection of the actors.
Storm needs Zookeeper to interact with other actors of its team, in Production Mode. In simple words if you have created a full fledged system where you have n number of Actors then you can run this in different machines so that your full function can be finished by the power of n processors simultaneously or sequentially then you need a Zookeeper to manage these machines(actors).

Storm runs in 2 modes:
  1. Development Mode
  2. Production Mode
Development Mode:
          In this mode, the code you do with Storm API don't need any other machine or Zookeeper to test. It can be executed in the same machine and thus don't need Zookeeper for its functionality.

Production Mode:
          In this mode we have to install a full fledged Zookeeper in multiple machines and starts the Storm server by giving these Zookeeper nodes in the config so that Storm server can interact with all the bolts of the topology[I know its a bit technical but now you know what is what, :-) ]

Now lets jump to the basic example of Storm, you can find this example here.

I am going to post the code here and then will explain each line of the code one by one.

Before starting the basic example of the Storm let me explain you the parts of the example:
  1. Spout: This is a class that acts as the creator of the messages. For this example I am just picking a random sentence from a List of predefined sentences and emitting it
  2. Bolt This is the class that is attached to the spout and will read the sentence that is being projected by the spout(SentenceSpout) 
  3. Topology: This class contains the main method that is having the overall handshake of classes.
  4. pom.xml: This file contains the dependencies to be used in development of the code.
You can find the code here

Explanation:
  1. LearningStormSpout.java: This class is doing the job of publishing the data in a stream whose name is "site". If you see the function declareOutputFields, you will see that we have declared the output field named "site". This is pretty straight forward since the data is emitting from the nextTuple(), where we are just picking a random sentence and emitting it. This sentence is going to be emitted on a stream whose name will be site.
  2. LearningStormBolt.java: This is the class whose job is to read the data coming on a specific stream. If you see the code under execute(), you will find that I am reading the stream whose name is site (line #15). This is because our data generator is pushing the data under stream named site. 
  3. LearningStormTopology.java: This class is the structure of your overall ecosystem of multiprocessing environment. If you see (line #17), we are actually creating a topology with the spout as LearningStormSpout, the first bolt of this as LearningStormBolt and kept the grouping to be shuffle grouping on the id which is same as the spout ID. 
  4. At last we have started the topology in a development mode and submitted it in a local cluster so that we can test it and see whether or not we are getting the correct results or not.
There is another way of starting the topology in a distributed environment which I will cover in other tutorial. For your information, in the distributed environment we have to install zookeeper in a cluster environment and install Storm server over these zookeepers and then have to install the jar of our topology on this Storm server.