Sunday, 19 April 2015

SQOOP Installation Problem:

Error: /usr/lib/hadoop does not exist! Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.

I have got above error  lines when  i am configuring Sqoop in my Hadoop Invironment.

I have solved this problem by doing these steps.


Need to configure the sqoop-env.sh file

In the sqoop/conf directory , here we have to copy sqoop-env-templet.sh file to a new copy of a sqoop-env.sh file by running below coomand.

[root@charan/sqoop/conf]# cp sqoop-env-template.sh sqoop-env.sh

Modify the newly created sqoop-env.sh configuration file: 

>export HADOOP_COMMON_HOME=/home/hadoop/hadoop/hadoop-2.3.0

#Set path to where hadoop-*-core.jar is available

>export HADOOP_MAPRED_HOME=/home/hadoop/hadoop/hadoop-2.3.0

#set the path to where bin/hbase is available
>export HBASE_HOME=

#Set the path to where bin/hive is available
>export HIVE_HOME=

#Set the path for where zookeper config dir is
>export ZOOCFGDIR=


If wont set HBASE_HOME  and ZOOCFGDIR paths dont worry  it will give you warnings only.

then test sqoop is working or not.

charan@charan-Latitude-E5440:~$ sqoop

Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: $HADOOP_HOME is deprecated.





Thats  it..





Saturday, 18 April 2015

Programme For Displaying  user information order by name Using Map Reduce Programme ... 

Lets start with basic programme.

Wednesday, 15 April 2015

 FAQs on Hadoop




1.What is HDFS?

A. HDFS, the Hadoop Distributed File System, is a distributed file system designed to

hold very large amounts of data (terabytes or even petabytes), and provide high-

throughput access to this information. Files are stored in a redundant fashion across

multiple machines to ensure their durability to failure and high availability to very

parallel applications

2.What are the Hadoop configuration files?

1. hdfs-site.xml

2. core-site.xml

3. mapred-site.xml


3.How NameNode Handles data node failures?

NameNode periodically receives a Heartbeat and a Blockreport from each of the

DataNode in the cluster.Receipt of a Heartbeat implies that the DataNode is functioning

properly.

When NameNode notices that it has not received a heartbeat message from a

DataNode after a certain amount of time, the DataNode is identified as dead. Since

blocks will be under replicated the system NameNode begins replicating the blocks that

were stored on the dead DataNode.

The NameNode takes responsibility of the replication of the data blocks from one

DataNode to another.The replication data transfer happens directly between DataNodes

and the data never passes through the NameNode.

4.What is MapReduce in Hadoop?

Hadoop MapReduce is a specially designed framework for distributed processing of

large data sets on clusters of commodity hardware. The framework itself can take care

of scheduling tasks, monitoring them and reassigning of failed tasks.

5.What is the responsibility of NameNode in HDFS

NameNode is a master daemon for creating metadata for blocks, stored on DataNodes.

Every DataNode sends heartbeat and block report to NameNode. If NameNode not

receives any heartbeat then it simply identifies that the DataNode is dead. This

NameNode is the single Point of failover. If NameNode goes down HDFS cluster is

inaccessible.

6.What it  the responsibility of SecondaryNameNode in HDFS?

SecondaryNameNode is the mater Daemon to create Housekeeping work for

NameNode. SecondaryNameNode is not the backup of NameNode but it is the backup

for metadata of the NameNode.

7.What is the DataNode in HDFS?

DataNode is the slave daemon of NameNode for storing actual data blocks. Each

DataNode stores number of 64MB blocks.

8.What is the JobTracker in HDFS?

JobTracker is a mater daemon for assigning tasks to TaskTrackers in different DataNodes

where it can find data blocks for input file.

9.How can we list all job running in a cluster?

]$ hadoop job -list

10.How can we kill a job?

]$ hadoop job –kill jobid

11.Whats the default port that jobtrackers listens to

http://localhost:50030

12.Whats the default port where the dfs namenode web ui will listen on

http://localhost:50070

13.What is Hadoop Streaming

A. Streaming is a generic API that allows programs written in virtually any language to be

used as Hadoop Mapper and Reducer implementations

Tuesday, 14 April 2015

Hadoop BigData






Hi Friends, Next onwards we will discuss about Hadoop Technology,


Wednesday, 31 July 2013

LoadBalacing with Apache server

1. Introduction:

Clustering allows us to run an application on several parallel servers (a.k.a cluster nodes). The load is distributed across different servers, and even if any of the servers fails, the application is still accessible via other cluster nodes. Clustering is crucial for scalable enterprise applications, as you can improve performance by simply adding more nodes to the cluster.

2. Why Clustering?

Clustering solutions usually provide:
  • Scalability
  • High Availability
  • Load Balancing

Scalability:

The key question here is, if it takes time T to fulfill a request, how much time will it take to fulfill N concurrent requests? The goal is to bring that time as close to T as possible by increasing the computing resources as the load increases. Ideally, the solution should allow for scaling both vertically (by increasing computing resources on the server) and horizontally (increasing the number of servers) and the scaling should be linear.

High Availability:

The objective here is to provide failover, so that if one server in the cluster goes down, then other servers in the cluster should be able to take over -- as transparently to the end user as possible.
In the servlet engine case, there are two levels of failover capabilities typically provided by clustering solutions:
  • Request-level failover
  • Session-level failover

Request-level Failover:

If one of the servers in the cluster goes down, all subsequent requests should get redirected to the remaining servers in the cluster. In a typical clustering solution, this usually involves using a heartbeat mechanism to keep track of the server status and avoiding sending requests to the servers that are not responding.

Session-level Failover:

Since an HTTP client can have a session that is maintained by the HTTP server, in session level failover, if one of the servers in the cluster goes down, then some other server in the cluster should be able to carry on with the sessions that were being handled by it, with minimal loss of continuity. In a typical clustering solution, this involves replicating the session data across the cluster (to one other machine in the cluster, at the least).

Load Balancing:

The objective here is that the solution should distribute the load among the servers in the cluster to provide the best possible response time to the end user.
In a typical clustering solution, this involves use of a load distribution algorithm, like a simple round robin algorithm or more sophisticated algorithms, that distributes requests to the servers in the cluster by keeping track of the load and available resources on the servers.

3. Architecture:

Architecture diagram of Tomcat Load Balancing Architecture diagram of Tomcat Load Balancing

4. How to implement solution using Apache Web Server and Mod_Jk:

So, this is what you need to download:
  1. Apache HTTP server 2.2.4 from The Apache HTTP Server Project
  2. Apache Tomcat 5.5.23 from Apache Tomcat downloads
  3. Mod JK Tomcat connector from Tomcat connector. Please note: You want to download the binary- click on JK 1.2 Binary Releases --> win32 --> jk-1.2.21 --> mod_jk-apache-2.2.4.so
Now let's start by installing Tomcat first.
  1. Extract the Tomcat zip. Hereafter, the directory you extracted to will be referred to as TOMCAT_HOME
  2. Test Tomcat to see that it works. Go to TOMCAT_HOME\bin and run startup.bat You may need to add an environment variable called CATALINA_HOME which is set to TOMCAT_HOME in case Tomcat fails to start.
  3. Open up your browser and access http://localhost:8080/
    If you see the default page, then Tomcat Instance 1 is working fine. Shut down Tomcat.
That's all for the first Tomcat instance. Now for the second:
  1. Make a directory called Tomcat2
  2. Copy the all directories from the TOMCAT_HOME directory into the tomcat2 directory.
  3. Open up tomcat2\conf\server.xml in a text editor. We've got to change the port numbers so that they don't conflict with the first instance.
    <Server port="8005" shutdown="SHUTDOWN"> to 
    <Server port="8025" shutdown="SHUTDOWN">
    
    <Connector port="8080" maxHttpHeaderSize="8192"... to
    <Connector port="8010" maxHttpHeaderSize="8192"...
    
    <Connector port="8009" enableLookups="false" 
     redirectPort="8443" protocol="AJP/1.3" /> to
    <Connector port="8019" enableLookups="false" 
    redirectPort="8443" protocol="AJP/1.3" />
    
Go to bin directory of tomcat2 and start the second tomcat using startup.bat . Test it out by pointing your browser to http://localhost:8010/
Your second tomcat instance is now ready to be used.
Next, let's set up the Apache HTTP Server. It's pretty simple...
  1. Run the installer you downloaded. The standard install will do.
  2. Open the Apache Server Monitor and start the web server if it's not already running.
  3. Point your browser to http://localhost/ to verify that Apache is running on port 80.
  4. Stop Apache.
Finally, we reach mod JK. Let's set it up first just to delegate requests to the two Tomcat instances, and we'll load balance it a bit later.
  1. Copy the mod_jk-apache-2.2.4.so to the modules directory in your Apache installation.
  2. Open up httpd.conf in the conf directory of your Apache installation in a text edit, and add the following line at the end of the set of LoadModule statements: LoadModule jk_module modules/mod_jk-apache-2.2.4.so
  3. Create a file called workers.properies in the conf directory. Add these lines to it:
    workers.tomcat_home=C:/apache-tomcat-5.5.23
    workers.java_home=C:/jdk1.5.0_07
    worker.list=worker1,worker2
    worker.worker1.port=8009
    worker.worker1.host=localhost
    worker.worker1.type=ajp13
    worker.worker2.port=8019
    worker.worker2.host=localhost 
    #you can also specify other 
    #machine address or name
    worker.worker2.type=ajp13
    
    This file defines which workers Apache can delegate to. We've listed worker1 and worker 2 to correspond to our two tomcat instances. Remember to set tomcat_home and java_home as well.
  4. Specify the worker properties in httpd.conf:
    Add these lines just after the LoadModule definitions-
    # Path to workers.properties
    JkWorkersFile c:/apache2.2/conf/workers.properties
    # Path to jk logs
    JkLogFile c:/apache2.2/mod_jk.log
    # Jk log level [debug/error/info]
    JkLogLevel info
    # Jk log format
    JkLogStampFormat "[%a %b %d %H:%M:%S %Y] "
    # JkOptions for forwarding
    JkOptions +ForwardKeySize +ForwardURICompat -ForwardDirectories
    # JkRequestLogFormat set the request format
    JkRequestLogFormat "%w %V %T"
    JkMount /jsp-examples worker1
    JkMount /jsp-examples/* worker1
    JkMount /tomcat-docs worker2
    JkMount /tomcat-docs/* worker2
    
    Defining these tells Apache where to look for definitions of its workers and tells it that any requests for the jsp-examples context should be handed off to the Tomcat instance represented by worker 1, and any requests for tomcat-docs context should be handed off to Tomcat Instance 2, represented by worker 2.
    Edit the server.xml for Tomcat 1 and Tomcat 2 and add a jvmRoute attribute to the Engine element:
    <Engine name="Catalina" defaultHost="localhost"
                               jvmRoute="worker1">
    for the first instance and
    <Engine name="Catalina" defaultHost="localhost" 
                               jvmRoute="worker2">
    for the second.
    

  5. Start Tomcat 1 and 2. Start up the Apache webserver. Point your browser to http://localhost/jsp-examples/ and then to http://localhost/tomcat-docs. You should see the respective pages load. To distinguish which Tomcat is serving you the page, the easiest thing to do is edit the index page in the tomcat-docs and jsp-examples of Tomcat 2 and change the title for example. Then you can verify that tomcat-docs is being served only by the second instance.
Apache is now delegating requests to both Tomcats. But we need to set tomcat for load balancing and for failover mechanism. If Tomcat 1 is crashed for whatever reason, Apache will automatically keep delegating to Tomcat 2 so your application remains accessible.
Load balancing is a simple configuration. First shut down your Tomcat instances and Apache as well.
  1. Open workers.properties
  2. Edit it so it looks like this (changed lines in bold)-
    workers.tomcat_home=C:/apache-tomcat-5.5.23
    workers.java_home=C:/jdk1.5.0_07
    #worker.list=worker1,worker2
    worker.list=balancer
    worker.worker1.port=8009
    worker.worker1.host=localhost
    worker.worker1.type=ajp13
    worker.worker1.lbfactor=1
    worker.worker2.port=8019
    worker.worker2.host=localhost
    worker.worker2.type=ajp13
    worker.worker2.lbfactor=1
    worker.balancer.type=lb
    worker.balancer.balance_workers=worker1,worker2
    worker.balancer.method=B
    # Specifies whether requests with SESSION ID's 
    # should be routed back to the same #Tomcat worker.
    worker.balancer. sticky_session =True
    
    IWe've changed the worker list to a single worker called balancer, and specified that the worker type of balancer is 'lb' or load balancer. The workers it manages are worker1 and worker2 (these do not need to appear in the workers list). And finally, we set the balance method to 'B' or balance by busy factor. Apache will delegate the next request to the Tomcat instance which is least busy. Please note that there are a couple of options for method- consult the Apache/Tomcat documentation which lists out options for workers properties to help you decide the best method for your type of application.
    If you want to use session stickyness, you must set different jvmRoute attributes in the Engine element in Tomcat's server.xml. Furthermore the names of the workers which are managed by the balancer have to be equal to the jvmRoute of the Tomcat instance they connect with.x
  3. Open httpd.conf and comment out the previous JkMount directives. Replace them with these:
    JkMount /jsp-examples balancer
    JkMount /jsp-examples/* balancer
    
    we've just pointed Apache to a single worker- the balancer.
  4. Start up both Tomcats and Apache. Access http://localhost/jsp-examples You will either be served by Tomcat 1 or Tomcat 2. To prove that both are capable of serving, shut down the first instance and refresh your browser. You should be served by instance two.

Conclusions:

This solution provides high scalability, high availability, and good load balancing capabilities that are comparable with any other software solution.

Wednesday, 17 July 2013

tomcat mod_jk tutorial for beginners

mod_jk is a replacement to the elderly mod_jserv. It is a plug-in that handles the communication between Tomcat and Apache.

In this tutorial, we assume that a stable Apache Web Server 2.X has been installed on your host. The next step in the checklist is downloading the latest stable release of Tomcat mod_jk, available at

http://www.apache.org/dist/tomcat/tomcat-connectors/jk/binaries/

Once downloaded, the module mod_jk.so should be copied in your Apache module directory (usually located in the APACHE_ROOT/modules directory). Check your Apache documentation if you cannot locate it.

Windows users are encouraged to rename the binary file to mod_jk.dll if the downloaded Windows module bears the .so extension. This way you will not confuse this library with a compiled library for Unix.
The configuration of mod_jk can be included into the Apache httpd.conf file or held in an external file, which is a good practice:


# Load mod_jk module
LoadModule jk_module modulesc/mod_jk.so # UNIX
# LoadModule jk_module modules/mod_jk.dll # WINDOWS
# Where to find workers.properties
JkWorkersFile /etc/httpd/conf/workers.properties
# Where to put jk shared memory
JkShmFile /var/log/httpd/mod_jk.shm

# Where to put jk logs
JkLogFile /var/log/httpd/mod_jk.log
# Set the jk log level [debug/error/info]
JkLogLevel info
# Select the timestamp log format
JkLogStampFormat "[%a %b %d %H:%M:%S %Y] "
# Send everything for context /yourApplication to mod_jk loadbalancer
JkMount /yourApplication/* loadbalancer

The module is loaded in memory by the LoadModule directive; the configuration of the single nodes is contained in a separate file named workers.properties, which will be examined in a moment.

The JkMount directive tells Apache which URLs it should forward to the mod_jk module. Supposing we have deployed a web application reachable at the web context yourApplication, with the above JKMount directive all requests with URL path /yourApplication/* are sent to the mod_jk load balancer. This way, you actually split the requests either on Apache directly (static contents) or on the load balancer for Java applications.

So, if you want your web application served directly by Tomcat Web Server you would need to point the browser to this location: http://localhost:8080/yourApplication

The same web context, proxied by Apache Web server can be reached at:
http://localhost/yourApplication


Additionally, you can use the JkMountFile directive that allows dynamic updates of mount points at runtime. When the mount file is changed, mod_jk will reload its content.

# Load mount points
JkMountFile conf/uriworkermap.properties

The format of the file is /url=worker_name. To get things started, paste the following example into the file you created:

# Mount the Servlet context to the ajp13 worker
/yourApplication=loadbalancer
/yourApplication/*=loadbalancer

This will configure mod_jk to forward requests to /yourApplication Apache web container.
Next, you need to configure the workers file conf/workers.properties. A worker is a process that defines a communication link between Apache and the Tomcat container.

This file specifies where the different nodes are located and how to balance the calls between the hosts. The configuration file is made up of global directives (that are generic for all nodes) and the individual worker's configuration. This is a sample two-node configuration:

# Define list of workers that will be used
worker.list=loadbalancer,status
# Define Node1
worker.node1.port=8009
worker.node1.host=192.168.10.1
worker.node1.type=ajp13
worker.node1.lbfactor=1
worker.node1.cachesize=10
# Define Node2
worker.node2.port=8009
worker.node2.host=192.168.10.2
worker.node2.type=ajp13
worker.node2.lbfactor=1
worker.node2.cachesize=10
# Load-balancing behaviour
worker.loadbalancer.type=lb
worker.loadbalancer.balance_workers=node1,node2
worker.loadbalancer.sticky_session=1
# Status worker for managing load balancer
worker.status.type=status

In this file, each node is defined using the worker.XXX naming convention where XXX represents an arbitrary name you choose for each of the target servlet containers. For each worker, you must specify the host name (or IP address) and the port number of the AJP13 connector running in the servlet container.

balance_workers is a comma-separated list of workers that the load balancer need to manage.
sticky_session specifies whether requests with SESSION IDs should be routed back to the same Tomcat worker. If sticky_session is set to true or 1, sessions are sticky, otherwise sticky_session is set to false. (The default is true.)

Finally, we must configure the Web instances on all clustered nodes so that they can expect requests forwarded from the mod_jk load balancer. Edit the server.xml file; locate the element and add an attribute jvmRoute:
   
 
    ... ...
  
The same attribute is required on node2:


    ... ...
  
  
You also need to be sure the AJP Connector definition is uncommented. By default, it is enabled.


emptySessionPath="true" enableLookups="false" redirectPort="8443" />

Tuesday, 25 June 2013