hadoopexpertblogs

Featured

Why Learn Big Data and Hadoop?

This is the post excerpt.

In my experience, people who do things in their career that they are excited about and have a passion for, can go farther and faster with the self-motivation than if they did something that they didn’t like, but felt like they needed to do it for other reasons. You are awesome in already taking initiative in your career by doing your research including visiting my blog.

big-data-fita

This current wave of “big data” has tremendous opportunities. The deluge of big data is likely to persist in the future. Tools to handle big data will eventually become mainstream and commonplace, which is when almost everyone is working with big data. However, enterprising folks can still get ahead of the mainstream today by investing in skills and career development. I realize this may sound like hyperbole, but this is the historical pattern that we have seen around how technology gets adopted and the resulting shifts in the workforce (e.g. printing press, radio, television, computers, internet, etc.).

BigData! A Worldwide Problem:
“Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.” In simpler terms, Big Data is a term given to large volumes of data that organizations store and process. However, It is becoming very difficult for companies to store, retrieve and process the ever-increasing data. If any company gets hold on managing its data well, nothing can stop it from becoming the next BIG success!

The problem lies in the use of traditional systems to store enormous data. Though these systems were a success a few years ago, with increasing amount and complexity of data, these are soon becoming obsolete. The good news is – Hadoop, which is not less than a panacea for all those companies working with BigData in a variety of applications has become an integral part for storing, handling, evaluating and retrieving hundreds or even petabytes of data.

Apache Hadoop! A Solution for Big Data:
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don’t overlook the charming yellow elephant you see, which is basically named after Doug’s son’s toy elephant!

Some of the top companies using Hadoop:
The importance of Hadoop is evident from the fact that there are many global MNCs that are using Hadoop and consider it as an integral part of their functioning, such as companies like Yahoo and Facebook! On February 19, 2008, Yahoo! Inc. established the world’s largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on over 10,000 core Linux cluster and generates data that is now widely used in every Yahoo! Web search query.

Facebook, a $5.1 billion company has over 1 billion active users in 2012, according to Wikipedia. Storing and managing data of such magnitude could have been a problem, even for a company like Facebook. But thanks to Apache Hadoop! Facebook uses Hadoop to keep track of each and every profile it has on it, as well as all the data related to them like their images, posts, comments, videos, etc.

Opportunities for Hadoopers:
Opportunities for Hadoopers are infinite – from a Hadoop Admin, Developer, to a Hadoop Tester or a Hadoop Architect, and so on. If cracking and managing BigData is your passion in life, then think no more and Join EconITService Hadoop course and carve a niche for yourself! Happy Hadooping!

Datanode doesn’t start with error “java.net.BindException: Address already in use”

In many real time scenario we have seen a error “java.net.BindException: Address already in use” with datanode when we start datanode.

You can observe following things during that issue.

1. Datanode doesn’t start with error saying “address already in use”.
2. “netstat -anp | grep 50010” shows no result.

ROOT CAUSE:
There are 3 ports needed when datanode starts and each has a different error message when address already in use.

1. Port 50010 is already in use
2016-12-02 00:01:14,056 ERROR datanode.DataNode (DataNode.java:secureMain(2630)) – Exception in secureMain
java.net.BindException: Problem binding to [0.0.0.0:50010] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException

2. Port 50075 is already in use
2016-12-01 23:57:57,298 ERROR datanode.DataNode (DataNode.java:secureMain(2630)) – Exception in secureMain
java.net.BindException: Address already in use

3. Port 8010 is already in use
2016-12-02 00:09:40,422 ERROR datanode.DataNode (DataNode.java:secureMain(2630)) – Exception in secureMain
java.net.BindException: Problem binding to [0.0.0.0:8010] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException

Note that there is no port information within datanode.log when Port 50075 is already in use.If it’s required to set a different port for datanode service, review the following properties:

dfs.datanode.address : default 50010
dfs.datanode.http.address : default 50075
dfs.datanode.ipc.address : default 8010

RESOLUTION :
Stop/kill the process which uses port 50010/50075/8010.

Please feel free to give your valuable feedback.

http://www.hadoopadmin.co.in/ambari/datanode-doesnt-start-with-error-java-net-bindexception-address-already-in-use/

HDFS disk space vs NameNode heap size

In HDFS, data and metadata are decoupled. Data files are split into block files that are stored, and replicated on DataNodes across the cluster. The filesystem namespace tree and associated metadata are stored on the NameNode.

Namespace objects are file inodes and blocks that point to block files on the DataNodes. These namespace objects are stored as a file system image (fsimage) in the NameNode’s memory and also persist locally. Updates to the metadata are written to an edit log. When the NameNode starts, or when a checkpoint is taken, the edits are applied, the log is cleared, and a new fsimage is created.

On DataNodes, data files are measured by disk space consumed—the actual data length—and not necessarily the full block size.

For example, a file that is 192 MB consumes 192 MB of disk space and not some integral multiple of the block size. Using the default block size of 128 MB, a file of 192 MB is split into two block files, one 128 MB file and one 64 MB file. On the NameNode, namespace objects are measured by the number of files and blocks. The same 192 MB file is represented by three namespace objects (1 file inode + 2 blocks) and consumes approximately 450 bytes of memory.

Large files split into fewer blocks generally consume less memory than small files that generate many blocks. One data file of 128 MB is represented by two namespace objects on the NameNode (1 file inode + 1 block) and consumes approximately 300 bytes of memory. By contrast, 128 files of 1 MB each are represented by 256 namespace objects (128 file inodes + 128 blocks) and consume approximately 38,400 bytes. The optimal split size, then, is some integral multiple of the block size, for memory management as well as data locality optimization.

How much memory you actually need depends on your workload, especially on the number of files, directories, and blocks generated in each namespace. If all of your files are split at the block size, you could allocate 1 GB for every million files. But given the historical average of 1.5 blocks per file (2 block objects), a more conservative estimate is 1 GB of memory for every million blocks.

Example 1: Estimating NameNode Heap Memory Used
Alice, Bob, and Carl each have 1 GB (1024 MB) of data on disk, but sliced into differently sized files. Alice and Bob have files that are some integral of the block size and require the least memory. Carl does not and fills the heap with unnecessary namespace objects.

Alice: 1 x 1024 MB file
1 file inode
8 blocks (1024 MB / 128 MB)
Total = 9 objects * 150 bytes = 1,350 bytes of heap memory
Bob: 8 x 128 MB files
8 file inodes
8 blocks
Total = 16 objects * 150 bytes = 2,400 bytes of heap memory
Carl: 1,024 x 1 MB files
1,024 file inodes
1,024 blocks
Total = 2,048 objects * 150 bytes = 307,200 bytes of heap memory
Example 2: Estimating NameNode Heap Memory Needed
In this example, memory is estimated by considering the capacity of a cluster. Values are rounded. Both clusters physically store 4800 TB, or approximately 36 million block files (at the default block size). Replication determines how many namespace blocks represent these block files.

Cluster A: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=1
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 1 = 128 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 128 MB = 36,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster A needs 36 GB of maximum heap space.
Cluster B: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=3
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 3 = 384 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 384 MB = 12,000,000 blocks

At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster B needs 12 GB of maximum heap space.
Both Cluster A and Cluster B store the same number of block files. In Cluster A, however, each block file is unique and represented by one block on the NameNode; in Cluster B, only one-third are unique and two-thirds are replicas.

What is the role of intuition in the era of big data? Have machines and data supplanted the human mind?

Contrary to what some people believe, intuition is as important as ever. When looking at massive, unprecedented datasets, you need someplace to start.Its been argued that intuition is more important than ever precisely because there’s so much data now. We are entering an era in which more and more things can be tested.

Big data has not replaced intuition — at least not yet; the latter merely complements the former. The relationship between the two is a continuum, not a binary.

Ranger User sync does not work due to ERROR UserGroupSync [UnixUserSyncThread]

If we have enabled AD/LDAP user sync in ranger and we get below error then we need to follow given steps to resolve it.

LdapUserGroupBuilder [UnixUserSyncThread] – Updating user count: 148, userName:, groupList: [test, groups]
09 Jun 2016 09:04:34 ERROR UserGroupSync [UnixUserSyncThread] – Failed to initialize UserGroup source/sink. Will retry after 3600000 milliseconds. Error details:
javax.naming.PartialResultException: Unprocessed Continuation Reference(s); remaining name ‘dc=companyName,dc=com’
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2866)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2840)
at com.sun.jndi.ldap.LdapNamingEnumeration.getNextBatch(LdapNamingEnumeration.java:147)
at com.sun.jndi.ldap.LdapNamingEnumeration.hasMoreImpl(LdapNamingEnumeration.java:216)
at com.sun.jndi.ldap.LdapNamingEnumeration.hasMore(LdapNamingEnumeration.java:189)
at org.apache.ranger.ldapusersync.process.LdapUserGroupBuilder.updateSink(LdapUserGroupBuilder.java:318)
at org.apache.ranger.usergroupsync.UserGroupSync.run(UserGroupSync.java:58)
at java.lang.Thread.run(Thread.java:745)

Root Cause: When ranger usersync is set for ranger.usersync.ldap.referral = ignore the ldap search will prematurely fail when it encounters additional referrals.

Resolution:

Change base db to dc=companyName,dc=com from cn=Users,dc=companyName,dc=com
Also change ranger.usersync.ldap.referral = follow from ranger.usersync.ldap.referral = ignore

So in this way it will resolve the above issue. I hope it helped you to solve your issue very easily.

Please feel free to give feedback or suggestion for any improvements.

Error: java.io.IOException: java.lang.RuntimeException: serious problem (state=,code=0)

If you run your hive query on ORC tables in hdp 2.3.4 then you may encounter this issue and it is because ORC split generation running on a global threadpool and doAs not being propagated to that threadpool. Threads in the threadpool are created on demand at execute time and thus execute as random users that were active at that time.

It is known issue and fixed by hitting: https://issues.apache.org/jira/browse/HIVE-13120 jira.

Intermittently ODBC users get error that another user doesn’t have permissions on the table. It seems hiveserver2 is checking on wrong user. For example, say you run a job as user ‘user1’, then the error message you will get is something like:

WARN [HiveServer2-Handler-Pool: Thread-587]: thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(681)) – Error fetching results:
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: serious problem
Caused by: java.io.IOException: java.lang.RuntimeException: serious problem

Caused by: java.lang.RuntimeException: serious problem
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1059)
…
Caused by: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException: Permission denied: user=haha, access=READ_EXECUTE, inode=”/apps/hive/warehouse/xixitb”:xixi:hdfs:drwx——
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
…
Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=haha, access=READ_EXECUTE, inode=”/apps/hive/warehouse/xixitb”:xixi:hdfs:drwx——
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
Note that user ‘haha’ is not querying on ‘xixitb’ at all.

Resolution:
For this issue we need to set following property at run time as a workaround which is to turn off local fetch task for hiveserver2.

set hive.fetch.task.conversion=none

0: jdbc:hive2://localhost:8443/default> set hive.fetch.task.conversion=none;

No rows affected (0.033 seconds)

0: jdbc:hive2://localhost:8443/default> select * from database1.table1 where lct_nbr=2413 and ivo_nbr in (17469,18630);

INFO : Tez session hasn’t been created yet. Opening session

INFO : Dag name: select * from ldatabase1.table1…(17469,18630)(Stage-1)

INFO :

INFO : Status: Running (Executing on YARN cluster with App id application_1462173172032_65644)

INFO : Map 1: -/-

INFO : Map 1: 0/44

INFO : Map 1: 0(+1)/44

Please feel free to give any feedback or suggestion for any improvements.