Featured

Why Learn Big Data and Hadoop?

This is the post excerpt.

In my experience, people who do things in their career that they are excited about and have a passion for, can go farther and faster with the self-motivation than if they did something that they didn’t like, but felt like they needed to do it for other reasons.  You are awesome in already taking initiative in your career by doing your research including visiting my blog.

big-data-fita

This current wave of “big data” has tremendous opportunities.  The deluge of big data is likely to persist in the future.  Tools to handle big data will eventually become mainstream and commonplace, which is when almost everyone is working with big data.  However, enterprising folks can still get ahead of the mainstream today by investing in skills and career development.  I realize this may sound like hyperbole, but this is the historical pattern that we have seen around how technology gets adopted and the resulting shifts in the workforce (e.g. printing press, radio, television, computers, internet, etc.).

BigData! A Worldwide Problem:
Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.” In simpler terms, Big Data is a term given to large volumes of data that organizations store and process. However,  It is becoming very difficult for companies to store, retrieve and process the ever-increasing data. If any company gets hold on managing its data well, nothing can stop it from becoming the next BIG success!

The problem lies in the use of traditional systems to store enormous data. Though these systems were a success a few years ago, with increasing amount and complexity of data, these are soon becoming obsolete. The good news is – Hadoop, which is not less than a panacea for all those companies working with BigData in a variety of applications has become an integral part for storing, handling, evaluating and retrieving hundreds or even petabytes of data.

Apache Hadoop! A Solution for Big Data:
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don’t overlook the charming yellow elephant you see, which is basically named after Doug’s son’s toy elephant!

Some of the top companies using Hadoop:
The importance of Hadoop is evident from the fact that there are many global MNCs that are using Hadoop and consider it as an integral part of their functioning, such as companies like Yahoo and Facebook! On February 19, 2008, Yahoo! Inc. established the world’s largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on over 10,000 core Linux cluster and generates data that is now widely used in every Yahoo! Web search query.

Facebook, a $5.1 billion company has over 1 billion active users in 2012, according to Wikipedia. Storing and managing data of such magnitude could have been a problem, even for a company like Facebook. But thanks to Apache Hadoop! Facebook uses Hadoop to keep track of each and every profile it has on it, as well as all the data related to them like their images, posts, comments, videos, etc.

Opportunities for Hadoopers:
Opportunities for Hadoopers are infinite – from a Hadoop Admin, Developer, to a Hadoop Tester or a Hadoop Architect, and so on. If cracking and managing BigData is your passion in life, then think no more and Join EconITService Hadoop course and carve a niche for yourself! Happy Hadooping!

 

Datanode doesn’t start with error “java.net.BindException: Address already in use”

In many real time scenario we have seen a error “java.net.BindException: Address already in use” with datanode when we start datanode.

You can observe following things during that issue.

1. Datanode doesn’t start with error saying “address already in use”.
2. “netstat -anp | grep 50010” shows no result.

ROOT CAUSE:
There are 3 ports needed when datanode starts and each has a different error message when address already in use.

1. Port 50010 is already in use
2016-12-02 00:01:14,056 ERROR datanode.DataNode (DataNode.java:secureMain(2630)) – Exception in secureMain
java.net.BindException: Problem binding to [0.0.0.0:50010] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException

2. Port 50075 is already in use
2016-12-01 23:57:57,298 ERROR datanode.DataNode (DataNode.java:secureMain(2630)) – Exception in secureMain
java.net.BindException: Address already in use

3. Port 8010 is already in use
2016-12-02 00:09:40,422 ERROR datanode.DataNode (DataNode.java:secureMain(2630)) – Exception in secureMain
java.net.BindException: Problem binding to [0.0.0.0:8010] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException

Note that there is no port information within datanode.log when Port 50075 is already in use.If it’s required to set a different port for datanode service, review the following properties:

dfs.datanode.address : default 50010
dfs.datanode.http.address : default 50075
dfs.datanode.ipc.address : default 8010

RESOLUTION :
Stop/kill the process which uses port 50010/50075/8010.

Please feel free to give your valuable feedback.

http://www.hadoopadmin.co.in/ambari/datanode-doesnt-start-with-error-java-net-bindexception-address-already-in-use/

 

Top most Hadoop Interview question

1. What are the Side Data Distribution Techniques?

Side data refers to extra static small data required by map reduce to perform job. Main challenge is the availability of side data on the node where the map would be executed. Hadoop provides two side data distribution techniques.

Using Job Configuration

An arbitrary Key value pair can be set in job configuration.

2. What is shuffling in MapReduce?

Once map tasks started to complete, A communication from reducers is started. where map output sent to reducer, which is looking for the output data to process. at same time data nodes are still process multiple other tasks. The data transfer of mappers output to reducer known as shuffling.

3. What is partitioning?

Partitioning is a process to identify the reducer instance, which would be used to supply the mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identifies the reducer as an recipient of mapper output. All the key, no matter which mapper has generated this, must lie with same reducer.

4. What is Distributed Cache in mapreduce framework?

Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars, which could be used by application to improve performance. Application provides details of file to jobconf object to cache. Mapreduce framework would copy the

5. What is a job tracker?

Job tracker is a background service executed on namenode for submitting and tracking a Job. Job in hadoop terminology refers to mapreduce jobs. It further break up the job into tasks. Which would be deployed every data node holding the required data. In a Hadoop cluster, Job tracker is master and task acts like child, acts, performs and revert the progress to job tracker through heartbeat.

6. How to set which framework would be used to run map reduce program?

mapreduce.framework.name. it can be

  1. Local
  2. Classic
  3. Yarn

7. What is replication factor for Job’s JAR?

These are one of the most critical resources used regularly by task completion. it’s replication factor is 10

8. mapred.job.tracker property is used for?

mapred.job.tracker property is used by runner to get the job tracker mode. if it set to local then runner would submit the job to local job tracker running of single JVM. else job would be sent to mentioned address in property.

9. Difference between Job.submit() or waitForCompletion() ?

Job Submit internally creates submitter instance and submit the job, while waitforcompletion poll’s progress at regular interval of one second. if job gets executed successfully, it displays successful message on console else display a relevant error message.

 

10. What are the types of tables in Hive?

There are two types of tables.

  1. Managed tables.
  2. External tables.

Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.

11. Does Hive support record level Insert, delete or update?

Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

12. What kind of datawarehouse application is suitable for Hive?

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.

Hive is most suited for data warehouse applications, where

1) Relatively static data is analyzed,

2) Fast response times are not required, and

3) When the data is not changing rapidly.

Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

13. How can the columns of a table in hive be written to a file?

By using awk command in shell, the output from HiveQL (Describe) can be written to a file.

hive -S -e “describe table_name;” | awk -F” ” ’{print 1}’ > ~/output.

14. CONCAT function in Hive with Example?

CONCAT function will concat the input strings. You can specify any number of strings separated by comma.

Example:

CONCAT (‘Hive’,’-’,’performs’,’-’,’good’,’-’,’in’,’-’,’Hadoop’);

Output:

Hive-performs-good-in-Hadoop

So, every time you delimit the strings by ‘-’. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.

CONCAT_WS (‘-’,’Hive’,’performs’,’good’,’in’,’Hadoop’);

Output: Hive-performs-good-in-Hadoop

15. REPEAT function in Hive with example?

REPEAT function will repeat the input string n times specified in the command.

Example:

REPEAT(‘Hadoop’,3);

Output:

HadoopHadoopHadoop.

Note: You can add a space with the input string also.

16. How Pig integrate with Mapreduce to process data?

Pig can easier to execute. When programmer wrote a script to analyze the data sets, Here Pig compiler will convert the programs into MapReduce understandable format. Pig engine execute the query on the MR Jobs. The MapReduce process the data and generate output report. Here MapReduce doesn’t return output to Pig, directly stored in the HDFS.

17. What is the difference between logical and physical plan?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

18. How many ways we can run Pig programs?

Pig programs or commands can be executed in three ways.

  • Script – Batch Method
  • Grunt Shell – Interactive Method
  • Embedded mode

All these ways can be applied to both Local and Mapreduce modes of execution.

19. What is Grunt in Pig?

Grunt is an Interactive Shell in Pig, and below are its major features:

  • Ctrl-E key combination will move the cursor to the end of the line.
  • Grunt remembers command history, and can recall lines in the history buffer using up or down cursor keys.
  • Grunt supports Auto completion mechanism, which will try to complete
  • Pig Latin keywords and functions when you press the Tab key.

20. What are the modes of Pig Execution?

Local Mode:

Local execution in a single JVM, all files are installed and run using local host and file system.

Mapreduce Mode:

Distributed execution on a Hadoop cluster, it is the default mode.

21. What are the main difference between local mode and MapReduce mode?

Local mode:

No need to start or install Hadoop. The pig scripts run in the local system. By default Pig store data in File system. 100% MapReduce and Local mode commands everything same, no need to change anything.

MapReduce Mode:

It’s mandatory to start Hadoop. Pig scripts run and stored in in HDFS. in Both modes, Java and Pig installation is mandatory.

22. Can we process vast amount of data in local mode? Why?

No, System has limited fixed amount of storage, where as Hadoop can handle vast amount of data. So, Pig -x Mapreduce mode is the best choice to process vast amount of data.

23. Does Pig support multi-line commands?

Yes

24. Hive doesn’t support multi-line commands, what about Pig?

Pig can support single and multiple line commands.

Single line comments:

Dump B; — It execute the data, but not store in the file system.

Multiple Line comments:

Store B into ‘/output’; /* it can store/persists the data in Hdfs or Local File System. In protection level most often used Store command */

25. Difference Between Pig and SQL ?

Pig is a Procedural SQL is Declarative Nested relational data model SQL flat relational Schema is optional SQL schema is required OLAP works SQL supports OLAP+OLTP works loads Limited Query  Optimization and Significent opportunity for query Optimization.

HDFS disk space vs NameNode heap size

In HDFS, data and metadata are decoupled. Data files are split into block files that are stored, and replicated on DataNodes across the cluster. The filesystem namespace tree and associated metadata are stored on the NameNode.

Namespace objects are file inodes and blocks that point to block files on the DataNodes. These namespace objects are stored as a file system image (fsimage) in the NameNode’s memory and also persist locally. Updates to the metadata are written to an edit log. When the NameNode starts, or when a checkpoint is taken, the edits are applied, the log is cleared, and a new fsimage is created.

On DataNodes, data files are measured by disk space consumed—the actual data length—and not necessarily the full block size.

For example, a file that is 192 MB consumes 192 MB of disk space and not some integral multiple of the block size. Using the default block size of 128 MB, a file of 192 MB is split into two block files, one 128 MB file and one 64 MB file. On the NameNode, namespace objects are measured by the number of files and blocks. The same 192 MB file is represented by three namespace objects (1 file inode + 2 blocks) and consumes approximately 450 bytes of memory.

Large files split into fewer blocks generally consume less memory than small files that generate many blocks. One data file of 128 MB is represented by two namespace objects on the NameNode (1 file inode + 1 block) and consumes approximately 300 bytes of memory. By contrast, 128 files of 1 MB each are represented by 256 namespace objects (128 file inodes + 128 blocks) and consume approximately 38,400 bytes. The optimal split size, then, is some integral multiple of the block size, for memory management as well as data locality optimization.

How much memory you actually need depends on your workload, especially on the number of files, directories, and blocks generated in each namespace. If all of your files are split at the block size, you could allocate 1 GB for every million files. But given the historical average of 1.5 blocks per file (2 block objects), a more conservative estimate is 1 GB of memory for every million blocks.

Example 1: Estimating NameNode Heap Memory Used
Alice, Bob, and Carl each have 1 GB (1024 MB) of data on disk, but sliced into differently sized files. Alice and Bob have files that are some integral of the block size and require the least memory. Carl does not and fills the heap with unnecessary namespace objects.

Alice: 1 x 1024 MB file
1 file inode
8 blocks (1024 MB / 128 MB)
Total = 9 objects * 150 bytes = 1,350 bytes of heap memory
Bob: 8 x 128 MB files
8 file inodes
8 blocks
Total = 16 objects * 150 bytes = 2,400 bytes of heap memory
Carl: 1,024 x 1 MB files
1,024 file inodes
1,024 blocks
Total = 2,048 objects * 150 bytes = 307,200 bytes of heap memory
Example 2: Estimating NameNode Heap Memory Needed
In this example, memory is estimated by considering the capacity of a cluster. Values are rounded. Both clusters physically store 4800 TB, or approximately 36 million block files (at the default block size). Replication determines how many namespace blocks represent these block files.

Cluster A: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=1
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 1 = 128 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 128 MB = 36,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster A needs 36 GB of maximum heap space.
Cluster B: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=3
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 3 = 384 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 384 MB = 12,000,000 blocks

At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster B needs 12 GB of maximum heap space.
Both Cluster A and Cluster B store the same number of block files. In Cluster A, however, each block file is unique and represented by one block on the NameNode; in Cluster B, only one-third are unique and two-thirds are replicas.

What is the role of intuition in the era of big data? Have machines and data supplanted the human mind?

Contrary to what some people believe, intuition is as important as ever. When looking at massive, unprecedented datasets, you need someplace to start.Its been argued that intuition is more important than ever precisely because there’s so much data now. We are entering an era in which more and more things can be tested.

Big data has not replaced intuition — at least not yet; the latter merely complements the former. The relationship between the two is a continuum, not a binary.

Ranger User sync does not work due to ERROR UserGroupSync [UnixUserSyncThread]

If we have enabled AD/LDAP user sync in ranger and we get below error then we need to follow given steps to resolve it.

LdapUserGroupBuilder [UnixUserSyncThread] – Updating user count: 148, userName:, groupList: [test, groups]
09 Jun 2016 09:04:34 ERROR UserGroupSync [UnixUserSyncThread] – Failed to initialize UserGroup source/sink. Will retry after 3600000 milliseconds. Error details:
javax.naming.PartialResultException: Unprocessed Continuation Reference(s); remaining name ‘dc=companyName,dc=com’
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2866)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2840)
at com.sun.jndi.ldap.LdapNamingEnumeration.getNextBatch(LdapNamingEnumeration.java:147)
at com.sun.jndi.ldap.LdapNamingEnumeration.hasMoreImpl(LdapNamingEnumeration.java:216)
at com.sun.jndi.ldap.LdapNamingEnumeration.hasMore(LdapNamingEnumeration.java:189)
at org.apache.ranger.ldapusersync.process.LdapUserGroupBuilder.updateSink(LdapUserGroupBuilder.java:318)
at org.apache.ranger.usergroupsync.UserGroupSync.run(UserGroupSync.java:58)
at java.lang.Thread.run(Thread.java:745)

Root Cause:  When ranger usersync is set for ranger.usersync.ldap.referral = ignore the ldap search will prematurely fail when it encounters additional referrals.

Resolution:

  1. Change base db to dc=companyName,dc=com from cn=Users,dc=companyName,dc=com
  2. Also change ranger.usersync.ldap.referral = follow from ranger.usersync.ldap.referral = ignore

So in this way it will resolve the above issue. I hope it helped you to solve your issue very easily.

Please feel free to give feedback or suggestion for any improvements.

Error: java.io.IOException: java.lang.RuntimeException: serious problem (state=,code=0)

If you run your hive query on ORC tables in hdp 2.3.4 then you may encounter this issue and it is because ORC split generation running on a global threadpool and doAs not being propagated to that threadpool. Threads in the threadpool are created on demand at execute time and thus execute as random users that were active at that time.

It is known issue and fixed by hitting: https://issues.apache.org/jira/browse/HIVE-13120 jira.

Intermittently ODBC users get error that another user doesn’t have permissions on the table. It seems hiveserver2 is checking on wrong user. For example, say you run a job as user ‘user1’, then the error message you will get is something like:

WARN [HiveServer2-Handler-Pool: Thread-587]: thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(681)) – Error fetching results:
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: serious problem
Caused by: java.io.IOException: java.lang.RuntimeException: serious problem

Caused by: java.lang.RuntimeException: serious problem
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1059)

Caused by: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException: Permission denied: user=haha, access=READ_EXECUTE, inode=”/apps/hive/warehouse/xixitb”:xixi:hdfs:drwx——
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)

Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=haha, access=READ_EXECUTE, inode=”/apps/hive/warehouse/xixitb”:xixi:hdfs:drwx——
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
Note that user ‘haha’ is not querying on ‘xixitb’ at all.

Resolution:
For this issue we need to set following property at run time as a workaround which is to turn off local fetch task for hiveserver2.

set hive.fetch.task.conversion=none

0: jdbc:hive2://localhost:8443/default> set hive.fetch.task.conversion=none;

No rows affected (0.033 seconds)

0: jdbc:hive2://localhost:8443/default> select * from database1.table1 where lct_nbr=2413 and ivo_nbr in (17469,18630);

INFO  : Tez session hasn’t been created yet. Opening session

INFO  : Dag name: select * from ldatabase1.table1…(17469,18630)(Stage-1)

INFO  :

INFO  : Status: Running (Executing on YARN cluster with App id application_1462173172032_65644)

INFO  : Map 1: -/-

INFO  : Map 1: 0/44

INFO  : Map 1: 0(+1)/44

Please feel free to give any feedback or suggestion for any improvements.