Menu
  • HOME
  • TAGS

Oozie null pointer exception when submitting jobs

cloudera,oozie,oozie-coordinator

Would like to add to the answer of this question. When oozie client got a NullPointerException, that usually means that request caused server thread failure. If you want to find out the "true" reason, you should look at the server log such as /var/log/oozie/oozie.log for CDH. There you will find...

Hive query executing differently in Hive client and JDBC

hadoop,jdbc,hive,cloudera

A few suggestions which can help you debug : The properties set when running from beeline client & through JDBC. I suspect the property hive.auto.convert.join is causing this behavior. Basically from beeline it is able to read all the smaller tables (two to six) in memory of each mapper and...

Not able to drop hive table

mysql,hadoop,hive,cloudera,cloudera-cdh

I had exactly the same issue. I tried to do what was suggested in the reference link, but I still had the same problem (even when I fixed all the errors I could see when running the scripts). Finally, I looked for that BIG_DECIMAL_HIGH_VALUE in the Hive metastore scripts, and...

SparkException: local class incompatible

java,hadoop,apache-spark,cloudera,cloudera-manager

After spending hours, we have solved the problem. Our problem's root cause is that we downloaded apache-spark from official site and builded it. So some jars are not competible with cloudera distributions. Finally we have learned today, spark cloudera distribution is available in github(https://github.com/cloudera/spark/tree/cdh5-1.2.0_5.3.2) and after building it we have...

Pig - How to Join and Define Schema in One Step

hadoop,apache-pig,bigdata,cloudera

According to the documentation you can't define a schema when joining relations. Note: Syntactically you can nest commands to have the feeling that you saved some steps like: D = foreach (join (LOAD 'a.txt' USING PigStorage('\\u001') AS (foo:int ,bar:chararray)) by foo, (LOAD 'b.txt' USING PigStorage('\\u001') AS (foo:int ,baz:long)) by foo...

What happens if impala Query runs out of memory?

hadoop,cloudera,impala,mpp

It depends on the version of Impala and how it's configured. In general, Impala will kill queries when they run out of memory. There is a process-wide memory limit at which point any query that requests memory will be killed. There is also another optional, per-query memory limit. Impala 2.0...

External Authentication - Cloudera Manager 5 and OpenLDAP

ldap,cloudera,openldap,cloudera-manager

after some days on it I found that SRCH base="" is not correct, and must be provided in Cloudera Manager, even if there is a user pattern already filled. I added the base pattern "dc=example,dc=com" and it worked. Felt stupid....

How to Specify ToDate during Schema Definition?

hadoop,apache-pig,cloudera

UDFs can't be applied in the schema definition. You may write your own loader instead.

hadoop - map reduce task and static variable

java,hadoop,cloudera

In a distributed Hadoop cluster each Map/Reduce task runs in it's own separate JVM. So there's no way to share static variable between different class instances running on different JVMs (and even on different nodes). But if you want to share some immutable data between tasks, you can use Configuration...

users other than root cannot access Hadoop

hadoop,cloudera,cloudera-cdh

the problem was because of different environmental settings for the "root" and "my-user" accounts. during the process, I setup the $HADOOP_HOME in my .bashrc which was forcing "my-user" account to use an obsolete path. Adjusting that based on root account setting solved the problem.

Search of integer is slow when comapared to string in solr

search,solr,cloudera

Both of your search examples are text as far as Solr is concerned. So, they should be treated identically. So, either you missed something from your description of the situation or there is something very funny about particular records. Have you tried searching for string and "integer" values that supposed...

Broken packages error while installing zookeeper-server

hadoop,cloudera,zookeeper,apt,cloudera-cdh

On reading this blog, I came to know this was not issue more of broken packages. To resolve this broken package error: Uninstall previous Zookeeper: $ sudo apt-get remove zookeeper $ sudo apt-get purge zookeeper Fresh install Zookeeper $ sudo apt-get install zookeeper Install Zookeeper Server $ sudo apt-get install...

Hive Query Language return only values where NOT LIKE a value in another table

hadoop,hive,cloudera,hiveql,impala

If your Hive version is 0.13 or newer, than you could use a subquery in the WHERE clause to filter the rows from the hosts table. The following is a more generalized approach that would not require you to enumerate all of the top-level domains you might find in your...

How to install impala on an already running hadoop cluster

hadoop,cloudera,impala

Here you go: Cloudera Non-CM Impala Installation

Cloudera manager is not starting

cloudera,cloudera-manager

I got the solution of this problem One of the other application that i am running updated the host file. So it was having two entries for localhost(Broken Host file). After fixing that problem was resolved....

Cannot run the job on hadoop cluster. only runs using LocalJobRunner

hadoop,cloudera,yarn,hadoop2,cloudera-cdh

Apparently, you can only submit a hadoop job from the node designated as the gateway node. Everything is working once I submitted the job from the gateway node.

hbase - is there any “explain” keyword?

hadoop,hbase,cloudera

You don't have an "explain" in HBase because it is not RDBMS. It is a key-value store and all of your operations are get (get value given a key), set (set value for a given key), delete (delete given key) and scans (all the table or specific key range) -...

How to resolve load main class MahoutDriver error on Twenty Newsgroups Classification Example

hadoop,machine-learning,mahout,cloudera,cloudera-cdh

This error was due to incorrect apache maven path and POM file i simply put pom.xml file (including mahout-master full source code ) under apache-maven-3.2.1 and run command again this method definitely solve this problem.

Why does automatic failover break when running both HA HDFS and MR1?

hadoop,cloudera,cloudera-cdh

The namenode and jobtracker share a similar HA implementation, to the degree that they both extend the same base class. They both use a backing zookeeper cluster to decide which available node is active. The location used in zookeeper is constructed by appending the failover group name (i.e. the values...

Hive impersonation not working with jdbc

java,jdbc,hive,cloudera

It was actually down to the hive server2 not being started in Cloudera and me trying to use the driver for hive server 2 Class.forName("org.apache.hive.jdbc.HiveDriver"); instead of the hive server driver Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");.

What happened to the AvroRecord class in CDH 5?

hadoop,cloudera,avro

This took me a minute to figure out. It's "not there" upstream too: https://github.com/apache/hadoop/tree/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro But this is because it's a generated class. The definition is here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/test/avro/avroRecord.avsc https://github.com/cloudera/hadoop-common/blob/cdh5-2.5.0_5.2.1/hadoop-common-project/hadoop-common/src/test/avro/avroRecord.avsc You're not finding it...

a weird error when trying to write to HDFS using CDH-5.2.0

hdfs,apache-spark,cloudera

It has nothing to do with 64 bits, or CDH. The error suggests you have included two different incompatible versions of Hadoop in your app. Maybe one of your dependencies is bringing in other versions or you have added another accidentally at runtime

where is the conf directory of Cloudera hadoop located?

apache,hadoop,cloudera

It is in the usual place, /etc/hadoop/conf. You will find this is actually a symlink that uses alternatives, but you can just go to this directory to find/edit config if needed. However it's much easier to manage the packages and config via Cloudera Manager. I really wouldn't bother with editing...

Put one local file into multiple HDFS directories

hadoop,hdfs,cloudera

In order to speed up the copy, some kind of parallelism is needed. It would be easy to run a multi-thread program to submit dozens of hdfs copy command a time in java. With shell script, you may do something like: m=10 for (( i = 0; i < 100;...

Number of HBase region servers vs data nodes

hadoop,hbase,cloudera

You can use any ratio you want but the rule of thumb is 1:1. The less regions a RS has the better, more RS means less regions per server and less regions to reassign if the node fails which will improve the recovery time (by a lot, although there has...

Hive ORC compression

hadoop,compression,hive,cloudera,snappy

I believe this is due to a known bug in 0.12 Have a look at this Jira HIVE-6083

Get country from tweet with certain keywords

java,twitter4j,cloudera,flume

On analysis you'll find that most of the tweets don't have location attached to it. Also, even if location is attached, the city, state or country may not be available or be correct. Also I've found tweets where such country names literally don't exist. So, you'll have to map city...

Oozie date time start

cloudera,hue,restfb,flume-ng,oozie-coordinator

What about to create a java action, and setup a Workflow property what uses the coordinator's current time. <property> <name>myStart</name> <value>${coord:current(0)}</value> </property> Than use this property in your action as parameter....

Fail to start Hive queries(MapReduce)

hadoop,cloudera,yarn

I have understood problem with select count(*) from tweets; The problem was that I placed my serde.jar in wrong directories on some Node-hosts. So i get errors with query in hive cli/Hue. CDH 4.* throws "Class not found Exception" and CDH 5.* Error code 2. But problem with jobTracker(Yarn) is...

CDH autodeployment via API does not set the CDH version for the hosts

python,api,hadoop,cloudera

I also posted my question into the Cloudera Community forum and the offer insight that helped me find a solution. I hope it helps others: http://community.cloudera.com/t5/Cloudera-Manager-Installation/Autodeployment-with-CM-Manager-5-3-1-issue/m-p/24422/highlight/false...

Copy files from Remote unix and windows servers into HDFS without intermediate staging

hadoop,hdfs,cloudera,biginsights,hortonworks

There is no standard command which achieves this. Good work arounds are given here and here. Hope this helps.

Cloudera Hadoop quick Start VM Impala Error

virtual-machine,cloudera,impala

impalad isn't running on that machine so it can't connect, so you need to start the impala service. On the cloudera quick start VM, it is easiest via the Cloudera Manager (one of the two web pages you saw upon startup). If you wish to confirm that impalad isn't running,...

How to strip HTML content in flume morphline.conf file using Xquery

solr,cloudera,flume

I found the following solution for my problem and hence I wanted to share with you guys: 2) After the Xquery command block I wrote following code to convert the date into required format and it worked perfectly fine. { convertTimestamp { field : createDate inputFormats : ["E MMM dd...

Cloudera Twiiter Hive Query failure

twitter,cloudera,hadoop-streaming

Resolved! Don't use prebuilt SerDe Jar to download. It may be outdated. Compile yourself!...

how to download source code for a specific cloudera distribution?

hadoop,hdfs,cloudera

You have two options for downloading cloudera version specific source code. Options 1: From Maven repo https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-hdfs/2.5.0-cdh5.3.0/hadoop-hdfs-2.5.0-cdh5.3.0-sources.jar https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-common/2.5.0-cdh5.3.0/hadoop-common-2.5.0-cdh5.3.0-sources.jar (Change the version and hadoop component name appropriately) Options 2: From tar ball repo Cloudera provides Hadoop relases in the form of tar balls...

Flume-ng hdfs sink .tmp file refresh rate control proprty

cloudera,flume,hortonworks-data-platform,flume-ng,flume-twitter

Consider decreasing your channel's capacity and transactionCapacity settings: capacity 100 The maximum number of events stored in the channel transactionCapacity 100 The maximum number of events the channel will take from a source or give to a sink per transaction These settings are responsible for controlling how many events get...

Can apache drill work with cloudera hadoop?

cloudera,apache-drill

I got this working with cloudera hadoop distribution. I already had cloudera cluster installed with all services running. perform following steps: Install apache drill on all nodes of the cluster. Run drill/bin/drillbit.sh on each node. Configure storage plugin for dfs using apache drill webinterface at host:8047. Update HDFS configurations here....

command usage:when to use hadoop fs and hdfs dfs

hadoop,hdfs,cloudera

Following are the three commands which appears same but have minute differences hadoop fs {args} hadoop dfs {args} hdfs dfs {args} hadoop fs <args> FS relates to a generic file system which can point to any file systems like local, HDFS etc. So this can be used when you are...

Where is Spark's log if run on Yarn?

hadoop,logging,apache-spark,cloudera,yarn

Pretty article for this question: Running Spark on YARN - see section "Debugging your Application". Decent explanation with all required examples. The only thing you need to follow to get correctly working history server for Spark is to close your Spark context in your application. Otherwise application history server does...

How to get all mapreduce jobs' status through REST API?

rest,hadoop,cloudera

The REST API related to ResourseManager has to be used for getting the list of all applications/jobs running in the cluster. Here is the API for the same.

Reverting the Load statement of Impala?

hadoop,cloudera,impala

Run impala-shell in a server where you have installed impala and type the command: DESCRIBE FORMATTED table_name; It will show the location of the table in hdfs. ...

NameError: uninitialized constant SingleColumnValueFilter

hbase,cloudera

hbase(main):009:0> import org.apache.hadoop.hbase.util.Bytes; hbase(main):009:0> import org.apache.hadoop.hbase.filter.SingleColumnValueFilter; hbase(main):009:0> import org.apache.hadoop.hbase.filter.BinaryComparator; hbase(main):009:0> import org.apache.hadoop.hbase.filter.CompareFilter; hbase(main):009:0> import org.apache.hadoop.hbase.filter. Filter; ...

Change IP address of a Hadoop HDFS data node server and avoid Block pool errors

hadoop,hdfs,cloudera,cloudera-manager

Turns out its better to: Change the IP of the server such that it successfully resolves it's hostname and host FQDN to the same IP. In my case I changed the IP on my DNS server. Delete the HDFS datanode role from the server Add the HDFS datanode role back...

hbase.master.port overridden programatically?

hadoop,hbase,cloudera

answering my own question :( as i just figured out the hbase standalone mode do not takes hbase.master.port into account https://github.com/cloudera/hbase/blob/cdh4.5.0-release/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java#L141 standalone mode: http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_ig_hbase_standalone_start.html only way to assign a port is to setup at least a Pseudo-Distributed Mode see this:...

json SerDe hive 0.13.1

json,hive,cloudera

I managed to solve it by getting the version 2 of the serde library from code.google.com/p/hive-json-serde/wiki/GettingStarted

what does “Encountered: after : ”“ ” mean using pig

hadoop,apache-pig,cloudera

You need to fix three issues in your code to make it work. 1.STORE stmt is not properly ended with semicolon. 2.STORE stmt output file is not properly enclosed with single quotes. 3. Need slight modification in Counts and Results stmt logic. Modified Script: Lines = LOAD '/user/hue/pig/examples/data/midsummer.txt' as (line:CHARARRAY);...

What is [email protected] means?

bash,shell,hadoop,cloudera

[email protected] are the command line parameter of the program. if you call a program named test.sh in this way: test.sh 1 2 3 [email protected] contains 1 2 3...

Email alerts from Cloudera Manager

email,hadoop,alert,cloudera,cloudera-manager

Did you try looking in the Cloudera Manager under Administration->Alerts? There's quite a bit of built-in monitoring and alert configuration options there. With regards to your example, there is a specific setting for "DataNode health" under the HDFS alerts.

Switch a disk containing cloudera hadoop / hdfs / hbase data

hadoop,hbase,database-migration,cloudera,disk-partitioning

i've resolved in this way: stop all services but hdfs export data out of the hdfs. In my case the interesting part was in hbase: su - hdfs hdfs dfs -ls / command show me the following data: drwxr-xr-x - hbase hbase 0 2015-02-26 20:40 /hbase drwxr-xr-x - hdfs supergroup...

How to write query to avoid single reducer in select distinct and size collect_set hive queries?

hadoop,hive,query-optimization,cloudera,hiveql

Using two queries works for count(distinct var): SELECT count(1) FROM ( SELECT DISTINCT locations as unique_locations from my_table ) t; Same goes for size collect_set I think: SELECT size(unique_locations) FROM ( SELECT collect_set(locations) as unique_locations from my_table ) t; ...

how to determine the cloudera minor release in the one click install debian package ? (i.e., 5.1 ? 5.2 ?)

hadoop,installation,cloudera

The one-click install repo currently points to the latest Cloudera version, which is 5.3.0 as of earlier this week. To check the version you installed, just list the package name. There should be some version number like '5.2.x' appended to the package name. An example command: dpkg -l | grep...

Cloudera installation failed to detect root privileges on CentOS

linux,ssh,centos,cloudera,cloudera-manager

You need root privileges without password , so your /etc/sudoers line will look something like this, cloudera ALL =(ALL) NOPASSWD: ALL ...

Hadoop Nodes and Roles

hadoop,mapreduce,hdfs,cloudera

Does the TaskTracker on Node1 sit idle since there is no DataNode service on that Node? Correct, if the data node is disabled then the task tracker will not be able to process the data as the data will not be avaiable; it will be idle. 2. or Does...

HDFS live node showing as decommissioned

hadoop,hdfs,cloudera

Well that is strange. Althought Cloudera manager thinks they're okay, I thought I'd try decommissioning the datanodes and then recommissioning them as a last ditch. It seems to have fixed the reporting as decommissioned In addition its fixed my underreplication issue as well....

Spark executor logs on YARN

apache-spark,cloudera,yarn,cloudera-manager

Check NodeManager's yarn.nodemanager.log-dir property. It's the log location of when Spark executor container is running. Note that when the application finishes NodeManager may remove the files (Log Aggregation). Check this document for detail. http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/...

Trouble installing Cloudera

cloudera

It ended up being a proxy problem, so never mind :)

Deploying hdfs core-site.xml with cloudera manager

hadoop,cloudera,cloudera-manager

There is a MapReduce Client Environment Safety Valve, also known as 'MapReduce Service Advanced Configuration Snippet (Safety Valve) for core-site.xml' found in the gui under mapreduce's configuration ->Service-Wide->Advanced will allow you to add any value that doesn't fit elsewhere. (There is also one for core-site.xml as well.) Having said that,...

java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/BenchmarkThroughput

hadoop,cloudera

This class is part of the HDFS test code, not the main HDFS library. You will not find it available automatically on the classpath in a cluster. The test code is published as compiled artifacts with Hadoop, but just different artifacts. CDH is no different. For CDH 4.4, you can...

Is it possible to tell the number of mappers / reducers used based on number of files?

hadoop,mapreduce,cloudera

Number of Mappers depends on the number of splits, however if the files are less then the split size then each file will correspond to one mapper. that is the reason large number of small files are not recommended determining properties to decide split size and there default values are...

restart a CDH5 cluster on EC2 saved as AMI

amazon-ec2,cloudera,cloudera-cdh,cloudera-manager

The reason this seems to be failing are the host names. When assigning a static private IP to an instance the hostname changes to ip-xx-xx-xx-xx. Editing the hosts file didn't seem to help. So I ditched the Cloudera manager installation as a whole. Installing CDH (5.3) from packages seems to...

Where does Cloudera Manager store its configuration?

hadoop,cloudera,alerts,cloudera-manager

(Moving from comment.) It's stored in a database like MySQL or postgresql by default. You can configure it to use a different DB but otherwise it runs one locally.

What is the benefit of using CDH (cloudera)? [closed]

hadoop,bigdata,apache-spark,cloudera,cloudera-cdh

Well, CDH is a "Hadoop distribution". For me, it is "a simple way of installing Hadoop" and having a nice web interface for administration. So you can't really use CDH instead of Hadoop. (Just as you can't use Red Hat instead of Linux.) Spark can also run as a stand-alone...

YARN UNHEALTHY nodes

hadoop,distributed-computing,cloudera,yarn,cloudera-cdh

try adding the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage to yarn-site.xml. This property specifies the maximum percentage of disk space utilization allowed after which a disk is marked as bad. Values can range from 0.0 to 100.0. yarn-default.xml...

Cloudera Certification for Hadoop developer

hadoop,cloudera

If you know the content well in the following books, then you should be decently prepared to attempt the exam: Hadoop: The Definitive Guide Hadoop Operations...

Convert Json Data into specific table format using Pig

json,hadoop,apache-pig,bigdata,cloudera

Can you try this Custom UDF? Sample input1: input.json {"Properties2":[{"K":"A","T":"String","V":"M "}, {"K":"B","T":"String","V":"N"}, {"K":"D","T":"String","V":"O"}]} {"Properties2":[{"K":"A","T":"String","V":"W"},{"K":"B","T":"String","V":"X"},{"K":"C","T":"String","V":"Y"},{"K":"D","T":"String","V":"Z"}]} PigScript: REGISTER jsonparse.jar A= LOAD 'input.json' Using JsonLoader('Properties2:{(K:chararray,T:chararray,V:chararray)}'); B= FOREACH A GENERATE...

Why Cloudera Manager reports that disks are full?

linux,hadoop,cloudera,cloudera-manager

I solved the problem. The property dfs.datanode.du.reservedwas set to 100GB, so Hadoop could not use that amount of space (for each volume) for storing new HDFS blocks.