Menu
  • HOME
  • TAGS

Use of core-site.xml in mapreduce program

Tag: hadoop,mapreduce,bigdata

I have seen mapreduce programs using/adding core-site.xml as a resource in the program. What is or how can core-site.xml be used in mapreduce programs ?

Best How To :

From documentation, Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:

core-default.xml : Read-only defaults for hadoop, core-site.xml: Site-specific configuration for a given hadoop installation

Configuration config = new Configuration();
config.addResource(new Path("/user/hadoop/core-site.xml")); 
config.addResource(new Path("/user/hadoop/hdfs-site.xml")); 

From Hadoop logs how can I find intermediate output byte sizes & reduce output bytes sizes?

hadoop

You can get these info by using FileSystemCounters. Details of the terms used in this counter is given below: FILE_BYTES_READ is the number of bytes read by local file system. Assume all the map input data comes from HDFS, then in map phase FILE_BYTES_READ should be zero. On the other...

hadoop complains about attempting to overwrite nonempty destination directory

hadoop,hdfs

This is a bug in Hadoop 2.6.0. It's been marked as fixed but it still happens occasionally (see: https://issues.apache.org/jira/browse/YARN-2624). Clearing out the appcache directory and restarting the YARN daemons should most likely fix this. ...

Create an external Hive table from an existing external table

csv,hadoop,hive

I am presuming you want to select distinct data from "uncleaned" table and insert into "cleaned" table. CREATE EXTERNAL TABLE `uncleaned`( `a` int, `b` string, `c` string, `d` string, `e` bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/external/uncleaned' create another...

Best way to store relational data in hdfs

sql,hadoop,hdfs

Typically to build a data warehouse in hadoop, you have to ingest all the tables. In your example you need to have all 3 tables in HDFS and then do the ETL/aggregation for example Joiners_weekly can have a etl which have select * from PersonCompany pc join Person p on...

HIVE: apply delimiter until a specified column

hadoop,datatable,hive,delimiter

Use regular expresion https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData you can define when use space as delimiter and when part of data

Save flume output to hive table with Hive Sink

hadoop,hive,flume

Adding these 2 rows to my config solved my problem, but I still have errors when read table from hive. I can read the table, it returns correct result but with errors agent1.sinks.sink1.hive.txnsPerBatchAsk = 2 agent1.sinks.sink1.batchSize = 10 ...

issue monitoring hadoop response

hadoop,cluster-computing,ganglia,gmetad

I find out the issue. It was related to the hadoop metrics properties. I set up ganglia in the hadoop-metrics.properties but I had to set up hadoop-metrics.properties config file. Now ganglia throws correct metrics.

Oozie on YARN - oozie is not allowed to impersonate hadoop

hadoop,yarn,oozie,ambari

Hi please update the core-site.xml <property> <name>hadoop.proxyuser.hadoop.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hadoop.hosts</name> <value>*</value> </property> and jobTracker address is the Resourcemananger address that will not be the case . once update the core-site.xml file it will works....

What are the different ways to check if the mapreduce program ran successfully

hadoop,mapreduce,bigdata

Just like any other command in Linux, you can check the exit status of a hadoop jar command using the built in variable $?. You can use: echo $? after executing the hadoop jar command to check its status. The exit status value varies from 0 to 255. An exit...

In a MapReduce , how to send arraylist as value from mapper to reducer [duplicate]

java,hadoop,arraylist,mapreduce

To pass an arraylist from mapper to reducer, it's clear that objects must implement Writable interface. Why don't you try this library? <dependency> <groupId>org.apache.giraph</groupId> <artifactId>giraph-core</artifactId> <version>1.1.0-hadoop2</version> </dependency> It has an abstract class: public abstract class ArrayListWritable<M extends org.apache.hadoop.io.Writable> extends ArrayList<M> implements org.apache.hadoop.io.Writable, org.apache.hadoop.conf.Configurable You...

Hadoop append data to hdfs file and ignore duplicate entries

java,hadoop,mapreduce,hive,hdfs

Since HDFS is used for Write Once , Read Many times. we can not change contain of Hdfs file. You are trying to append data to file which is there in hdfs. copy your file intohdfs and then you can use -getmerge utility. hadoop fs -getmerge [addnl]. One other solution...

Flink error - org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

maven,hadoop,flink

Have you tried the Hadoop-2 build of Flink? Have a look at the downloads page. There is a build called flink-0.9.0-milestone-1-bin-hadoop2.tgz that should work with Hadoop 2.

jets3t cannot upload file to s3

hadoop,amazon-s3,jets3t

Alright, I'm answering this for posterity. The issue was actually maven. It seems that I was using incompatible version of the two framework. Of course maven being maven, cannot detect this....

Hadoop Basic - error while creating directroy

hadoop,hdfs

Before starting to create a directory, you should be sure that your hadoop installation is correct, through jps command, and looking for any process missing. In your case, the namenode isn't up. If you see in the logs, it appears to be that some folders aren't created. Do this: mkdir...

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanSetDropBehind issue in ecllipse

maven,hadoop,apache-spark,word-count

When you run in eclipse, the referenced jars are the only source for your program to run. So the jar hadoop-core(thats where CanSetDropBehind is present), is not added properly in your eclipse from local repository for some reasons. You need to identify this if it is a proxy issue, or...

Sqoop Export with Missing Data

sql,postgresql,shell,hadoop,sqoop

I solved the problem by changing my reduce function so that if there were not the correct amount of fields to output a certain value and then I was able to use the --input-null-non-string with that value and it worked.

Apache Spark: Error while starting PySpark

python,hadoop,apache-spark,pyspark

From the logs it looks like pyspark is unable to understand host localhost.Please check your /etc/hosts file , if localhost is not available , add an entry it should resolve this issue. e.g: [Ip] [Hostname] localhost In case you are not able to change host entry of the server edit...

Importtsv command gives : Container exited with a non-zero exit code 1 error

hadoop,hbase,classpath,yarn

java.lang.ClassNotFoundException: Class org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer not found The above line clearly indicates you are missing the web proxy jar. You need to add the following jar in your Hbase lib folder. hadoop-yarn-server-web-proxy-2.6.0.jar ...

Use of core-site.xml in mapreduce program

hadoop,mapreduce,bigdata

From documentation, Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath: core-default.xml : Read-only defaults for hadoop, core-site.xml: Site-specific configuration for a given hadoop installation Configuration config = new Configuration(); config.addResource(new Path("/user/hadoop/core-site.xml")); config.addResource(new Path("/user/hadoop/hdfs-site.xml")); ...

ERROR jdbc.HiveConnection: Error opening session Hive

java,hadoop,jdbc,hive

org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{use:database=default}) at org.apache.thrift.TApplicationException.read(TApplicationException.java:111) This error mostly occurs if you have version mismatch between your hive and hive-jdbc. Please check if both the versions match. Please refer this for more information....

Different ways of hadoop installation

hadoop,installation

Hadoop runs on Unix and on Windows. Linux is the only supported production platform, but other flavors of Unix (including Mac OS X) can be used to run Hadoop for development. Windows is only supported as a development platform, and additionally requires Cygwin to run. If you have Linux OS,...

Add PARTITION after creating TABLE in hive

hadoop,hive,partition

First create a table in such a way so that you don't have partition column in the table. create external table Student(col1 string, col2 string) partitioned by (dept string) location 'ANY_RANDOM_LOCATION'; Once you are done with the creation of the table then alter the table to add the partition department...

Datanode and Nodemanager on slave machine are not able to connect to NameNode and ResourceManager on master machine

java,apache,sockets,hadoop,tcp

I figured the issue. In iptables rules tcp connection were blocked. I flushed iptables rules using below command and issue is resolved. sudo iptables -f

Hadoop map reduce Extract specific columns from csv file in csv format

java,hadoop,file-io,mapreduce,bigdata

You can Use Apache Pig to do Filtering and Validating Date format as well. Follow below steps : Copy your file into HDFS Load file using load command and pigStorage(). Select 20 column using ForEach statment (You can just give column name/number like $0,$3,$5..etc) Write UDF to validate date format...

CouchDB-Why my rerduce is always coming as false ? I am not able to reduce anything properly

mapreduce,couchdb,couchdb-futon

What you got was the sum of values per title. What you wanted, was the sum of values in general. Change the grouping drop-down list to none. Check CouchdDB's wiki for more details on grouping....

SQL Server 2012 & Polybase - 'Hadoop Connectivity' configuration option missing

sql-server,hadoop,sql-server-2012

Are you sure Polybase is installed and enabled? You should have installed it during the SQL Server installation process and enable the according services....

What is the equivalent of BlobstoreLineInputReader for targeting Google Cloud Storage?

python,google-app-engine,mapreduce,pipeline

At first, I attempted thinkjson's CloudStorageLineInputReader but had no success. Then I found this pull request...which led me to rbruyere's fork. Despite some linting issues (like the spelling on GoolgeCloudStorageLineInputReader), however at the bottom of the pull request it is mentioned that it works fine, and asks if the project...

How to run hadoop appliaction automatically?

hadoop

Your saviour is oozie. Happy Learning

Input of the reduce phase is not what I expect in Hadoop (Java)

java,hadoop,mapreduce,reduce,emit

This is a typical problem for people beginning with Hadoop MapReduce. The problem is in your reducer. When looping through the given Iterator<IntWritable>, each IntWritable instance is re-used, so it only keeps one instance around at a given time. That means when you call iterator.next() your first saved IntWritable instance...

How to insert and Update simultaneously to PostgreSQL with sqoop command

postgresql,hadoop,hive,sqoop

According to my internet search, it is not possible to perform both insert and update directly to postgreSQL DB. Instead you can create a storedProc/function in postgreSQL and you can send data there.. sqoop export --connect <url> --call <upsert proc> --export-dir /results/bar_data Stored proc/function should perform both Update and Insert....

Aggregating heterogeneous documents in MongoDB

mongodb,mapreduce,aggregation-framework

Currently, to use the aggregation framework, you have to know the field names to write the query. There are some proposals to remove that limitation. Right now, given your design, to use the aggregation framework, you probably have to consolidate at regular intervals your data using map-reduce to something more...

hadoop large file does not split

performance,hadoop,split,mapreduce

dfs.block.size is not alone playing a role and it's recommended not to change because it applies globally to HDFS. Split size in mapreduce is calculated by this formula max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size)) So you can set these properties in driver class as conf.setLong("mapred.max.split.size", maxSplitSize); conf.setLong("mapred.min.split.size", minSplitSize); Or in Config file...

Hive external table not reading entirety of string from CSV source

csv,hadoop,hive,hiveql

What version of Hive are you using? On Amazon EMR, Hive version 0.13.1 I run your code and get the following hive> CREATE EXTERNAL TABLE BG ( > `Id` string, > `Someint` int > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' > LOCATION '/tmp/example' > TBLPROPERTIES ("skip.header.line.count"="1"); OK...

Vertica: Input record 1 has been rejected (Too few columns found)

hadoop,vertica

We found the problem: One node in Hadoop was "shaky" therefore each time Vertica accessed this node the file was empty. After stopping this node problem solved. We found 2 issues: 1. Node was "Shaky" but still had ping therefor system "Thinks" it is alive 2. Vertica failed to read...

how to drop partition metadata from hive, when partition is drop by using alter drop command

hadoop,apache-hive

Partitioning is defined when the table is created. By running ALTER TABLE ... DROP PARTITION ... you are only deleting the data and metadata for the matching partitions, not the partitioning of the table itself. Your best bet at this point will be to recreate the table without the partitioning....

Incorrect response to mapReduce query in mongo-db

mongodb,mapreduce

Your problem here is that you have missed one of the core concepts of how mapReduce works. The relevant documentation that explains this is found here: MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for...

Spark on yarn jar upload problems

java,hadoop,mapreduce,apache-spark

The problem was solved by copying spark-assembly.jar into a directory on the hdfs for each node and then passing it to spark-submit --conf spark.yarn.jar as a parameter. Commands are listed below: hdfs dfs -copyFromLocal /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar /user/spark/spark-assembly.jar /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar simplemr.jar ...

JMH Benchmark on Hadoop YARN

java,hadoop,yarn,microbenchmark,jmh

In my opinion jhm not wrok on hadoop cluster, because in each node of the cluster the benchmark want start a own jvm. That not work, the node communicate for the parallelization. First I measure the time for the execution of the program and repeat this, at the end I...

Merging two columns into a single column and formatting the content to form an accurate date-time format in Hive?

sql,regex,hadoop,hive,datetime-format

I'm going to assume that the 12 is month and that 3 is day since you didn't specify. Also, you said you want HH:MM:SS but there is no seconds in your example so I don't know how you're going to get them in there. I also changed 8:37pm to 8:37am...

How to get a “fieldcount” (like wordcount) on CouchDB/Cloudant?

javascript,mapreduce,couchdb,word-count,cloudant

To get a count of each key in your document, create a "map" function like this: function (doc) { var keys = Object.keys(doc); for(var i in keys) { emit(keys[i], null); } } and use the built in _count reducer. You can then retrieve the grouped answer by accessing the view...

Why we are configuring mapred.job.tracker in YARN?

hadoop,mapreduce,yarn

This is just a guess, but either those tutorials talking about configuring the JobTracker in YARN are written by people who don't know what YARN is, or they set it in case you decide to stop working with YARN someday. You are right: the JobTracker and TaskTracker do not exist...