Maybe your parquet file only takes one HDFS block. Create a big parquet file that has many HDFS blocks and load it val k = sc.parquetFile("the-big-table.parquet") k.partitions.length You'll see same number of partitions as HDFS blocks. This worked fine for me (spark-1.1.0)...
hadoop,cloudera-cdh,impala,parquet
Unfortunately it is not possible to read from a custom binary format in Impala. You should convert your files to csv, then create an external table over the existing csv files as a temporary table, and finally insert into a final parquet table reading from the temp csv table. The...
Turns out this is a known bug fixed in spark 1.3.0 and 1.2.1 https://issues.apache.org/jira/browse/SPARK-5049...
hive,apache-spark,hiveql,apache-spark-sql,parquet
You need to explicitly enumerate the columns in both the source and target list: in this case select * will not suffice. insert overwrite table logs_parquet PARTITION(create_date) (col2, col3..) select col2,col3, .. col1 from logs Yes it is more work to write the query - but partitioning queries do require...
See this issue on the spark jira. It is supported from 1.4 onwards. Without upgrading to 1.4, you could either point at the top level directory: sqlContext.parquetFile('/path/to/dir/') which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass...
sql,hadoop,apache-spark,apache-spark-sql,parquet
SELECT id, max(date1), max(date2), max(date3) FROM parquetFile group by id ...
apache-spark,rdd,apache-spark-sql,parquet
I think it's possible by calling foreachPartition(f: Iterator[T] => Unit) on the RDD you want to save. In the function you provided into foreachPartition: Prepare the path hdfs://localhost:9000/parquet_data/year=x/week=y a ParquetWriter exhaust the Iterator by inserting each line into the recordWriter. clean up ...
scala,hadoop,thrift,data-warehouse,parquet
I think you need to implement ParquetWriteSupport class to write your custom class.
hadoop,hive,apache-spark,parquet
There is no magic in the case of nested collection. Spark will handle the same way a RDD[(String, String)] and a RDD[(String, Seq[String])]. Reading such nested collection from Parquet files can be tricky, though. Let's take an example from the spark-shell (1.3.1): scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> case class...
I can actually reproduce this problem with Spark 1.3.1 on EMR, when saving to S3. However, saving to HDFS works fine. You could save to HDFS first, and then use e.g. s3distcp to move the files to S3....
hadoop,apache-pig,cloudera-cdh,parquet
It appears that by coincidence the sequence that is used for line splitting in Pig's intermediate storage, also occurs in one of the byte arrays that are returned by the custom UDFs. This causes pig to break up the line somewhere in the middle, and start looking for a datatype...
I did a bit of research and got the answer, so here it is for anyone else that gets stuck with this: ParquetSerDe currently has no support for any kind of table definition except pure DDL, where you must explicitely specify each column. There is a JIRA ticket that tracks...