Menu
  • HOME
  • TAGS

How to split parquet files into many partitions in Spark?

scala,apache-spark,parquet

Maybe your parquet file only takes one HDFS block. Create a big parquet file that has many HDFS blocks and load it val k = sc.parquetFile("the-big-table.parquet") k.partitions.length You'll see same number of partitions as HDFS blocks. This worked fine for me (spark-1.1.0)...

Is it possible to load parquet table directly from file?

hadoop,cloudera-cdh,impala,parquet

Unfortunately it is not possible to read from a custom binary format in Impala. You should convert your files to csv, then create an external table over the existing csv files as a temporary table, and finally insert into a final parquet table reading from the temp csv table. The...

Reading partitioned parquet file into Spark results in fields in incorrect order

hive,apache-spark,parquet

Turns out this is a known bug fixed in spark 1.3.0 and 1.2.1 https://issues.apache.org/jira/browse/SPARK-5049...

Spark: Hive Query

hive,apache-spark,hiveql,apache-spark-sql,parquet

You need to explicitly enumerate the columns in both the source and target list: in this case select * will not suffice. insert overwrite table logs_parquet PARTITION(create_date) (col2, col3..) select col2,col3, .. col1 from logs Yes it is more work to write the query - but partitioning queries do require...

Read few parquet files at the same time in Spark

apache-spark,parquet

See this issue on the spark jira. It is supported from 1.4 onwards. Without upgrading to 1.4, you could either point at the top level directory: sqlContext.parquetFile('/path/to/dir/') which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass...

Scala: Spark sqlContext query

sql,hadoop,apache-spark,apache-spark-sql,parquet

SELECT id, max(date1), max(date2), max(date3) FROM parquetFile group by id ...

Writing RDD partitions to individual parquet files in its own directory

apache-spark,rdd,apache-spark-sql,parquet

I think it's possible by calling foreachPartition(f: Iterator[T] => Unit) on the RDD you want to save. In the function you provided into foreachPartition: Prepare the path hdfs://localhost:9000/parquet_data/year=x/week=y a ParquetWriter exhaust the Iterator by inserting each line into the recordWriter. clean up ...

What is the best way to write a Scala object to Parquet?

scala,hadoop,thrift,data-warehouse,parquet

I think you need to implement ParquetWriteSupport class to write your custom class.

How to read a nested collection in Spark

hadoop,hive,apache-spark,parquet

There is no magic in the case of nested collection. Spark will handle the same way a RDD[(String, String)] and a RDD[(String, Seq[String])]. Reading such nested collection from Parquet files can be tricky, though. Let's take an example from the spark-shell (1.3.1): scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> case class...

Parquet error when saving from Spark

apache-spark,parquet

I can actually reproduce this problem with Spark 1.3.1 on EMR, when saving to S3. However, saving to HDFS works fine. You could save to HDFS first, and then use e.g. s3distcp to move the files to S3....

Pig cannot read its own intermediate data

hadoop,apache-pig,cloudera-cdh,parquet

It appears that by coincidence the sequence that is used for line splitting in Pig's intermediate storage, also occurs in one of the byte arrays that are returned by the custom UDFs. This causes pig to break up the line somewhere in the middle, and start looking for a datatype...

How to specify schema for parquet data in hive 0.13+

hive,avro,parquet

I did a bit of research and got the answer, so here it is for anyone else that gets stuck with this: ParquetSerDe currently has no support for any kind of table definition except pure DDL, where you must explicitely specify each column. There is a JIRA ticket that tracks...