Menu
  • HOME
  • TAGS

Retrieving TriangleCount

scala,apache-spark,spark-graphx

triangleCount counts number of triangles per vertex and returns Graph[Int,Int], so you have to extract vertices: scala> graph.triangleCount().vertices.collect() res0: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((1,1), (3,1), (2,1)) ...

Conversion of GUID type String to VertexIDs type Long using Piggybank HashFNV in Pig

apache-pig,spark-graphx

The answer is the following Pig script: REGISTER piggybank.jar; A = LOAD '/user/hue/guidfile.txt' AS (guid:chararray, name:chararray, label:chararray); B = FOREACH A GENERATE (guid, name, label, org.apache.pig.piggybank.evaluation.string.HashFNV(guid)); store B INTO '/user/hue/guidlongfile.txt'; The result includes an additional field of type Long. The name and label fields are mentioned to indicate a Vertex...

Spark - GraphX: mapReduceTriplets vs aggregateMessages

mapreduce,apache-spark,spark-graphx

Probably you need something like this: val oldestFollower: VertexRDD[(String, Int)] = userGraph.aggregateMessages[(String, Int)] ( // For each edge send a message to the destination vertex with the attribute of the source vertex sendMsg = { triplet => triplet.sendToDst(triplet.srcAttr.name, triplet.srcAttr.age) }, // To combine messages take the message for the older...

Scala, get sum of multidimensional array

scala,spark-graphx

scala> val res = Array((1,1), (3,1), (2,1)) res: Array[(Int, Int)] = Array((1,1), (3,1), (2,1)) scala> res.map(_._2).sum res7: Int = 3 or in one operation: scala> res.foldLeft(0){case (acc, (k,v)) => acc +v } res8: Int = 3 ...

Subtract an RDD from another RDD dosen't work correctly

scala,apache-spark,spark-graphx

Performing set operations like subtract with mutable types (Array in this example) is usually unsupported, or at least not recommended. Try using a immutable type instead. I believe WrappedArray is the relevant container for storing arrays in sets, but i'm not sure....

Passing a function foreach key of an Array

scala,apache-spark,scala-collections,spark-graphx

You're looking for the groupBy function followed by mapValues to process each group. pairs groupBy {_._1} mapValues { groupOfPairs => doSomething(groupOfPairs) } ...

Graphx EdgeRDD count taking long time to compute

apache-spark,spark-graphx

Aftergoing thru the log, figured that my task size is bigger and it takes time to schedule it. Spark itself warns this by saying. 15/06/17 02:32:47 WARN TaskSetManager: Stage 1 contains a task of very large size (140947 KB). The maximum recommended task size is 100 KB. That lead me...

Simple path queries on large graphs

data-mining,networkx,large-data,jung,spark-graphx

So this is a way to do it in networkx. It's roughly based on the solution I gave here. I'm assuming that a->b and a<-b are two distinct paths you want. I'm going to return this as a list of lists. Each sublist is the (ordered) edges of a path....

How to create a VertexId in Apache Spark GraphX using a Long data type?

scala,apache-spark,spark-graphx

Not exactly. If you take a look at the signature of the apply method of the Graph object you'll see something like this (for a full signature see API docs): apply[VD, ED]( vertices: RDD[(VertexId, VD)], edges: RDD[Edge[ED]], defaultVertexAttr: VD) As you can read in a description: Construct a graph from...

applying a function to graph data using mapReduceTriplets in spark and graphx

scala,network-programming,apache-spark,graph-algorithm,spark-graphx

I think you want to use GraphOps.collectNeighbors instead of either mapReduceTriplets or aggregateMessages. collectNeighbors will give you an RDD with, for every VertexId in your graph, the connected nodes as an array. Just reduce the Array based on your needs. Something like: val countsRdd = graph.collectNeighbors(EdgeDirection.Either) .join(graph.vertices) .map{ case (vid,t)...

Vertex Property Inheritance - Graphx Scala Spark

scala,apache-spark,spark-graphx

It is a scala question, just convert the the extended type to the abstract type using asInstanceOf, for example: val variable1: RDD[UserProperty] = {..your code..} val variable2: RDD[ProductProperty] = {..your code..} val result: RDD[VertexProperty] = SparkContext.union( variable1.asInstanceOf[VertexProperty], variable2.asInstanceOf[VertexProperty]) The same goes for edge property, use val edge: EdgeProperty = Edge(srcID,...

Clojure: Scala/Java interop issues for Spark Graphx

java,scala,clojure,spark-graphx

Finally got it right (I think). Here is the code that appears to be working: (ns spark-tests.core (:require [flambo.conf :as conf] [flambo.api :as f] [flambo.tuple :as ft]) (:import (org.apache.spark.graphx Edge Graph) (org.apache.spark.api.java JavaRDD StorageLevels) (scala.reflect ClassTag$))) (defonce c (-> (conf/spark-conf) (conf/master "local") (conf/app-name "flame_princess"))) (defonce sc (f/spark-context c)) (def users...

When joining vertices, am I forced to use MEMORY_ONLY caching?

apache-spark,spark-graphx

Well, I think it's actually not a bug. Looking at the code for VertexRDD it overrides the cache method, and uses the original StorageLevel used to create this vertex. override def cache(): this.type = { partitionsRDD.persist(targetStorageLevel) this } ...

Find mutually Edges with Spark and GraphX

graph,apache-spark,vertices,edges,spark-graphx

Try: edges.intersection(edges.map(e => Edge(e.dstId, e.srcId)) Note that this compares the Edge.attr values as well. If you want to ignore attr values, then do this: edges.map(e=> (e.srcId,e.dstId)).intersection(edges.map(e => (e.dstId, e.srcId))) ...

Is there any Spark GraphX constructor with merge function for duplicate Vertices

apache-spark,spark-graphx

You've already answered much of your own question, however if you are looking for a way to just control the merge and otherwise still use the existing constructor you could do: val vertices: RDD[(VertexId, Long)] ... val edges: RDD[Edge[Long]] ... val mergedVertices = VertexRDD(vertices, edges, default, mergeFun) val graph =...