scala,apache-spark,spark-graphx
triangleCount counts number of triangles per vertex and returns Graph[Int,Int], so you have to extract vertices: scala> graph.triangleCount().vertices.collect() res0: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((1,1), (3,1), (2,1)) ...
The answer is the following Pig script: REGISTER piggybank.jar; A = LOAD '/user/hue/guidfile.txt' AS (guid:chararray, name:chararray, label:chararray); B = FOREACH A GENERATE (guid, name, label, org.apache.pig.piggybank.evaluation.string.HashFNV(guid)); store B INTO '/user/hue/guidlongfile.txt'; The result includes an additional field of type Long. The name and label fields are mentioned to indicate a Vertex...
mapreduce,apache-spark,spark-graphx
Probably you need something like this: val oldestFollower: VertexRDD[(String, Int)] = userGraph.aggregateMessages[(String, Int)] ( // For each edge send a message to the destination vertex with the attribute of the source vertex sendMsg = { triplet => triplet.sendToDst(triplet.srcAttr.name, triplet.srcAttr.age) }, // To combine messages take the message for the older...
scala> val res = Array((1,1), (3,1), (2,1)) res: Array[(Int, Int)] = Array((1,1), (3,1), (2,1)) scala> res.map(_._2).sum res7: Int = 3 or in one operation: scala> res.foldLeft(0){case (acc, (k,v)) => acc +v } res8: Int = 3 ...
scala,apache-spark,spark-graphx
Performing set operations like subtract with mutable types (Array in this example) is usually unsupported, or at least not recommended. Try using a immutable type instead. I believe WrappedArray is the relevant container for storing arrays in sets, but i'm not sure....
scala,apache-spark,scala-collections,spark-graphx
You're looking for the groupBy function followed by mapValues to process each group. pairs groupBy {_._1} mapValues { groupOfPairs => doSomething(groupOfPairs) } ...
Aftergoing thru the log, figured that my task size is bigger and it takes time to schedule it. Spark itself warns this by saying. 15/06/17 02:32:47 WARN TaskSetManager: Stage 1 contains a task of very large size (140947 KB). The maximum recommended task size is 100 KB. That lead me...
data-mining,networkx,large-data,jung,spark-graphx
So this is a way to do it in networkx. It's roughly based on the solution I gave here. I'm assuming that a->b and a<-b are two distinct paths you want. I'm going to return this as a list of lists. Each sublist is the (ordered) edges of a path....
scala,apache-spark,spark-graphx
Not exactly. If you take a look at the signature of the apply method of the Graph object you'll see something like this (for a full signature see API docs): apply[VD, ED]( vertices: RDD[(VertexId, VD)], edges: RDD[Edge[ED]], defaultVertexAttr: VD) As you can read in a description: Construct a graph from...
scala,network-programming,apache-spark,graph-algorithm,spark-graphx
I think you want to use GraphOps.collectNeighbors instead of either mapReduceTriplets or aggregateMessages. collectNeighbors will give you an RDD with, for every VertexId in your graph, the connected nodes as an array. Just reduce the Array based on your needs. Something like: val countsRdd = graph.collectNeighbors(EdgeDirection.Either) .join(graph.vertices) .map{ case (vid,t)...
scala,apache-spark,spark-graphx
It is a scala question, just convert the the extended type to the abstract type using asInstanceOf, for example: val variable1: RDD[UserProperty] = {..your code..} val variable2: RDD[ProductProperty] = {..your code..} val result: RDD[VertexProperty] = SparkContext.union( variable1.asInstanceOf[VertexProperty], variable2.asInstanceOf[VertexProperty]) The same goes for edge property, use val edge: EdgeProperty = Edge(srcID,...
java,scala,clojure,spark-graphx
Finally got it right (I think). Here is the code that appears to be working: (ns spark-tests.core (:require [flambo.conf :as conf] [flambo.api :as f] [flambo.tuple :as ft]) (:import (org.apache.spark.graphx Edge Graph) (org.apache.spark.api.java JavaRDD StorageLevels) (scala.reflect ClassTag$))) (defonce c (-> (conf/spark-conf) (conf/master "local") (conf/app-name "flame_princess"))) (defonce sc (f/spark-context c)) (def users...
Well, I think it's actually not a bug. Looking at the code for VertexRDD it overrides the cache method, and uses the original StorageLevel used to create this vertex. override def cache(): this.type = { partitionsRDD.persist(targetStorageLevel) this } ...
graph,apache-spark,vertices,edges,spark-graphx
Try: edges.intersection(edges.map(e => Edge(e.dstId, e.srcId)) Note that this compares the Edge.attr values as well. If you want to ignore attr values, then do this: edges.map(e=> (e.srcId,e.dstId)).intersection(edges.map(e => (e.dstId, e.srcId))) ...
You've already answered much of your own question, however if you are looking for a way to just control the merge and otherwise still use the existing constructor you could do: val vertices: RDD[(VertexId, Long)] ... val edges: RDD[Edge[Long]] ... val mergedVertices = VertexRDD(vertices, edges, default, mergeFun) val graph =...