Error: Why so many tasks in my spark job?

lundi 13 juin 2016

Why so many tasks in my spark job?

I have a spark job that takes a file with 8 records from hdfs, does a simple aggregation and saves it back to Hadoop. I notice there are like hundreds of tasks when I do this.

I also am not sure why there are multiple jobs for this? I thought a job was more like when an action happened. I can speculate as to why - but my understanding was that inside of this code it should be one job and it should be broken down into stages, not multiple jobs. Why doesn't it just break it down into stages, how come it breaks into jobs?

As far as the 200 plus tasks, since the amount of data and the amount of nodes is miniscule, it doesn't make sense that there is like 25 tasks for each row of data when there is only one aggregations and a couple of filters. Why wouldn't it just have one task per partition per atomic operation?

Here is the relevant scala code -

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object TestProj {object TestProj {
  def main(args: Array[String]) {

    /* set the application name in the SparkConf object */
    val appConf = new SparkConf().setAppName("Test Proj")

    /* env settings that I don't need to set in REPL*/
    val sc = new SparkContext(appConf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    val rdd1 = sc.textFile("hdfs://node002:8020/flat_files/miscellaneous/ex.txt")

     /*the below rdd will have schema defined in Record class*/
     val rddCase =  sc.textFile("hdfs://node002:8020/flat_files/miscellaneous/ex.txt")
                      .map(x=>x.split(" "))    //file record into array of strings based spaces
                      .map(x=>Record(
                                     x(0).toInt,
                                     x(1).asInstanceOf[String],
                                     x(2).asInstanceOf[String],
                                     x(3).toInt
                                    ))


    /* the below dataframe groups on first letter of first name and counts it*/
    val aggDF = rddCase.toDF()
                       .groupBy($"firstName".substr(1,1).alias("firstLetter"))
                       .count
                       .orderBy($"firstLetter")

    /* save to hdfs*/
    aggDF.write.format("parquet").mode("append").save("/raw/miscellaneous/ex_out_agg")

  }

    case class Record(id: Int
                    , firstName: String
                    , lastName: String
                    , quantity:Int)

}

Below is the screen shot after clicking on the application