Opensource BigData framework
Key Distinctions for Spark vs. MapReduce
Ex:- Word count
scala> val f = sc.textFile("")
f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
part-00000 part-00001 _SUCCESS
Pre-flight check
Run the spark shell
sc -> spark context
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD).
scala> val data = 1 to 10000
scala> val disData = sc.parallelize(data)
scala> disData.filter(_ < 10).collect()
Spark Desconstructed
Map Reduce - simplifies data processing on large clusters
Fast data sharing- use the advantage of more memory
Generalised DAGs [Directed Acyclic Graph]- supports lazy evaluation, build the graph and then see how this can be optimised.
whereas Hadoop, job step by job step and there's synchronization barrier between steps
Generalised DAGs [Directed Acyclic Graph]- supports lazy evaluation, build the graph and then see how this can be optimised.
whereas Hadoop, job step by job step and there's synchronization barrier between steps
Key Distinctions for Spark vs. MapReduce
Ex:- Word count
scala> val f = sc.textFile("")
f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:14
scala> wc.saveAsTextFile("wc_out") - it will create the partition files
spark-training/wc_out$ lspart-00000 part-00001 _SUCCESS
Extra-Ordinary piece of work. Interesting concepts to read. Very much informative. Thanks for sharing. Waiting for your future posts.
ReplyDeleteTableau Training in Chennai
Tableau Course in Chennai
Tableau Training Institutes in Chennai
Tableau Training in Tambaram
Spoken English Classes in Chennai
Best Spoken English Classes in Chennai
SAS Training in Chennai
SAS Course in Chennai