JAVA: Introduction to Apache Spark

Friday, 2 October 2015

Introduction to Apache Spark

Opensource BigData framework

https://databricks.com/spark/developer-resources

http://spark.apache.org/docs/latest/quick-start.html

Pre-flight check

Run the spark shell

./bin/spark-shell

sc -> spark context

sc.master

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD).

scala> val data = 1 to 10000

scala> val disData = sc.parallelize(data)

scala> disData.filter(_ < 10).collect()

Spark Desconstructed

Map Reduce - simplifies data processing on large clusters

Fast data sharing- use the advantage of more memory
Generalised DAGs [Directed Acyclic Graph]- supports lazy evaluation, build the graph and then see how this can be optimised.
whereas Hadoop, job step by job step and there's synchronization barrier between steps

Key Distinctions for Spark vs. MapReduce

Ex:- Word count
scala> val f = sc.textFile("README.md")
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12

scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:14

scala> wc.saveAsTextFile("wc_out") - it will create the partition files

spark-training/wc_out$ ls
part-00000 part-00001 _SUCCESS

Ex:-

Clusters

RDD

Transformations

Actions

Persistence

juyjyj

1 comment:

Joe28 January 2019 at 02:20
Extra-Ordinary piece of work. Interesting concepts to read. Very much informative. Thanks for sharing. Waiting for your future posts.
Tableau Training in Chennai
Tableau Course in Chennai
Tableau Training Institutes in Chennai
Tableau Training in Tambaram
Spoken English Classes in Chennai
Best Spoken English Classes in Chennai
SAS Training in Chennai
SAS Course in Chennai
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)