Friday, 2 October 2015

Introduction to Apache Spark

Opensource BigData framework

https://databricks.com/spark/developer-resources

http://spark.apache.org/docs/latest/quick-start.html

Pre-flight check

Run the spark shell 
./bin/spark-shell

sc -> spark context
sc.master
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD).

scala> val data = 1 to 10000

scala> val disData = sc.parallelize(data)

scala> disData.filter(_ < 10).collect()


Spark Desconstructed



Map Reduce - simplifies data processing on large clusters


Fast data sharing- use the advantage of more memory
Generalised DAGs [Directed Acyclic Graph]- supports lazy evaluation, build the graph and then see how this can be optimised.
whereas Hadoop, job step by job step and there's synchronization barrier between steps



Key Distinctions for Spark vs. MapReduce


Ex:- Word count
scala> val f = sc.textFile("README.md")
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12

scala>  val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:14

scala> wc.saveAsTextFile("wc_out") - it will create the partition files
spark-training/wc_out$ ls
part-00000  part-00001  _SUCCESS

Ex:-

Clusters

RDD


Transformations

Actions
Persistence



juyjyj

1 comment: