Apache Spark and PySpark

Some reading materials:

Spark Documentation: Spark Official Documentation
PySpark Documentation: PySpark API Documentation.
Books: "Learning Spark", "Advanced Analytics with Spark"

Big Data:

Data volume in TBs, PBs and more.

Hadoop is old system which is used for processing bigdata. Hadoop is using file system to process data.

Characterisitcs:

Volume - size of data, (bytes<kb<mb<gb<tb<pb....)

Ex: Digital payments, Social media data, e-commerce data etc...

Velocity - Data speed of travel
Variety - Excel, RDBMS, txt, JSON, XML, HTML, Documents, images, audio, video, geo maps, etc...
Veracity - Quality of data and accuracy of data, trust worthy data or not.
Value - Actionable information or any data which will provide meaning for business decision or can be considered as useful.

Company requirements:

Data storage
Data processing speed
Scalability

Hadoop: (HDFS - filesystem) (Map Reduce(computation) - Programming framework)

Handling large amount of data
Personal Computer (CPU{processing unit}, HardDisk{Storage}, RAM{Speed to READ and WRITE data})
Data reside in HardDisk and read/write process is expensive and hadoop use HardDisk for all it's intermediate data storage for to provide end result.

Spark (Introduced in 2009): Language APIs are aligned

Speed: In-memory computation
Flexibility: Supports multiple languages - Python, Java, R, Scala, SQL
Scalability: Works with large amount of data

Hadoop Vs Spark:

Other points to compare:

Hadoop is sequential and spark is parallel in nature due to in-memory concept.
Hadoop have data distributed in Hard Disk network and Spark has data distribution in RAM.
Map Reduce coding is harder then Spark RDD and Dataframe coding.

Apache Spark:

A unified analytics engine for large-scale data processing. It ditribute data in large cluster or cluster's and process them. It is distributed system. Spark use in-memory processing. Spark is scalable.

Cluster means of laptops or systems which has catability to process data.

Enable parallel processing and help to save time in processing.

Different facilities of Spark:

PySpark
Spark SQL
MLlib
GraphX
Structred Streaming.

Spark Ecosystem ( Component or Modules )

Spark SQL: It is good for structured data. As you can analysis data as they will be represented in table format.
MLLib: Machine learning library for prediction and automation.
GraphX: It is used for data derived from social media as it work with networking data.
Streaming: It is used for real time data processing.

Apache Spark Architecture (Processing Flow):

Spark is a brain of large scale data processing.

Driver: Act as controller, co-ordinates execution of tasks in different clusters. Real life example driver of a car.
Executors: They are workers who actually perform task and do computations. eg: constructor worker who create building.
Cluster Manager: Handles resource management and task scheduling. eg: HR who recruits and provide work force for task to a company.

Cluster modes:

Local Mode - Standalone
Cluster Mode - Distributed systems

Spark Context: Gateway which will allow driver to interact with cluster manager for resource management and to collect status of executors and to know task status.
RDD(Resilient Distributed Dataset): Provide fault tolerance. As will help in recovering data even after failure. Helps to recover last used data and hence saved from long data processing.

Spark Architecture Advantages:

Fast processing with in-memory computing in place of disk computing.
Fault tolerance through RDDs.
Support multiple programming languages: Java, Python, R, Scala
Provide advance analytics along with Machine Learning capabilities.
Graph processing to analyze network data and timelined data.

PySpark(API for Python with Spark Framework):

Pyhton API for Spark, to allow python developers to leverage spark's capabilities.

Combine Python simplicity with Spark capabilities to handle large scale data.
ex: Analyzing customer data

PySpark Basics:

Spark Context - Entry point for using any spark functionality
RDD(Resilience Distributed Dataset) - immutable distributed collections of objects
Actions and Transformations

Transformations - Lazy operations on RDDs. (Map,Filter)
Actions - Eager operations it trigger execution and return values(collect, count, etc..).

Dataframes -

A distributed collection of data organized into named columns. Data is represented in tabular format.
They are like Relational databases and tables.
We can use google colab for practice it.

Other supported languages are: Java, Scala and R.

Spark Dataframes and SQL:

Dataframes: A distributed collection of data organized into named columns. Data is represented in tabular format.
SQL: Language to query structured data in Spark.

Spark Framework:

Details:

Programming - Languages and tools supported by Spark for development.
Library - Useful libraries to make coding easy.
Engine - Main brain or processing unit of Spark
Management - Cluster manager/Resource manager - provide resources for storage and computation.
Storage - Nature of storage depend on data nature. NoSQL for semi-structure data. Similarly others are depicted in above diagram.

Spark Components:

Spark Core:

Parallel and distributed data processing responsible for:

memory management and fault tolerance.
Scheduling and monitor clusters.
Interact with storage systems.

RDD (Resilient Distributed Dataset):

Immutable, Fault tolerant, Distributed collections - All these are operated in parallel.
Transformations -

Operators to do mapping task, filter, join, union etc.
It create new RDD and it is a lazy operators.
It require Action to execute.
It is recomputed everytime action is called.

Actions -

operators (it will reduce) - count, first and so on .... etc.
It is used to see the data or result and are eager operators.

SparkSQL:

Query facility for data analysis(HQL) - Hive Query Language, supported by Apache Hive.
Used for structured data

Spark Streaming:

Real time processing on streaming data (web server related log files).
Social Media sector - Twiter, Facebook etc.
It will devide the stream into small batches and store it over target location.

MLlib:

Machine Learning library - focused on prediction of scenario based solution on basis of historic data.

GraphX

Uniform ETL tool.
It is created for network graphs.
Exploratory Data Analysis, interactive graph computations.
Graph algorithms: Page Rank

Spark Use Cases:

Detection of earthquake.
Targeting Advertisement
e-commerce, real time transactions
in financial and banking: Fraud detection

Pyspark Basic code example:

How to create and display dataframe in pyspark. It allow efficient manipulation and quering of structured data.

Code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Basic usage").getOrCreate()

data = [("Alice",1), ("Bob",2), ("John", 3)]

df = spark.createDataFrame(data,["name", "id"])

df.show()

Sample 2:

Code:

from pyspark import SparkContext

from pyspart.sql import SparkSession

sc=SparkContext('local','Sample app')

#Trying to create RDD.

data=sc.parallelize([1,2,3,4,5,6]) - converting list of data into RDD

#squaring list of data.

squares=data.map(lambda x: x*x) - transformation is called.

print(Squares.collect()) - action is called for checking information

#[1,4,9,16,25,36] - result of action command.

#Trying to do same task with the help of dataframe

spark=SparkSession.builder.appName('Simple App').getorCreate()

df = spark.createDataFrame([(1, 'Ravi', 25),(2, 'Rupesh', 38),(3, 'Kishore', 23)],["ID", "Name", "Age"])

df.show()

Screenshot:

Machine Learning in Spark:

MLLib: Spark's scalable machine learning library which provide algorithms and utilities for classification, regression, clustering and more.
ML Concepts consist of :

Supervised machine learning - classification (catagorical data prediction) and regression(Numerical data prediction).
Unsupervised machin learning clustering and others.

Code:

from pyspark.ml.classification import LogisticRegression

Graph processing with Spark:

GraphX: Spark's API for graph processing.
Allows for the analysis and manipulation of graph structures.
example: linkedin, facebook and other social media.

Search This Blog

Learn with Puneet