Skip to content

Big Data Analytics with Apache Spark and Scala

Hands-on practical training with real-world business use cases, taught by experienced Big Data professionals.

Pairview Training


£2,856 inc VAT
Study method
Online with live classes
4 days, Full-time
No formal qualification
Additional info
  • Tutor is available to students
  • Certificate of completion available and is included in the price

Add to basket or enquire


About This Course

This course will enable delegates to build complete, unified big data applications combining batch, streaming, and interactive analytics using different data types. Delegates will learn how to write sophisticated parallel applications to implement faster and better decisions and real-time actions that can be applied to a wide variety of use cases, architectures, and industries.

Benefits to Learners

At the end of this course, you will be able to:

  • Build unified big data applications combining batch, streaming, and interactive analytics.
  • Understand how to write sophisticated parallel applications.
  • Understand different APIs which Spark offers such as Spark Streaming, MLlib, SparkSQL, GraphX.


Course Curriculum

Part 1: Scala (Day 1 & 2)

1. Introduction

  • Introduction to Scala
  • Why is Scala used
  • Advantages of Scala for data science
  • Installing Scala

2. Functional programming v/s Object oriented programming

  • What is functional programming
  • What is object oriented programming
  • Advantages of functional programming

3. Basic object-oriented programming

  • Classes
  • Objects
  • Methods

4. Scala basics

  • Key concepts of Scala
  • Scala syntax
  • Comments in Scala
  • Keywords in Scala
  • Scala Identifiers
  • Scala packages
  • Scala worksheet
  • Scala repl session

5. Programming in Scala

  • Creating variables
  • Data types
  • Type inference
  • Operators
  • Conditional statements
  • Decision statements
  • Loops

6. Functions in Scala

  • Defining functions
  • High order functions
  • Using special functions

7. Collections

  • Sequence – Indexed and Linear sequence
  • Set – Hash set, tree set, list set
  • Maps – Hash map, tree map, list map

8. Idiomatic Scala

  • Expressions
  • Pattern matching
  • Handling exceptions
    • Try and Catch

9. Object oriented scala

  • Classes, fields, and methods
  • Singleton objects
  • Case classes

10. File I/O

  • Reading files in Scala
  • Writing to files in Scala

11. Parallel Processing in Scala

  • Parallel processing in Scala
  • Advantages of parallel collections
  • Mapping functions over parallel collections
  • Filtering parallel collections
  • When to use parallel collections

Part 2: Apache Spark (Day 3 & 4)

1. Introduction to SPARK

  • Defining Big Data and Big Computation
  • What is Spark?
  • What is its purpose?
  • What are the benefits of Spark?
  • Components of the Spark unified stack
  • Resilient Distributed Dataset (RDD)
  • Downloading and installing Spark standalone

2. Resilient Distributed Dataset

  • Creating RDD
  • RDD Transformations
  • RDD Actions
  • Programming with RDD
  • Numeric RDD operations
  • Pair RDDs
  • Transformations and Actions on Pair RDDs
  • Joins and Reduction Operations

3. Structured data: SQL, DataFrame, and Datasets

  • Spark SQL
  • DataFrame and datasets
  • Datasets instead of RDD
  • Combining RDDs with the powerful automatic optimizations behind Spark SQL.
  • Connecting to databases with JDBC

4. Defining the Spark architecture

  • Partitioning data across the cluster using Resilient Distributed Datasets (RDD) and DataFrames
  • Apportioning task execution across multiple nodes
  • Running applications with the Spark execution model
  • Creating resilient and fault-tolerant clusters
  • Achieving scalable distributed storage
  • Monitoring and administering Spark applications

5. Performing machine learning with Spark

  • MLLIB: Machine Learning Library
  • Predicting outcomes with supervised learning
  • Building a decision tree classifier
  • Grouping data using unsupervised learning
  • Clustering with the k-means method

6. Streaming Data in Spark

  • Implementing sliding window operations
  • Determining state from continuous data
  • Processing simultaneous streams
  • Improving performance and reliability
  • Streaming from built-in sources (e.g., log files, Twitter sockets, Kinesis, Kafka)
  • Processing with the streaming API and Spark SQL

7. GraphX Library

  • Introduction to Graphs
  • Imports
  • Building the graph
  • Creating Graph Frames
  • Standard Graph Algorithms
    • Breadth First Search
    • Connected Components
    • Strongly connected components
    • PageRank
    • Shortest Paths
  • Basic graph and dataframe queries

Who is this course for?

This course is suitable for data architects, database developers and administrators that are looking to advance their career to Big Data engineering.

If delegates have not had such experience with database design and development, they should meet the following entry requirements:

  • A completed graduate degree with a minimum of a 2:2 is essential.
  • An academic background related to technology, IT, programming, engineering or technical science is advisable.
  • Loving to work with technical processes and development, and aspiring to build a successful career in the technical, infrastructural part of Business Analytics.


  • Knowledge of a programming language (e.g. Java).

Career path

Big data engineering

Data architecture

Big data solution development

Questions and answers


Currently there are no reviews for this course. Be the first to leave a review.


What does study method mean?

Study method describes the format in which the course will be delivered. At courses are delivered in a number of ways, including online courses, where the course content can be accessed online remotely, and classroom courses, where courses are delivered in person at a classroom venue.

What are CPD hours/points?

CPD stands for Continuing Professional Development. If you work in certain professions or for certain companies, your employer may require you to complete a number of CPD hours or points, per year. You can find a range of CPD courses on, many of which can be completed online.

What is a ‘regulated qualification’?

A regulated qualification is delivered by a learning institution which is regulated by a government body. In England, the government body which regulates courses is Ofqual. Ofqual regulated qualifications sit on the Regulated Qualifications Framework (RQF), which can help students understand how different qualifications in different fields compare to each other. The framework also helps students to understand what qualifications they need to progress towards a higher learning goal, such as a university degree or equivalent higher education award.

What is an ‘endorsed’ course?

An endorsed course is a skills based course which has been checked over and approved by an independent awarding body. Endorsed courses are not regulated so do not result in a qualification - however, the student can usually purchase a certificate showing the awarding body’s logo if they wish. Certain awarding bodies - such as Quality Licence Scheme and TQUK - have developed endorsement schemes as a way to help students select the best skills based courses for them.