Learning PySpark build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0

Annotation

Detalles Bibliográficos
Autor principal: Drabas, Tomasz (-)
Otros Autores: Lee, Denny, Karau, Holden
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, UK : Packt Publishing 2017.
Colección:EBSCO Academic eBook Collection Complete.
Acceso en línea:Conectar con la versión electrónica
Ver en Universidad de Navarra:https://innopac.unav.es/record=b39294195*spi
Tabla de Contenidos:
  • Cover
  • Copyright
  • Credits
  • Foreword
  • About the Authors
  • About the Reviewer
  • www.PacktPub.com
  • Customer Feedback
  • Table of Contents
  • Preface
  • Chapter 1: Understanding Spark
  • What is Apache Spark?
  • Spark Jobs and APIs
  • Execution process
  • Resilient Distributed Dataset
  • DataFrames
  • Datasets
  • Catalyst Optimizer
  • Project Tungsten
  • Spark 2.0 architecture
  • Unifying Datasets and DataFrames
  • Introducing SparkSession
  • Tungsten phase 2
  • Structured streaming
  • Continuous applications
  • Summary
  • Chapter 2: Resilient Distributed Datasets
  • Internal workings of an RDD
  • Creating RDDs
  • Schema
  • Reading from files
  • Lambda expressions
  • Global versus local scope
  • Transformations
  • The .map(...) transformation
  • The .filter(...) transformation
  • The .flatMap(...) transformation
  • The .distinct(...) transformation
  • The .sample(...) transformation
  • The .leftOuterJoin(...) transformation
  • The .repartition(...) transformation
  • Actions
  • The .take(...) method
  • The .collect(...) method
  • The .reduce(...) method
  • The .count(...) method
  • The .saveAsTextFile(...) method
  • The .foreach(...) method
  • Summary
  • Chapter 3: DataFrames
  • Python to RDD communications
  • Catalyst Optimizer refresh
  • Speeding up PySpark with DataFrames
  • Creating DataFrames
  • Generating our own JSON data
  • Creating a DataFrame
  • Creating a temporary table
  • Simple DataFrame queries
  • DataFrame API query
  • SQL query
  • Interoperating with RDDs
  • Inferring the schema using reflection
  • Programmatically specifying the schema
  • Querying with the DataFrame API
  • Number of rows
  • Running filter statements
  • Querying with SQL
  • Number of rows
  • Running filter statements using the where Clauses
  • DataFrame scenario
  • on-time flight performance
  • Preparing the source datasets.
  • Joining flight performance and airports
  • Visualizing our flight-performance data
  • Spark Dataset API
  • Summary
  • Chapter 4: Prepare Data for Modeling
  • Checking for duplicates, missing observations, and outliers
  • Duplicates
  • Missing observations
  • Outliers
  • Getting familiar with your data
  • Descriptive statistics
  • Correlations
  • Visualization
  • Histograms
  • Interactions between features
  • Summary
  • Chapter 5: Introducing MLlib
  • Overview of the package
  • Loading and transforming the data
  • Getting to know your data
  • Descriptive statistics
  • Correlations
  • Statistical testing
  • Creating the final dataset
  • Creating an RDD of LabeledPoints
  • Splitting into training and testing
  • Predicting infant survival
  • Logistic regression in MLlib
  • Selecting only the most predictable features
  • Random forest in MLlib
  • Summary
  • Chapter 6: Introducing the ML Package
  • Overview of the package
  • Transformer
  • Estimators
  • Classification
  • Regression
  • Clustering
  • Pipeline
  • Predicting the chances of infant survival with ML
  • Loading the data
  • Creating transformers
  • Creating an estimator
  • Creating a pipeline
  • Fitting the model
  • Evaluating the performance of the model
  • Saving the model
  • Parameter hyper-tuning
  • Grid search
  • Train-validation splitting
  • Other features of PySpark ML in action
  • Feature extraction
  • NLP
  • related feature extractors
  • Discretizing continuous variables
  • Standardizing continuous variables
  • Classification
  • Clustering
  • Finding clusters in the births dataset
  • Topic mining
  • Regression
  • Summary
  • Chapter 7: GraphFrames
  • Introducing GraphFrames
  • Installing GraphFrames
  • Creating a library
  • Preparing your flights dataset
  • Building the graph
  • Executing simple queries
  • Determining the number of airports and trips.
  • Determining the longest delay in this dataset
  • Determining the number of delayed versus on-time/early flights
  • What flights departing Seattle are most likely to have significant delays?
  • What states tend to have significant delays departing from Seattle?
  • Understanding vertex degrees
  • Determining the top transfer airports
  • Understanding motifs
  • Determining airport ranking using PageRank
  • Determining the most popular non-stop flights
  • Using Breadth-First Search
  • Visualizing flights using D3
  • Summary
  • Chapter 8: TensorFrames
  • What is Deep Learning?
  • The need for neural networks and Deep Learning
  • What is feature engineering?
  • Bridging the data and algorithm
  • What is TensorFlow?
  • Installing Pip
  • Installing TensorFlow
  • Matrix multiplication using constants
  • Matrix multiplication using placeholders
  • Running the model
  • Running another model
  • Discussion
  • Introducing TensorFrames
  • TensorFrames
  • quick start
  • Configuration and setup
  • Launching a Spark cluster
  • Creating a TensorFrames library
  • Installing TensorFlow on your cluster
  • Using TensorFlow to add a constant to an existing column
  • Executing the Tensor graph
  • Blockwise reducing operations example
  • Building a DataFrame of vectors
  • Analysing the DataFrame
  • Computing elementwise sum and min of all vectors
  • Summary
  • Chapter 9: Polyglot Persistence with Blaze
  • Installing Blaze
  • Polyglot persistence
  • Abstracting data
  • Working with NumPy arrays
  • Working with pandas' DataFrame
  • Working with files
  • Working with databases
  • Interacting with relational databases
  • Interacting with the MongoDB database
  • Data operations
  • Accessing columns
  • Symbolic transformations
  • Operations on columns
  • Reducing data
  • Joins
  • Summary
  • Chapter 10: Structured Streaming
  • What is Spark Streaming?.
  • Why do we need Spark Streaming?
  • What is the Spark Streaming application data flow?
  • Simple streaming application using DStreams
  • A quick primer on global aggregations
  • Introducing Structured Streaming
  • Summary
  • Chapter 11: Packaging Spark Applications
  • The spark-submit command
  • Command line parameters
  • Deploying the app programmatically
  • Configuring your SparkSession
  • Creating SparkSession
  • Modularizing code
  • Structure of the module
  • Calculating the distance between two points
  • Converting distance units
  • Building an egg
  • User defined functions in Spark
  • Submitting a job
  • Monitoring execution
  • Databricks Jobs
  • Summary
  • Index.