Intel Xeon Phi coprocessor high-performance programming knights landing edition

Intel Xeon Phi Processor High Performance Programming is an all-in-one source of information for programming the Second-Generation Intel Xeon Phi product family also called Knights Landing. The authors provide detailed and timely Knights Landingspecific details, programming advice, and real-world ex...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Jeffers, Jim, author (author), Reinders, James, author, Sodani, Avinash, author
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Amsterdam, [Netherlands] : Morgan Kaufmann 2016.
Edición:	Knights Landing edition
Materias:	High performance computing. High performance processors. Coprocessors. Computer programming.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630312306719

Tabla de Contenidos:

Front Cover
Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition
Copyright
Contents
Acknowledgments
Foreword
Extending the Sports Car Analogy to Higher Performance
What Exactly Is The Unfair Advantage?
Peak Performance Versus Drivable/Usable Performance
How Does The Unfair Advantage Relate to This Book?
Closing Comments
Preface
Sports Car Tutorial: Introduction for Many-Core Is Online
Parallelism Pearls: Inspired by Many Cores
Organization
Structured Parallel Programming
What's New?
lotsofcores.com
Section I: Knights Landing
Chapter 1: Introduction
Introduction to Many-Core Programming
Trend: More Parallelism
Why Intel® Xeon Phi™ Processors Are Needed
Processors Versus Coprocessor
Measuring Readiness for Highly Parallel Execution
What About GPUs?
Enjoy the Lack of Porting Needed but Still Tune!
Transformation for Performance
Hyper-Threading Versus Multithreading
Programming Models
Why We Could Skip To Section II Now
For More Information
Chapter 2: Knights Landing overview
Overview
Instruction Set
Architecture Overview
Tile
Mesh: On-Die Interconnect
Cluster modes
MCDRAM (High-Bandwidth Memory) and DDR (DDR4)
I/O (PCIe Gen3)
Motivation: Our Vision and Purpose
Performance
Summary
For More Information
Chapter 3: Programming MCDRAM and Cluster modes
Use Cache Mode and Default Cluster Mode (at First)
Programming for Cluster Modes
Programming for Memory Modes
Memory Usage Models
What Is the memkind Library (and hbwmalloc)?
Maximizing Performance With Memory Usage Models
Critical review for our Hello MCDRAM
NUMACTL -H
Learning NUMA Node Numbering
Ways to Observe MCDRAM Allocations
Numactl: Move All Allocations to MCDRAM.
Oversubcription of MCDRAM: A Killer or an Opportunity?
Autohbw: Move Selected Allocations to MCDRAM
Memkind/FASTMEM: Explicit Usage of MCDRAM
Explicit memory usage in C/C++: memkind
C++ notes
Explicit memory usage in Fortran: FASTMEM
Fortran FASTMEM failure modes
ALLOCATE prefers MCDRAM
ALLOCATE requires MCDRAM
Query Memory Mode and MCDRAM Available
SNC Performance Implications of Allocation and Threading
Allocation With SNC
How to Not Hard Code the NUMA Node Numbers
Approaches to Determining What to Put in MCDRAM
Approach 1: Observing or Emulating MCDRAM Effects
Stage 1: Code modification
Stage 2: Manual Execution
Stage 3: Autotuning Configuration (Optional)
Approach 2: Using Intel VTune to Determine MCDRAM Candidate Data Structures
Stage 1: Profiling Data Collection
Stage 2: Profiling Data Analysis
Stage 3: Code Modification
Results Analysis of the Two Approaches
Summary of Two Approaches to ``What Goes in MCDRAM´´
Why Rebooting Is Required to Change Modes
BIOS
Save/Restore/Change All BIOS Setting
Summary
For More Information
Chapter 4: Knights Landing architecture
Tile architecture
Core and VPU
Front-end unit
Allocation unit
Integer execution unit
Memory execution unit
Vector processing unit
Threading
L2 Architecture
Cluster modes
All-to-All Cluster Mode
Quadrant Cluster Mode
SNC-4 Mode
Hemisphere Cluster and SNC-2 Modes
Cluster Mode Summary
Memory interleaving
Memory modes
Cache Mode
Flat Mode
Hybrid Mode
Capacity, Bandwidth, Latency
Interactions of cluster and memory modes
Summary
For More Information
Chapter 5: Intel Omni-Path Fabric
Overview
Host Fabric Interface
Intel OPA Switches
Intel OPA Management
Performance and Scalability
Extreme Message Rates
Low Latency.
Addressing
Multicast
Transport Layer APIs
OFA Open Fabric Interface
Performance-Scaled Messaging
Open Fabrics Verbs and Compatibility
Quality of Service
Service Levels
Traffic Flow Optimization and Packet Interleaving
Credit-Based Flow Control
Security
Partition-Based Security
Management Security
Virtual Fabrics
Unicast Address Resolution
Typical Flow for Well Behaved Applications
Out of Band Mechanisms
Multicast Address Resolution
Typical Flow for Well-Behaved Applications
Summary
For More Information
Chapter 6: μarch optimization advice
Best Performance From 1, 2, or 4 Threads Per Core, Rarely 3
Hyperthreading: Do Not Turn It Off
Memory subsystem
Caches
MCDRAM and DDR
Advice: Large Pages Can Be Good (2M/1G)
μarch nuances (tile)
Instruction Cache, Decode, and Branch Predictors
Integer
Vector
Memory Accesses and Prefetch Options
Code Examples
Direct mapped MCDRAM cache
Advice: use AVX-512
Advice: Upgrade to AVX-512 From AVX/AVX2 and IMCI
Scalar Versus Vector Code
Instruction Latency Tables
Advice: Use AVX-512 Extensions for Knights Landing
Advice: Use AVX-512ER
IMCI to AVX-512: Reciprocal and Exponentials
Advice: Use AVX-512CD
Advice: Use AVX-512PF
IMCI to AVX-512: Software Prefetching
Advice: Gather and Scatter Instructions Only When Irregular
IMCI to AVX-512: Gathers/Scatters
IMCI to AVX-512: Swizzle Instructions
IMCI to AVX-512: Unaligned Loads/Stores
IMCI to AVX-512: Data Conversion Instructions
IMCI to AVX-512: Nontemporal Stores/Cache Line Evicts
Summary
For more information
Section II: Parallel programming
Chapter 7: Programming overview for Knights Landing
To Refactor, or Not to Refactor, That Is the Question
Evolutionary Optimization of Applications.
Revolutionary Optimization of Applications
Know When to Hold'em and When to Fold'em
For More Information
Chapter 8: Tasks and threads
OpenMP, Fortran 2008, Intel TBB, Intel MKL
Importance of Thread Pools
OpenMP
Parallel Processing Model
Directives
Significant Controls Over OpenMP
OpenMP Nesting-Use Hot Teams
Fortran 2008
DO CONCURRENT
DO CONCURRENT and DATA RACES
DO CONCURRENT Definition
DO CONCURRENT Versus FOR ALL
DO CONCURRENT Versus OpenMP ``Parallel´´
Intel TBB
Why TBB?
Using TBB
parallel_for
blocked_range
Partitioners
parallel_reduce
parallel_invoke
TBB Flow Graph
TBB Memory Allocation, memkind, and MCDRAM
hStreams
Summary
For More Information
Chapter 9: Vectorization
Why Vectorize?
How to Vectorize
Three Approaches to Achieving Vectorization
Six-Step Vectorization Methodology
Step 1. Measure Baseline Release Build Performance
Step 2. Determine Hotspots Using Intel VTune™ Amplifier
Step 3. Determine Loop Candidates Using Intel Compiler Vec-Report
Step 4. Get Advice Using Intel Advisor
Step 5. Implement Vectorization Recommendations
Step 6: Repeat!
Streaming Through Caches: Data Layout, Alignment, Prefetching, and so on
Why Data Layout Affects Vectorization Performance
Data Alignment
Prefetching
Compiler prefetches
Compiler prefetch controls (prefetching via directives/pragmas)
Manual prefetches
Streaming Stores
When streaming stores will be generated for Knights Landing
Nontemporal: compiler generation of nontemporal stores
Compiler Tips
Avoid Manual Loop Unrolling
Requirements for a Loop to Vectorize (Intel Compiler)
Importance of Inlining, Interference With Simple Profiling
Compiler Options
Memory Disambiguation Inside Vector-Loops
Compiler Directives
SIMD Directives.
Requirements to vectorize with SIMD directives
SIMD directive clauses
Use SIMD directives with care
The Vector and Novector Directives
Use vector directives with care
The ivdep Directive
ivdep example in fortran
ivdep examples in C
Random Number Function Vectorization
Data Alignment to Assist Vectorization
Step 1: aligning the data
How to define aligned STATIC arrays
Step 2: inform the compiler of the alignment
How to tell the compiler all memory references are nicely aligned for the target
Use Array Sections to Encourage Vectorization
Fortran Array Sections
Subscript triplets
Vector subscripts
Implications for array copies, efficiency issues
Look at What the Compiler Created: Assembly Code Inspection
How to Find the Assembly Code
Numerical Result Variations With Vectorization
Summary
For More Information
Chapter 10: Vectorization advisor
Getting Started With Intel Advisor for Knights Landing
Enabling and Improving AVX-512 Code With the Survey Report
Preparing Your Application
Running a Survey Analysis With Trip Counts
One-Stop-Shop Performance Overview in the Survey Report
Enabling AVX-512 Speedups Via Recommendations
Fixing ineffective AVX-512 peeled/remainder loop issues
Speedups with approximate reciprocal, reciprocal square root, and exponent/mantissa extraction
Inefficient memory access in assumed-shape array and AVX-512 gather/scatter
Making Expert Users Happy: Knights Landing-Specific Traits and ISA Analysis
Compress/Expand Trait
Gather/Scatter Traits
Conflict(-free) subset detection Trait
Memory Access Pattern Report
AVX-512 Gather/Scatter Profiler
Mask Utilization and FLOPs Profiler
Advisor Roofline Report
Explore AVX-512 Code Characteristics Without AVX-512 Hardware.
Example - Analysis of a Computational Chemistry Code.

Intel Xeon Phi coprocessor high-performance programming knights landing edition

Ejemplares similares