Intel Xeon Phi coprocessor high-performance programming knights landing edition

Intel Xeon Phi Processor High Performance Programming is an all-in-one source of information for programming the Second-Generation Intel Xeon Phi product family also called Knights Landing. The authors provide detailed and timely Knights Landingspecific details, programming advice, and real-world ex...

Descripción completa

Detalles Bibliográficos
Otros Autores: Jeffers, Jim, author (author), Reinders, James, author, Sodani, Avinash, author
Formato: Libro electrónico
Idioma:Inglés
Publicado: Amsterdam, [Netherlands] : Morgan Kaufmann 2016.
Edición:Knights Landing edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630312306719
Tabla de Contenidos:
  • Front Cover
  • Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition
  • Copyright
  • Contents
  • Acknowledgments
  • Foreword
  • Extending the Sports Car Analogy to Higher Performance
  • What Exactly Is The Unfair Advantage?
  • Peak Performance Versus Drivable/Usable Performance
  • How Does The Unfair Advantage Relate to This Book?
  • Closing Comments
  • Preface
  • Sports Car Tutorial: Introduction for Many-Core Is Online
  • Parallelism Pearls: Inspired by Many Cores
  • Organization
  • Structured Parallel Programming
  • What's New?
  • lotsofcores.com
  • Section I: Knights Landing
  • Chapter 1: Introduction
  • Introduction to Many-Core Programming
  • Trend: More Parallelism
  • Why Intel® Xeon Phi™ Processors Are Needed
  • Processors Versus Coprocessor
  • Measuring Readiness for Highly Parallel Execution
  • What About GPUs?
  • Enjoy the Lack of Porting Needed but Still Tune!
  • Transformation for Performance
  • Hyper-Threading Versus Multithreading
  • Programming Models
  • Why We Could Skip To Section II Now
  • For More Information
  • Chapter 2: Knights Landing overview
  • Overview
  • Instruction Set
  • Architecture Overview
  • Tile
  • Mesh: On-Die Interconnect
  • Cluster modes
  • MCDRAM (High-Bandwidth Memory) and DDR (DDR4)
  • I/O (PCIe Gen3)
  • Motivation: Our Vision and Purpose
  • Performance
  • Summary
  • For More Information
  • Chapter 3: Programming MCDRAM and Cluster modes
  • Use Cache Mode and Default Cluster Mode (at First)
  • Programming for Cluster Modes
  • Programming for Memory Modes
  • Memory Usage Models
  • What Is the memkind Library (and hbwmalloc)?
  • Maximizing Performance With Memory Usage Models
  • Critical review for our Hello MCDRAM
  • NUMACTL -H
  • Learning NUMA Node Numbering
  • Ways to Observe MCDRAM Allocations
  • Numactl: Move All Allocations to MCDRAM.
  • Oversubcription of MCDRAM: A Killer or an Opportunity?
  • Autohbw: Move Selected Allocations to MCDRAM
  • Memkind/FASTMEM: Explicit Usage of MCDRAM
  • Explicit memory usage in C/C++: memkind
  • C++ notes
  • Explicit memory usage in Fortran: FASTMEM
  • Fortran FASTMEM failure modes
  • ALLOCATE prefers MCDRAM
  • ALLOCATE requires MCDRAM
  • Query Memory Mode and MCDRAM Available
  • SNC Performance Implications of Allocation and Threading
  • Allocation With SNC
  • How to Not Hard Code the NUMA Node Numbers
  • Approaches to Determining What to Put in MCDRAM
  • Approach 1: Observing or Emulating MCDRAM Effects
  • Stage 1: Code modification
  • Stage 2: Manual Execution
  • Stage 3: Autotuning Configuration (Optional)
  • Approach 2: Using Intel VTune to Determine MCDRAM Candidate Data Structures
  • Stage 1: Profiling Data Collection
  • Stage 2: Profiling Data Analysis
  • Stage 3: Code Modification
  • Results Analysis of the Two Approaches
  • Summary of Two Approaches to ``What Goes in MCDRAM´´
  • Why Rebooting Is Required to Change Modes
  • BIOS
  • Save/Restore/Change All BIOS Setting
  • Summary
  • For More Information
  • Chapter 4: Knights Landing architecture
  • Tile architecture
  • Core and VPU
  • Front-end unit
  • Allocation unit
  • Integer execution unit
  • Memory execution unit
  • Vector processing unit
  • Threading
  • L2 Architecture
  • Cluster modes
  • All-to-All Cluster Mode
  • Quadrant Cluster Mode
  • SNC-4 Mode
  • Hemisphere Cluster and SNC-2 Modes
  • Cluster Mode Summary
  • Memory interleaving
  • Memory modes
  • Cache Mode
  • Flat Mode
  • Hybrid Mode
  • Capacity, Bandwidth, Latency
  • Interactions of cluster and memory modes
  • Summary
  • For More Information
  • Chapter 5: Intel Omni-Path Fabric
  • Overview
  • Host Fabric Interface
  • Intel OPA Switches
  • Intel OPA Management
  • Performance and Scalability
  • Extreme Message Rates
  • Low Latency.
  • Addressing
  • Multicast
  • Transport Layer APIs
  • OFA Open Fabric Interface
  • Performance-Scaled Messaging
  • Open Fabrics Verbs and Compatibility
  • Quality of Service
  • Service Levels
  • Traffic Flow Optimization and Packet Interleaving
  • Credit-Based Flow Control
  • Security
  • Partition-Based Security
  • Management Security
  • Virtual Fabrics
  • Unicast Address Resolution
  • Typical Flow for Well Behaved Applications
  • Out of Band Mechanisms
  • Multicast Address Resolution
  • Typical Flow for Well-Behaved Applications
  • Summary
  • For More Information
  • Chapter 6: μarch optimization advice
  • Best Performance From 1, 2, or 4 Threads Per Core, Rarely 3
  • Hyperthreading: Do Not Turn It Off
  • Memory subsystem
  • Caches
  • MCDRAM and DDR
  • Advice: Large Pages Can Be Good (2M/1G)
  • μarch nuances (tile)
  • Instruction Cache, Decode, and Branch Predictors
  • Integer
  • Vector
  • Memory Accesses and Prefetch Options
  • Code Examples
  • Direct mapped MCDRAM cache
  • Advice: use AVX-512
  • Advice: Upgrade to AVX-512 From AVX/AVX2 and IMCI
  • Scalar Versus Vector Code
  • Instruction Latency Tables
  • Advice: Use AVX-512 Extensions for Knights Landing
  • Advice: Use AVX-512ER
  • IMCI to AVX-512: Reciprocal and Exponentials
  • Advice: Use AVX-512CD
  • Advice: Use AVX-512PF
  • IMCI to AVX-512: Software Prefetching
  • Advice: Gather and Scatter Instructions Only When Irregular
  • IMCI to AVX-512: Gathers/Scatters
  • IMCI to AVX-512: Swizzle Instructions
  • IMCI to AVX-512: Unaligned Loads/Stores
  • IMCI to AVX-512: Data Conversion Instructions
  • IMCI to AVX-512: Nontemporal Stores/Cache Line Evicts
  • Summary
  • For more information
  • Section II: Parallel programming
  • Chapter 7: Programming overview for Knights Landing
  • To Refactor, or Not to Refactor, That Is the Question
  • Evolutionary Optimization of Applications.
  • Revolutionary Optimization of Applications
  • Know When to Hold'em and When to Fold'em
  • For More Information
  • Chapter 8: Tasks and threads
  • OpenMP, Fortran 2008, Intel TBB, Intel MKL
  • Importance of Thread Pools
  • OpenMP
  • Parallel Processing Model
  • Directives
  • Significant Controls Over OpenMP
  • OpenMP Nesting-Use Hot Teams
  • Fortran 2008
  • DO CONCURRENT
  • DO CONCURRENT and DATA RACES
  • DO CONCURRENT Definition
  • DO CONCURRENT Versus FOR ALL
  • DO CONCURRENT Versus OpenMP ``Parallel´´
  • Intel TBB
  • Why TBB?
  • Using TBB
  • parallel_for
  • blocked_range
  • Partitioners
  • parallel_reduce
  • parallel_invoke
  • TBB Flow Graph
  • TBB Memory Allocation, memkind, and MCDRAM
  • hStreams
  • Summary
  • For More Information
  • Chapter 9: Vectorization
  • Why Vectorize?
  • How to Vectorize
  • Three Approaches to Achieving Vectorization
  • Six-Step Vectorization Methodology
  • Step 1. Measure Baseline Release Build Performance
  • Step 2. Determine Hotspots Using Intel VTune™ Amplifier
  • Step 3. Determine Loop Candidates Using Intel Compiler Vec-Report
  • Step 4. Get Advice Using Intel Advisor
  • Step 5. Implement Vectorization Recommendations
  • Step 6: Repeat!
  • Streaming Through Caches: Data Layout, Alignment, Prefetching, and so on
  • Why Data Layout Affects Vectorization Performance
  • Data Alignment
  • Prefetching
  • Compiler prefetches
  • Compiler prefetch controls (prefetching via directives/pragmas)
  • Manual prefetches
  • Streaming Stores
  • When streaming stores will be generated for Knights Landing
  • Nontemporal: compiler generation of nontemporal stores
  • Compiler Tips
  • Avoid Manual Loop Unrolling
  • Requirements for a Loop to Vectorize (Intel Compiler)
  • Importance of Inlining, Interference With Simple Profiling
  • Compiler Options
  • Memory Disambiguation Inside Vector-Loops
  • Compiler Directives
  • SIMD Directives.
  • Requirements to vectorize with SIMD directives
  • SIMD directive clauses
  • Use SIMD directives with care
  • The Vector and Novector Directives
  • Use vector directives with care
  • The ivdep Directive
  • ivdep example in fortran
  • ivdep examples in C
  • Random Number Function Vectorization
  • Data Alignment to Assist Vectorization
  • Step 1: aligning the data
  • How to define aligned STATIC arrays
  • Step 2: inform the compiler of the alignment
  • How to tell the compiler all memory references are nicely aligned for the target
  • Use Array Sections to Encourage Vectorization
  • Fortran Array Sections
  • Subscript triplets
  • Vector subscripts
  • Implications for array copies, efficiency issues
  • Look at What the Compiler Created: Assembly Code Inspection
  • How to Find the Assembly Code
  • Numerical Result Variations With Vectorization
  • Summary
  • For More Information
  • Chapter 10: Vectorization advisor
  • Getting Started With Intel Advisor for Knights Landing
  • Enabling and Improving AVX-512 Code With the Survey Report
  • Preparing Your Application
  • Running a Survey Analysis With Trip Counts
  • One-Stop-Shop Performance Overview in the Survey Report
  • Enabling AVX-512 Speedups Via Recommendations
  • Fixing ineffective AVX-512 peeled/remainder loop issues
  • Speedups with approximate reciprocal, reciprocal square root, and exponent/mantissa extraction
  • Inefficient memory access in assumed-shape array and AVX-512 gather/scatter
  • Making Expert Users Happy: Knights Landing-Specific Traits and ISA Analysis
  • Compress/Expand Trait
  • Gather/Scatter Traits
  • Conflict(-free) subset detection Trait
  • Memory Access Pattern Report
  • AVX-512 Gather/Scatter Profiler
  • Mask Utilization and FLOPs Profiler
  • Advisor Roofline Report
  • Explore AVX-512 Code Characteristics Without AVX-512 Hardware.
  • Example - Analysis of a Computational Chemistry Code.