Intel Xeon Phi coprocessor high-performance programming knights landing edition
Intel Xeon Phi Processor High Performance Programming is an all-in-one source of information for programming the Second-Generation Intel Xeon Phi product family also called Knights Landing. The authors provide detailed and timely Knights Landingspecific details, programming advice, and real-world ex...
Otros Autores: | , , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Amsterdam, [Netherlands] :
Morgan Kaufmann
2016.
|
Edición: | Knights Landing edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630312306719 |
Tabla de Contenidos:
- Front Cover
- Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition
- Copyright
- Contents
- Acknowledgments
- Foreword
- Extending the Sports Car Analogy to Higher Performance
- What Exactly Is The Unfair Advantage?
- Peak Performance Versus Drivable/Usable Performance
- How Does The Unfair Advantage Relate to This Book?
- Closing Comments
- Preface
- Sports Car Tutorial: Introduction for Many-Core Is Online
- Parallelism Pearls: Inspired by Many Cores
- Organization
- Structured Parallel Programming
- What's New?
- lotsofcores.com
- Section I: Knights Landing
- Chapter 1: Introduction
- Introduction to Many-Core Programming
- Trend: More Parallelism
- Why Intel® Xeon Phi™ Processors Are Needed
- Processors Versus Coprocessor
- Measuring Readiness for Highly Parallel Execution
- What About GPUs?
- Enjoy the Lack of Porting Needed but Still Tune!
- Transformation for Performance
- Hyper-Threading Versus Multithreading
- Programming Models
- Why We Could Skip To Section II Now
- For More Information
- Chapter 2: Knights Landing overview
- Overview
- Instruction Set
- Architecture Overview
- Tile
- Mesh: On-Die Interconnect
- Cluster modes
- MCDRAM (High-Bandwidth Memory) and DDR (DDR4)
- I/O (PCIe Gen3)
- Motivation: Our Vision and Purpose
- Performance
- Summary
- For More Information
- Chapter 3: Programming MCDRAM and Cluster modes
- Use Cache Mode and Default Cluster Mode (at First)
- Programming for Cluster Modes
- Programming for Memory Modes
- Memory Usage Models
- What Is the memkind Library (and hbwmalloc)?
- Maximizing Performance With Memory Usage Models
- Critical review for our Hello MCDRAM
- NUMACTL -H
- Learning NUMA Node Numbering
- Ways to Observe MCDRAM Allocations
- Numactl: Move All Allocations to MCDRAM.
- Oversubcription of MCDRAM: A Killer or an Opportunity?
- Autohbw: Move Selected Allocations to MCDRAM
- Memkind/FASTMEM: Explicit Usage of MCDRAM
- Explicit memory usage in C/C++: memkind
- C++ notes
- Explicit memory usage in Fortran: FASTMEM
- Fortran FASTMEM failure modes
- ALLOCATE prefers MCDRAM
- ALLOCATE requires MCDRAM
- Query Memory Mode and MCDRAM Available
- SNC Performance Implications of Allocation and Threading
- Allocation With SNC
- How to Not Hard Code the NUMA Node Numbers
- Approaches to Determining What to Put in MCDRAM
- Approach 1: Observing or Emulating MCDRAM Effects
- Stage 1: Code modification
- Stage 2: Manual Execution
- Stage 3: Autotuning Configuration (Optional)
- Approach 2: Using Intel VTune to Determine MCDRAM Candidate Data Structures
- Stage 1: Profiling Data Collection
- Stage 2: Profiling Data Analysis
- Stage 3: Code Modification
- Results Analysis of the Two Approaches
- Summary of Two Approaches to ``What Goes in MCDRAM´´
- Why Rebooting Is Required to Change Modes
- BIOS
- Save/Restore/Change All BIOS Setting
- Summary
- For More Information
- Chapter 4: Knights Landing architecture
- Tile architecture
- Core and VPU
- Front-end unit
- Allocation unit
- Integer execution unit
- Memory execution unit
- Vector processing unit
- Threading
- L2 Architecture
- Cluster modes
- All-to-All Cluster Mode
- Quadrant Cluster Mode
- SNC-4 Mode
- Hemisphere Cluster and SNC-2 Modes
- Cluster Mode Summary
- Memory interleaving
- Memory modes
- Cache Mode
- Flat Mode
- Hybrid Mode
- Capacity, Bandwidth, Latency
- Interactions of cluster and memory modes
- Summary
- For More Information
- Chapter 5: Intel Omni-Path Fabric
- Overview
- Host Fabric Interface
- Intel OPA Switches
- Intel OPA Management
- Performance and Scalability
- Extreme Message Rates
- Low Latency.
- Addressing
- Multicast
- Transport Layer APIs
- OFA Open Fabric Interface
- Performance-Scaled Messaging
- Open Fabrics Verbs and Compatibility
- Quality of Service
- Service Levels
- Traffic Flow Optimization and Packet Interleaving
- Credit-Based Flow Control
- Security
- Partition-Based Security
- Management Security
- Virtual Fabrics
- Unicast Address Resolution
- Typical Flow for Well Behaved Applications
- Out of Band Mechanisms
- Multicast Address Resolution
- Typical Flow for Well-Behaved Applications
- Summary
- For More Information
- Chapter 6: μarch optimization advice
- Best Performance From 1, 2, or 4 Threads Per Core, Rarely 3
- Hyperthreading: Do Not Turn It Off
- Memory subsystem
- Caches
- MCDRAM and DDR
- Advice: Large Pages Can Be Good (2M/1G)
- μarch nuances (tile)
- Instruction Cache, Decode, and Branch Predictors
- Integer
- Vector
- Memory Accesses and Prefetch Options
- Code Examples
- Direct mapped MCDRAM cache
- Advice: use AVX-512
- Advice: Upgrade to AVX-512 From AVX/AVX2 and IMCI
- Scalar Versus Vector Code
- Instruction Latency Tables
- Advice: Use AVX-512 Extensions for Knights Landing
- Advice: Use AVX-512ER
- IMCI to AVX-512: Reciprocal and Exponentials
- Advice: Use AVX-512CD
- Advice: Use AVX-512PF
- IMCI to AVX-512: Software Prefetching
- Advice: Gather and Scatter Instructions Only When Irregular
- IMCI to AVX-512: Gathers/Scatters
- IMCI to AVX-512: Swizzle Instructions
- IMCI to AVX-512: Unaligned Loads/Stores
- IMCI to AVX-512: Data Conversion Instructions
- IMCI to AVX-512: Nontemporal Stores/Cache Line Evicts
- Summary
- For more information
- Section II: Parallel programming
- Chapter 7: Programming overview for Knights Landing
- To Refactor, or Not to Refactor, That Is the Question
- Evolutionary Optimization of Applications.
- Revolutionary Optimization of Applications
- Know When to Hold'em and When to Fold'em
- For More Information
- Chapter 8: Tasks and threads
- OpenMP, Fortran 2008, Intel TBB, Intel MKL
- Importance of Thread Pools
- OpenMP
- Parallel Processing Model
- Directives
- Significant Controls Over OpenMP
- OpenMP Nesting-Use Hot Teams
- Fortran 2008
- DO CONCURRENT
- DO CONCURRENT and DATA RACES
- DO CONCURRENT Definition
- DO CONCURRENT Versus FOR ALL
- DO CONCURRENT Versus OpenMP ``Parallel´´
- Intel TBB
- Why TBB?
- Using TBB
- parallel_for
- blocked_range
- Partitioners
- parallel_reduce
- parallel_invoke
- TBB Flow Graph
- TBB Memory Allocation, memkind, and MCDRAM
- hStreams
- Summary
- For More Information
- Chapter 9: Vectorization
- Why Vectorize?
- How to Vectorize
- Three Approaches to Achieving Vectorization
- Six-Step Vectorization Methodology
- Step 1. Measure Baseline Release Build Performance
- Step 2. Determine Hotspots Using Intel VTune™ Amplifier
- Step 3. Determine Loop Candidates Using Intel Compiler Vec-Report
- Step 4. Get Advice Using Intel Advisor
- Step 5. Implement Vectorization Recommendations
- Step 6: Repeat!
- Streaming Through Caches: Data Layout, Alignment, Prefetching, and so on
- Why Data Layout Affects Vectorization Performance
- Data Alignment
- Prefetching
- Compiler prefetches
- Compiler prefetch controls (prefetching via directives/pragmas)
- Manual prefetches
- Streaming Stores
- When streaming stores will be generated for Knights Landing
- Nontemporal: compiler generation of nontemporal stores
- Compiler Tips
- Avoid Manual Loop Unrolling
- Requirements for a Loop to Vectorize (Intel Compiler)
- Importance of Inlining, Interference With Simple Profiling
- Compiler Options
- Memory Disambiguation Inside Vector-Loops
- Compiler Directives
- SIMD Directives.
- Requirements to vectorize with SIMD directives
- SIMD directive clauses
- Use SIMD directives with care
- The Vector and Novector Directives
- Use vector directives with care
- The ivdep Directive
- ivdep example in fortran
- ivdep examples in C
- Random Number Function Vectorization
- Data Alignment to Assist Vectorization
- Step 1: aligning the data
- How to define aligned STATIC arrays
- Step 2: inform the compiler of the alignment
- How to tell the compiler all memory references are nicely aligned for the target
- Use Array Sections to Encourage Vectorization
- Fortran Array Sections
- Subscript triplets
- Vector subscripts
- Implications for array copies, efficiency issues
- Look at What the Compiler Created: Assembly Code Inspection
- How to Find the Assembly Code
- Numerical Result Variations With Vectorization
- Summary
- For More Information
- Chapter 10: Vectorization advisor
- Getting Started With Intel Advisor for Knights Landing
- Enabling and Improving AVX-512 Code With the Survey Report
- Preparing Your Application
- Running a Survey Analysis With Trip Counts
- One-Stop-Shop Performance Overview in the Survey Report
- Enabling AVX-512 Speedups Via Recommendations
- Fixing ineffective AVX-512 peeled/remainder loop issues
- Speedups with approximate reciprocal, reciprocal square root, and exponent/mantissa extraction
- Inefficient memory access in assumed-shape array and AVX-512 gather/scatter
- Making Expert Users Happy: Knights Landing-Specific Traits and ISA Analysis
- Compress/Expand Trait
- Gather/Scatter Traits
- Conflict(-free) subset detection Trait
- Memory Access Pattern Report
- AVX-512 Gather/Scatter Profiler
- Mask Utilization and FLOPs Profiler
- Advisor Roofline Report
- Explore AVX-512 Code Characteristics Without AVX-512 Hardware.
- Example - Analysis of a Computational Chemistry Code.