Programming pig

This guide is an ideal learning tool and reference for Apache Pig, the open source engine for executing parallel data flows on Hadoop. With Pig, you can batch-process data without having to create a full-fledged application-making it easy for you to experiment with new datasets. Programming Pig in...

Descripción completa

Detalles Bibliográficos
Autor principal: Gates, Alan (-)
Otros Autores: Loukides, Michael Kosta (illustrator), Blanchette, Meghan, Romano, Robert (Illustrator), illustrator
Formato: Libro electrónico
Idioma:Inglés
Publicado: Sebastopol, CA : O'Reilly 2011.
Edición:First edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009628000606719
Tabla de Contenidos:
  • Table of Contents; Preface; Data Addiction; Who Should Read This Book; Conventions Used in This Book; Code Examples in This Book; Using Code Examples; Safari® Books Online; How to Contact Us; Acknowledgments; Chapter 1. Introduction; What Is Pig?; Pig on Hadoop; MapReduce's hello world; Pig Latin, a Parallel Dataflow Language; Comparing query and dataflow languages; How Pig differs from MapReduce; What Is Pig Useful For?; Pig Philosophy; Pig's History; Chapter 2. Installing and Running Pig; Downloading and Installing Pig; Downloading the Pig Package from Apache; Downloading Pig from Cloudera
  • Downloading Pig Artifacts from MavenDownloading the Source; Running Pig; Running Pig Locally on Your Machine; Running Pig on Your Hadoop Cluster; Running Pig in the Cloud; Command-Line and Configuration Options; Return Codes; Chapter 3. Grunt; Entering Pig Latin Scripts in Grunt; HDFS Commands in Grunt; Controlling Pig from Grunt; Chapter 4. Pig's Data Model; Types; Scalar Types; Complex Types; Map; Tuple; Bag; Nulls; Schemas; Casts; Chapter 5. Introduction to Pig Latin; Preliminary Matters; Case Sensitivity; Comments; Input and Output; Load; Store; Dump; Relational Operations; foreach
  • Expressions in foreachUDFs in foreach; Naming fields in foreach; Filter; Group; Order by; Distinct; Join; Limit; Sample; Parallel; User Defined Functions; Registering UDFs; Registering Python UDFs; define and UDFs; Calling Static Java Functions; Chapter 6. Advanced Pig Latin; Advanced Relational Operations; Advanced Features of foreach; flatten; Nested foreach; Using Different Join Implementations; Joining small to large data; Joining skewed data; Joining sorted data; cogroup; union; cross; Integrating Pig with Legacy Code and MapReduce; stream; mapreduce; Nonlinear Data Flows
  • Controlling Executionset; Setting the Partitioner; Pig Latin Preprocessor; Parameter Substitution; Macros; Including Other Pig Latin Scripts; Chapter 7. Developing and Testing Pig Latin Scripts; Development Tools; Syntax Highlighting and Checking; describe; explain; illustrate; Pig Statistics; MapReduce Job Status; Debugging Tips; Testing Your Scripts with PigUnit; Chapter 8. Making Pig Fly; Writing Your Scripts to Perform Well; Filter Early and Often; Project Early and Often; Set Up Your Joins Properly; Use Multiquery When Possible; Choose the Right Data Type
  • Select the Right Level of ParallelismWriting Your UDF to Perform; Tune Pig and Hadoop for Your Job; Using Compression in Intermediate Results; Data Layout Optimization; Bad Record Handling; Chapter 9. Embedding Pig Latin in Python; Compile; Bind; Binding Multiple Sets of Variables; Run; Running Multiple Bindings; Utility Methods; Chapter 10. Writing Evaluation and Filter Functions; Writing an Evaluation Function in Java; Where Your UDF Will Run; Evaluation Function Basics; Interacting with Pig values; Input and Output Schemas; Error Handling and Progress Reporting
  • Constructors and Passing Data from Frontend to Backend