Web scraping with Python scrape data from any website with the power of Python

Successfully scrape data from any website with the power of Python About This Book A hands-on guide to web scraping with real-life problems and solutions Techniques to download and extract data from complex websites Create a number of different web scrapers to extract information Who This Book Is Fo...

Descripción completa

Detalles Bibliográficos
Otros Autores: Lawson, Richard, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham [United Kingdom] : Packt Publishing 2015.
Edición:1st edition
Colección:Community experience distilled.
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009629845206719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Credits
  • About the Author
  • About the Reviewers
  • www.PacktPub.com
  • Table of Contents
  • Preface
  • Chapter 1: Introduction to Web Scraping
  • When is web scraping useful?
  • Is web scraping legal?
  • Background research
  • Checking robots.txt
  • Examining the Sitemap
  • Estimating the size of a website
  • Identifying the technology used by a website
  • Finding the owner of a website
  • Crawling your first website
  • Downloading a web page
  • Retrying downloads
  • Setting a user agent
  • Sitemap crawler
  • ID iteration crawler
  • Link crawler
  • Advanced features
  • Summary
  • Chapter 2: Scraping the Data
  • Analyzing a web page
  • Three approaches to scrape a web page
  • Regular expressions
  • Beautiful Soup
  • Lxml
  • CSS selectors
  • Comparing performance
  • Scraping results
  • Overview
  • Adding a scrape callback to the link crawler
  • Summary
  • Chapter 3: Caching Downloads
  • Adding cache support to the link crawler
  • Disk cache
  • Implementation
  • Testing the cache
  • Saving disk space
  • Expiring stale data
  • Drawbacks
  • Database cache
  • What is NoSQL?
  • Installing MongoDB
  • Overview of MongoDB
  • MongoDB cache implementation
  • Compression
  • Testing the cache
  • Summary
  • Chapter 4: Concurrent Downloading
  • One million web pages
  • Parsing the Alexa list
  • Sequential crawler
  • Threaded crawler
  • How threads and processes work
  • Implementation
  • Cross-process crawler
  • Performance
  • Summary
  • Chapter 5: Dynamic Content
  • An example dynamic web page
  • Reverse engineering a dynamic web page
  • Edge cases
  • Rendering a dynamic web page
  • PyQt or PySide
  • Executing JavaScript
  • Website interaction with WebKit
  • Waiting for results
  • The Render class
  • Selenium
  • Summary
  • Chapter 6: Interacting with Forms
  • The Login form
  • Loading cookies from the web browser.
  • Extending the login script to update content
  • Automating forms with the Mechanize module
  • Summary
  • Chapter 7: Solving CAPTCHA
  • Registering an account
  • Loading the CAPTCHA image
  • Optical Character Recognition
  • Further improvements
  • Solving complex CAPTCHAs
  • Using a CAPTCHA solving service
  • Getting started with 9kw
  • 9kw CAPTCHA API
  • Integrating with registration
  • Summary
  • Chapter 8: Scrapy
  • Installation
  • Starting a project
  • Defining a model
  • Creating a spider
  • Tuning settings
  • Testing the spider
  • Scraping with the shell command
  • Checking results
  • Interrupting and resuming a crawl
  • Visual scraping with Portia
  • Installation
  • Annotation
  • Tuning a spider
  • Checking results
  • Automated scraping with Scrapely
  • Summary
  • Chapter 9: Overview
  • Google search engine
  • Facebook
  • The website
  • The API
  • Gap
  • BMW
  • Summary
  • Index.