Python web scraping fetching data from the web

Successfully scrape data from any website with the power of Python 3.x About This Book A hands-on guide to web scraping using Python with solutions to real-world problems Create a number of different web scrapers in Python to extract information This book includes practical examples on using the pop...

Descripción completa

Detalles Bibliográficos
Otros Autores: Jarmul, Katharine, author (author), Lawson, Richard, author
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham : Packt 2017.
Edición:Second edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630208406719
Tabla de Contenidos:
  • Cover
  • Credits
  • Copyright
  • About the Authors
  • About the Reviewers
  • www.PacktPub.com
  • Customer Feedback
  • Table of Contents
  • Preface
  • Chapter 1: Introduction to Web Scraping
  • When is web scraping useful?
  • Is web scraping legal?
  • Python 3
  • Background research
  • Checking robots.txt
  • Examining the Sitemap
  • Estimating the size of a website
  • Identifying the technology used by a website
  • Finding the owner of a website
  • Crawling your first website
  • Scraping versus crawling
  • Downloading a web page
  • Retrying downloads
  • Setting a user agent
  • Sitemap crawler
  • ID iteration crawler
  • Link crawlers
  • Advanced features
  • Parsing robots.txt
  • Supporting proxies
  • Throttling downloads
  • Avoiding spider traps
  • Final version
  • Using the requests library
  • Summary
  • Chapter 2: Scraping the Data
  • Analyzing a web page
  • Three approaches to scrape a web page
  • Regular expressions
  • Beautiful Soup
  • Lxml
  • CSS selectors and your Browser Console
  • XPath Selectors
  • LXML and Family Trees
  • Comparing performance
  • Scraping results
  • Overview of Scraping
  • Adding a scrape callback to the link crawler
  • Summary
  • Chapter 3: Caching Downloads
  • When to use caching?
  • Adding cache support to the link crawler
  • Disk Cache
  • Implementing DiskCache
  • Testing the cache
  • Saving disk space
  • Expiring stale data
  • Drawbacks of DiskCache
  • Key-value storage cache
  • What is key-value storage?
  • Installing Redis
  • Overview of Redis
  • Redis cache implementation
  • Compression
  • Testing the cache
  • Exploring requests-cache
  • Summary
  • Chapter 4: Concurrent Downloading
  • One million web pages
  • Parsing the Alexa list
  • Sequential crawler
  • Threaded crawler
  • How threads and processes work
  • Implementing a multithreaded crawler
  • Multiprocessing crawler
  • Performance.
  • [Python multiprocessing and the GIL]
  • Python multiprocessing and the GIL
  • Summary
  • Chapter 5: Dynamic Content
  • An example dynamic web page
  • Reverse engineering a dynamic web page
  • Edge cases
  • Rendering a dynamic web page
  • PyQt or PySide
  • Debugging with Qt
  • Executing JavaScript
  • Website interaction with WebKit
  • Waiting for results
  • The Render class
  • Selenium
  • Selenium and Headless Browsers
  • Summary
  • Chapter 6: Interacting with Forms
  • The Login form
  • Loading cookies from the web browser
  • Extending the login script to update content
  • Automating forms with Selenium
  • Summary
  • Chapter 7: Solving CAPTCHA
  • Registering an account
  • Loading the CAPTCHA image
  • Optical character recognition
  • Further improvements
  • Solving complex CAPTCHAs
  • Using a CAPTCHA solving service
  • Getting started with 9kw
  • The 9kw CAPTCHA API
  • Reporting errors
  • Integrating with registration
  • CAPTCHAs and machine learning
  • Summary
  • Chapter 8: Scrapy
  • Installing Scrapy
  • Starting a project
  • Defining a model
  • Creating a spider
  • Tuning settings
  • Testing the spider
  • Different Spider Types
  • Scraping with the shell command
  • Checking results
  • Interrupting and resuming a crawl
  • Scrapy Performance Tuning
  • Visual scraping with Portia
  • Installation
  • Annotation
  • Running the Spider
  • Checking results
  • Automated scraping with Scrapely
  • Summary
  • Chapter 9: Putting It All Together
  • Google search engine
  • Facebook
  • The website
  • Facebook API
  • Gap
  • BMW
  • Summary
  • Index.