Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL

Paperback
from $0.00

Author: Michael Schrenk

ISBN-10: 1593271204

ISBN-13: 9781593271206

Category: General & Miscellaneous Software

Search in google:

This text first outlines the deficiencies of browsers, and then explains how these deficiencies can be exploited in the design and deployment of task-specific webbots. Readers will learn how to write stealthy webbots that read email, emulate online forms, auto-authenticate, manage cookies, and handle encryption.

; Dedication; ACKNOWLEDGMENTS; Introduction; Old-School Client-Server Technology; The Problem with Browsers; What to Expect from This Book; About the Website; About the Code; Requirements; A Disclaimer (This Is Important); FUNDAMENTAL CONCEPTS AND TECHNIQUES; Chapter 1: WHAT'S IN IT FOR YOU?; 1.1 Uncovering the Internet's True Potential; 1.2 What's in It for Developers?; 1.3 What's in It for Business Leaders?; 1.4 Final Thoughts; Chapter 2: IDEAS FOR WEBBOT PROJECTS; 2.1 Inspiration from Browser Limitations; 2.2 A Few Crazy Ideas to Get You Started; 2.3 Final Thoughts; Chapter 3: DOWNLOADING WEB PAGES; 3.1 Think About Files, Not Web Pages; 3.2 Downloading Files with PHP's Built-in Functions; 3.3 Introducing PHP/CURL; 3.4 Installing PHP/CURL; 3.5 LIB_http; 3.6 Final Thoughts; Chapter 4: PARSING TECHNIQUES; 4.1 Parsing Poorly Written HTML; 4.2 Standard Parse Routines; 4.3 Using LIB_parse; 4.4 Useful PHP Functions; 4.5 Final Thoughts; Chapter 5: AUTOMATING FORM SUBMISSION; 5.1 Reverse Engineering Form Interfaces; 5.2 Form Handlers, Data Fields, Methods, and Event Triggers; 5.3 Unpredictable Forms; 5.4 Analyzing a Form; 5.5 Final Thoughts; Chapter 6: MANAGING LARGE AMOUNTS OF DATA; 6.1 Organizing Data; 6.2 Making Data Smaller; 6.3 Thumbnailing Images; 6.4 Final Thoughts; PROJECTS; Chapter 7: PRICE-MONITORING WEBBOTS; 7.1 The Target; 7.2 Designing the Parsing Script; 7.3 Initialization and Downloading the Target; 7.4 Further Exploration; Chapter 8: IMAGE-CAPTURING WEBBOTS; 8.1 Example Image-Capturing Webbot; 8.2 Creating the Image-Capturing Webbot; 8.3 Further Exploration; 8.4 Final Thoughts; Chapter 9: LINK-VERIFICATION WEBBOTS; 9.1 Creating the Link-Verification Webbot; 9.2 Running the Webbot; 9.3 Further Exploration; Chapter 10: ANONYMOUS BROWSING WEBBOTS; 10.1 Anonymity with Proxies; 10.2 The Anonymizer Project; 10.3 Final Thoughts; Chapter 11: SEARCH-RANKING WEBBOTS; 11.1 Description of a Search Result Page; 11.2 What the Search-Ranking Webbot Does; 11.3 Running the Search-Ranking Webbot; 11.4 How the Search-Ranking Webbot Works; 11.5 The Search-Ranking Webbot Script; 11.6 Final Thoughts; 11.7 Further Exploration; Chapter 12: AGGREGATION WEBBOTS; 12.1 Choosing Data Sources for Webbots; 12.2 Example Aggregation Webbot; 12.3 Adding Filtering to Your Aggregation Webbot; 12.4 Further Exploration; Chapter 13: FTP WEBBOTS; 13.1 Example FTP Webbot; 13.2 PHP and FTP; 13.3 Further Exploration; Chapter 14: NNTP NEWS WEBBOTS; 14.1 NNTP Use and History; 14.2 Webbots and Newsgroups; 14.3 Further Exploration; Chapter 15: WEBBOTS THAT READ EMAIL; 15.1 The POP3 Protocol; 15.2 Executing POP3 Commands with a Webbot; 15.3 Further Exploration; Chapter 16: WEBBOTS THAT SEND EMAIL; 16.1 Email, Webbots, and Spam; 16.2 Sending Mail with SMTP and PHP; 16.3 Writing a Webbot That Sends Email Notifications; 16.4 Further Exploration; Chapter 17: CONVERTING A WEBSITE INTO A FUNCTION; 17.1 Writing a Function Interface; 17.2 Final Thoughts; ADVANCED TECHNICAL CONSIDERATIONS; Chapter 18: SPIDERS; 18.1 How Spiders Work; 18.2 Example Spider; 18.3 LIB_simple_spider; 18.4 Experimenting with the Spider; 18.5 Adding the Payload; 18.6 Further Exploration; Chapter 19: PROCUREMENT WEBBOTS AND SNIPERS; 19.1 Procurement Webbot Theory; 19.2 Sniper Theory; 19.3 Testing Your Own Webbots and Snipers; 19.4 Further Exploration; 19.5 Final Thoughts; Chapter 20: WEBBOTS AND CRYPTOGRAPHY; 20.1 Designing Webbots That Use Encryption; 20.2 A Quick Overview of Web Encryption; 20.3 Local Certificates; 20.4 Final Thoughts; Chapter 21: AUTHENTICATION; 21.1 What Is Authentication?; 21.2 Example Scripts and Practice Pages; 21.3 Basic Authentication; 21.4 Session Authentication; 21.5 Final Thoughts; Chapter 22: ADVANCED COOKIE MANAGEMENT; 22.1 How Cookies Work; 22.2 PHP/CURL and Cookies; 22.3 How Cookies Challenge Webbot Design; 22.4 Further Exploration; Chapter 23: SCHEDULING WEBBOTS AND SPIDERS; 23.1 The Windows Task Scheduler; 23.2 Complex Schedules; 23.3 Non-Calendar-Based Triggers; 23.4 Final Thoughts; LARGER CONSIDERATIONS; Chapter 24: DESIGNING STEALTHY WEBBOTS AND SPIDERS; 24.1 Why Design a Stealthy Webbot?; 24.2 Stealth Means Simulating Human Patterns; 24.3 Final Thoughts; Chapter 25: WRITING FAULT-TOLERANT WEBBOTS; 25.1 Types of Webbot Fault Tolerance; 25.2 Error Handlers; Chapter 26: DESIGNING WEBBOT-FRIENDLY WEBSITES; 26.1 Optimizing Web Pages for Search Engine Spiders; 26.2 Web Design Techniques That Hinder Search Engine Spiders; 26.3 Designing Data-Only Interfaces; Chapter 27: KILLINGGGGGG SPIDERS; 27.1 Asking Nicely; 27.2 Building Speed Bumps; 27.3 Setting Traps; 27.4 Final Thoughts; Chapter 28: KEEPING WEBBOTS OUT OF TROUBLE; 28.1 It's All About Respect; 28.2 Copyright; 28.3 Trespass to Chattels; 28.4 Internet Law; 28.5 Final Thoughts; PHP/CURL REFERENCE; Creating a Minimal PHP/CURL Session; Initiating PHP/CURL Sessions; Setting PHP/CURL Options; Executing the PHP/CURL Command; Closing PHP/CURL Sessions; STATUS CODES; HTTP Codes; NNTP Codes; SMS EMAIL ADDRESSES; Colophon;Michael Schrenk uses webbots and data-driven web applications to create competitive advantages for businesses. He has written for Computerworld and Web Techniques magazines and has taught courses on Web usability and Internet marketing. He has also given presentations on intelligent Web agents and online corporate intelligence at the DEFCON hacker's convention.