Java爬虫入门到精通实战教程零基础学会网络数据采集与反爬虫应对技巧

引言：为什么选择Java进行网络爬虫开发？

在当今数据驱动的时代，网络爬虫已成为获取互联网数据的重要工具。Java作为一种成熟、稳定且功能强大的编程语言，在爬虫开发领域具有独特的优势。首先，Java拥有丰富的生态系统，包括Jsoup、HttpClient、Selenium等优秀的网络请求和HTML解析库。其次，Java的多线程机制使得大规模并发爬取变得简单高效。最重要的是，Java的跨平台特性和强大的异常处理机制，使得爬虫程序能够在复杂的网络环境中稳定运行。

与Python相比，Java爬虫在处理大规模数据采集时具有更好的性能表现，特别是在需要长时间运行的企业级应用中，Java的内存管理和垃圾回收机制能够有效避免内存泄漏问题。此外，Java的类型安全特性能够在编译期发现潜在错误，提高代码质量。

第一章：Java爬虫基础环境搭建

1.1 开发环境准备

在开始Java爬虫开发之前，我们需要准备以下开发环境：

JDK安装与配置： 确保安装JDK 8或更高版本。可以通过以下命令验证安装：

java -version
javac -version

IDE选择： 推荐使用IntelliJ IDEA或Eclipse作为开发IDE。IntelliJ IDEA对Java Web开发支持更好，特别是对于HTTP请求调试。

构建工具： 使用Maven或Gradle管理项目依赖。本文以Maven为例，在pom.xml中添加必要的依赖：

<dependencies>
    <!-- Jsoup - HTML解析库 -->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.15.4</version>
    </dependency>
    
    <!-- HttpClient - HTTP客户端 -->
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.14</version>
    </dependency>
    
    <!-- Gson - JSON解析库 -->
    <dependency>
        <groupId>com.google.code.gson</groupId>
        <artifactId>gson</artifactId>
        <version>2.10.1</version>
    </dependency>
    
    <!-- Selenium - 浏览器自动化 -->
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
        <version>4.11.0</version>
    </dependency>
</dependencies>

1.2 网络爬虫工作原理

网络爬虫的基本工作流程如下：

URL管理器：维护待爬取URL和已爬取URL集合
下载器：发送HTTP请求获取网页内容
解析器：从HTML中提取目标数据
数据处理器：清洗、存储提取的数据
调度器：协调各组件工作，控制爬取节奏

第二章：Java网络请求发送

2.1 使用HttpURLConnection发送请求

HttpURLConnection是Java标准库提供的HTTP客户端，无需额外依赖：

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.StandardCharsets;

public class SimpleHttpClient {
    public static String sendGetRequest(String targetURL) {
        HttpURLConnection connection = null;
        try {
            URL url = new URL(targetURL);
            connection = (HttpURLConnection) url.openConnection();
            
            // 设置请求方法
            connection.setRequestMethod("GET");
            
            // 设置请求头
            connection.setRequestProperty("User-Agent", 
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
            connection.setRequestProperty("Accept", 
                "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
            
            // 设置超时时间
            connection.setConnectTimeout(5000);
            connection.setReadTimeout(5000);
            
            // 获取响应码
            int responseCode = connection.getResponseCode();
            System.out.println("Response Code: " + responseCode);
            
            if (responseCode == HttpURLConnection.HTTP_OK) {
                BufferedReader in = new BufferedReader(
                    new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8));
                String inputLine;
                StringBuilder response = new StringBuilder();
                
                while ((inputLine = in.readLine()) != null) {
                    response.append(inputLine);
                }
                in.close();
                return response.toString();
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (connection != null) {
                connection.disconnect();
            }
        }
        return null;
    }
    
    public static void main(String[] args) {
        String html = sendGetRequest("https://example.com");
        System.out.println(html);
    }
}

2.2 使用Apache HttpClient发送请求

Apache HttpClient提供了更强大的功能，包括连接池、Cookie管理等：

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.nio.charset.StandardCharsets;

public class HttpClientExample {
    public static String sendGetWithHttpClient(String targetURL) {
        // 创建HttpClient实例
        try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
            // 创建GET请求
            HttpGet httpGet = new HttpGet(targetURL);
            
            // 设置请求头
            httpGet.setHeader("User-Agent", 
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
            httpGet.setHeader("Accept", 
                "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
            httpGet.setHeader("Accept-Language", "zh-CN,zh;q=0.9");
            
            // 执行请求
            try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
                // 获取响应实体
                String html = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
                return html;
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
    
    public static void main(String[] args) {
        String html = sendGetWithHttpClient("https://example.com");
        JsoupParser.parseHtml(html);
    }
}

2.3 处理HTTPS和SSL证书问题

在爬取HTTPS网站时，可能会遇到SSL证书验证问题：

import org.apache.http.conn.ssl.NoopHostnameVerifier;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.ssl.SSLContexts;
import javax.net.ssl.SSLContext;
import java.security.KeyManagementException;
import java.security.NoSuchAlgorithmException;

public class SSLHttpClient {
    public static CloseableHttpClient createSSLClient() {
        try {
            // 创建信任所有证书的SSL上下文
            SSLContext sslContext = SSLContexts.custom()
                .loadTrustMaterial((chain, authType) -> true)
                .build();
            
            // 创建SSL连接工厂
            SSLConnectionSocketFactory socketFactory = 
                new SSLConnectionSocketFactory(sslContext, 
                    NoopHostnameVerifier.INSTANCE);
            
            return HttpClients.custom()
                .setSSLSocketFactory(socketFactory)
                .build();
        } catch (Exception e) {
            e.printStackTrace();
            return HttpClients.createDefault();
        }
    }
}

第三章：HTML解析与数据提取

3.1 Jsoup基础用法

Jsoup是Java中最流行的HTML解析库，提供了类似jQuery的选择器语法：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupParser {
    public static void parseHtml(String html) {
        // 解析HTML文档
        Document doc = Jsoup.parse(html);
        
        // 通过ID获取元素
        Element titleElement = doc.getElementById("title");
        if (titleElement != null) {
            System.out.println("Title: " + titleElement.text());
        }
        
        // 通过CSS选择器获取元素
        Elements newsItems = doc.select("div.news-item");
        for (Element item : newsItems) {
            // 获取元素的文本内容
            String title = item.select("h2").text();
            String link = item.select("a").attr("href");
            String summary = item.select("p.summary").text();
            
            System.out.println("标题: " + title);
            System.out.println("链接: " + link);
            System例: " + summary);
            System.out.println("-------------------");
        }
        
        // 获取所有链接
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            String url = link.attr("abs:href"); // 获取绝对URL
            String text = link.text();
            System.out.println(text + " -> " + url);
        }
    }
}

3.2 高级选择器用法

public class AdvancedJsoupExample {
    public static void parseComplexHtml(String html) {
        Document doc = Jsoup.parse(html);
        
        // 组合选择器
        Elements items = doc.select("div.container > div.row > div.col-md-8");
        
        // 属性选择器
        Elements images = doc.select("img[src$=.jpg]"); // 以.jpg结尾的src
        Elements inputs = doc.select("input[type=text]"); // type为text的input
        
        // 伪类选择器
        Elements firstChild = doc.select("div.item:first-child");
        Elements lastChild = doc.select("div.item:last-child");
        Elements nthChild = doc.select("div.item:nth-child(2n+1)"); // 奇数项
        
        // 文本包含选择器
        Elements containsText = doc.select("p:contains(Java)");
        
        // 正则表达式匹配
        Elements regexMatch = doc.select("div[class~=^news-\\d+$]"); // 匹配class为news-数字
        
        // 提取数据并构建对象
        for (Element item : doc.select("div.product")) {
            Product product = new Product();
            product.setName(item.select("h3").text());
            product.setPrice(item.select(".price").text());
            product.setUrl(item.select("a").attr("abs:href"));
            product.setImage(item.select("img").attr("abs:src"));
            
            // 处理属性数据
            Element specs = item.select("div.specs").first();
            if (specs != null) {
                product.setBrand(specs.attr("data-brand"));
                product.setCategory(specs.attr("data-category"));
            }
            
            // 数据存储或进一步处理
            saveProduct(product);
        }
    }
    
    private static void saveProduct(Product product) {
        // 实现数据存储逻辑
        System.out.println("保存产品: " + product);
    }
}

class Product {
    private String name;
    private String price;
    private String url;
    private String image;
    private String brand;
    private String category;
    
    // getters and setters
    public String getName() { return name; }
    public void setName(String name) { this.name = name; }
    public String getPrice() { return price; }
    public void setPrice(String price) { this.price = price; }
    public String getUrl() { return url; }
    public void setUrl(String url) { this.url = url; }
    public String getImage() { return image; }
    public void setImage(String image) { this.image = image; }
    public String getBrand() { return brand; }
    public void setBrand(String brand) { this.brand = brand; }
    public String getCategory() { return category; }
    public void setCategory(String category) { this.category = category; }
    
    @Override
    public String toString() {
        return "Product{name='" + name + "', price='" + price + "', brand='" + brand + "'}";
    }
}

3.3 处理动态加载内容

对于JavaScript动态加载的内容，需要使用Selenium：

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;
import java.time.Duration;
import java.util.List;

public class SeleniumCrawler {
    private WebDriver driver;
    private WebDriverWait wait;

    public SeleniumCrawler() {
        // 设置ChromeDriver路径
        System.setProperty("webdriver.chrome.driver", "path/to/chromedriver");
        
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless"); // 无头模式
        options.addArguments("--disable-gpu");
        options.addArguments("--no-sandbox");
        options.addArguments("--disable-dev-shm-usage");
        
        this.driver = new ChromeDriver(options);
        this.wait = new WebDriverWait(driver, Duration.ofSeconds(10));
    }

    public void crawlDynamicContent(String url) {
        try {
            driver.get(url);
            
            // 等待特定元素加载
            wait.until(ExpectedConditions.presenceOfElementLocated(
                By.cssSelector("div.loaded-content")));
            
            // 获取动态内容
            List<WebElement> items = driver.findElements(By.cssSelector("div.item"));
            for (WebElement item : items) {
                String title = item.findElement(By.cssSelector("h3")).getText();
                String price = item.findElement(By.cssSelector(".price")).getText();
                System.out.println(title + " - " + price);
            }
            
            // 模拟滚动加载
            driver.executeScript("window.scrollTo(0, document.body.scrollHeight);");
            Thread.sleep(2000); // 等待新内容加载
            
            // 处理弹窗
            if (isElementPresent(By.cssSelector("div.modal"))) {
                driver.findElement(By.cssSelector("button.close")).click();
            }
            
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            driver.quit();
        }
    }

    private boolean isElementPresent(By locator) {
        try {
            driver.findElement(locator);
            return true;
        } catch (Exception e) {
            return false;
爬虫应对技巧
            return false;
        }
    }
}

第四章：数据存储与管理

4.1 文件存储

import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;

public class FileStorage {
    // 保存为CSV文件
    public static void saveToCSV(List<Product> products, String filePath) {
        try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) {
            // 写入表头
            writer.write("名称,价格,链接,图片,品牌,分类\n");
            
            for (Product product : products) {
                // CSV格式：字段用逗号分隔，包含逗号的字段用引号包围
                String line = String.format("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n",
                    escapeCsv(product.getName()),
                    escapeCsv(product.getPrice()),
                    escapeCsv(product.getUrl()),
                    escapeCsv(product.getImage()),
                    escapeCsv(product.getBrand()),
                    escapeCsv(product.getCategory()));
                writer.write(line);
            }
            System.out.println("数据已保存到: " + filePath);
        } JSON存储
    public static void saveToJson(List<Product> products, String filePath) {
        try (FileWriter writer = new FileWriter(filePath)) {
            Gson gson = new GsonBuilder().setPrettyPrinting().create();
            gson.toJson(products, writer);
            System.out.println("JSON数据已保存到: " + filePath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    // 保存为JSONL（每行一个JSON对象）
    public static void saveToJsonl(List<Product> products, String filePath) {
        try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) {
            Gson gson = new Gson();
            for (Product product : products) {
                writer.write(gson.toJson(product) + "\n");
            }
            System.out.println("JSONL数据已保存到: " + filePath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    // 辅助方法：转义CSV特殊字符
    private static String escapeCsv(String field) {
        if (field == null) return "";
        return field.replace("\"", "\"\"");
    }
}

4.2 数据库存储

import java.sql.*;
import java.util.List;

public class DatabaseStorage {
    private static final String DB_URL = "jdbc:mysql://localhost:3306/crawler_db?useSSL=false&serverTimezone=UTC";
    private static final String USER = "root";
    private static final String PASSWORD = "password";

    // 创建数据库表
    public static void createTable() {
        String sql = """
            CREATE TABLE IF NOT EXISTS products (
                id INT AUTO_INCREMENT PRIMARY KEY,
                name VARCHAR(255) NOT NULL,
                price VARCHAR(50),
                url VARCHAR(500),
                image VARCHAR(500),
                brand VARCHAR(100),
                category VARCHAR(100),
                crawl_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                UNIQUE KEY unique_url (url)
            )
            """;
        
        try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD);
             Statement stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("表创建成功");
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }

    // 批量插入数据
    public static void batchInsert(List<Product> products) {
        String sql = "INSERT INTO products (name, price, url, image, brand, category) " +
                    "VALUES (?, ?, ?, ?, ?, ?) " +
                    "ON DUPLICATE KEY UPDATE " +
                    "name=VALUES(name), price=VALUES(price), image=VALUES(image), " +
                    "brand=VALUES(brand), category=VALUES(category)";

        try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD);
             PreparedStatement pstmt = conn.prepareStatement(sql)) {
            
            int batchSize = 0;
            for (Product product : products) {
                pstmt.setString(1, product.getName());
                pstmt.setString(2, product.getPrice());
                pstmt.setString(3, product.getUrl());
                pstmt.setString(4, product.getImage());
                pstmt.setString(5, product.getBrand());
                pstmt.setString(6, product.getCategory());
                pstmt.addBatch();
                batchSize++;

                // 每100条执行一次批量插入
                if (batchSize % 100 == 0) {
                    pstmt.executeBatch();
                }
            }
            // 执行剩余的批量操作
            pstmt.executeBatch();
            System.out.println("批量插入完成，共插入 " + products.size() + " 条数据");
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }

    // 查询数据
    public static List<Product> queryProducts(String brand) {
        String sql = "SELECT * FROM products WHERE brand = ?";
        List<Product> products = new ArrayList<>();
        
        try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD);
             PreparedStatement pstmt = conn.prepareStatement(sql)) {
            pstmt.setString(1, brand);
            ResultSet rs = pstmt.executeQuery();
            
            while (rs.next()) {
                Product product = new Product();
                product.setName(rs.getString("name"));
                product.setPrice(rs.getString("price"));
                product.setUrl(rs.getString("url"));
                product.setImage(rs.getString("image"));
                product.setBrand(rs.getString("brand"));
                product.setCategory(rs.getString("category"));
                products.add(product);
            }
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return products;
    }
}

第五章：爬虫高级技巧

5.1 多线程爬虫

import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;

public class MultiThreadedCrawler {
    private final ExecutorService executor;
    private final BlockingQueue<String> urlQueue;
    private final ConcurrentHashMap<String, Boolean> visitedUrls;
    private final AtomicInteger counter;
    private final int maxThreads;

    public MultiThreadedCrawler(int maxThreads) {
        this.maxThreads = maxThreads;
        this.executor = Executors.newFixedThreadPool(maxThreads);
        this.urlQueue = new LinkedBlockingQueue<>();
        this.visitedUrls = new ConcurrentHashMap<>();
        this.counter = new AtomicInteger(0);
    }

    public void startCrawling(String startUrl) {
        // 添加起始URL
        urlQueue.offer(startUrl);
        
        // 启动监控线程
        executor.submit(this::monitorQueue);
        
        // 启动工作线程
        for (int i = 0; i < maxThreads; i++) {
            executor.submit(this::crawlWorker);
        }
        
        // 等待所有任务完成
        try {
            Thread.sleep(5000); // 等待5秒
            shutdown();
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }

    private void crawlWorker() {
        while (!Thread.currentThread().isInterrupted()) {
            try {
                String url = urlQueue.poll(1, TimeUnit.SECONDS);
                if (url == null) continue;

                // 检查是否已访问
                if (visitedUrls.putIfAbsent(url, true) != null) {
                    continue;
                }

                // 爬取页面
                String html = sendRequest(url);
                if (html != null) {
                    // 解析并提取新URL
                    List<String> newUrls = parseUrls(html);
                    newUrls.forEach(newUrl -> {
                        if (!visitedUrls.containsKey(newUrl)) {
                            urlQueue.offer(newUrl);
                        }
                    });

                    // 提取数据
                    List<Product> products = parseProducts(html);
                    // 保存数据
                    saveProducts(products);

                    int count = counter.incrementAndGet();
                    System.out.println("已爬取页面: " + count + " - " + url);
                }
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                break;
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    private void monitorQueue() {
        while (!Thread.currentThread().isInterrupted()) {
            try {
                Thread.sleep(10000); // 每10秒打印一次状态
                System.out.println("队列大小: " + urlQueue.size() + 
                                 ", 已访问: " + visitedUrls.size() + 
                                 ", 计数: " + counter.get());
            }爬虫应对技巧
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                break;
            }
        }
    }

    private String sendRequest(String url) {
        // 实现HTTP请求发送
        return HttpClientExample.sendGetWithHttpClient(url);
    }

    private List<String> parseUrls(String html) {
        // 实现URL解析
        Document doc = Jsoup.parse(html);
        Elements links = doc.select("a[href]");
        List<String> urls = new ArrayList<>();
        for (Element link : links) {
            String absUrl = link.attr("abs:href");
            if (absUrl.startsWith("http")) {
                urls.add(absUrl);
            }
        }
        return urls;
    }

    private List<Product> parseProducts(String html) {
        // 实现产品解析
        return new ArrayList<>(); // 简化实现
    }

    private void saveProducts(List<Product> products) {
        // 实现数据保存
        if (!products.isEmpty()) {
            // 可以批量保存到数据库或文件
        }
    }

    public void shutdown() {
        executor.shutdown();
        try {
            if (!executor.awaitTermination(60, TimeUnit.SECONDS)) {
                executor.shutdownNow();
            }
        } catch (InterruptedException e) {
            executor.shutdownNow();
            Thread.currentThread().interrupt();
        }
        System.out.println("爬虫已停止");
    }
}

5.2 代理IP轮换

import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

public class ProxyManager {
    private final List<Proxy> proxyList;
    private final Map<String, Proxy> proxyUsage;
    private final Map<String, Integer> failureCount;
    private final int maxFailures;

    public ProxyManager(int maxFailures) {
        this.proxyList = new ArrayList<>();
        this.proxyUsage = new ConcurrentHashMap<>();
        this.failureCount = new ConcurrentHashMap<>();
        this.maxFailures = maxFailures;
    }

    public void addProxy(String host, int port) {
        proxyList.add(new Proxy(host, port));
    }

    public Proxy getProxy() {
        if (proxyList.isEmpty()) return null;
        
        // 简单轮询策略
        int index = (int) (System.currentTimeMillis() % proxyList.size());
        Proxy proxy = proxyList.get(index);
        
        // 检查是否可用
        if (failureCount.getOrDefault(proxy.toString(), 0) >= maxFailures) {
            // 跳过不可用代理
            return getProxy();
        }
        
        return proxy;
    }

    public void markSuccess(String proxyKey) {
        // 重置失败计数
        failureCount.remove(proxyKey);
    }

    public void markFailure(String proxyKey) {
        // 增加失败计数
        failureCount.merge(proxyKey, 1, Integer::sum);
        
        // 如果超过阈值，从列表中移除
        if (failureCount.get(proxyKey) >= maxFailures) {
            proxyList.removeIf(p -> p.toString().equals(proxyKey));
            System.out.println("移除失效代理: " + proxyKey);
        }
    }

    public static class Proxy {
        public final String host;
        public final int port;

        public Proxy(String host, int port) {
            this.host = host;
            this.port = port;
        }

        @Override
        public String toString() {
            return host + ":" + port;
        }
    }
}

5.3 请求头管理

import java.util.*;
import java.util.concurrent.ThreadLocalRandom;

public class HeaderManager {
    private static final List<String> USER_AGENTS = Arrays.asList(
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15"
    );

    private static final List<String> ACCEPT_LANGUAGES = Arrays.asList(
        "zh-CN,zh;q=0.9,en;q=0.8",
        "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7",
        "ja-JP,ja;q=0.9,en;q=0.8"
    );

    public static Map<String, String> getRandomHeaders() {
        Map<String, String> headers = new HashMap<>();
        Random random = ThreadLocalRandom.current();
        
        headers.put("User-Agent", USER_AGENTS.get(random.nextInt(USER_AGENTS.size())));
        headers.put("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        headers.put("Accept-Language", ACCEPT_LANGUAGES.get(random.nextInt(ACCEPT_LANGUAGES.size())));
        headers.put("Accept-Encoding", "gzip, deflate, br");
        headers.put("Connection", "keep-alive");
        headers.put("Upgrade-Insecure-Requests", "1");
        
        return headers;
    }

    public static void applyHeaders(HttpGet httpGet) {
        Map<String, String> headers = getRandomHeaders();
        headers.forEach(httpGet::setHeader);
    }
}

第六章：反爬虫应对技巧

6.1 识别反爬虫机制

常见的反爬虫机制包括：

频率限制：单位时间内请求次数限制
User-Agent检测：检测非浏览器请求
IP封禁：封禁异常IP
验证码：需要人工识别
JavaScript混淆：动态生成内容
请求头验证：检查Referer、Cookie等

6.2 基础反爬虫应对

import java.util.Random;
import java.util.concurrent.TimeUnit;

public class AntiCrawlerStrategy {
    private final Random random = new Random();
    private final int minDelay;
    private final int maxDelay;

    public AntiCrawlerStrategy(int minDelay, int maxDelay) {
        this.minDelay = minDelay;
        this.maxDelay = maxDelay;
    }

    // 随机延迟
    public void randomDelay() {
        int delay = minDelay + random.nextInt(maxDelay - minDelay + 1);
        try {
            TimeUnit.SECONDS.sleep(delay);
            System.out.println("延迟 " + delay + " 秒");
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }

    // 模拟人类行为
    public void simulateHumanBehavior() {
        // 随机移动鼠标（仅Selenium）
        // 随机滚动页面
        // 随机暂停
        randomDelay();
    }

    // 处理验证码
    public void handleCaptcha(String captchaUrl) {
        // 1. 截图验证码
        // 2. 调用第三方打码平台（如超级鹰、云打码）
        // 3. 或者人工处理
        System.out.println("需要处理验证码: " + captchaUrl);
    }

    // IP轮换策略
    public String rotateProxy(ProxyManager proxyManager) {
        ProxyManager.Proxy proxy = proxyManager.getProxy();
        if (proxy != null) {
            return proxy.toString();
        }
        return null;
    }
}

6.3 高级反爬虫应对

import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;

public class AdvancedAntiCrawler {
    // 处理JavaScript混淆
    public static String deobfuscateJavaScript(String obfuscatedCode) {
        // 实现反混淆逻辑
        // 例如：eval()函数解密、字符串拼接还原等
        return obfuscatedCode; // 简化实现
    }

    // 处理动态Token
    public static String extractDynamicToken(WebDriver driver) {
        // 从JavaScript变量或DOM中提取动态Token
        JavascriptExecutor js = (JavascriptExecutor) driver;
        String token = (String) js.executeScript(
            "return window.dynamicToken || document.querySelector('meta[name=\"token\"]').content;");
        return token;
    }

    // 处理WebSocket数据
    public static void handleWebSocketData(WebDriver driver) {
        // 使用浏览器开发者工具协议（CDP）监听WebSocket
        JavascriptExecutor js = (JavascriptExecutor) driver;
        js.executeScript("""
            const originalWebSocket = window.WebSocket;
            window.WebSocket = function(url, protocols) {
                const ws = new originalWebSocket(url, protocols);
                ws.addEventListener('message', function(event) {
                    console.log('WebSocket message:', event.data);
                    // 将数据存储到全局变量供Java读取
                    window.lastWebSocketMessage = event.data;
                });
                return ws;
            };
        """);
    }

    // 处理Canvas指纹
    public static void spoofCanvasFingerprint(WebDriver driver) {
        // 修改Canvas指纹，避免被识别为爬虫
        JavascriptExecutor js = (JavascriptExecutor) driver;
        js.executeScript("""
            const originalGetContext = HTMLCanvasElement.prototype.getContext;
            HTMLCanvasElement.prototype.getContext = function(type) {
                const context = originalGetContext.call(this, type);
                if (type === '2d') {
                    const originalFillText = context.fillText;
                    context.fillText = function(text, x, y) {
                        // 添加随机噪声
                        const noise = Math.random() * 0.1;
                        return originalFillText.call(this, text, x + noise, y + noise);
                    };
                }
                return context;
            };
        """);
    }
}

第七章：实战项目：电商网站商品信息爬虫

7.1 项目需求分析

目标：爬取某电商网站的商品信息，包括名称、价格、评价数、店铺名称等，并支持多线程、代理轮换、数据存储。

7.2 完整代码实现

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;

// 商品数据模型
class ProductInfo {
    private String name;
    private String price;
    private String评价数;
    private String shopName;
    private String productUrl;
    private String imageUrl;

    // 构造函数、getter/setter、toString
    public ProductInfo(String name, String price, String评价数, String shopName, String productUrl, String imageUrl) {
        this.name = name;
        this.price = price;
        this.评价数 =评价数;
        this.shopName = shopName;
        this.productUrl = productUrl;
        this.imageUrl = imageUrl;
    }

    @Override
    public String toString() {
        return "ProductInfo{" +
                "name='" + name + '\'' +
                ", price='" + price + '\'' +
                ",评价数='" +评价数 + '\'' +
                ", shopName='" + shopName + '\'' +
                ", productUrl='" + productUrl + '\'' +
                ", imageUrl='" + imageUrl + '\'' +
                '}';
    }
}

// 爬虫主类
public class EcommerceCrawler {
    private final int maxThreads;
    private final ExecutorService executor;
    private final BlockingQueue<String> urlQueue;
    private final ConcurrentHashMap<String, Boolean> visitedUrls;
    private final AtomicInteger counter;
    private final List<ProductInfo> collectedProducts;
    private final ProxyManager proxyManager;
    private final AntiCrawlerStrategy antiCrawler;

    public EcommerceCrawler(int maxThreads) {
        this.maxThreads = maxThreads;
        this.executor = Executors.newFixedThreadPool(maxThreads);
        this.urlQueue = new LinkedBlockingQueue<>();
        this.visitedUrls = new ConcurrentHashMap<>();
        this.counter = new AtomicInteger(0);
        this.collectedProducts = new ArrayList<>();
        this.proxyManager = new ProxyManager(3);
        this.antiCrawler = new AntiCrawlerStrategy(2, 5); // 2-5秒延迟

        // 添加一些测试代理
        proxyManager.addProxy("192.168.1.100", 8080);
        proxyManager.addProxy("192.168.1.101", 8080);
    }

    public void start(String startUrl, int maxPages) {
        urlQueue.offer(startUrl);
        
        // 启动工作线程
        for (int i = 0; i < maxThreads; i++) {
            executor.submit(new CrawlTask(maxPages));
        }

        // 监控线程
        executor.submit(() -> {
            while (!Thread.currentThread().isInterrupted()) {
                try {
                    Thread.sleep(10000);
                    System.out.printf("进度: %d/%d, 队列: %d, 收集: %d%n",
                            counter.get(), maxPages, urlQueue.size(), collectedProducts.size());
                    if (counter.get() >= maxPages) {
                        shutdown();
                        break;
                    }
                } catch (InterruptedException e) {
                    break;
                }
            }
        });
    }

    private class CrawlTask implements Runnable {
        private final int maxPages;

        public CrawlTask(int maxPages) {
            this.maxPages = maxPages;
        }

        @Override
        public void run() {
            while (!Thread.currentThread().isInterrupted() && counter.get() < maxPages) {
                try {
                    String url = urlQueue.poll(1, TimeUnit.SECONDS);
                    if (url == null || visitedUrls.putIfAbsent(url, true) != null) {
                        continue;
                    }

                    // 随机延迟
                    antiCrawler.randomDelay();

                    // 发送请求
                    String html = sendRequest(url);
                    if (html == null) continue;

                    // 解析商品信息
                    ProductInfo product = parseProduct(html, url);
                    if (product != null) {
                        synchronized (collectedProducts) {
                            collectedProducts.add(product);
                        }
                    }

                    // 提取新链接
                    List<String> newUrls = extractLinks(html);
                    for (String newUrl : newUrls) {
                        if (!visitedUrls.containsKey(newUrl) && urlQueue.size() < 1000) {
                            urlQueue.offer(newUrl);
                        }
                    }

                    int count = counter.incrementAndGet();
                    System.out.println("完成: " + count + " - " + url);

                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        private String sendRequest(String url) {
            ProxyManager.Proxy proxy = proxyManager.getProxy();
            // 这里简化处理，实际应使用代理
            try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
                HttpGet httpGet = new HttpGet(url);
                HeaderManager.applyHeaders(httpGet);
                
                try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
                    if (response.getStatusLine().getStatusCode() == 200) {
                        return EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
                    }
                }
            } catch (Exception e) {
                if (proxy != null) {
                    proxyManager.markFailure(proxy.toString());
                }
                e.printStackTrace();
            }
            return null;
        }

        private ProductInfo parseProduct(String html, String url) {
            try {
                Document doc = Jsoup.parse(html);
                
                // 根据实际网站结构调整选择器
                Element nameElement = doc.selectFirst("h1.product-title");
                Element priceElement = doc.selectFirst("span.price");
                Element评价Element = doc.selectFirst("div.reviews-count");
                Element shopElement = doc.selectFirst("div.shop-name");
                Element imageElement = doc.selectFirst("img.product-image");

                if (nameElement == null || priceElement == null) return null;

                String name = nameElement.text();
                String price = priceElement.text();
                String评价数 =评价Element != null ?评价Element.text() : "0";
                String shopName = shopElement != null ? shopElement.text() : "";
                String imageUrl = imageElement != null ? imageElement.attr("abs:src") : "";

                return new ProductInfo(name, price,评价数, shopName, url, imageUrl);
            } catch (Exception e) {
                return null;
            }
        }

        private List<String> extractLinks(String html) {
            List<String> links = new ArrayList<>();
            try {
                Document doc = Jsoup.parse(html);
                Elements elements = doc.select("a[href]");
                for (Element element : elements) {
                    String absUrl = element.attr("abs:href");
                    if (absUrl.contains("/product/") || absUrl.contains("/item/")) {
                        links.add(absUrl);
                    }
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
            return links;
        }
    }

    public void shutdown() {
        executor.shutdown();
        try {
            if (!executor.awaitTermination(30, TimeUnit.SECONDS)) {
                executor.shutdownNow();
            }
        } catch (InterruptedException e) {
            executor.shutdownNow();
        }

        // 保存数据
        saveData();
        System.out.println("爬虫完成，共收集 " + collectedProducts.size() + " 个商品");
    }

    private void saveData() {
        // 保存为CSV
        FileStorage.saveToCSV(
            collectedProducts.stream()
                .map(p -> new Product(p.name, p.price, p.productUrl, p.imageUrl, p.shopName, "电商"))
                .collect(Collectors.toList()),
            "products.csv"
        );
        
        // 保存为JSON
        FileStorage.saveToJson(
            collectedProducts.stream()
                .map(p -> new Product(p.name, p.price, p.productUrl, p.imageUrl, p.shopName, "电商"))
                .collect(Collectors.toList()),
            "products.json"
        );
    }

    public static void main(String[] args) {
        EcommerceCrawler crawler = new EcommerceCrawler(5); // 5个线程
        crawler.start("https://example-ecommerce.com/products", 100); // 爬取100页
    }
}

第八章：法律与道德考量

8.1 爬虫法律风险

robots.txt协议： 爬虫应当遵守网站的robots.txt协议，该文件定义了哪些路径允许或禁止爬取。

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

public class RobotsTxtChecker {
    public static boolean isAllowed(String targetUrl, String userAgent) {
        try {
            URL url = new URL(targetUrl);
            String robotsUrl = url.getProtocol() + "://" + url.getHost() + "/robots.txt";
            
            HttpURLConnection connection = (HttpURLConnection) new URL(robotsUrl).openConnection();
            connection.setRequestMethod("GET");
            connection.setConnectTimeout(5000);
            connection.setReadTimeout(5000);
            
            if (connection.getResponseCode() == 200) {
                BufferedReader reader = new BufferedReader(
                    new InputStreamReader(connection.getInputStream()));
                String line;
                boolean inUserAgentBlock = false;
                boolean allowed = true;
                
                while ((line = reader.readLine()) != null) {
                    line = line.trim();
                    if (line.startsWith("User-agent:")) {
                        String agent = line.substring(11).trim();
                        inUserAgentBlock = agent.equals("*") || agent.equals(userAgent);
                    } else if (inUserAgentBlock && line.startsWith("Disallow:")) {
                        String path = line.substring(9).trim();
                        if (url.getPath().startsWith(path)) {
                            allowed = false;
                        }
                    } else if (inUserAgentBlock && line.startsWith("Allow:")) {
                        String path = line.substring(6).trim();
                        if (url.getPath().startsWith(path)) {
                            allowed = true;
                        }
                    }
                }
                reader.close();
                return allowed;
            }
        } catch (Exception e) {
            // 如果无法获取robots.txt，默认允许
            return true;
        }
        return true;
    }

    public static void main(String[] args) {
        boolean allowed = RobotsTxtChecker.isAllowed(
            "https://example.com/admin", 
            "MyCrawler/1.0"
        );
        System.out.println("是否允许爬取: " + allowed);
    }
}

8.2 道德爬虫准则

控制爬取频率：避免对目标网站造成过大负担
尊重版权：不要爬取受版权保护的内容用于商业目的
保护隐私：不要爬取个人信息
透明性：在User-Agent中提供联系方式
数据使用：明确数据用途，避免滥用

第九章：性能优化与监控

9.1 性能监控

import java.util.concurrent.atomic.AtomicLong;

public class CrawlerMonitor {
    private final AtomicLong totalRequests = new AtomicLong(0);
    private final AtomicLong successfulRequests = new AtomicLong(0);
    private final AtomicLong failedRequests = new AtomicLong(0);
    private final AtomicLong totalBytes = new AtomicLong(0);
    private final long startTime = System.currentTimeMillis();

    public void recordRequest(boolean success, long bytes) {
        totalRequests.incrementAndGet();
        if (success) {
            successfulRequests.incrementAndGet();
        } else {
            failedRequests.incrementAndGet();
        }
        totalBytes.addAndGet(bytes);
    }

    public void printStats() {
        long elapsed = (System.currentTimeMillis() - startTime) / 1000;
        double successRate = (double) successfulRequests.get() / totalRequests.get() * 100;
        double speed = totalRequests.get() / (double) elapsed;
        
        System.out.println("========== 爬虫统计 ==========");
        System.out.println("运行时间: " + elapsed + " 秒");
        System.out.println("总请求数: " + totalRequests.get());
        System.out.println("成功: " + successfulRequests.get());
        System.out.println("失败: " + failedRequests.get());
        System.out.printf("成功率: %.2f%%%n", successRate);
        System.out.printf("平均速度: %.2f 请求/秒%n", speed);
        System.out.printf("总数据量: %.2f MB%n", totalBytes.get() / 1024.0 / 1024.0);
        System.out.println("==============================");
    }
}

9.2 内存优化

import java.lang.ref.WeakReference;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;

public class MemoryOptimizedCrawler {
    // 使用弱引用避免内存泄漏
    private final Queue<WeakReference<String>> urlCache = new ConcurrentLinkedQueue<>();

    public void processLargeData(List<String> items) {
        // 使用流式处理避免一次性加载所有数据
        items.stream()
            .filter(item -> item != null)
            .map(item -> processItem(item))
            .filter(result -> result != null)
            .forEach(this::saveResult);
    }

    private String processItem(String item) {
        // 处理单个项
        return item.toUpperCase();
    }

    private void saveResult(String result) {
        // 保存结果
        System.out.println("保存: " + result);
    }

    // 使用try-with-resources确保资源释放
    public void crawlWithResourceManagement(String url) {
        try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
            HttpGet httpGet = new HttpGet(url);
            try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
                // 处理响应
                String html = EntityUtils.toString(response.getEntity());
                // 处理html...
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

第十章：总结与进阶学习

10.1 核心要点回顾

基础环境：JDK + Maven + Jsoup/HttpClient
网络请求：HttpURLConnection、HttpClient、Selenium
数据解析：Jsoup选择器、XPath、JSON解析
数据存储：文件、数据库
高级技巧：多线程、代理轮换、请求头管理 6.反爬虫应对：随机延迟、IP轮换、User-Agent轮换
实战项目：电商爬虫完整实现
法律道德：遵守robots.txt、控制频率

10.2 进阶学习方向

分布式爬虫：使用Redis、Kafka实现分布式调度
深度学习：使用机器学习识别验证码、反爬虫策略
浏览器自动化：Playwright、Puppeteer
API逆向：分析WebSocket、GraphQL接口
数据清洗：使用正则表达式、NLP技术
可视化监控：Grafana、Prometheus集成

10.3 推荐资源

官方文档：Jsoup、Apache HttpClient、Selenium官方文档
开源项目：WebMagic、Crawler4j、Apache Nutch
书籍：《Java网络爬虫实战》、《数据抓取的艺术》
社区：GitHub、Stack Overflow、CSDN

10.4 持续改进

爬虫技术在不断发展，建议：

定期更新依赖库版本
关注目标网站的技术变化
学习新的反爬虫技术及应对方法
参与开源项目贡献
建立自己的爬虫框架

通过本教程的学习，您应该已经掌握了Java爬虫的基础知识和实战技巧。记住，爬虫开发是一个持续学习和改进的过程，只有不断实践和总结，才能成为真正的爬虫专家。祝您在爬虫开发的道路上取得成功！