引言:为什么选择Java进行网络爬虫开发?

在当今数据驱动的时代,网络爬虫已成为获取互联网数据的重要工具。Java作为一种成熟、稳定且功能强大的编程语言,在爬虫开发领域具有独特的优势。首先,Java拥有丰富的生态系统,包括Jsoup、HttpClient、Selenium等优秀的网络请求和HTML解析库。其次,Java的多线程机制使得大规模并发爬取变得简单高效。最重要的是,Java的跨平台特性和强大的异常处理机制,使得爬虫程序能够在复杂的网络环境中稳定运行。

与Python相比,Java爬虫在处理大规模数据采集时具有更好的性能表现,特别是在需要长时间运行的企业级应用中,Java的内存管理和垃圾回收机制能够有效避免内存泄漏问题。此外,Java的类型安全特性能够在编译期发现潜在错误,提高代码质量。

第一章:Java爬虫基础环境搭建

1.1 开发环境准备

在开始Java爬虫开发之前,我们需要准备以下开发环境:

JDK安装与配置: 确保安装JDK 8或更高版本。可以通过以下命令验证安装:

java -version
javac -version

IDE选择: 推荐使用IntelliJ IDEA或Eclipse作为开发IDE。IntelliJ IDEA对Java Web开发支持更好,特别是对于HTTP请求调试。

构建工具: 使用Maven或Gradle管理项目依赖。本文以Maven为例,在pom.xml中添加必要的依赖:

<dependencies>
    <!-- Jsoup - HTML解析库 -->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.15.4</version>
    </dependency>
    
    <!-- HttpClient - HTTP客户端 -->
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.14</version>
    </dependency>
    
    <!-- Gson - JSON解析库 -->
    <dependency>
        <groupId>com.google.code.gson</groupId>
        <artifactId>gson</artifactId>
        <version>2.10.1</version>
    </dependency>
    
    <!-- Selenium - 浏览器自动化 -->
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
        <version>4.11.0</version>
    </dependency>
</dependencies>

1.2 网络爬虫工作原理

网络爬虫的基本工作流程如下:

  1. URL管理器:维护待爬取URL和已爬取URL集合
  2. 下载器:发送HTTP请求获取网页内容
  3. 解析器:从HTML中提取目标数据
  4. 数据处理器:清洗、存储提取的数据
  5. 调度器:协调各组件工作,控制爬取节奏

第二章:Java网络请求发送

2.1 使用HttpURLConnection发送请求

HttpURLConnection是Java标准库提供的HTTP客户端,无需额外依赖:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.StandardCharsets;

public class SimpleHttpClient {
    public static String sendGetRequest(String targetURL) {
        HttpURLConnection connection = null;
        try {
            URL url = new URL(targetURL);
            connection = (HttpURLConnection) url.openConnection();
            
            // 设置请求方法
            connection.setRequestMethod("GET");
            
            // 设置请求头
            connection.setRequestProperty("User-Agent", 
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
            connection.setRequestProperty("Accept", 
                "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
            
            // 设置超时时间
            connection.setConnectTimeout(5000);
            connection.setReadTimeout(5000);
            
            // 获取响应码
            int responseCode = connection.getResponseCode();
            System.out.println("Response Code: " + responseCode);
            
            if (responseCode == HttpURLConnection.HTTP_OK) {
                BufferedReader in = new BufferedReader(
                    new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8));
                String inputLine;
                StringBuilder response = new StringBuilder();
                
                while ((inputLine = in.readLine()) != null) {
                    response.append(inputLine);
                }
                in.close();
                return response.toString();
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (connection != null) {
                connection.disconnect();
            }
        }
        return null;
    }
    
    public static void main(String[] args) {
        String html = sendGetRequest("https://example.com");
        System.out.println(html);
    }
}

2.2 使用Apache HttpClient发送请求

Apache HttpClient提供了更强大的功能,包括连接池、Cookie管理等:

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.nio.charset.StandardCharsets;

public class HttpClientExample {
    public static String sendGetWithHttpClient(String targetURL) {
        // 创建HttpClient实例
        try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
            // 创建GET请求
            HttpGet httpGet = new HttpGet(targetURL);
            
            // 设置请求头
            httpGet.setHeader("User-Agent", 
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
            httpGet.setHeader("Accept", 
                "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
            httpGet.setHeader("Accept-Language", "zh-CN,zh;q=0.9");
            
            // 执行请求
            try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
                // 获取响应实体
                String html = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
                return html;
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
    
    public static void main(String[] args) {
        String html = sendGetWithHttpClient("https://example.com");
        JsoupParser.parseHtml(html);
    }
}

2.3 处理HTTPS和SSL证书问题

在爬取HTTPS网站时,可能会遇到SSL证书验证问题:

import org.apache.http.conn.ssl.NoopHostnameVerifier;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.ssl.SSLContexts;
import javax.net.ssl.SSLContext;
import java.security.KeyManagementException;
import java.security.NoSuchAlgorithmException;

public class SSLHttpClient {
    public static CloseableHttpClient createSSLClient() {
        try {
            // 创建信任所有证书的SSL上下文
            SSLContext sslContext = SSLContexts.custom()
                .loadTrustMaterial((chain, authType) -> true)
                .build();
            
            // 创建SSL连接工厂
            SSLConnectionSocketFactory socketFactory = 
                new SSLConnectionSocketFactory(sslContext, 
                    NoopHostnameVerifier.INSTANCE);
            
            return HttpClients.custom()
                .setSSLSocketFactory(socketFactory)
                .build();
        } catch (Exception e) {
            e.printStackTrace();
            return HttpClients.createDefault();
        }
    }
}

第三章:HTML解析与数据提取

3.1 Jsoup基础用法

Jsoup是Java中最流行的HTML解析库,提供了类似jQuery的选择器语法:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupParser {
    public static void parseHtml(String html) {
        // 解析HTML文档
        Document doc = Jsoup.parse(html);
        
        // 通过ID获取元素
        Element titleElement = doc.getElementById("title");
        if (titleElement != null) {
            System.out.println("Title: " + titleElement.text());
        }
        
        // 通过CSS选择器获取元素
        Elements newsItems = doc.select("div.news-item");
        for (Element item : newsItems) {
            // 获取元素的文本内容
            String title = item.select("h2").text();
            String link = item.select("a").attr("href");
            String summary = item.select("p.summary").text();
            
            System.out.println("标题: " + title);
            System.out.println("链接: " + link);
            System例: " + summary);
            System.out.println("-------------------");
        }
        
        // 获取所有链接
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            String url = link.attr("abs:href"); // 获取绝对URL
            String text = link.text();
            System.out.println(text + " -> " + url);
        }
    }
}

3.2 高级选择器用法

public class AdvancedJsoupExample {
    public static void parseComplexHtml(String html) {
        Document doc = Jsoup.parse(html);
        
        // 组合选择器
        Elements items = doc.select("div.container > div.row > div.col-md-8");
        
        // 属性选择器
        Elements images = doc.select("img[src$=.jpg]"); // 以.jpg结尾的src
        Elements inputs = doc.select("input[type=text]"); // type为text的input
        
        // 伪类选择器
        Elements firstChild = doc.select("div.item:first-child");
        Elements lastChild = doc.select("div.item:last-child");
        Elements nthChild = doc.select("div.item:nth-child(2n+1)"); // 奇数项
        
        // 文本包含选择器
        Elements containsText = doc.select("p:contains(Java)");
        
        // 正则表达式匹配
        Elements regexMatch = doc.select("div[class~=^news-\\d+$]"); // 匹配class为news-数字
        
        // 提取数据并构建对象
        for (Element item : doc.select("div.product")) {
            Product product = new Product();
            product.setName(item.select("h3").text());
            product.setPrice(item.select(".price").text());
            product.setUrl(item.select("a").attr("abs:href"));
            product.setImage(item.select("img").attr("abs:src"));
            
            // 处理属性数据
            Element specs = item.select("div.specs").first();
            if (specs != null) {
                product.setBrand(specs.attr("data-brand"));
                product.setCategory(specs.attr("data-category"));
            }
            
            // 数据存储或进一步处理
            saveProduct(product);
        }
    }
    
    private static void saveProduct(Product product) {
        // 实现数据存储逻辑
        System.out.println("保存产品: " + product);
    }
}

class Product {
    private String name;
    private String price;
    private String url;
    private String image;
    private String brand;
    private String category;
    
    // getters and setters
    public String getName() { return name; }
    public void setName(String name) { this.name = name; }
    public String getPrice() { return price; }
    public void setPrice(String price) { this.price = price; }
    public String getUrl() { return url; }
    public void setUrl(String url) { this.url = url; }
    public String getImage() { return image; }
    public void setImage(String image) { this.image = image; }
    public String getBrand() { return brand; }
    public void setBrand(String brand) { this.brand = brand; }
    public String getCategory() { return category; }
    public void setCategory(String category) { this.category = category; }
    
    @Override
    public String toString() {
        return "Product{name='" + name + "', price='" + price + "', brand='" + brand + "'}";
    }
}

3.3 处理动态加载内容

对于JavaScript动态加载的内容,需要使用Selenium:

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;
import java.time.Duration;
import java.util.List;

public class SeleniumCrawler {
    private WebDriver driver;
    private WebDriverWait wait;

    public SeleniumCrawler() {
        // 设置ChromeDriver路径
        System.setProperty("webdriver.chrome.driver", "path/to/chromedriver");
        
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless"); // 无头模式
        options.addArguments("--disable-gpu");
        options.addArguments("--no-sandbox");
        options.addArguments("--disable-dev-shm-usage");
        
        this.driver = new ChromeDriver(options);
        this.wait = new WebDriverWait(driver, Duration.ofSeconds(10));
    }

    public void crawlDynamicContent(String url) {
        try {
            driver.get(url);
            
            // 等待特定元素加载
            wait.until(ExpectedConditions.presenceOfElementLocated(
                By.cssSelector("div.loaded-content")));
            
            // 获取动态内容
            List<WebElement> items = driver.findElements(By.cssSelector("div.item"));
            for (WebElement item : items) {
                String title = item.findElement(By.cssSelector("h3")).getText();
                String price = item.findElement(By.cssSelector(".price")).getText();
                System.out.println(title + " - " + price);
            }
            
            // 模拟滚动加载
            driver.executeScript("window.scrollTo(0, document.body.scrollHeight);");
            Thread.sleep(2000); // 等待新内容加载
            
            // 处理弹窗
            if (isElementPresent(By.cssSelector("div.modal"))) {
                driver.findElement(By.cssSelector("button.close")).click();
            }
            
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            driver.quit();
        }
    }

    private boolean isElementPresent(By locator) {
        try {
            driver.findElement(locator);
            return true;
        } catch (Exception e) {
            return false;
爬虫应对技巧
            return false;
        }
    }
}

第四章:数据存储与管理

4.1 文件存储

import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;

public class FileStorage {
    // 保存为CSV文件
    public static void saveToCSV(List<Product> products, String filePath) {
        try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) {
            // 写入表头
            writer.write("名称,价格,链接,图片,品牌,分类\n");
            
            for (Product product : products) {
                // CSV格式:字段用逗号分隔,包含逗号的字段用引号包围
                String line = String.format("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n",
                    escapeCsv(product.getName()),
                    escapeCsv(product.getPrice()),
                    escapeCsv(product.getUrl()),
                    escapeCsv(product.getImage()),
                    escapeCsv(product.getBrand()),
                    escapeCsv(product.getCategory()));
                writer.write(line);
            }
            System.out.println("数据已保存到: " + filePath);
        } JSON存储
    public static void saveToJson(List<Product> products, String filePath) {
        try (FileWriter writer = new FileWriter(filePath)) {
            Gson gson = new GsonBuilder().setPrettyPrinting().create();
            gson.toJson(products, writer);
            System.out.println("JSON数据已保存到: " + filePath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    // 保存为JSONL(每行一个JSON对象)
    public static void saveToJsonl(List<Product> products, String filePath) {
        try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) {
            Gson gson = new Gson();
            for (Product product : products) {
                writer.write(gson.toJson(product) + "\n");
            }
            System.out.println("JSONL数据已保存到: " + filePath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    // 辅助方法:转义CSV特殊字符
    private static String escapeCsv(String field) {
        if (field == null) return "";
        return field.replace("\"", "\"\"");
    }
}

4.2 数据库存储

import java.sql.*;
import java.util.List;

public class DatabaseStorage {
    private static final String DB_URL = "jdbc:mysql://localhost:3306/crawler_db?useSSL=false&serverTimezone=UTC";
    private static final String USER = "root";
    private static final String PASSWORD = "password";

    // 创建数据库表
    public static void createTable() {
        String sql = """
            CREATE TABLE IF NOT EXISTS products (
                id INT AUTO_INCREMENT PRIMARY KEY,
                name VARCHAR(255) NOT NULL,
                price VARCHAR(50),
                url VARCHAR(500),
                image VARCHAR(500),
                brand VARCHAR(100),
                category VARCHAR(100),
                crawl_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                UNIQUE KEY unique_url (url)
            )
            """;
        
        try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD);
             Statement stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("表创建成功");
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }

    // 批量插入数据
    public static void batchInsert(List<Product> products) {
        String sql = "INSERT INTO products (name, price, url, image, brand, category) " +
                    "VALUES (?, ?, ?, ?, ?, ?) " +
                    "ON DUPLICATE KEY UPDATE " +
                    "name=VALUES(name), price=VALUES(price), image=VALUES(image), " +
                    "brand=VALUES(brand), category=VALUES(category)";

        try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD);
             PreparedStatement pstmt = conn.prepareStatement(sql)) {
            
            int batchSize = 0;
            for (Product product : products) {
                pstmt.setString(1, product.getName());
                pstmt.setString(2, product.getPrice());
                pstmt.setString(3, product.getUrl());
                pstmt.setString(4, product.getImage());
                pstmt.setString(5, product.getBrand());
                pstmt.setString(6, product.getCategory());
                pstmt.addBatch();
                batchSize++;

                // 每100条执行一次批量插入
                if (batchSize % 100 == 0) {
                    pstmt.executeBatch();
                }
            }
            // 执行剩余的批量操作
            pstmt.executeBatch();
            System.out.println("批量插入完成,共插入 " + products.size() + " 条数据");
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }

    // 查询数据
    public static List<Product> queryProducts(String brand) {
        String sql = "SELECT * FROM products WHERE brand = ?";
        List<Product> products = new ArrayList<>();
        
        try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD);
             PreparedStatement pstmt = conn.prepareStatement(sql)) {
            pstmt.setString(1, brand);
            ResultSet rs = pstmt.executeQuery();
            
            while (rs.next()) {
                Product product = new Product();
                product.setName(rs.getString("name"));
                product.setPrice(rs.getString("price"));
                product.setUrl(rs.getString("url"));
                product.setImage(rs.getString("image"));
                product.setBrand(rs.getString("brand"));
                product.setCategory(rs.getString("category"));
                products.add(product);
            }
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return products;
    }
}

第五章:爬虫高级技巧

5.1 多线程爬虫

import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;

public class MultiThreadedCrawler {
    private final ExecutorService executor;
    private final BlockingQueue<String> urlQueue;
    private final ConcurrentHashMap<String, Boolean> visitedUrls;
    private final AtomicInteger counter;
    private final int maxThreads;

    public MultiThreadedCrawler(int maxThreads) {
        this.maxThreads = maxThreads;
        this.executor = Executors.newFixedThreadPool(maxThreads);
        this.urlQueue = new LinkedBlockingQueue<>();
        this.visitedUrls = new ConcurrentHashMap<>();
        this.counter = new AtomicInteger(0);
    }

    public void startCrawling(String startUrl) {
        // 添加起始URL
        urlQueue.offer(startUrl);
        
        // 启动监控线程
        executor.submit(this::monitorQueue);
        
        // 启动工作线程
        for (int i = 0; i < maxThreads; i++) {
            executor.submit(this::crawlWorker);
        }
        
        // 等待所有任务完成
        try {
            Thread.sleep(5000); // 等待5秒
            shutdown();
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }

    private void crawlWorker() {
        while (!Thread.currentThread().isInterrupted()) {
            try {
                String url = urlQueue.poll(1, TimeUnit.SECONDS);
                if (url == null) continue;

                // 检查是否已访问
                if (visitedUrls.putIfAbsent(url, true) != null) {
                    continue;
                }

                // 爬取页面
                String html = sendRequest(url);
                if (html != null) {
                    // 解析并提取新URL
                    List<String> newUrls = parseUrls(html);
                    newUrls.forEach(newUrl -> {
                        if (!visitedUrls.containsKey(newUrl)) {
                            urlQueue.offer(newUrl);
                        }
                    });

                    // 提取数据
                    List<Product> products = parseProducts(html);
                    // 保存数据
                    saveProducts(products);

                    int count = counter.incrementAndGet();
                    System.out.println("已爬取页面: " + count + " - " + url);
                }
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                break;
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    private void monitorQueue() {
        while (!Thread.currentThread().isInterrupted()) {
            try {
                Thread.sleep(10000); // 每10秒打印一次状态
                System.out.println("队列大小: " + urlQueue.size() + 
                                 ", 已访问: " + visitedUrls.size() + 
                                 ", 计数: " + counter.get());
            }爬虫应对技巧
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                break;
            }
        }
    }

    private String sendRequest(String url) {
        // 实现HTTP请求发送
        return HttpClientExample.sendGetWithHttpClient(url);
    }

    private List<String> parseUrls(String html) {
        // 实现URL解析
        Document doc = Jsoup.parse(html);
        Elements links = doc.select("a[href]");
        List<String> urls = new ArrayList<>();
        for (Element link : links) {
            String absUrl = link.attr("abs:href");
            if (absUrl.startsWith("http")) {
                urls.add(absUrl);
            }
        }
        return urls;
    }

    private List<Product> parseProducts(String html) {
        // 实现产品解析
        return new ArrayList<>(); // 简化实现
    }

    private void saveProducts(List<Product> products) {
        // 实现数据保存
        if (!products.isEmpty()) {
            // 可以批量保存到数据库或文件
        }
    }

    public void shutdown() {
        executor.shutdown();
        try {
            if (!executor.awaitTermination(60, TimeUnit.SECONDS)) {
                executor.shutdownNow();
            }
        } catch (InterruptedException e) {
            executor.shutdownNow();
            Thread.currentThread().interrupt();
        }
        System.out.println("爬虫已停止");
    }
}

5.2 代理IP轮换

import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

public class ProxyManager {
    private final List<Proxy> proxyList;
    private final Map<String, Proxy> proxyUsage;
    private final Map<String, Integer> failureCount;
    private final int maxFailures;

    public ProxyManager(int maxFailures) {
        this.proxyList = new ArrayList<>();
        this.proxyUsage = new ConcurrentHashMap<>();
        this.failureCount = new ConcurrentHashMap<>();
        this.maxFailures = maxFailures;
    }

    public void addProxy(String host, int port) {
        proxyList.add(new Proxy(host, port));
    }

    public Proxy getProxy() {
        if (proxyList.isEmpty()) return null;
        
        // 简单轮询策略
        int index = (int) (System.currentTimeMillis() % proxyList.size());
        Proxy proxy = proxyList.get(index);
        
        // 检查是否可用
        if (failureCount.getOrDefault(proxy.toString(), 0) >= maxFailures) {
            // 跳过不可用代理
            return getProxy();
        }
        
        return proxy;
    }

    public void markSuccess(String proxyKey) {
        // 重置失败计数
        failureCount.remove(proxyKey);
    }

    public void markFailure(String proxyKey) {
        // 增加失败计数
        failureCount.merge(proxyKey, 1, Integer::sum);
        
        // 如果超过阈值,从列表中移除
        if (failureCount.get(proxyKey) >= maxFailures) {
            proxyList.removeIf(p -> p.toString().equals(proxyKey));
            System.out.println("移除失效代理: " + proxyKey);
        }
    }

    public static class Proxy {
        public final String host;
        public final int port;

        public Proxy(String host, int port) {
            this.host = host;
            this.port = port;
        }

        @Override
        public String toString() {
            return host + ":" + port;
        }
    }
}

5.3 请求头管理

import java.util.*;
import java.util.concurrent.ThreadLocalRandom;

public class HeaderManager {
    private static final List<String> USER_AGENTS = Arrays.asList(
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15"
    );

    private static final List<String> ACCEPT_LANGUAGES = Arrays.asList(
        "zh-CN,zh;q=0.9,en;q=0.8",
        "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7",
        "ja-JP,ja;q=0.9,en;q=0.8"
    );

    public static Map<String, String> getRandomHeaders() {
        Map<String, String> headers = new HashMap<>();
        Random random = ThreadLocalRandom.current();
        
        headers.put("User-Agent", USER_AGENTS.get(random.nextInt(USER_AGENTS.size())));
        headers.put("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        headers.put("Accept-Language", ACCEPT_LANGUAGES.get(random.nextInt(ACCEPT_LANGUAGES.size())));
        headers.put("Accept-Encoding", "gzip, deflate, br");
        headers.put("Connection", "keep-alive");
        headers.put("Upgrade-Insecure-Requests", "1");
        
        return headers;
    }

    public static void applyHeaders(HttpGet httpGet) {
        Map<String, String> headers = getRandomHeaders();
        headers.forEach(httpGet::setHeader);
    }
}

第六章:反爬虫应对技巧

6.1 识别反爬虫机制

常见的反爬虫机制包括:

  • 频率限制:单位时间内请求次数限制
  • User-Agent检测:检测非浏览器请求
  • IP封禁:封禁异常IP
  • 验证码:需要人工识别
  • JavaScript混淆:动态生成内容
  • 请求头验证:检查Referer、Cookie等

6.2 基础反爬虫应对

import java.util.Random;
import java.util.concurrent.TimeUnit;

public class AntiCrawlerStrategy {
    private final Random random = new Random();
    private final int minDelay;
    private final int maxDelay;

    public AntiCrawlerStrategy(int minDelay, int maxDelay) {
        this.minDelay = minDelay;
        this.maxDelay = maxDelay;
    }

    // 随机延迟
    public void randomDelay() {
        int delay = minDelay + random.nextInt(maxDelay - minDelay + 1);
        try {
            TimeUnit.SECONDS.sleep(delay);
            System.out.println("延迟 " + delay + " 秒");
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }

    // 模拟人类行为
    public void simulateHumanBehavior() {
        // 随机移动鼠标(仅Selenium)
        // 随机滚动页面
        // 随机暂停
        randomDelay();
    }

    // 处理验证码
    public void handleCaptcha(String captchaUrl) {
        // 1. 截图验证码
        // 2. 调用第三方打码平台(如超级鹰、云打码)
        // 3. 或者人工处理
        System.out.println("需要处理验证码: " + captchaUrl);
    }

    // IP轮换策略
    public String rotateProxy(ProxyManager proxyManager) {
        ProxyManager.Proxy proxy = proxyManager.getProxy();
        if (proxy != null) {
            return proxy.toString();
        }
        return null;
    }
}

6.3 高级反爬虫应对

import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;

public class AdvancedAntiCrawler {
    // 处理JavaScript混淆
    public static String deobfuscateJavaScript(String obfuscatedCode) {
        // 实现反混淆逻辑
        // 例如:eval()函数解密、字符串拼接还原等
        return obfuscatedCode; // 简化实现
    }

    // 处理动态Token
    public static String extractDynamicToken(WebDriver driver) {
        // 从JavaScript变量或DOM中提取动态Token
        JavascriptExecutor js = (JavascriptExecutor) driver;
        String token = (String) js.executeScript(
            "return window.dynamicToken || document.querySelector('meta[name=\"token\"]').content;");
        return token;
    }

    // 处理WebSocket数据
    public static void handleWebSocketData(WebDriver driver) {
        // 使用浏览器开发者工具协议(CDP)监听WebSocket
        JavascriptExecutor js = (JavascriptExecutor) driver;
        js.executeScript("""
            const originalWebSocket = window.WebSocket;
            window.WebSocket = function(url, protocols) {
                const ws = new originalWebSocket(url, protocols);
                ws.addEventListener('message', function(event) {
                    console.log('WebSocket message:', event.data);
                    // 将数据存储到全局变量供Java读取
                    window.lastWebSocketMessage = event.data;
                });
                return ws;
            };
        """);
    }

    // 处理Canvas指纹
    public static void spoofCanvasFingerprint(WebDriver driver) {
        // 修改Canvas指纹,避免被识别为爬虫
        JavascriptExecutor js = (JavascriptExecutor) driver;
        js.executeScript("""
            const originalGetContext = HTMLCanvasElement.prototype.getContext;
            HTMLCanvasElement.prototype.getContext = function(type) {
                const context = originalGetContext.call(this, type);
                if (type === '2d') {
                    const originalFillText = context.fillText;
                    context.fillText = function(text, x, y) {
                        // 添加随机噪声
                        const noise = Math.random() * 0.1;
                        return originalFillText.call(this, text, x + noise, y + noise);
                    };
                }
                return context;
            };
        """);
    }
}

第七章:实战项目:电商网站商品信息爬虫

7.1 项目需求分析

目标:爬取某电商网站的商品信息,包括名称、价格、评价数、店铺名称等,并支持多线程、代理轮换、数据存储。

7.2 完整代码实现

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;

// 商品数据模型
class ProductInfo {
    private String name;
    private String price;
    private String评价数;
    private String shopName;
    private String productUrl;
    private String imageUrl;

    // 构造函数、getter/setter、toString
    public ProductInfo(String name, String price, String评价数, String shopName, String productUrl, String imageUrl) {
        this.name = name;
        this.price = price;
        this.评价数 =评价数;
        this.shopName = shopName;
        this.productUrl = productUrl;
        this.imageUrl = imageUrl;
    }

    @Override
    public String toString() {
        return "ProductInfo{" +
                "name='" + name + '\'' +
                ", price='" + price + '\'' +
                ",评价数='" +评价数 + '\'' +
                ", shopName='" + shopName + '\'' +
                ", productUrl='" + productUrl + '\'' +
                ", imageUrl='" + imageUrl + '\'' +
                '}';
    }
}

// 爬虫主类
public class EcommerceCrawler {
    private final int maxThreads;
    private final ExecutorService executor;
    private final BlockingQueue<String> urlQueue;
    private final ConcurrentHashMap<String, Boolean> visitedUrls;
    private final AtomicInteger counter;
    private final List<ProductInfo> collectedProducts;
    private final ProxyManager proxyManager;
    private final AntiCrawlerStrategy antiCrawler;

    public EcommerceCrawler(int maxThreads) {
        this.maxThreads = maxThreads;
        this.executor = Executors.newFixedThreadPool(maxThreads);
        this.urlQueue = new LinkedBlockingQueue<>();
        this.visitedUrls = new ConcurrentHashMap<>();
        this.counter = new AtomicInteger(0);
        this.collectedProducts = new ArrayList<>();
        this.proxyManager = new ProxyManager(3);
        this.antiCrawler = new AntiCrawlerStrategy(2, 5); // 2-5秒延迟

        // 添加一些测试代理
        proxyManager.addProxy("192.168.1.100", 8080);
        proxyManager.addProxy("192.168.1.101", 8080);
    }

    public void start(String startUrl, int maxPages) {
        urlQueue.offer(startUrl);
        
        // 启动工作线程
        for (int i = 0; i < maxThreads; i++) {
            executor.submit(new CrawlTask(maxPages));
        }

        // 监控线程
        executor.submit(() -> {
            while (!Thread.currentThread().isInterrupted()) {
                try {
                    Thread.sleep(10000);
                    System.out.printf("进度: %d/%d, 队列: %d, 收集: %d%n",
                            counter.get(), maxPages, urlQueue.size(), collectedProducts.size());
                    if (counter.get() >= maxPages) {
                        shutdown();
                        break;
                    }
                } catch (InterruptedException e) {
                    break;
                }
            }
        });
    }

    private class CrawlTask implements Runnable {
        private final int maxPages;

        public CrawlTask(int maxPages) {
            this.maxPages = maxPages;
        }

        @Override
        public void run() {
            while (!Thread.currentThread().isInterrupted() && counter.get() < maxPages) {
                try {
                    String url = urlQueue.poll(1, TimeUnit.SECONDS);
                    if (url == null || visitedUrls.putIfAbsent(url, true) != null) {
                        continue;
                    }

                    // 随机延迟
                    antiCrawler.randomDelay();

                    // 发送请求
                    String html = sendRequest(url);
                    if (html == null) continue;

                    // 解析商品信息
                    ProductInfo product = parseProduct(html, url);
                    if (product != null) {
                        synchronized (collectedProducts) {
                            collectedProducts.add(product);
                        }
                    }

                    // 提取新链接
                    List<String> newUrls = extractLinks(html);
                    for (String newUrl : newUrls) {
                        if (!visitedUrls.containsKey(newUrl) && urlQueue.size() < 1000) {
                            urlQueue.offer(newUrl);
                        }
                    }

                    int count = counter.incrementAndGet();
                    System.out.println("完成: " + count + " - " + url);

                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        private String sendRequest(String url) {
            ProxyManager.Proxy proxy = proxyManager.getProxy();
            // 这里简化处理,实际应使用代理
            try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
                HttpGet httpGet = new HttpGet(url);
                HeaderManager.applyHeaders(httpGet);
                
                try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
                    if (response.getStatusLine().getStatusCode() == 200) {
                        return EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
                    }
                }
            } catch (Exception e) {
                if (proxy != null) {
                    proxyManager.markFailure(proxy.toString());
                }
                e.printStackTrace();
            }
            return null;
        }

        private ProductInfo parseProduct(String html, String url) {
            try {
                Document doc = Jsoup.parse(html);
                
                // 根据实际网站结构调整选择器
                Element nameElement = doc.selectFirst("h1.product-title");
                Element priceElement = doc.selectFirst("span.price");
                Element评价Element = doc.selectFirst("div.reviews-count");
                Element shopElement = doc.selectFirst("div.shop-name");
                Element imageElement = doc.selectFirst("img.product-image");

                if (nameElement == null || priceElement == null) return null;

                String name = nameElement.text();
                String price = priceElement.text();
                String评价数 =评价Element != null ?评价Element.text() : "0";
                String shopName = shopElement != null ? shopElement.text() : "";
                String imageUrl = imageElement != null ? imageElement.attr("abs:src") : "";

                return new ProductInfo(name, price,评价数, shopName, url, imageUrl);
            } catch (Exception e) {
                return null;
            }
        }

        private List<String> extractLinks(String html) {
            List<String> links = new ArrayList<>();
            try {
                Document doc = Jsoup.parse(html);
                Elements elements = doc.select("a[href]");
                for (Element element : elements) {
                    String absUrl = element.attr("abs:href");
                    if (absUrl.contains("/product/") || absUrl.contains("/item/")) {
                        links.add(absUrl);
                    }
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
            return links;
        }
    }

    public void shutdown() {
        executor.shutdown();
        try {
            if (!executor.awaitTermination(30, TimeUnit.SECONDS)) {
                executor.shutdownNow();
            }
        } catch (InterruptedException e) {
            executor.shutdownNow();
        }

        // 保存数据
        saveData();
        System.out.println("爬虫完成,共收集 " + collectedProducts.size() + " 个商品");
    }

    private void saveData() {
        // 保存为CSV
        FileStorage.saveToCSV(
            collectedProducts.stream()
                .map(p -> new Product(p.name, p.price, p.productUrl, p.imageUrl, p.shopName, "电商"))
                .collect(Collectors.toList()),
            "products.csv"
        );
        
        // 保存为JSON
        FileStorage.saveToJson(
            collectedProducts.stream()
                .map(p -> new Product(p.name, p.price, p.productUrl, p.imageUrl, p.shopName, "电商"))
                .collect(Collectors.toList()),
            "products.json"
        );
    }

    public static void main(String[] args) {
        EcommerceCrawler crawler = new EcommerceCrawler(5); // 5个线程
        crawler.start("https://example-ecommerce.com/products", 100); // 爬取100页
    }
}

第八章:法律与道德考量

8.1 爬虫法律风险

robots.txt协议: 爬虫应当遵守网站的robots.txt协议,该文件定义了哪些路径允许或禁止爬取。

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

public class RobotsTxtChecker {
    public static boolean isAllowed(String targetUrl, String userAgent) {
        try {
            URL url = new URL(targetUrl);
            String robotsUrl = url.getProtocol() + "://" + url.getHost() + "/robots.txt";
            
            HttpURLConnection connection = (HttpURLConnection) new URL(robotsUrl).openConnection();
            connection.setRequestMethod("GET");
            connection.setConnectTimeout(5000);
            connection.setReadTimeout(5000);
            
            if (connection.getResponseCode() == 200) {
                BufferedReader reader = new BufferedReader(
                    new InputStreamReader(connection.getInputStream()));
                String line;
                boolean inUserAgentBlock = false;
                boolean allowed = true;
                
                while ((line = reader.readLine()) != null) {
                    line = line.trim();
                    if (line.startsWith("User-agent:")) {
                        String agent = line.substring(11).trim();
                        inUserAgentBlock = agent.equals("*") || agent.equals(userAgent);
                    } else if (inUserAgentBlock && line.startsWith("Disallow:")) {
                        String path = line.substring(9).trim();
                        if (url.getPath().startsWith(path)) {
                            allowed = false;
                        }
                    } else if (inUserAgentBlock && line.startsWith("Allow:")) {
                        String path = line.substring(6).trim();
                        if (url.getPath().startsWith(path)) {
                            allowed = true;
                        }
                    }
                }
                reader.close();
                return allowed;
            }
        } catch (Exception e) {
            // 如果无法获取robots.txt,默认允许
            return true;
        }
        return true;
    }

    public static void main(String[] args) {
        boolean allowed = RobotsTxtChecker.isAllowed(
            "https://example.com/admin", 
            "MyCrawler/1.0"
        );
        System.out.println("是否允许爬取: " + allowed);
    }
}

8.2 道德爬虫准则

  1. 控制爬取频率:避免对目标网站造成过大负担
  2. 尊重版权:不要爬取受版权保护的内容用于商业目的
  3. 保护隐私:不要爬取个人信息
  4. 透明性:在User-Agent中提供联系方式
  5. 数据使用:明确数据用途,避免滥用

第九章:性能优化与监控

9.1 性能监控

import java.util.concurrent.atomic.AtomicLong;

public class CrawlerMonitor {
    private final AtomicLong totalRequests = new AtomicLong(0);
    private final AtomicLong successfulRequests = new AtomicLong(0);
    private final AtomicLong failedRequests = new AtomicLong(0);
    private final AtomicLong totalBytes = new AtomicLong(0);
    private final long startTime = System.currentTimeMillis();

    public void recordRequest(boolean success, long bytes) {
        totalRequests.incrementAndGet();
        if (success) {
            successfulRequests.incrementAndGet();
        } else {
            failedRequests.incrementAndGet();
        }
        totalBytes.addAndGet(bytes);
    }

    public void printStats() {
        long elapsed = (System.currentTimeMillis() - startTime) / 1000;
        double successRate = (double) successfulRequests.get() / totalRequests.get() * 100;
        double speed = totalRequests.get() / (double) elapsed;
        
        System.out.println("========== 爬虫统计 ==========");
        System.out.println("运行时间: " + elapsed + " 秒");
        System.out.println("总请求数: " + totalRequests.get());
        System.out.println("成功: " + successfulRequests.get());
        System.out.println("失败: " + failedRequests.get());
        System.out.printf("成功率: %.2f%%%n", successRate);
        System.out.printf("平均速度: %.2f 请求/秒%n", speed);
        System.out.printf("总数据量: %.2f MB%n", totalBytes.get() / 1024.0 / 1024.0);
        System.out.println("==============================");
    }
}

9.2 内存优化

import java.lang.ref.WeakReference;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;

public class MemoryOptimizedCrawler {
    // 使用弱引用避免内存泄漏
    private final Queue<WeakReference<String>> urlCache = new ConcurrentLinkedQueue<>();

    public void processLargeData(List<String> items) {
        // 使用流式处理避免一次性加载所有数据
        items.stream()
            .filter(item -> item != null)
            .map(item -> processItem(item))
            .filter(result -> result != null)
            .forEach(this::saveResult);
    }

    private String processItem(String item) {
        // 处理单个项
        return item.toUpperCase();
    }

    private void saveResult(String result) {
        // 保存结果
        System.out.println("保存: " + result);
    }

    // 使用try-with-resources确保资源释放
    public void crawlWithResourceManagement(String url) {
        try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
            HttpGet httpGet = new HttpGet(url);
            try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
                // 处理响应
                String html = EntityUtils.toString(response.getEntity());
                // 处理html...
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

第十章:总结与进阶学习

10.1 核心要点回顾

  1. 基础环境:JDK + Maven + Jsoup/HttpClient
  2. 网络请求:HttpURLConnection、HttpClient、Selenium
  3. 数据解析:Jsoup选择器、XPath、JSON解析
  4. 数据存储:文件、数据库
  5. 高级技巧:多线程、代理轮换、请求头管理 6.反爬虫应对:随机延迟、IP轮换、User-Agent轮换
  6. 实战项目:电商爬虫完整实现
  7. 法律道德:遵守robots.txt、控制频率

10.2 进阶学习方向

  1. 分布式爬虫:使用Redis、Kafka实现分布式调度
  2. 深度学习:使用机器学习识别验证码、反爬虫策略
  3. 浏览器自动化:Playwright、Puppeteer
  4. API逆向:分析WebSocket、GraphQL接口
  5. 数据清洗:使用正则表达式、NLP技术
  6. 可视化监控:Grafana、Prometheus集成

10.3 推荐资源

  • 官方文档:Jsoup、Apache HttpClient、Selenium官方文档
  • 开源项目:WebMagic、Crawler4j、Apache Nutch
  • 书籍:《Java网络爬虫实战》、《数据抓取的艺术》
  • 社区:GitHub、Stack Overflow、CSDN

10.4 持续改进

爬虫技术在不断发展,建议:

  • 定期更新依赖库版本
  • 关注目标网站的技术变化
  • 学习新的反爬虫技术及应对方法
  • 参与开源项目贡献
  • 建立自己的爬虫框架

通过本教程的学习,您应该已经掌握了Java爬虫的基础知识和实战技巧。记住,爬虫开发是一个持续学习和改进的过程,只有不断实践和总结,才能成为真正的爬虫专家。祝您在爬虫开发的道路上取得成功!