引言:为什么选择Java进行网络爬虫开发?
在当今数据驱动的时代,网络爬虫已成为获取互联网数据的重要工具。Java作为一种成熟、稳定且功能强大的编程语言,在爬虫开发领域具有独特的优势。首先,Java拥有丰富的生态系统,包括Jsoup、HttpClient、Selenium等优秀的网络请求和HTML解析库。其次,Java的多线程机制使得大规模并发爬取变得简单高效。最重要的是,Java的跨平台特性和强大的异常处理机制,使得爬虫程序能够在复杂的网络环境中稳定运行。
与Python相比,Java爬虫在处理大规模数据采集时具有更好的性能表现,特别是在需要长时间运行的企业级应用中,Java的内存管理和垃圾回收机制能够有效避免内存泄漏问题。此外,Java的类型安全特性能够在编译期发现潜在错误,提高代码质量。
第一章:Java爬虫基础环境搭建
1.1 开发环境准备
在开始Java爬虫开发之前,我们需要准备以下开发环境:
JDK安装与配置: 确保安装JDK 8或更高版本。可以通过以下命令验证安装:
java -version
javac -version
IDE选择: 推荐使用IntelliJ IDEA或Eclipse作为开发IDE。IntelliJ IDEA对Java Web开发支持更好,特别是对于HTTP请求调试。
构建工具: 使用Maven或Gradle管理项目依赖。本文以Maven为例,在pom.xml中添加必要的依赖:
<dependencies>
<!-- Jsoup - HTML解析库 -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.4</version>
</dependency>
<!-- HttpClient - HTTP客户端 -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.14</version>
</dependency>
<!-- Gson - JSON解析库 -->
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.10.1</version>
</dependency>
<!-- Selenium - 浏览器自动化 -->
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.11.0</version>
</dependency>
</dependencies>
1.2 网络爬虫工作原理
网络爬虫的基本工作流程如下:
- URL管理器:维护待爬取URL和已爬取URL集合
- 下载器:发送HTTP请求获取网页内容
- 解析器:从HTML中提取目标数据
- 数据处理器:清洗、存储提取的数据
- 调度器:协调各组件工作,控制爬取节奏
第二章:Java网络请求发送
2.1 使用HttpURLConnection发送请求
HttpURLConnection是Java标准库提供的HTTP客户端,无需额外依赖:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.StandardCharsets;
public class SimpleHttpClient {
public static String sendGetRequest(String targetURL) {
HttpURLConnection connection = null;
try {
URL url = new URL(targetURL);
connection = (HttpURLConnection) url.openConnection();
// 设置请求方法
connection.setRequestMethod("GET");
// 设置请求头
connection.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
connection.setRequestProperty("Accept",
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
// 设置超时时间
connection.setConnectTimeout(5000);
connection.setReadTimeout(5000);
// 获取响应码
int responseCode = connection.getResponseCode();
System.out.println("Response Code: " + responseCode);
if (responseCode == HttpURLConnection.HTTP_OK) {
BufferedReader in = new BufferedReader(
new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8));
String inputLine;
StringBuilder response = new StringBuilder();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
return response.toString();
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (connection != null) {
connection.disconnect();
}
}
return null;
}
public static void main(String[] args) {
String html = sendGetRequest("https://example.com");
System.out.println(html);
}
}
2.2 使用Apache HttpClient发送请求
Apache HttpClient提供了更强大的功能,包括连接池、Cookie管理等:
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.nio.charset.StandardCharsets;
public class HttpClientExample {
public static String sendGetWithHttpClient(String targetURL) {
// 创建HttpClient实例
try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
// 创建GET请求
HttpGet httpGet = new HttpGet(targetURL);
// 设置请求头
httpGet.setHeader("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
httpGet.setHeader("Accept",
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
httpGet.setHeader("Accept-Language", "zh-CN,zh;q=0.9");
// 执行请求
try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
// 获取响应实体
String html = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
return html;
}
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
public static void main(String[] args) {
String html = sendGetWithHttpClient("https://example.com");
JsoupParser.parseHtml(html);
}
}
2.3 处理HTTPS和SSL证书问题
在爬取HTTPS网站时,可能会遇到SSL证书验证问题:
import org.apache.http.conn.ssl.NoopHostnameVerifier;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.ssl.SSLContexts;
import javax.net.ssl.SSLContext;
import java.security.KeyManagementException;
import java.security.NoSuchAlgorithmException;
public class SSLHttpClient {
public static CloseableHttpClient createSSLClient() {
try {
// 创建信任所有证书的SSL上下文
SSLContext sslContext = SSLContexts.custom()
.loadTrustMaterial((chain, authType) -> true)
.build();
// 创建SSL连接工厂
SSLConnectionSocketFactory socketFactory =
new SSLConnectionSocketFactory(sslContext,
NoopHostnameVerifier.INSTANCE);
return HttpClients.custom()
.setSSLSocketFactory(socketFactory)
.build();
} catch (Exception e) {
e.printStackTrace();
return HttpClients.createDefault();
}
}
}
第三章:HTML解析与数据提取
3.1 Jsoup基础用法
Jsoup是Java中最流行的HTML解析库,提供了类似jQuery的选择器语法:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupParser {
public static void parseHtml(String html) {
// 解析HTML文档
Document doc = Jsoup.parse(html);
// 通过ID获取元素
Element titleElement = doc.getElementById("title");
if (titleElement != null) {
System.out.println("Title: " + titleElement.text());
}
// 通过CSS选择器获取元素
Elements newsItems = doc.select("div.news-item");
for (Element item : newsItems) {
// 获取元素的文本内容
String title = item.select("h2").text();
String link = item.select("a").attr("href");
String summary = item.select("p.summary").text();
System.out.println("标题: " + title);
System.out.println("链接: " + link);
System例: " + summary);
System.out.println("-------------------");
}
// 获取所有链接
Elements links = doc.select("a[href]");
for (Element link : links) {
String url = link.attr("abs:href"); // 获取绝对URL
String text = link.text();
System.out.println(text + " -> " + url);
}
}
}
3.2 高级选择器用法
public class AdvancedJsoupExample {
public static void parseComplexHtml(String html) {
Document doc = Jsoup.parse(html);
// 组合选择器
Elements items = doc.select("div.container > div.row > div.col-md-8");
// 属性选择器
Elements images = doc.select("img[src$=.jpg]"); // 以.jpg结尾的src
Elements inputs = doc.select("input[type=text]"); // type为text的input
// 伪类选择器
Elements firstChild = doc.select("div.item:first-child");
Elements lastChild = doc.select("div.item:last-child");
Elements nthChild = doc.select("div.item:nth-child(2n+1)"); // 奇数项
// 文本包含选择器
Elements containsText = doc.select("p:contains(Java)");
// 正则表达式匹配
Elements regexMatch = doc.select("div[class~=^news-\\d+$]"); // 匹配class为news-数字
// 提取数据并构建对象
for (Element item : doc.select("div.product")) {
Product product = new Product();
product.setName(item.select("h3").text());
product.setPrice(item.select(".price").text());
product.setUrl(item.select("a").attr("abs:href"));
product.setImage(item.select("img").attr("abs:src"));
// 处理属性数据
Element specs = item.select("div.specs").first();
if (specs != null) {
product.setBrand(specs.attr("data-brand"));
product.setCategory(specs.attr("data-category"));
}
// 数据存储或进一步处理
saveProduct(product);
}
}
private static void saveProduct(Product product) {
// 实现数据存储逻辑
System.out.println("保存产品: " + product);
}
}
class Product {
private String name;
private String price;
private String url;
private String image;
private String brand;
private String category;
// getters and setters
public String getName() { return name; }
public void setName(String name) { this.name = name; }
public String getPrice() { return price; }
public void setPrice(String price) { this.price = price; }
public String getUrl() { return url; }
public void setUrl(String url) { this.url = url; }
public String getImage() { return image; }
public void setImage(String image) { this.image = image; }
public String getBrand() { return brand; }
public void setBrand(String brand) { this.brand = brand; }
public String getCategory() { return category; }
public void setCategory(String category) { this.category = category; }
@Override
public String toString() {
return "Product{name='" + name + "', price='" + price + "', brand='" + brand + "'}";
}
}
3.3 处理动态加载内容
对于JavaScript动态加载的内容,需要使用Selenium:
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;
import java.time.Duration;
import java.util.List;
public class SeleniumCrawler {
private WebDriver driver;
private WebDriverWait wait;
public SeleniumCrawler() {
// 设置ChromeDriver路径
System.setProperty("webdriver.chrome.driver", "path/to/chromedriver");
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless"); // 无头模式
options.addArguments("--disable-gpu");
options.addArguments("--no-sandbox");
options.addArguments("--disable-dev-shm-usage");
this.driver = new ChromeDriver(options);
this.wait = new WebDriverWait(driver, Duration.ofSeconds(10));
}
public void crawlDynamicContent(String url) {
try {
driver.get(url);
// 等待特定元素加载
wait.until(ExpectedConditions.presenceOfElementLocated(
By.cssSelector("div.loaded-content")));
// 获取动态内容
List<WebElement> items = driver.findElements(By.cssSelector("div.item"));
for (WebElement item : items) {
String title = item.findElement(By.cssSelector("h3")).getText();
String price = item.findElement(By.cssSelector(".price")).getText();
System.out.println(title + " - " + price);
}
// 模拟滚动加载
driver.executeScript("window.scrollTo(0, document.body.scrollHeight);");
Thread.sleep(2000); // 等待新内容加载
// 处理弹窗
if (isElementPresent(By.cssSelector("div.modal"))) {
driver.findElement(By.cssSelector("button.close")).click();
}
} catch (Exception e) {
e.printStackTrace();
} finally {
driver.quit();
}
}
private boolean isElementPresent(By locator) {
try {
driver.findElement(locator);
return true;
} catch (Exception e) {
return false;
爬虫应对技巧
return false;
}
}
}
第四章:数据存储与管理
4.1 文件存储
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
public class FileStorage {
// 保存为CSV文件
public static void saveToCSV(List<Product> products, String filePath) {
try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) {
// 写入表头
writer.write("名称,价格,链接,图片,品牌,分类\n");
for (Product product : products) {
// CSV格式:字段用逗号分隔,包含逗号的字段用引号包围
String line = String.format("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n",
escapeCsv(product.getName()),
escapeCsv(product.getPrice()),
escapeCsv(product.getUrl()),
escapeCsv(product.getImage()),
escapeCsv(product.getBrand()),
escapeCsv(product.getCategory()));
writer.write(line);
}
System.out.println("数据已保存到: " + filePath);
} JSON存储
public static void saveToJson(List<Product> products, String filePath) {
try (FileWriter writer = new FileWriter(filePath)) {
Gson gson = new GsonBuilder().setPrettyPrinting().create();
gson.toJson(products, writer);
System.out.println("JSON数据已保存到: " + filePath);
} catch (IOException e) {
e.printStackTrace();
}
}
// 保存为JSONL(每行一个JSON对象)
public static void saveToJsonl(List<Product> products, String filePath) {
try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) {
Gson gson = new Gson();
for (Product product : products) {
writer.write(gson.toJson(product) + "\n");
}
System.out.println("JSONL数据已保存到: " + filePath);
} catch (IOException e) {
e.printStackTrace();
}
}
// 辅助方法:转义CSV特殊字符
private static String escapeCsv(String field) {
if (field == null) return "";
return field.replace("\"", "\"\"");
}
}
4.2 数据库存储
import java.sql.*;
import java.util.List;
public class DatabaseStorage {
private static final String DB_URL = "jdbc:mysql://localhost:3306/crawler_db?useSSL=false&serverTimezone=UTC";
private static final String USER = "root";
private static final String PASSWORD = "password";
// 创建数据库表
public static void createTable() {
String sql = """
CREATE TABLE IF NOT EXISTS products (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255) NOT NULL,
price VARCHAR(50),
url VARCHAR(500),
image VARCHAR(500),
brand VARCHAR(100),
category VARCHAR(100),
crawl_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE KEY unique_url (url)
)
""";
try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD);
Statement stmt = conn.createStatement()) {
stmt.execute(sql);
System.out.println("表创建成功");
} catch (SQLException e) {
e.printStackTrace();
}
}
// 批量插入数据
public static void batchInsert(List<Product> products) {
String sql = "INSERT INTO products (name, price, url, image, brand, category) " +
"VALUES (?, ?, ?, ?, ?, ?) " +
"ON DUPLICATE KEY UPDATE " +
"name=VALUES(name), price=VALUES(price), image=VALUES(image), " +
"brand=VALUES(brand), category=VALUES(category)";
try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD);
PreparedStatement pstmt = conn.prepareStatement(sql)) {
int batchSize = 0;
for (Product product : products) {
pstmt.setString(1, product.getName());
pstmt.setString(2, product.getPrice());
pstmt.setString(3, product.getUrl());
pstmt.setString(4, product.getImage());
pstmt.setString(5, product.getBrand());
pstmt.setString(6, product.getCategory());
pstmt.addBatch();
batchSize++;
// 每100条执行一次批量插入
if (batchSize % 100 == 0) {
pstmt.executeBatch();
}
}
// 执行剩余的批量操作
pstmt.executeBatch();
System.out.println("批量插入完成,共插入 " + products.size() + " 条数据");
} catch (SQLException e) {
e.printStackTrace();
}
}
// 查询数据
public static List<Product> queryProducts(String brand) {
String sql = "SELECT * FROM products WHERE brand = ?";
List<Product> products = new ArrayList<>();
try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD);
PreparedStatement pstmt = conn.prepareStatement(sql)) {
pstmt.setString(1, brand);
ResultSet rs = pstmt.executeQuery();
while (rs.next()) {
Product product = new Product();
product.setName(rs.getString("name"));
product.setPrice(rs.getString("price"));
product.setUrl(rs.getString("url"));
product.setImage(rs.getString("image"));
product.setBrand(rs.getString("brand"));
product.setCategory(rs.getString("category"));
products.add(product);
}
} catch (SQLException e) {
e.printStackTrace();
}
return products;
}
}
第五章:爬虫高级技巧
5.1 多线程爬虫
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;
public class MultiThreadedCrawler {
private final ExecutorService executor;
private final BlockingQueue<String> urlQueue;
private final ConcurrentHashMap<String, Boolean> visitedUrls;
private final AtomicInteger counter;
private final int maxThreads;
public MultiThreadedCrawler(int maxThreads) {
this.maxThreads = maxThreads;
this.executor = Executors.newFixedThreadPool(maxThreads);
this.urlQueue = new LinkedBlockingQueue<>();
this.visitedUrls = new ConcurrentHashMap<>();
this.counter = new AtomicInteger(0);
}
public void startCrawling(String startUrl) {
// 添加起始URL
urlQueue.offer(startUrl);
// 启动监控线程
executor.submit(this::monitorQueue);
// 启动工作线程
for (int i = 0; i < maxThreads; i++) {
executor.submit(this::crawlWorker);
}
// 等待所有任务完成
try {
Thread.sleep(5000); // 等待5秒
shutdown();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
private void crawlWorker() {
while (!Thread.currentThread().isInterrupted()) {
try {
String url = urlQueue.poll(1, TimeUnit.SECONDS);
if (url == null) continue;
// 检查是否已访问
if (visitedUrls.putIfAbsent(url, true) != null) {
continue;
}
// 爬取页面
String html = sendRequest(url);
if (html != null) {
// 解析并提取新URL
List<String> newUrls = parseUrls(html);
newUrls.forEach(newUrl -> {
if (!visitedUrls.containsKey(newUrl)) {
urlQueue.offer(newUrl);
}
});
// 提取数据
List<Product> products = parseProducts(html);
// 保存数据
saveProducts(products);
int count = counter.incrementAndGet();
System.out.println("已爬取页面: " + count + " - " + url);
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
} catch (Exception e) {
e.printStackTrace();
}
}
}
private void monitorQueue() {
while (!Thread.currentThread().isInterrupted()) {
try {
Thread.sleep(10000); // 每10秒打印一次状态
System.out.println("队列大小: " + urlQueue.size() +
", 已访问: " + visitedUrls.size() +
", 计数: " + counter.get());
}爬虫应对技巧
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
}
private String sendRequest(String url) {
// 实现HTTP请求发送
return HttpClientExample.sendGetWithHttpClient(url);
}
private List<String> parseUrls(String html) {
// 实现URL解析
Document doc = Jsoup.parse(html);
Elements links = doc.select("a[href]");
List<String> urls = new ArrayList<>();
for (Element link : links) {
String absUrl = link.attr("abs:href");
if (absUrl.startsWith("http")) {
urls.add(absUrl);
}
}
return urls;
}
private List<Product> parseProducts(String html) {
// 实现产品解析
return new ArrayList<>(); // 简化实现
}
private void saveProducts(List<Product> products) {
// 实现数据保存
if (!products.isEmpty()) {
// 可以批量保存到数据库或文件
}
}
public void shutdown() {
executor.shutdown();
try {
if (!executor.awaitTermination(60, TimeUnit.SECONDS)) {
executor.shutdownNow();
}
} catch (InterruptedException e) {
executor.shutdownNow();
Thread.currentThread().interrupt();
}
System.out.println("爬虫已停止");
}
}
5.2 代理IP轮换
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
public class ProxyManager {
private final List<Proxy> proxyList;
private final Map<String, Proxy> proxyUsage;
private final Map<String, Integer> failureCount;
private final int maxFailures;
public ProxyManager(int maxFailures) {
this.proxyList = new ArrayList<>();
this.proxyUsage = new ConcurrentHashMap<>();
this.failureCount = new ConcurrentHashMap<>();
this.maxFailures = maxFailures;
}
public void addProxy(String host, int port) {
proxyList.add(new Proxy(host, port));
}
public Proxy getProxy() {
if (proxyList.isEmpty()) return null;
// 简单轮询策略
int index = (int) (System.currentTimeMillis() % proxyList.size());
Proxy proxy = proxyList.get(index);
// 检查是否可用
if (failureCount.getOrDefault(proxy.toString(), 0) >= maxFailures) {
// 跳过不可用代理
return getProxy();
}
return proxy;
}
public void markSuccess(String proxyKey) {
// 重置失败计数
failureCount.remove(proxyKey);
}
public void markFailure(String proxyKey) {
// 增加失败计数
failureCount.merge(proxyKey, 1, Integer::sum);
// 如果超过阈值,从列表中移除
if (failureCount.get(proxyKey) >= maxFailures) {
proxyList.removeIf(p -> p.toString().equals(proxyKey));
System.out.println("移除失效代理: " + proxyKey);
}
}
public static class Proxy {
public final String host;
public final int port;
public Proxy(String host, int port) {
this.host = host;
this.port = port;
}
@Override
public String toString() {
return host + ":" + port;
}
}
}
5.3 请求头管理
import java.util.*;
import java.util.concurrent.ThreadLocalRandom;
public class HeaderManager {
private static final List<String> USER_AGENTS = Arrays.asList(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15"
);
private static final List<String> ACCEPT_LANGUAGES = Arrays.asList(
"zh-CN,zh;q=0.9,en;q=0.8",
"en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7",
"ja-JP,ja;q=0.9,en;q=0.8"
);
public static Map<String, String> getRandomHeaders() {
Map<String, String> headers = new HashMap<>();
Random random = ThreadLocalRandom.current();
headers.put("User-Agent", USER_AGENTS.get(random.nextInt(USER_AGENTS.size())));
headers.put("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
headers.put("Accept-Language", ACCEPT_LANGUAGES.get(random.nextInt(ACCEPT_LANGUAGES.size())));
headers.put("Accept-Encoding", "gzip, deflate, br");
headers.put("Connection", "keep-alive");
headers.put("Upgrade-Insecure-Requests", "1");
return headers;
}
public static void applyHeaders(HttpGet httpGet) {
Map<String, String> headers = getRandomHeaders();
headers.forEach(httpGet::setHeader);
}
}
第六章:反爬虫应对技巧
6.1 识别反爬虫机制
常见的反爬虫机制包括:
- 频率限制:单位时间内请求次数限制
- User-Agent检测:检测非浏览器请求
- IP封禁:封禁异常IP
- 验证码:需要人工识别
- JavaScript混淆:动态生成内容
- 请求头验证:检查Referer、Cookie等
6.2 基础反爬虫应对
import java.util.Random;
import java.util.concurrent.TimeUnit;
public class AntiCrawlerStrategy {
private final Random random = new Random();
private final int minDelay;
private final int maxDelay;
public AntiCrawlerStrategy(int minDelay, int maxDelay) {
this.minDelay = minDelay;
this.maxDelay = maxDelay;
}
// 随机延迟
public void randomDelay() {
int delay = minDelay + random.nextInt(maxDelay - minDelay + 1);
try {
TimeUnit.SECONDS.sleep(delay);
System.out.println("延迟 " + delay + " 秒");
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
// 模拟人类行为
public void simulateHumanBehavior() {
// 随机移动鼠标(仅Selenium)
// 随机滚动页面
// 随机暂停
randomDelay();
}
// 处理验证码
public void handleCaptcha(String captchaUrl) {
// 1. 截图验证码
// 2. 调用第三方打码平台(如超级鹰、云打码)
// 3. 或者人工处理
System.out.println("需要处理验证码: " + captchaUrl);
}
// IP轮换策略
public String rotateProxy(ProxyManager proxyManager) {
ProxyManager.Proxy proxy = proxyManager.getProxy();
if (proxy != null) {
return proxy.toString();
}
return null;
}
}
6.3 高级反爬虫应对
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
public class AdvancedAntiCrawler {
// 处理JavaScript混淆
public static String deobfuscateJavaScript(String obfuscatedCode) {
// 实现反混淆逻辑
// 例如:eval()函数解密、字符串拼接还原等
return obfuscatedCode; // 简化实现
}
// 处理动态Token
public static String extractDynamicToken(WebDriver driver) {
// 从JavaScript变量或DOM中提取动态Token
JavascriptExecutor js = (JavascriptExecutor) driver;
String token = (String) js.executeScript(
"return window.dynamicToken || document.querySelector('meta[name=\"token\"]').content;");
return token;
}
// 处理WebSocket数据
public static void handleWebSocketData(WebDriver driver) {
// 使用浏览器开发者工具协议(CDP)监听WebSocket
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("""
const originalWebSocket = window.WebSocket;
window.WebSocket = function(url, protocols) {
const ws = new originalWebSocket(url, protocols);
ws.addEventListener('message', function(event) {
console.log('WebSocket message:', event.data);
// 将数据存储到全局变量供Java读取
window.lastWebSocketMessage = event.data;
});
return ws;
};
""");
}
// 处理Canvas指纹
public static void spoofCanvasFingerprint(WebDriver driver) {
// 修改Canvas指纹,避免被识别为爬虫
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("""
const originalGetContext = HTMLCanvasElement.prototype.getContext;
HTMLCanvasElement.prototype.getContext = function(type) {
const context = originalGetContext.call(this, type);
if (type === '2d') {
const originalFillText = context.fillText;
context.fillText = function(text, x, y) {
// 添加随机噪声
const noise = Math.random() * 0.1;
return originalFillText.call(this, text, x + noise, y + noise);
};
}
return context;
};
""");
}
}
第七章:实战项目:电商网站商品信息爬虫
7.1 项目需求分析
目标:爬取某电商网站的商品信息,包括名称、价格、评价数、店铺名称等,并支持多线程、代理轮换、数据存储。
7.2 完整代码实现
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;
// 商品数据模型
class ProductInfo {
private String name;
private String price;
private String评价数;
private String shopName;
private String productUrl;
private String imageUrl;
// 构造函数、getter/setter、toString
public ProductInfo(String name, String price, String评价数, String shopName, String productUrl, String imageUrl) {
this.name = name;
this.price = price;
this.评价数 =评价数;
this.shopName = shopName;
this.productUrl = productUrl;
this.imageUrl = imageUrl;
}
@Override
public String toString() {
return "ProductInfo{" +
"name='" + name + '\'' +
", price='" + price + '\'' +
",评价数='" +评价数 + '\'' +
", shopName='" + shopName + '\'' +
", productUrl='" + productUrl + '\'' +
", imageUrl='" + imageUrl + '\'' +
'}';
}
}
// 爬虫主类
public class EcommerceCrawler {
private final int maxThreads;
private final ExecutorService executor;
private final BlockingQueue<String> urlQueue;
private final ConcurrentHashMap<String, Boolean> visitedUrls;
private final AtomicInteger counter;
private final List<ProductInfo> collectedProducts;
private final ProxyManager proxyManager;
private final AntiCrawlerStrategy antiCrawler;
public EcommerceCrawler(int maxThreads) {
this.maxThreads = maxThreads;
this.executor = Executors.newFixedThreadPool(maxThreads);
this.urlQueue = new LinkedBlockingQueue<>();
this.visitedUrls = new ConcurrentHashMap<>();
this.counter = new AtomicInteger(0);
this.collectedProducts = new ArrayList<>();
this.proxyManager = new ProxyManager(3);
this.antiCrawler = new AntiCrawlerStrategy(2, 5); // 2-5秒延迟
// 添加一些测试代理
proxyManager.addProxy("192.168.1.100", 8080);
proxyManager.addProxy("192.168.1.101", 8080);
}
public void start(String startUrl, int maxPages) {
urlQueue.offer(startUrl);
// 启动工作线程
for (int i = 0; i < maxThreads; i++) {
executor.submit(new CrawlTask(maxPages));
}
// 监控线程
executor.submit(() -> {
while (!Thread.currentThread().isInterrupted()) {
try {
Thread.sleep(10000);
System.out.printf("进度: %d/%d, 队列: %d, 收集: %d%n",
counter.get(), maxPages, urlQueue.size(), collectedProducts.size());
if (counter.get() >= maxPages) {
shutdown();
break;
}
} catch (InterruptedException e) {
break;
}
}
});
}
private class CrawlTask implements Runnable {
private final int maxPages;
public CrawlTask(int maxPages) {
this.maxPages = maxPages;
}
@Override
public void run() {
while (!Thread.currentThread().isInterrupted() && counter.get() < maxPages) {
try {
String url = urlQueue.poll(1, TimeUnit.SECONDS);
if (url == null || visitedUrls.putIfAbsent(url, true) != null) {
continue;
}
// 随机延迟
antiCrawler.randomDelay();
// 发送请求
String html = sendRequest(url);
if (html == null) continue;
// 解析商品信息
ProductInfo product = parseProduct(html, url);
if (product != null) {
synchronized (collectedProducts) {
collectedProducts.add(product);
}
}
// 提取新链接
List<String> newUrls = extractLinks(html);
for (String newUrl : newUrls) {
if (!visitedUrls.containsKey(newUrl) && urlQueue.size() < 1000) {
urlQueue.offer(newUrl);
}
}
int count = counter.incrementAndGet();
System.out.println("完成: " + count + " - " + url);
} catch (Exception e) {
e.printStackTrace();
}
}
}
private String sendRequest(String url) {
ProxyManager.Proxy proxy = proxyManager.getProxy();
// 这里简化处理,实际应使用代理
try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
HttpGet httpGet = new HttpGet(url);
HeaderManager.applyHeaders(httpGet);
try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
if (response.getStatusLine().getStatusCode() == 200) {
return EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
}
}
} catch (Exception e) {
if (proxy != null) {
proxyManager.markFailure(proxy.toString());
}
e.printStackTrace();
}
return null;
}
private ProductInfo parseProduct(String html, String url) {
try {
Document doc = Jsoup.parse(html);
// 根据实际网站结构调整选择器
Element nameElement = doc.selectFirst("h1.product-title");
Element priceElement = doc.selectFirst("span.price");
Element评价Element = doc.selectFirst("div.reviews-count");
Element shopElement = doc.selectFirst("div.shop-name");
Element imageElement = doc.selectFirst("img.product-image");
if (nameElement == null || priceElement == null) return null;
String name = nameElement.text();
String price = priceElement.text();
String评价数 =评价Element != null ?评价Element.text() : "0";
String shopName = shopElement != null ? shopElement.text() : "";
String imageUrl = imageElement != null ? imageElement.attr("abs:src") : "";
return new ProductInfo(name, price,评价数, shopName, url, imageUrl);
} catch (Exception e) {
return null;
}
}
private List<String> extractLinks(String html) {
List<String> links = new ArrayList<>();
try {
Document doc = Jsoup.parse(html);
Elements elements = doc.select("a[href]");
for (Element element : elements) {
String absUrl = element.attr("abs:href");
if (absUrl.contains("/product/") || absUrl.contains("/item/")) {
links.add(absUrl);
}
}
} catch (Exception e) {
e.printStackTrace();
}
return links;
}
}
public void shutdown() {
executor.shutdown();
try {
if (!executor.awaitTermination(30, TimeUnit.SECONDS)) {
executor.shutdownNow();
}
} catch (InterruptedException e) {
executor.shutdownNow();
}
// 保存数据
saveData();
System.out.println("爬虫完成,共收集 " + collectedProducts.size() + " 个商品");
}
private void saveData() {
// 保存为CSV
FileStorage.saveToCSV(
collectedProducts.stream()
.map(p -> new Product(p.name, p.price, p.productUrl, p.imageUrl, p.shopName, "电商"))
.collect(Collectors.toList()),
"products.csv"
);
// 保存为JSON
FileStorage.saveToJson(
collectedProducts.stream()
.map(p -> new Product(p.name, p.price, p.productUrl, p.imageUrl, p.shopName, "电商"))
.collect(Collectors.toList()),
"products.json"
);
}
public static void main(String[] args) {
EcommerceCrawler crawler = new EcommerceCrawler(5); // 5个线程
crawler.start("https://example-ecommerce.com/products", 100); // 爬取100页
}
}
第八章:法律与道德考量
8.1 爬虫法律风险
robots.txt协议: 爬虫应当遵守网站的robots.txt协议,该文件定义了哪些路径允许或禁止爬取。
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
public class RobotsTxtChecker {
public static boolean isAllowed(String targetUrl, String userAgent) {
try {
URL url = new URL(targetUrl);
String robotsUrl = url.getProtocol() + "://" + url.getHost() + "/robots.txt";
HttpURLConnection connection = (HttpURLConnection) new URL(robotsUrl).openConnection();
connection.setRequestMethod("GET");
connection.setConnectTimeout(5000);
connection.setReadTimeout(5000);
if (connection.getResponseCode() == 200) {
BufferedReader reader = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
String line;
boolean inUserAgentBlock = false;
boolean allowed = true;
while ((line = reader.readLine()) != null) {
line = line.trim();
if (line.startsWith("User-agent:")) {
String agent = line.substring(11).trim();
inUserAgentBlock = agent.equals("*") || agent.equals(userAgent);
} else if (inUserAgentBlock && line.startsWith("Disallow:")) {
String path = line.substring(9).trim();
if (url.getPath().startsWith(path)) {
allowed = false;
}
} else if (inUserAgentBlock && line.startsWith("Allow:")) {
String path = line.substring(6).trim();
if (url.getPath().startsWith(path)) {
allowed = true;
}
}
}
reader.close();
return allowed;
}
} catch (Exception e) {
// 如果无法获取robots.txt,默认允许
return true;
}
return true;
}
public static void main(String[] args) {
boolean allowed = RobotsTxtChecker.isAllowed(
"https://example.com/admin",
"MyCrawler/1.0"
);
System.out.println("是否允许爬取: " + allowed);
}
}
8.2 道德爬虫准则
- 控制爬取频率:避免对目标网站造成过大负担
- 尊重版权:不要爬取受版权保护的内容用于商业目的
- 保护隐私:不要爬取个人信息
- 透明性:在User-Agent中提供联系方式
- 数据使用:明确数据用途,避免滥用
第九章:性能优化与监控
9.1 性能监控
import java.util.concurrent.atomic.AtomicLong;
public class CrawlerMonitor {
private final AtomicLong totalRequests = new AtomicLong(0);
private final AtomicLong successfulRequests = new AtomicLong(0);
private final AtomicLong failedRequests = new AtomicLong(0);
private final AtomicLong totalBytes = new AtomicLong(0);
private final long startTime = System.currentTimeMillis();
public void recordRequest(boolean success, long bytes) {
totalRequests.incrementAndGet();
if (success) {
successfulRequests.incrementAndGet();
} else {
failedRequests.incrementAndGet();
}
totalBytes.addAndGet(bytes);
}
public void printStats() {
long elapsed = (System.currentTimeMillis() - startTime) / 1000;
double successRate = (double) successfulRequests.get() / totalRequests.get() * 100;
double speed = totalRequests.get() / (double) elapsed;
System.out.println("========== 爬虫统计 ==========");
System.out.println("运行时间: " + elapsed + " 秒");
System.out.println("总请求数: " + totalRequests.get());
System.out.println("成功: " + successfulRequests.get());
System.out.println("失败: " + failedRequests.get());
System.out.printf("成功率: %.2f%%%n", successRate);
System.out.printf("平均速度: %.2f 请求/秒%n", speed);
System.out.printf("总数据量: %.2f MB%n", totalBytes.get() / 1024.0 / 1024.0);
System.out.println("==============================");
}
}
9.2 内存优化
import java.lang.ref.WeakReference;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
public class MemoryOptimizedCrawler {
// 使用弱引用避免内存泄漏
private final Queue<WeakReference<String>> urlCache = new ConcurrentLinkedQueue<>();
public void processLargeData(List<String> items) {
// 使用流式处理避免一次性加载所有数据
items.stream()
.filter(item -> item != null)
.map(item -> processItem(item))
.filter(result -> result != null)
.forEach(this::saveResult);
}
private String processItem(String item) {
// 处理单个项
return item.toUpperCase();
}
private void saveResult(String result) {
// 保存结果
System.out.println("保存: " + result);
}
// 使用try-with-resources确保资源释放
public void crawlWithResourceManagement(String url) {
try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
HttpGet httpGet = new HttpGet(url);
try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
// 处理响应
String html = EntityUtils.toString(response.getEntity());
// 处理html...
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
第十章:总结与进阶学习
10.1 核心要点回顾
- 基础环境:JDK + Maven + Jsoup/HttpClient
- 网络请求:HttpURLConnection、HttpClient、Selenium
- 数据解析:Jsoup选择器、XPath、JSON解析
- 数据存储:文件、数据库
- 高级技巧:多线程、代理轮换、请求头管理 6.反爬虫应对:随机延迟、IP轮换、User-Agent轮换
- 实战项目:电商爬虫完整实现
- 法律道德:遵守robots.txt、控制频率
10.2 进阶学习方向
- 分布式爬虫:使用Redis、Kafka实现分布式调度
- 深度学习:使用机器学习识别验证码、反爬虫策略
- 浏览器自动化:Playwright、Puppeteer
- API逆向:分析WebSocket、GraphQL接口
- 数据清洗:使用正则表达式、NLP技术
- 可视化监控:Grafana、Prometheus集成
10.3 推荐资源
- 官方文档:Jsoup、Apache HttpClient、Selenium官方文档
- 开源项目:WebMagic、Crawler4j、Apache Nutch
- 书籍:《Java网络爬虫实战》、《数据抓取的艺术》
- 社区:GitHub、Stack Overflow、CSDN
10.4 持续改进
爬虫技术在不断发展,建议:
- 定期更新依赖库版本
- 关注目标网站的技术变化
- 学习新的反爬虫技术及应对方法
- 参与开源项目贡献
- 建立自己的爬虫框架
通过本教程的学习,您应该已经掌握了Java爬虫的基础知识和实战技巧。记住,爬虫开发是一个持续学习和改进的过程,只有不断实践和总结,才能成为真正的爬虫专家。祝您在爬虫开发的道路上取得成功!
