引言

数据挖掘是从大量数据中提取隐含的、先前未知的、且潜在有用的信息和知识的过程。MATLAB作为一种强大的科学计算和工程仿真软件,凭借其丰富的工具箱、直观的编程环境和强大的可视化能力,已成为数据挖掘领域的重要工具。本指南将系统性地介绍MATLAB在数据挖掘中的应用,从基础理论到高级实践,帮助读者掌握从数据预处理到模型评估的完整流程。

1. 数据挖掘基础与MATLAB环境

1.1 数据挖掘的核心概念

数据挖掘通常包括以下主要任务:

  • 分类:预测离散类别标签(如垃圾邮件识别)
  • 回归:预测连续数值(如房价预测)
  • 聚类:将数据分组为相似的簇(如客户细分)
  • 关联规则挖掘:发现数据项之间的关联(如购物篮分析)
  • 异常检测:识别不符合常规模式的数据点

1.2 MATLAB数据挖掘工具箱

MATLAB提供了多个与数据挖掘相关的工具箱:

  • Statistics and Machine Learning Toolbox:包含经典机器学习算法
  • Deep Learning Toolbox:用于神经网络和深度学习
  • Text Analytics Toolbox:处理文本数据
  • Database Toolbox:连接数据库

1.3 MATLAB开发环境设置

% 检查已安装的工具箱
ver

% 设置工作目录
cd 'C:\MATLAB_Data_Mining'

% 创建示例数据集
load fisheriris  % 加载内置数据集
disp('数据集加载完成');

2. 数据预处理技术

2.1 数据清洗

数据清洗是数据挖掘的第一步,处理缺失值、异常值和重复数据。

2.1.1 缺失值处理

% 创建包含缺失值的示例数据
data = [1, 2, NaN; 4, 5, 6; 7, NaN, 9; 10, 11, 12];

% 方法1:删除含有缺失值的行
clean_data1 = rmmissing(data);

% 方法2:用均值填充缺失值
col_means = mean(data, 'omitnan');
for i = 1:size(data,2)
    missing_idx = isnan(data(:,i));
    data(missing_idx,i) = col_means(i);
end
clean_data2 = data;

% 方法3:使用插值方法
clean_data3 = fillmissing(data, 'linear');

disp('缺失值处理完成');

2.1.2 异常值检测

% 使用Z-score方法检测异常值
data = [10, 20, 30, 40, 50, 1000];  % 包含异常值的数据
z_scores = (data - mean(data)) / std(data);
outliers = find(abs(z_scores) > 3);  % Z-score > 3视为异常值

% 使用箱线图方法
figure;
boxplot(data);
title('箱线图检测异常值');

% 使用孤立森林算法(需要Statistics and Machine Learning Toolbox)
X = randn(100,2);  % 生成正常数据
X = [X; 10, 10; -10, -10];  % 添加异常值
[forest, ~, scores] = IsolationForest(X, 'NumLearners', 100);
outlier_idx = find(scores > 0.6);  % 分数越高越可能是异常值

2.2 数据标准化与归一化

% 标准化(Z-score标准化)
data = [1, 2, 3; 4, 5, 6; 7, 8, 9];
z_score = (data - mean(data)) ./ std(data);

% 归一化到[0,1]区间
normalized = (data - min(data)) ./ (max(data) - min(data));

% 使用内置函数
data = [10, 20, 30; 40, 50, 60; 70, 80, 90];
zscore_data = zscore(data);
minmax_data = normalize(data, 'range');

2.3 特征工程

特征工程是提高模型性能的关键步骤。

2.3.1 特征选择

% 使用相关系数选择特征
load fisheriris;
X = meas;  % 特征数据
Y = species;  % 标签

% 计算特征与标签的相关性
corr_matrix = corrcoef([X, grp2idx(Y)]);
feature_corr = corr_matrix(1:4,5);

% 使用递归特征消除(RFE)
mdl = fitctree(X, Y);  % 使用决策树作为基模型
[fs, history] = fscnca(X, Y, 'Verbose', true);  % 特征选择

% 使用PCA降维
[coeff, score, latent] = pca(X);
cumulative_variance = cumsum(latent) / sum(latent);
retain_components = find(cumulative_variance >= 0.95, 1);
X_pca = score(:,1:retain_components);

2.3.2 特征构造

% 日期特征构造示例
dates = {'2023-01-01', '2023-01-02', '2023-01-03'};
date_vec = datetime(dates);

% 提取年、月、日、星期几
years = year(date_vec);
months = month(date_vec);
days = day(date_vec);
weekdays = weekday(date_vec);

% 创建交互特征
data = [1, 2; 3, 4; 5, 6];
interaction = data(:,1) .* data(:,2);  % 乘积特征

3. 分类算法实践

3.1 K近邻算法(KNN)

% 加载数据集
load fisheriris;
X = meas;
Y = species;

% 划分训练集和测试集
cv = cvpartition(Y, 'HoldOut', 0.3);
train_idx = training(cv);
test_idx = test(cv);

X_train = X(train_idx,:);
Y_train = Y(train_idx,:);
X_test = X(test_idx,:);
Y_test = Y(test_idx,:);

% 训练KNN模型
k = 5;
mdl = fitcknn(X_train, Y_train, 'NumNeighbors', k);

% 预测
Y_pred = predict(mdl, X_test);

% 评估
accuracy = sum(Y_pred == Y_test) / length(Y_test);
fprintf('KNN分类准确率: %.2f%%\n', accuracy*100);

% 混淆矩阵
figure;
confusionchart(Y_test, Y_pred);
title('KNN分类混淆矩阵');

3.2 决策树与随机森林

% 决策树分类
tree_mdl = fitctree(X_train, Y_train);
Y_pred_tree = predict(tree_mdl, X_test);
accuracy_tree = sum(Y_pred_tree == Y_test) / length(Y_test);

% 可视化决策树
view(tree_mdl, 'Mode', 'graph');

% 随机森林
rf_mdl = TreeBagger(50, X_train, Y_train, 'Method', 'classification');
Y_pred_rf = predict(rf_mdl, X_test);
Y_pred_rf = categorical(Y_pred_rf);  % 转换为分类变量

% 评估随机森林
accuracy_rf = sum(Y_pred_rf == Y_test) / length(Y_test);
fprintf('随机森林准确率: %.2f%%\n', accuracy_rf*100);

% 特征重要性
imp = predictorImportance(rf_mdl);
figure;
bar(imp);
title('随机森林特征重要性');
xlabel('特征');
ylabel('重要性得分');

3.3 支持向量机(SVM)

% SVM分类
svm_mdl = fitcsvm(X_train, Y_train, 'KernelFunction', 'rbf');
Y_pred_svm = predict(svm_mdl, X_test);
accuracy_svm = sum(Y_pred_svm == Y_test) / length(Y_test);

% 网格搜索优化参数
C = [0.1, 1, 10];
sigma = [0.1, 1, 10];
best_acc = 0;
best_C = 0;
best_sigma = 0;

for i = 1:length(C)
    for j = 1:length(sigma)
        mdl = fitcsvm(X_train, Y_train, 'KernelFunction', 'rbf', ...
                      'BoxConstraint', C(i), 'KernelScale', sigma(j));
        Y_pred = predict(mdl, X_test);
        acc = sum(Y_pred == Y_test) / length(Y_test);
        if acc > best_acc
            best_acc = acc;
            best_C = C(i);
            best_sigma = sigma(j);
        end
    end
end

fprintf('最佳参数: C=%.2f, sigma=%.2f, 准确率=%.2f%%\n', ...
        best_C, best_sigma, best_acc*100);

4. 回归算法实践

4.1 线性回归

% 创建示例数据
rng(1);  % 设置随机种子
X = 1:100;
Y = 2*X + 3 + 5*randn(1,100);  % 线性关系加噪声

% 划分数据集
train_idx = 1:70;
test_idx = 71:100;
X_train = X(train_idx)';
Y_train = Y(train_idx)';
X_test = X(test_idx)';
Y_test = Y(test_idx)';

% 训练线性回归模型
mdl = fitlm(X_train, Y_train);
disp(mdl);

% 预测
Y_pred = predict(mdl, X_test);

% 评估
mse = mean((Y_test - Y_pred).^2);
rmse = sqrt(mse);
r2 = 1 - sum((Y_test - Y_pred).^2) / sum((Y_test - mean(Y_test)).^2);

fprintf('线性回归性能:\n');
fprintf('MSE: %.2f\n', mse);
fprintf('RMSE: %.2f\n', rmse);
fprintf('R²: %.2f\n', r2);

% 可视化
figure;
scatter(X_train, Y_train, 'b', 'filled');
hold on;
scatter(X_test, Y_test, 'r', 'filled');
plot(X_test, Y_pred, 'g-', 'LineWidth', 2);
legend('训练数据', '测试数据', '预测值');
title('线性回归预测');
xlabel('X');
ylabel('Y');

4.2 决策树回归

% 决策树回归
tree_reg = fitrtree(X_train, Y_train);
Y_pred_tree = predict(tree_reg, X_test);

% 评估
mse_tree = mean((Y_test - Y_pred_tree).^2);
rmse_tree = sqrt(mse_tree);
r2_tree = 1 - sum((Y_test - Y_pred_tree).^2) / sum((Y_test - mean(Y_test)).^2);

fprintf('决策树回归性能:\n');
fprintf('MSE: %.2f\n', mse_tree);
fprintf('RMSE: %.2f\n', rmse_tree);
fprintf('R²: %.2f\n', r2_tree);

% 剪枝优化
pruned_tree = prune(tree_reg, 'Level', 5);
Y_pred_pruned = predict(pruned_tree, X_test);
mse_pruned = mean((Y_test - Y_pred_pruned).^2);
fprintf('剪枝后MSE: %.2f\n', mse_pruned);

4.3 集成学习回归

% 随机森林回归
rf_reg = TreeBagger(100, X_train, Y_train, 'Method', 'regression');
Y_pred_rf = predict(rf_reg, X_test);
Y_pred_rf = str2double(Y_pred_rf);  % 转换为数值

% 评估
mse_rf = mean((Y_test - Y_pred_rf).^2);
rmse_rf = sqrt(mse_rf);
r2_rf = 1 - sum((Y_test - Y_pred_rf).^2) / sum((Y_test - mean(Y_test)).^2);

fprintf('随机森林回归性能:\n');
fprintf('MSE: %.2f\n', mse_rf);
fprintf('RMSE: %.2f\n', rmse_rf);
fprintf('R²: %.2f\n', r2_rf);

% 梯度提升回归
gbm = fitrensemble(X_train, Y_train, 'Method', 'LSBoost', ...
                   'NumLearningCycles', 100);
Y_pred_gbm = predict(gbm, X_test);
mse_gbm = mean((Y_test - Y_pred_gbm).^2);
fprintf('梯度提升回归MSE: %.2f\n', mse_gbm);

5. 聚类算法实践

5.1 K-means聚类

% 生成示例数据
rng(1);
data = [randn(100,2)*0.5 + ones(100,2); 
        randn(100,2)*0.5 + 3*ones(100,2);
        randn(100,2)*0.5 + [3,0]*ones(100,2)];

% 确定最佳K值(肘部法则)
max_k = 10;
inertia = zeros(max_k,1);
for k = 1:max_k
    [~, ~, sumd] = kmeans(data, k, 'Replicates', 5);
    inertia(k) = sum(sumd);
end

figure;
plot(1:max_k, inertia, '-o');
title('肘部法则确定最佳K值');
xlabel('K值');
ylabel('类内平方和');

% 使用K=3进行聚类
k = 3;
[idx, centroids] = kmeans(data, k, 'Replicates', 10);

% 可视化
figure;
gscatter(data(:,1), data(:,2), idx);
hold on;
plot(centroids(:,1), centroids(:,2), 'kx', 'MarkerSize', 15, 'LineWidth', 3);
title('K-means聚类结果');
legend('簇1', '簇2', '簇3', '聚类中心');

5.2 层次聚类

% 计算距离矩阵
dist_matrix = pdist(data);
linkage_matrix = linkage(dist_matrix, 'ward');

% 确定聚类数量
figure;
dendrogram(linkage_matrix);
title('层次聚类树状图');
xlabel('样本索引');
ylabel('距离');

% 根据树状图切割
k = 3;
idx_hierarchical = cluster(linkage_matrix, 'maxclust', k);

% 可视化
figure;
gscatter(data(:,1), data(:,2), idx_hierarchical);
title('层次聚类结果');

5.3 DBSCAN密度聚类

% DBSCAN需要Statistics and Machine Learning Toolbox
% 生成包含噪声的非球形数据
rng(1);
data = [randn(100,2)*0.5; 
        2*randn(100,2) + [5,5]; 
        2*randn(100,2) + [5,-5];
        2*randn(10,2) + [0,0]];  % 噪声点

% 使用DBSCAN
epsilon = 1.5;
minpts = 5;
[idx, ~] = dbscan(data, epsilon, minpts);

% 可视化
figure;
gscatter(data(:,1), data(:,2), idx);
title('DBSCAN聚类结果');
xlabel('特征1');
ylabel('特征2');

6. 关联规则挖掘

6.1 Apriori算法实现

% 创建交易数据集
transactions = {
    {'牛奶', '面包', '黄油'},
    {'牛奶', '面包'},
    {'牛奶', '鸡蛋'},
    {'面包', '黄油', '鸡蛋'},
    {'牛奶', '面包', '黄油', '鸡蛋'}
};

% 将事务转换为二进制矩阵
items = unique([transactions{:}]);
num_items = length(items);
num_trans = length(transactions);

binary_matrix = zeros(num_trans, num_items);
for i = 1:num_trans
    for j = 1:num_items
        if any(strcmp(transactions{i}, items{j}))
            binary_matrix(i,j) = 1;
        end
    end
end

% 实现Apriori算法
function [rules, supports] = apriori(transactions, min_support, min_confidence)
    % transactions: 二进制矩阵
    % min_support: 最小支持度
    % min_confidence: 最小置信度
    
    [num_trans, num_items] = size(transactions);
    
    % 生成频繁1项集
    item_counts = sum(transactions);
    support_1 = item_counts / num_trans;
    frequent_1 = find(support_1 >= min_support);
    
    % 生成频繁k项集
    k = 2;
    frequent_k = frequent_1;
    all_frequent = {};
    
    while ~isempty(frequent_k)
        % 生成候选k项集
        if k == 2
            candidates = nchoosek(frequent_1, k);
        else
            % 连接步骤
            candidates = [];
            for i = 1:size(frequent_k,1)
                for j = i+1:size(frequent_k,1)
                    if all(frequent_k(i,1:k-2) == frequent_k(j,1:k-2))
                        new_candidate = [frequent_k(i,:), frequent_k(j,k-1)];
                        candidates = [candidates; new_candidate];
                    end
                end
            end
        end
        
        % 剪枝和计算支持度
        valid_candidates = [];
        for i = 1:size(candidates,1)
            candidate = candidates(i,:);
            % 检查所有子集是否频繁
            is_frequent = true;
            for j = 1:k
                subset = candidate;
                subset(j) = [];
                if ~ismember(subset, frequent_k, 'rows')
                    is_frequent = false;
                    break;
                end
            end
            
            if is_frequent
                % 计算支持度
                support = sum(all(transactions(:,candidate),2)) / num_trans;
                if support >= min_support
                    valid_candidates = [valid_candidates; candidate];
                    all_frequent{end+1} = {candidate, support};
                end
            end
        end
        
        frequent_k = valid_candidates;
        k = k + 1;
    end
    
    % 生成关联规则
    rules = {};
    supports = [];
    
    for i = 1:length(all_frequent)
        itemset = all_frequent{i}{1};
        support = all_frequent{i}{2};
        
        if length(itemset) > 1
            % 生成所有可能的规则
            for j = 1:length(itemset)-1
                antecedent = itemset(1:j);
                consequent = itemset(j+1:end);
                
                % 计算置信度
                antecedent_support = sum(all(transactions(:,antecedent),2)) / num_trans;
                confidence = support / antecedent_support;
                
                if confidence >= min_confidence
                    rules{end+1} = {antecedent, consequent};
                    supports(end+1) = support;
                end
            end
        end
    end
end

% 应用Apriori算法
min_support = 0.4;
min_confidence = 0.7;
[rules, supports] = apriori(binary_matrix, min_support, min_confidence);

% 显示结果
fprintf('关联规则挖掘结果:\n');
for i = 1:length(rules)
    antecedent = items(rules{i}{1});
    consequent = items(rules{i}{2});
    fprintf('规则: {%s} => {%s}, 支持度: %.2f\n', ...
            strjoin(antecedent, ','), strjoin(consequent, ','), supports(i));
end

7. 异常检测

7.1 基于统计的异常检测

% 生成正常数据和异常数据
rng(1);
normal_data = randn(1000,2) * 0.5;
anomalies = [5*randn(10,2) + [10,10]; 5*randn(10,2) + [-10,-10]];
data = [normal_data; anomalies];

% 使用马氏距离
mu = mean(data(1:1000,:));
sigma = cov(data(1:1000,:));
mahal_dist = mahal(data, mu, sigma);

% 设定阈值(95%分位数)
threshold = chi2inv(0.95, 2);
outliers = find(mahal_dist > threshold);

% 可视化
figure;
scatter(data(:,1), data(:,2), 20, 'b', 'filled');
hold on;
scatter(data(outliers,1), data(outliers,2), 50, 'r', 'filled');
title('基于马氏距离的异常检测');
legend('正常点', '异常点');

7.2 基于孤立森林的异常检测

% 使用孤立森林
X = [normal_data; anomalies];
[forest, ~, scores] = IsolationForest(X, 'NumLearners', 100, ...
                                       'ContaminationFraction', 0.1);

% 分数越高越可能是异常值
outlier_idx = find(scores > 0.6);

% 可视化
figure;
scatter(X(:,1), X(:,2), 20, 'b', 'filled');
hold on;
scatter(X(outlier_idx,1), X(outlier_idx,2), 50, 'r', 'filled');
title('基于孤立森林的异常检测');

8. 模型评估与优化

8.1 交叉验证

% 使用交叉验证评估模型
cv = cvpartition(Y, 'KFold', 5);  % 5折交叉验证
accuracies = zeros(5,1);

for i = 1:5
    train_idx = training(cv,i);
    test_idx = test(cv,i);
    
    X_train = X(train_idx,:);
    Y_train = Y(train_idx,:);
    X_test = X(test_idx,:);
    Y_test = Y(test_idx,:);
    
    % 训练模型
    mdl = fitcknn(X_train, Y_train, 'NumNeighbors', 5);
    Y_pred = predict(mdl, X_test);
    
    % 计算准确率
    accuracies(i) = sum(Y_pred == Y_test) / length(Y_test);
end

fprintf('5折交叉验证准确率: %.2f%% ± %.2f%%\n', ...
        mean(accuracies)*100, std(accuracies)*100);

8.2 超参数优化

% 使用贝叶斯优化进行超参数搜索
% 定义目标函数
function accuracy = objective_function(params)
    % params: [k, distance_metric]
    k = round(params(1));
    distance_metric = params(2);
    
    % 训练KNN模型
    mdl = fitcknn(X_train, Y_train, 'NumNeighbors', k, ...
                  'Distance', distance_metric);
    Y_pred = predict(mdl, X_test);
    accuracy = sum(Y_pred == Y_test) / length(Y_test);
end

% 定义参数范围
vars = [optimizableVariable('k', [1, 20], 'Type', 'integer');
        optimizableVariable('distance_metric', [1, 2], 'Type', 'integer')];

% 运行贝叶斯优化
results = bayesopt(@objective_function, vars, ...
                   'AcquisitionFunctionName', 'expected-improvement-plus', ...
                   'MaxObjectiveEvaluations', 30, ...
                   'Verbose', 0);

% 获取最佳参数
best_params = results.XAtMinObjective;
fprintf('最佳参数: k=%d, distance_metric=%d\n', ...
        round(best_params.k), round(best_params.distance_metric));

8.3 模型集成

% 创建集成模型
ensemble_mdl = fitcensemble(X_train, Y_train, 'Method', 'AdaBoostM2', ...
                            'NumLearningCycles', 100);

% 预测
Y_pred_ensemble = predict(ensemble_mdl, X_test);
accuracy_ensemble = sum(Y_pred_ensemble == Y_test) / length(Y_test);

% 比较不同集成方法
methods = {'Bag', 'AdaBoostM1', 'AdaBoostM2', 'LogitBoost', 'GentleBoost'};
accuracies_ensemble = zeros(length(methods),1);

for i = 1:length(methods)
    mdl = fitcensemble(X_train, Y_train, 'Method', methods{i}, ...
                       'NumLearningCycles', 100);
    Y_pred = predict(mdl, X_test);
    accuracies_ensemble(i) = sum(Y_pred == Y_test) / length(Y_test);
end

% 可视化比较
figure;
bar(accuracies_ensemble);
set(gca, 'XTickLabel', methods);
title('不同集成方法的性能比较');
ylabel('准确率');

9. 实际案例:客户流失预测

9.1 问题定义

预测电信客户是否会在下个月流失,这是一个二分类问题。

9.2 数据准备

% 模拟客户数据
rng(1);
num_customers = 1000;

% 特征
age = 18 + randi([0, 60], num_customers, 1);
tenure = randi([1, 72], num_customers, 1);  % 在网月数
monthly_charge = 50 + 50*rand(num_customers, 1);
total_charges = monthly_charge .* tenure;
contract_type = randi([1, 3], num_customers, 1);  % 1:月付, 2:年付, 3:两年付
payment_method = randi([1, 3], num_customers, 1);
paperless_billing = randi([0, 1], num_customers, 1);
tech_support = randi([0, 1], num_customers, 1);

% 目标变量:是否流失(1:流失, 0:未流失)
% 基于特征生成目标变量(模拟真实情况)
churn_prob = 0.1 + 0.002*(72-tenure) + 0.001*(monthly_charge-50) + ...
             0.1*(contract_type==1) + 0.05*(payment_method==3) + ...
             0.1*(paperless_billing==1) - 0.15*(tech_support==1);
churn_prob = min(max(churn_prob, 0), 1);
churn = rand(num_customers,1) < churn_prob;

% 创建数据表
data = table(age, tenure, monthly_charge, total_charges, ...
             contract_type, payment_method, paperless_billing, ...
             tech_support, churn, ...
             'VariableNames', {'Age', 'Tenure', 'MonthlyCharge', ...
                               'TotalCharges', 'ContractType', ...
                               'PaymentMethod', 'PaperlessBilling', ...
                               'TechSupport', 'Churn'});

% 划分训练集和测试集
cv = cvpartition(data.Churn, 'HoldOut', 0.3);
train_data = data(training(cv),:);
test_data = data(test(cv),:);

% 特征和标签
X_train = train_data(:,1:end-1);
Y_train = train_data.Churn;
X_test = test_data(:,1:end-1);
Y_test = test_data.Churn;

9.3 模型训练与评估

% 训练随机森林模型
rf_mdl = TreeBagger(200, X_train, Y_train, 'Method', 'classification', ...
                    'OOBPrediction', 'on', 'OOBPredictorImportance', 'on');

% 预测
Y_pred = predict(rf_mdl, X_test);
Y_pred = categorical(Y_pred);

% 评估
accuracy = sum(Y_pred == Y_test) / length(Y_test);
fprintf('随机森林准确率: %.2f%%\n', accuracy*100);

% 混淆矩阵
figure;
confusionchart(Y_test, Y_pred);
title('客户流失预测混淆矩阵');

% 特征重要性
imp = predictorImportance(rf_mdl);
figure;
bar(imp);
set(gca, 'XTickLabel', X_train.Properties.VariableNames);
title('特征重要性');
ylabel('重要性得分');

% ROC曲线和AUC
[Y_pred_scores, ~] = predict(rf_mdl, X_test, 'Scores', 'on');
[~,~,~,auc] = perfcurve(Y_test, Y_pred_scores(:,2), '1');
fprintf('AUC: %.3f\n', auc);

% 绘制ROC曲线
figure;
plotroc(Y_test, Y_pred_scores(:,2));
title(sprintf('ROC曲线 (AUC = %.3f)', auc));

9.4 模型部署与预测

% 保存模型
save('customer_churn_model.mat', 'rf_mdl');

% 加载模型并预测新客户
load('customer_churn_model.mat');

% 新客户数据
new_customer = table(45, 24, 75, 1800, 2, 1, 1, 1, ...
                     'VariableNames', {'Age', 'Tenure', 'MonthlyCharge', ...
                                       'TotalCharges', 'ContractType', ...
                                       'PaymentMethod', 'PaperlessBilling', ...
                                       'TechSupport', 'Churn'});

% 预测
[new_pred, new_scores] = predict(rf_mdl, new_customer);
fprintf('新客户预测: %s\n', char(new_pred));
fprintf('流失概率: %.2f%%\n', new_scores(2)*100);

10. 高级主题:深度学习在数据挖掘中的应用

10.1 神经网络分类

% 准备数据
load fisheriris;
X = meas;
Y = grp2idx(species);  % 转换为数值标签

% 划分数据集
cv = cvpartition(Y, 'HoldOut', 0.3);
train_idx = training(cv);
test_idx = test(cv);

X_train = X(train_idx,:);
Y_train = Y(train_idx,:);
X_test = X(test_idx,:);
Y_test = Y(test_idx,:);

% 创建神经网络
layers = [
    featureInputLayer(4)
    fullyConnectedLayer(10)
    reluLayer
    fullyConnectedLayer(5)
    softmaxLayer
    classificationLayer];

options = trainingOptions('adam', ...
    'MaxEpochs', 100, ...
    'MiniBatchSize', 32, ...
    'ValidationData', {X_test, categorical(Y_test)}, ...
    'Plots', 'training-progress');

% 训练网络
net = trainNetwork(X_train, categorical(Y_train), layers, options);

% 预测
Y_pred = classify(net, X_test);
accuracy = sum(Y_pred == categorical(Y_test)) / length(Y_test);
fprintf('神经网络准确率: %.2f%%\n', accuracy*100);

10.2 自编码器用于异常检测

% 生成数据
rng(1);
normal_data = randn(1000,10) * 0.5;
anomalies = 5*randn(50,10) + 10;
X = [normal_data; anomalies];

% 创建自编码器
layers = [
    featureInputLayer(10)
    fullyConnectedLayer(5)
    reluLayer
    fullyConnectedLayer(10)
    regressionLayer];

options = trainingOptions('adam', ...
    'MaxEpochs', 50, ...
    'MiniBatchSize', 32, ...
    'Plots', 'training-progress');

% 训练自编码器
autoencoder = trainNetwork(X, X, layers, options);

% 计算重构误差
X_reconstructed = predict(autoencoder, X);
reconstruction_error = sum((X - X_reconstructed).^2, 2);

% 确定异常阈值
threshold = prctile(reconstruction_error(1:1000), 95);
anomaly_idx = find(reconstruction_error > threshold);

% 可视化
figure;
scatter(1:length(reconstruction_error), reconstruction_error, 20, 'b', 'filled');
hold on;
scatter(anomaly_idx, reconstruction_error(anomaly_idx), 50, 'r', 'filled');
yline(threshold, 'r--', 'LineWidth', 2);
title('基于自编码器的异常检测');
xlabel('样本索引');
ylabel('重构误差');
legend('正常点', '异常点', '阈值');

11. 数据挖掘项目最佳实践

11.1 项目流程

  1. 问题定义:明确业务目标和评估指标
  2. 数据收集:获取相关数据源
  3. 数据探索:EDA(探索性数据分析)
  4. 数据预处理:清洗、转换、特征工程
  5. 模型选择:根据问题选择合适算法
  6. 模型训练:训练和调优
  7. 模型评估:使用交叉验证和多种指标
  8. 模型部署:集成到生产环境
  9. 监控与维护:持续监控模型性能

11.2 代码组织与文档

% 项目结构示例
% ├── data/
% │   ├── raw/          % 原始数据
% │   ├── processed/    % 处理后数据
% │   └── external/     % 外部数据
% ├── src/
% │   ├── preprocessing.m
% │   ├── feature_engineering.m
% │   ├── modeling.m
% │   └── evaluation.m
% ├── models/
% │   └── saved_models/
% ├── results/
% │   ├── figures/
% │   └── reports/
% └── main.m

% 主程序示例
function main()
    % 1. 数据加载
    data = load_data();
    
    % 2. 数据预处理
    [X, Y] = preprocess_data(data);
    
    % 3. 特征工程
    X = feature_engineering(X);
    
    % 4. 模型训练
    model = train_model(X, Y);
    
    % 5. 模型评估
    results = evaluate_model(model, X, Y);
    
    % 6. 保存结果
    save_results(results);
end

11.3 性能优化技巧

% 使用并行计算加速
parpool;  % 启动并行池

% 在循环中使用并行
parfor i = 1:100
    % 耗时的计算
    result(i) = complex_calculation(data(i,:));
end

% 使用GPU加速(如果可用)
if gpuDeviceCount > 0
    X_gpu = gpuArray(X);
    % 在GPU上执行计算
    result = gpuArray(complex_calculation(X_gpu));
end

% 使用内置函数优化
% 避免使用循环,尽量使用向量化操作
% 不好的做法
for i = 1:size(data,1)
    for j = 1:size(data,2)
        data(i,j) = data(i,j) * 2;
    end
end

% 好的做法
data = data * 2;

12. 总结与展望

MATLAB为数据挖掘提供了全面的工具和环境,从数据预处理到模型部署的完整流程。通过本指南,读者可以掌握:

  1. 基础技能:数据清洗、特征工程、模型训练与评估
  2. 核心算法:分类、回归、聚类、关联规则、异常检测
  3. 高级技术:集成学习、深度学习、模型优化
  4. 实践能力:完整项目流程、代码组织、性能优化

未来发展方向

  • 自动化机器学习(AutoML):MATLAB的AutoML工具将进一步简化模型选择和调优
  • 可解释AI:增强模型的可解释性,满足监管和业务需求
  • 边缘计算:将模型部署到边缘设备,实现实时预测
  • 多模态数据挖掘:结合文本、图像、时序数据等多源数据

学习资源

通过持续学习和实践,读者可以将MATLAB数据挖掘技术应用于各种实际场景,解决复杂的业务问题,创造更大的价值。