引言:为什么数据科学需要编程技能?

在当今数据驱动的时代,数据科学已经成为改变世界的强大力量。从推荐系统到医疗诊断,从金融风控到自动驾驶,数据科学无处不在。然而,要真正掌握数据科学,编程技能是不可或缺的基础。Python和R作为数据科学领域的两大主流编程语言,各自拥有独特的优势和生态系统。

编程在数据科学中的核心作用

  • 数据获取与清洗:从各种来源获取数据,处理缺失值、异常值
  • 数据探索与可视化:发现数据中的模式和关系
  • 统计分析与建模:构建预测模型和统计推断
  • 机器学习与深度学习:实现复杂的算法和模型
  • 结果展示与自动化:生成报告和自动化流程

本文将从零开始,系统地介绍Python和R语言在数据科学中的核心技能,通过完整的代码示例和实际案例,帮助你快速掌握解决实际问题的能力。

第一部分:Python数据科学基础

1.1 Python环境搭建与基础语法

环境搭建

首先,我们需要安装Python和必要的数据科学库。推荐使用Anaconda发行版,它包含了所有必需的库。

# 安装Anaconda(推荐)
# 下载地址:https://www.anaconda.com/products/distribution

# 创建新的conda环境
conda create -n datascience python=3.9
conda activate datascience

# 安装核心数据科学库
conda install numpy pandas matplotlib seaborn scikit-learn
pip install jupyter

Python基础语法回顾

Python是数据科学中最受欢迎的语言,其简洁的语法和强大的库使其成为首选。

# 基础数据结构
numbers = [1, 2, 3, 4, 5]  # 列表
dictionary = {'name': 'Alice', 'age': 25}  # 字典
tuple_data = (1, 2, 3)  # 元组

# 控制流
for i in range(5):
    if i % 2 == 0:
        print(f"{i} is even")
    else:
        print(f"{i} is odd")

# 函数定义
def calculate_mean(numbers):
    """计算列表的平均值"""
    return sum(numbers) / len(numbers)

# 使用示例
data = [10, 20, 30, 40, 50]
mean_value = calculate_mean(data)
print(f"平均值: {mean_value}")

1.2 NumPy:科学计算的基础

NumPy是Python科学计算的基础库,提供了高效的数组操作和数学函数。

import numpy as np

# 创建数组
arr = np.array([1, 2, 3, 4, 5])
print(f"数组: {arr}")
print(f"形状: {arr.shape}")
print(f"数据类型: {arr.dtype}")

# 基本运算
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print(f"加法: {arr1 + arr2}")
print(f"乘法: {arr1 * arr2}")
print(f"平方: {arr1 ** 2}")

# 统计函数
data = np.random.randn(100)  # 生成100个标准正态分布的随机数
print(f"均值: {data.mean():.2f}")
print(f"标准差: {data.std():.2f}")
print(f"中位数: {np.median(data):.2f}")

# 数组操作
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"矩阵形状: {matrix.shape}")
print(f"转置:\n{matrix.T}")
print(f"矩阵乘法:\n{matrix @ matrix.T}")

# 广播机制
arr = np.array([1, 2, 3])
broadcast_result = arr + 10  # 每个元素都加10
print(f"广播结果: {broadcast_result}")

1.3 Pandas:数据处理与分析

Pandas是Python中用于数据处理和分析的核心库,提供了DataFrame和Series两种主要数据结构。

import pandas as pd
import numpy as np

# 创建DataFrame
data = {
    '姓名': ['张三', '李四', '王五', '赵六'],
    '年龄': [25, 30, 35, 28],
    '城市': ['北京', '上海', '广州', '深圳'],
    '薪资': [10000, 15000, 12000, 18000]
}
df = pd.DataFrame(data)
print("原始DataFrame:")
print(df)

# 数据查看
print(f"\n前3行:\n{df.head(3)}")
print(f"\n数据形状: {df.shape}")
print(f"\n数据类型:\n{df.dtypes}")
print(f"\n基本统计:\n{df.describe()}")

# 数据选择与过滤
print(f"\n选择特定列:\n{df['姓名']}")
print(f"\n选择多列:\n{df[['姓名', '薪资']]}")
print(f"\n条件过滤:\n{df[df['年龄'] > 28]}")
print(f"\n复合条件:\n{df[(df['年龄'] > 25) & (df['薪资'] > 12000)]}")

# 数据处理
# 添加新列
df['薪资等级'] = df['薪资'].apply(lambda x: '高' if x > 13000 else '中')
print(f"\n添加新列:\n{df}")

# 处理缺失值
df_missing = df.copy()
df_missing.loc[1, '年龄'] = np.nan  # 引入缺失值
print(f"\n有缺失值的DataFrame:\n{df_missing}")
print(f"\n缺失值统计:\n{df_missing.isnull().sum()}")
df_filled = df_missing.fillna({'年龄': df_missing['年龄'].mean()})
print(f"\n填充缺失值后:\n{df_filled}")

# 数据分组与聚合
df_grouped = df.groupby('城市')['薪资'].agg(['mean', 'count', 'max'])
print(f"\n按城市分组统计:\n{df_grouped}")

# 数据排序
df_sorted = df.sort_values('薪资', ascending=False)
print(f"\n按薪资降序排列:\n{df_sorted}")

# 数据合并
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})
merged = pd.merge(df1, df2, on='key', how='inner')
print(f"\n合并结果:\n{merged}")

1.4 Matplotlib与Seaborn:数据可视化

数据可视化是数据科学中理解数据和展示结果的重要工具。

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as as

# 设置中文字体(解决中文显示问题)
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

# 创建示例数据
np.random.seed(42)
data = {
    '年龄': np.random.randint(20, 60, 100),
    '薪资': np.random.randint(5000, 25000, 100),
    '城市': np.random.choice(['北京', '上海', '广州', '深圳'], 100),
    '工作年限': np.random.randint(1, 20, 100)
}
df = pd.DataFrame(data)

# 1. 散点图
plt.figure(figsize=(10, 6))
plt.scatter(df['工作年限'], df['薪资'], alpha=0.6)
plt.title('工作年限与薪资关系')
plt.xlabel('工作年限')
plt.ylabel('薪资')
plt.grid(True, alpha=0.3)
plt.show()

# 2. 直方图
plt.figure(figsize=(10, 6))
plt.hist(df['薪资'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
plt.title('薪资分布直方图')
plt.xlabel('薪资')
plt.ylabel('频数')
plt.show()

# 3. 箱线图(使用Seaborn)
plt.figure(figsize=(10, 6))
sns.boxplot(x='城市', y='薪资', data=df)
plt.title('不同城市薪资分布')
plt.show()

# 4. 热力图(相关性矩阵)
plt.figure(figsize=(8, 6))
correlation_matrix = df[['年龄', '薪资', '工作年限']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('特征相关性热力图')
plt.show()

# 5. 多子图
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 子图1:散点图
axes[0, 0].scatter(df['年龄'], df['薪资'], alpha=0.6, color='red')
axes[0, 0].set_title('年龄 vs 薪资')
# 子图2:直方图
axes[0, 1].hist(df['年龄'], bins=15, alpha=0.7, color='green')
axes[0, 1].set_title('年龄分布')
# 子图3:箱线图
sns.boxplot(x='城市', y='薪资', data=df, ax=axes[1, 0])
axes[1, 0].set_title('城市 vs 薪资')
# 子图4:折线图
df_sorted = df.sort_values('工作年限')
axes[1, 1].plot(df_sorted['工作年限'], df_sorted['薪资'], marker='o', linestyle='-')
axes[1, 1].set_title('工作年限 vs 薪资趋势')
plt.tight_layout()
plt.show()

1.5 Scikit-learn:机器学习入门

Scikit-learn是Python中最流行的机器学习库,提供了丰富的算法和工具。

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
import pandas as pd

# 示例1:线性回归预测薪资
print("=== 线性回归示例 ===")
# 创建数据
np.random.seed(42)
X = np.random.rand(100, 2)  # 100个样本,2个特征
y = 3 * X[:, 0] + 2 * X[:, 1] + np.random.randn(100) * 0.1  # 线性关系加噪声

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 训练模型
model = LinearRegression()
model.fit(X_train, y_train)

# 预测与评估
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"模型系数: {model.coef_}")
print(f"截距: {model.intercept_}")
print(f"均方误差: {mse:.4f}")
print(f"R²分数: {r2:.4f}")

# 示例2:分类问题
print("\n=== 分类问题示例 ===")
# 创建分类数据
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, 
                          n_redundant=2, random_state=42)

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 训练随机森林分类器
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# 预测与评估
y_pred = clf.predict(X_test)
print(f"分类准确率: {clf.score(X_test, y_test):.4f}")
print("\n分类报告:")
print(classification_report(y_test, y_pred))
print("\n混淆矩阵:")
print(confusion_matrix(y_test, y_pred))

# 特征重要性
feature_importance = clf.feature_importances_
print(f"\n特征重要性: {feature_importance}")

1.6 实际案例:房价预测系统

让我们整合以上知识,构建一个完整的房价预测系统。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

# 1. 创建模拟数据集
def create_housing_data(n_samples=1000):
    """创建模拟房价数据集"""
    np.random.seed(42)
    
    data = {
        '面积': np.random.randint(50, 200, n_samples),
        '房间数': np.random.randint(1, 6, n_samples),
        '楼层': np.random.randint(1, 30, n_samples),
        '建造年份': np.random.randint(1980, 2022, n_samples),
        '区域': np.random.choice(['市中心', '郊区', '开发区'], n_samples),
        '地铁距离': np.random.uniform(0, 5, n_samples).round(2),
        '装修': np.random.choice(['毛坯', '简装', '精装'], n_samples)
    }
    
    df = pd.DataFrame(data)
    
    # 基础价格
    base_price = 5000
    
    # 根据特征计算价格(模拟真实关系)
    df['价格'] = (
        base_price +
        df['面积'] * 300 +
        df['房间数'] * 5000 +
        df['楼层'] * 100 +
        (2022 - df['建造年份']) * 50 +
        (df['区域'] == '市中心') * 100000 +
        (df['区域'] == '郊区') * 50000 +
        df['地铁距离'] * (-20000) +
        (df['装修'] == '精装') * 30000 +
        (df['装修'] == '简装') * 15000 +
        np.random.normal(0, 20000, n_samples)  # 噪声
    )
    
    return df

# 2. 数据预处理和建模
def build_housing_model():
    """构建房价预测模型"""
    
    # 创建数据
    df = create_housing_data(1000)
    print("数据集预览:")
    print(df.head())
    print(f"\n数据集形状: {df.shape}")
    
    # 特征工程
    X = df.drop('价格', axis=1)
    y = df['价格']
    
    # 定义数值和分类特征
    numeric_features = ['面积', '房间数', '楼层', '建造年份', '地铁距离']
    categorical_features = ['区域', '装修']
    
    # 创建预处理管道
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numeric_features),
            ('cat', OneHotEncoder(drop='first'), categorical_features)
        ])
    
    # 创建完整管道
    model = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
    ])
    
    # 分割数据
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # 训练模型
    print("\n开始训练模型...")
    model.fit(X_train, y_train)
    
    # 预测与评估
    y_pred = model.predict(X_test)
    
    print("\n=== 模型评估 ===")
    print(f"平均绝对误差 (MAE): ${mean_absolute_error(y_test, y_pred):,.2f}")
    print(f"均方根误差 (RMSE): ${np.sqrt(mean_squared_error(y_test, y_pred)):,.2f}")
    print(f"R² 分数: {r2_score(y_test, y_pred):.4f}")
    
    # 保存模型
    joblib.dump(model, 'housing_model.pkl')
    print("\n模型已保存为 'housing_model.pkl'")
    
    return model, X_test, y_test, y_pred

# 3. 模型预测示例
def predict_new_house(model, house_features):
    """预测新房子的价格"""
    prediction = model.predict(house_features)
    return prediction[0]

# 运行完整示例
if __name__ == "__main__":
    # 构建模型
    model, X_test, y_test, y_pred = build_housing_model()
    
    # 预测新房子
    new_house = pd.DataFrame({
        '面积': [120],
        '房间数': [3],
        '楼层': [15],
        '建造年份': [2015],
        '区域': ['市中心'],
        '地铁距离': [0.5],
        '装修': ['精装']
    })
    
    predicted_price = predict_new_house(model, new_house)
    print(f"\n=== 新房预测 ===")
    print(f"输入特征:\n{new_house}")
    print(f"预测价格: ${predicted_price:,.2f}")
    
    # 显示特征重要性(需要从pipeline中提取)
    # 注意:对于pipeline,需要特殊处理来获取特征重要性
    print("\n=== 特征重要性分析 ===")
    print("由于使用了Pipeline和OneHotEncoder,特征重要性需要特殊处理。")
    print("这里展示如何提取:")
    
    # 获取预处理后的特征名称
    preprocessor = model.named_steps['preprocessor']
    regressor = model.named_steps['regressor']
    
    # 获取数值特征名称
    num_feature_names = numeric_features
    
    # 获取分类特征名称(OneHotEncoder后)
    cat_encoder = preprocessor.named_transformers_['cat']
    cat_feature_names = cat_encoder.get_feature_names_out(categorical_features)
    
    # 合并所有特征名称
    all_feature_names = np.concatenate([num_feature_names, cat_feature_names])
    
    # 获取特征重要性
    importances = regressor.feature_importances_
    
    # 创建DataFrame显示
    feature_importance_df = pd.DataFrame({
        '特征': all_feature_names,
        '重要性': importances
    }).sort_values('重要性', ascending=False)
    
    print(feature_importance_df)

第二部分:R语言数据科学基础

2.1 R环境搭建与基础语法

R语言是专门为统计分析和数据可视化设计的编程语言,在学术界和工业界都有广泛应用。

环境搭建

# 安装R(推荐使用RStudio作为IDE)
# 下载地址:https://cran.r-project.org/
# RStudio下载:https://www.rstudio.com/products/rstudio/download/

# 安装核心包
install.packages(c("tidyverse", "ggplot2", "dplyr", "readr", "lubridate"))
install.packages(c("caret", "randomForest", "xgboost"))
install.packages(c("shiny", "rmarkdown"))

# 加载包
library(tidyverse)
library(ggplot2)
library(dplyr)

R基础语法

# 基础数据结构
# 向量
numbers <- c(1, 2, 3, 4, 5)
names(numbers) <- c("a", "b", "c", "d", "e")

# 矩阵
matrix_data <- matrix(1:12, nrow=3, ncol=4)

# 数据框
df <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(25, 30, 35),
  salary = c(5000, 7000, 6000)
)

# 列表
my_list <- list(
  numbers = numbers,
  matrix = matrix_data,
  df = df
)

# 控制流
for (i in 1:5) {
  if (i %% 2 == 0) {
    print(paste(i, "是偶数"))
  } else {
    print(paste(i, "是奇数"))
  }
}

# 函数定义
calculate_mean <- function(x) {
  return(sum(x) / length(x))
}

# 使用函数
data <- c(10, 20, 30, 40, 50)
mean_value <- calculate_mean(data)
print(paste("平均值:", mean_value))

2.2 Tidyverse:现代R数据科学

Tidyverse是R语言中一套用于数据科学的包,提供了一致的语法和强大的功能。

library(tidyverse)

# 1. 创建数据框
df <- tibble(
  name = c("张三", "李四", "王五", "赵六"),
  age = c(25, 30, 35, 28),
  city = c("北京", "上海", "广州", "深圳"),
  salary = c(10000, 15000, 12000, 18000)
)

# 2. 数据查看
print("原始数据:")
print(df)
print(paste("数据形状:", nrow(df), "行", ncol(df), "列"))
summary(df)

# 3. 数据操作(dplyr)
# 选择列
df %>% select(name, salary)

# 过滤行
df %>% filter(age > 28)

# 排序
df %>% arrange(desc(salary))

# 添加新列
df %>% mutate(salary_level = ifelse(salary > 13000, "高", "中"))

# 分组聚合
df %>% 
  group_by(city) %>% 
  summarise(
    avg_salary = mean(salary),
    count = n(),
    max_salary = max(salary)
  )

# 管道操作组合
result <- df %>%
  filter(age > 25) %>%
  mutate(salary_k = salary / 1000) %>%
  arrange(desc(salary_k)) %>%
  select(name, city, salary_k)

print("管道操作结果:")
print(result)

# 4. 数据读取与写入
# 读取CSV
# df <- read_csv("data.csv")

# 写入CSV
# write_csv(df, "output.csv")

# 5. 处理缺失值
df_missing <- df
df_missing$age[2] <- NA
print("有缺失值的数据:")
print(df_missing)

# 检查缺失值
print("缺失值统计:")
print(colSums(is.na(df_missing)))

# 填充缺失值
df_filled <- df_missing %>%
  mutate(age = ifelse(is.na(age), mean(age, na.rm = TRUE), age))

print("填充缺失值后:")
print(df_filled)

2.3 ggplot2:强大的数据可视化

ggplot2是R语言中最强大的可视化库,基于图形语法构建。

library(ggplot2)
library(dplyr)

# 创建示例数据
set.seed(42)
df <- data.frame(
  age = sample(20:60, 100, replace = TRUE),
  salary = sample(5000:25000, 100, replace = TRUE),
  city = sample(c("北京", "上海", "广州", "深圳"), 100, replace = TRUE),
  work_years = sample(1:20, 100, replace = TRUE)
)

# 1. 散点图
p1 <- ggplot(df, aes(x = work_years, y = salary)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "工作年限与薪资关系",
       x = "工作年限",
       y = "薪资") +
  theme_minimal()

print(p1)

# 2. 直方图
p2 <- ggplot(df, aes(x = salary)) +
  geom_histogram(bins = 20, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "薪资分布直方图",
       x = "薪资",
       y = "频数") +
  theme_minimal()

print(p2)

# 3. 箱线图
p3 <- ggplot(df, aes(x = city, y = salary, fill = city)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "不同城市薪资分布",
       x = "城市",
       y = "薪资") +
  theme_minimal() +
  theme(legend.position = "none")

print(p3)

# 4. 多图组合
library(patchwork)

# 创建多个图
p1 <- ggplot(df, aes(x = age, y = salary)) + geom_point() + labs(title = "年龄 vs 薪资")
p2 <- ggplot(df, aes(x = city, y = salary)) + geom_boxplot() + labs(title = "城市 vs 薪资")
p3 <- ggplot(df, aes(x = work_years, y = salary)) + geom_line() + labs(title = "工作年限 vs 薪资")
p4 <- ggplot(df, aes(x = salary)) + geom_histogram() + labs(title = "薪资分布")

# 组合显示
combined <- (p1 + p2) / (p3 + p4)
print(combined)

# 5. 相关性热力图
library(reshape2)
cor_matrix <- cor(df[, c("age", "salary", "work_years")])
cor_melted <- melt(cor_matrix)

p5 <- ggplot(cor_melted, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  geom_text(aes(label = round(value, 2)), color = "black", size = 4) +
  labs(title = "特征相关性热力图",
       fill = "相关系数") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(p5)

2.4 R语言机器学习:caret包

caret包是R语言中用于机器学习的统一接口,简化了模型训练和评估过程。

library(caret)
library(randomForest)
library(dplyr)

# 示例1:线性回归
print("=== 线性回归示例 ===")

# 创建数据
set.seed(42)
n <- 100
X <- data.frame(
  feature1 = runif(n),
  feature2 = runif(n)
)
y <- 3 * X$feature1 + 2 * X$feature2 + rnorm(n, 0, 0.1)

# 合并数据
data <- cbind(X, y)

# 分割数据集
set.seed(42)
trainIndex <- createDataPartition(data$y, p = 0.8, list = FALSE)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]

# 训练模型
model_lm <- train(
  y ~ .,
  data = train_data,
  method = "lm",
  trControl = trainControl(method = "cv", number = 5)
)

# 预测
predictions <- predict(model_lm, test_data)

# 评估
mse <- mean((test_data$y - predictions)^2)
r2 <- cor(test_data$y, predictions)^2

print(paste("均方误差:", round(mse, 4)))
print(paste("R²分数:", round(r2, 4)))
print(summary(model_lm))

# 示例2:分类问题
print("\n=== 分类问题示例 ===")

# 创建分类数据
set.seed(42)
n <- 1000
class_data <- data.frame(
  feature1 = rnorm(n),
  feature2 = rnorm(n),
  feature3 = rnorm(n),
  class = as.factor(sample(c(0, 1), n, replace = TRUE))
)

# 添加一些相关性
class_data$feature1 <- class_data$feature1 + ifelse(class_data$class == 1, 1, 0)

# 分割数据
set.seed(42)
trainIndex <- createDataPartition(class_data$class, p = 0.7, list = FALSE)
train_class <- class_data[trainIndex, ]
test_class <- class_data[-trainIndex, ]

# 训练随机森林
model_rf <- train(
  class ~ .,
  data = train_class,
  method = "rf",
  trControl = trainControl(method = "cv", number = 5),
  ntree = 100
)

# 预测
pred_class <- predict(model_rf, test_class)

# 评估
conf_matrix <- confusionMatrix(pred_class, test_class$class)
print(conf_matrix)

# 特征重要性
importance <- varImp(model_rf)
print("特征重要性:")
print(importance)

# 示例3:使用tidymodels(现代R机器学习)
library(tidymodels)

print("\n=== tidymodels示例 ===")

# 创建数据
data <- tibble(
  x1 = rnorm(100),
  x2 = rnorm(100),
  y = 3 * x1 + 2 * x2 + rnorm(100, 0, 0.1)
)

# 数据分割
set.seed(42)
data_split <- initial_split(data, prop = 0.8)
train_data <- training(data_split)
test_data <- testing(data_split)

# 创建模型规范
lm_spec <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

# 创建工作流
wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_formula(y ~ .)

# 训练模型
model_fit <- wf %>% fit(data = train_data)

# 预测
predictions <- predict(model_fit, test_data)

# 评估
results <- test_data %>%
  bind_cols(predictions) %>%
  metrics(truth = y, estimate = .pred)

print("模型评估指标:")
print(results)

2.5 实际案例:客户流失预测系统

library(tidyverse)
library(caret)
library(randomForest)
library(pROC)

# 1. 创建模拟客户数据
set.seed(42)
n <- 2000

customer_data <- tibble(
  customer_id = 1:n,
  tenure = sample(1:72, n, replace = TRUE),  # 在网时长(月)
  monthly_charges = runif(n, 20, 120),       # 月费用
  total_charges = tenure * monthly_charges,   # 总费用
  contract = sample(c("Month-to-month", "One year", "Two year"), n, 
                    prob = c(0.5, 0.3, 0.2), replace = TRUE),
  payment_method = sample(c("Electronic check", "Mailed check", 
                           "Bank transfer", "Credit card"), n, replace = TRUE),
  internet_service = sample(c("DSL", "Fiber optic", "No"), n, 
                           prob = c(0.3, 0.4, 0.3), replace = TRUE),
  tech_support = sample(c("Yes", "No"), n, prob = c(0.3, 0.7), replace = TRUE)
)

# 创建流失标签(模拟真实情况)
customer_data <- customer_data %>%
  mutate(
    churn_prob = case_when(
      contract == "Month-to-month" ~ 0.3,
      contract == "One year" ~ 0.1,
      contract == "Two year" ~ 0.05,
      TRUE ~ 0.2
    ) + 
    ifelse(monthly_charges > 80, 0.1, 0) +
    ifelse(tenure < 6, 0.15, 0) +
    ifelse(tech_support == "No", 0.05, 0),
    churn = as.factor(sample(c(0, 1), n, prob = c(1 - churn_prob, churn_prob), replace = TRUE))
  ) %>%
  select(-churn_prob)

print("客户数据预览:")
print(head(customer_data))

# 2. 数据探索
print("\n=== 数据探索 ===")
print(paste("总样本数:", nrow(customer_data)))
print(paste("流失客户数:", sum(customer_data$churn == 1)))
print(paste("流失率:", round(mean(customer_data$churn == 1) * 100, 2), "%"))

# 按合同类型统计流失率
churn_by_contract <- customer_data %>%
  group_by(contract) %>%
  summarise(
    total = n(),
    churn_rate = mean(churn == 1)
  )
print("\n按合同类型的流失率:")
print(churn_by_contract)

# 3. 数据预处理和特征工程
# 将分类变量转换为数值
customer_encoded <- customer_data %>%
  mutate(
    contract = factor(contract, levels = c("Month-to-month", "One year", "Two year")),
    payment_method = factor(payment_method),
    internet_service = factor(internet_service),
    tech_support = factor(tech_support)
  ) %>%
  select(-customer_id)  # 移除ID列

# 4. 数据分割
set.seed(42)
trainIndex <- createDataPartition(customer_encoded$churn, p = 0.7, list = FALSE)
train_data <- customer_encoded[trainIndex, ]
test_data <- customer_encoded[-trainIndex, ]

# 5. 模型训练
print("\n=== 模型训练 ===")

# 定义训练控制(交叉验证)
ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = TRUE
)

# 训练随机森林
rf_model <- train(
  churn ~ .,
  data = train_data,
  method = "rf",
  trControl = ctrl,
  ntree = 200,
  metric = "ROC"
)

print("随机森林模型训练完成")
print(rf_model)

# 6. 模型评估
print("\n=== 模型评估 ===")

# 预测测试集
predictions <- predict(rf_model, test_data)
pred_probs <- predict(rf_model, test_data, type = "prob")

# 混淆矩阵
conf_matrix <- confusionMatrix(predictions, test_data$churn, positive = "1")
print("混淆矩阵:")
print(conf_matrix)

# ROC曲线
roc_curve <- roc(test_data$churn, pred_probs[, "1"])
print(paste("AUC:", round(auc(roc_curve), 4)))

# 绘制ROC曲线
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red")

# 7. 特征重要性
importance <- varImp(rf_model)
print("\n特征重要性:")
print(importance)

# 8. 模型应用:预测新客户
new_customers <- tibble(
  tenure = c(3, 24, 48),
  monthly_charges = c(70.5, 85.2, 95.8),
  total_charges = c(211.5, 2044.8, 4598.4),
  contract = factor(c("Month-to-month", "One year", "Two year"), 
                    levels = c("Month-to-month", "One year", "Two year")),
  payment_method = factor(c("Electronic check", "Credit card", "Bank transfer")),
  internet_service = factor(c("Fiber optic", "DSL", "Fiber optic")),
  tech_support = factor(c("No", "Yes", "Yes"))
)

new_predictions <- predict(rf_model, new_customers)
new_probs <- predict(rf_model, new_customers, type = "prob")

print("\n=== 新客户流失预测 ===")
print(cbind(new_customers, churn_prediction = new_predictions, churn_probability = new_probs[, "1"]))

# 9. 保存模型
saveRDS(rf_model, "customer_churn_model.rds")
print("\n模型已保存为 'customer_churn_model.rds'")

# 10. 加载并使用保存的模型
loaded_model <- readRDS("customer_churn_model.rds")
test_prediction <- predict(loaded_model, new_customers[1, ])
print(paste("加载模型预测结果:", test_prediction))

第三部分:Python与R的对比与选择

3.1 语言特性对比

特性 Python R
学习曲线 平缓,语法简洁 较陡,统计语法独特
数据处理 Pandas(强大但需学习) Tidyverse(直观自然)
可视化 Matplotlib/Seaborn(灵活) ggplot2(图形语法)
机器学习 Scikit-learn(工业标准) caret/tidymodels(统计导向)
深度学习 TensorFlow/PyTorch(主流) 较少,主要通过reticulate
生产部署 优秀(Flask/FastAPI) 有限(Shiny/R Plumber)
社区规模 庞大,跨领域 统计领域专注
性能 优秀(C扩展) 良好(向量化操作)

3.2 选择建议

选择Python的情况:

  • 需要与工程团队协作
  • 涉及深度学习或复杂AI
  • 需要生产环境部署
  • 处理大规模数据或需要高性能计算
  • 需要跨领域应用(Web开发、自动化等)

选择R的情况:

  • 统计分析和假设检验
  • 学术研究和论文发表
  • 探索性数据分析
  • 需要高质量统计图表
  • 传统统计建模(如时间序列分析)

3.3 混合使用策略

在实际工作中,可以结合两者优势:

# Python中调用R代码
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri

# 激活pandas到R的转换
pandas2ri.activate()

# 在Python中使用R的统计函数
r_code = '''
library(forecast)
auto_arima_forecast <- function(data) {
  model <- auto.arima(data)
  forecast(model, h=10)
}
'''
robjects.r(r_code)
# R中调用Python代码
library(reticulate)
py_run_string("
import pandas as pd
import numpy as np
# Python代码
")

第四部分:综合实战项目

4.1 项目概述:电商用户行为分析系统

我们将构建一个完整的电商用户行为分析系统,包括数据处理、用户画像、行为预测和可视化报告。

Python实现

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

# 1. 创建模拟电商数据
def create_ecommerce_data(n_users=5000):
    """创建模拟电商用户行为数据"""
    np.random.seed(42)
    
    # 用户基本信息
    users = pd.DataFrame({
        'user_id': range(1, n_users + 1),
        'age': np.random.randint(18, 70, n_users),
        'gender': np.random.choice(['M', 'F'], n_users, p=[0.48, 0.52]),
        'registration_date': pd.to_datetime('2022-01-01') + 
                            pd.to_timedelta(np.random.randint(0, 730, n_users), unit='D'),
        'city_tier': np.random.choice(['Tier1', 'Tier2', 'Tier3'], n_users, 
                                     p=[0.3, 0.4, 0.3])
    })
    
    # 用户行为数据
    behavior_data = []
    for user_id in range(1, n_users + 1):
        # 每个用户的会话次数
        n_sessions = np.random.poisson(5)
        
        for session in range(n_sessions):
            # 每次会话的行为
            session_date = users.loc[users.user_id == user_id, 'registration_date'].iloc[0] + \
                          pd.to_timedelta(np.random.randint(0, 730), unit='D')
            
            # 浏览、加购、购买行为
            n_views = np.random.poisson(8)
            n_add_to_cart = np.random.poisson(2)
            n_purchases = np.random.binomial(n_add_to_cart, 0.6)  # 加购后60%概率购买
            
            # 购买金额
            purchase_amount = 0
            if n_purchases > 0:
                purchase_amount = np.random.gamma(2, 50, n_purchases).sum()
            
            behavior_data.append({
                'user_id': user_id,
                'session_date': session_date,
                'n_views': n_views,
                'n_add_to_cart': n_add_to_cart,
                'n_purchases': n_purchases,
                'purchase_amount': purchase_amount,
                'session_duration': np.random.exponential(300)  # 会话时长(秒)
            })
    
    behavior_df = pd.DataFrame(behavior_data)
    
    # 合并用户和行为数据
    full_data = pd.merge(users, behavior_df, on='user_id')
    
    return full_data, users

# 2. 用户画像分析
def user_segmentation_analysis(df, users):
    """用户分群分析"""
    print("=== 用户分群分析 ===")
    
    # 计算用户聚合特征
    user_features = df.groupby('user_id').agg({
        'n_views': 'sum',
        'n_add_to_cart': 'sum',
        'n_purchases': 'sum',
        'purchase_amount': 'sum',
        'session_duration': 'sum',
        'session_date': ['min', 'max']
    }).reset_index()
    
    # 扁平化列名
    user_features.columns = ['user_id', 'total_views', 'total_cart', 'total_purchases', 
                            'total_amount', 'total_duration', 'first_session', 'last_session']
    
    # 计算RFM特征
    current_date = df['session_date'].max()
    user_features['recency'] = (current_date - user_features['last_session']).dt.days
    user_features['frequency'] = user_features['total_purchases']
    user_features['monetary'] = user_features['total_amount']
    
    # 合并用户基本信息
    user_features = pd.merge(users, user_features, on='user_id')
    
    # 选择聚类特征
    cluster_features = ['recency', 'frequency', 'monetary', 'total_views', 'total_duration']
    X_cluster = user_features[cluster_features].fillna(0)
    
    # 标准化
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_cluster)
    
    # K-means聚类(4个群组)
    kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
    user_features['cluster'] = kmeans.fit_predict(X_scaled)
    
    # 分析每个群组
    cluster_analysis = user_features.groupby('cluster').agg({
        'recency': 'mean',
        'frequency': 'mean',
        'monetary': 'mean',
        'total_views': 'mean',
        'total_duration': 'mean',
        'user_id': 'count'
    }).round(2)
    
    print("\n用户群组特征:")
    print(cluster_analysis)
    
    # 可视化
    plt.figure(figsize=(12, 8))
    
    plt.subplot(2, 2, 1)
    sns.scatterplot(data=user_features, x='recency', y='monetary', hue='cluster', palette='viridis')
    plt.title('Recency vs Monetary')
    
    plt.subplot(2, 2, 2)
    sns.scatterplot(data=user_features, x='frequency', y='monetary', hue='cluster', palette='viridis')
    plt.title('Frequency vs Monetary')
    
    plt.subplot(2, 2, 3)
    user_features['cluster'].value_counts().plot(kind='bar')
    plt.title('用户群组分布')
    plt.xlabel('Cluster')
    plt.ylabel('Count')
    
    plt.subplot(2, 2, 4)
    cluster_analysis['monetary'].plot(kind='bar')
    plt.title('各群组平均消费金额')
    plt.xlabel('Cluster')
    plt.ylabel('Avg Monetary')
    
    plt.tight_layout()
    plt.show()
    
    return user_features, cluster_analysis

# 3. 购买预测模型
def purchase_prediction_model(df, user_features):
    """构建购买预测模型"""
    print("\n=== 购买预测模型 ===")
    
    # 准备训练数据:预测用户下次会话是否会购买
    # 特征:历史行为统计
    feature_data = user_features.copy()
    
    # 创建目标变量:下次会话是否购买(模拟)
    # 这里我们用历史购买频率来模拟未来行为
    np.random.seed(42)
    feature_data['next_purchase'] = (feature_data['frequency'] > 0).astype(int)
    
    # 添加一些噪声
    noise = np.random.binomial(1, 0.1, len(feature_data))
    feature_data['next_purchase'] = feature_data['next_purchase'] ^ noise
    
    # 选择特征
    features = ['recency', 'frequency', 'monetary', 'total_views', 'total_duration', 'age']
    X = feature_data[features].fillna(0)
    y = feature_data['next_purchase']
    
    # 分割数据
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # 训练模型
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    # 预测与评估
    y_pred = model.predict(X_test)
    
    print("\n模型性能:")
    print(classification_report(y_test, y_pred))
    
    # 特征重要性
    importance_df = pd.DataFrame({
        'feature': features,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("\n特征重要性:")
    print(importance_df)
    
    # 可视化特征重要性
    plt.figure(figsize=(10, 6))
    sns.barplot(data=importance_df, x='importance', y='feature')
    plt.title('特征重要性')
    plt.tight_layout()
    plt.show()
    
    return model, importance_df

# 4. 生成分析报告
def generate_report(user_features, cluster_analysis, model_importance):
    """生成分析报告"""
    print("\n" + "="*60)
    print("电商用户行为分析报告")
    print("="*60)
    
    print(f"\n1. 数据概览")
    print(f"   - 总用户数: {len(user_features)}")
    print(f"   - 平均用户价值: ${user_features['monetary'].mean():.2f}")
    print(f"   - 付费用户比例: {user_features['frequency'].gt(0).mean():.1%}")
    
    print(f"\n2. 用户分群洞察")
    for cluster_id in cluster_analysis.index:
        cluster_data = cluster_analysis.loc[cluster_id]
        print(f"   群组 {cluster_id}:")
        print(f"     - 用户数: {cluster_data['user_id']}")
        print(f"     - 平均消费: ${cluster_data['monetary']:.2f}")
        print(f"     - 平均频次: {cluster_data['frequency']:.2f}")
    
    print(f"\n3. 关键发现")
    top_features = model_importance.head(3)
    print(f"   - 影响购买的最重要因素: {', '.join(top_features['feature'].tolist())}")
    
    print(f"\n4. 业务建议")
    print("   - 高价值用户(群组 X): 提供VIP服务和专属优惠")
    print("   - 潜在流失用户(高Recency): 推送召回活动和优惠券")
    print("   - 新用户(低Frequency): 优化首次购买体验,提供新手礼包")
    
    print("\n" + "="*60)

# 主函数
def main():
    """主执行函数"""
    print("开始生成电商数据...")
    df, users = create_ecommerce_data(5000)
    
    print(f"数据生成完成: {len(df)} 条记录")
    print("\n数据预览:")
    print(df.head())
    
    # 用户分群分析
    user_features, cluster_analysis = user_segmentation_analysis(df, users)
    
    # 购买预测模型
    model, importance_df = purchase_prediction_model(df, user_features)
    
    # 生成报告
    generate_report(user_features, cluster_analysis, importance_df)
    
    print("\n分析完成!")

if __name__ == "__main__":
    main()

R语言实现

library(tidyverse)
library(cluster)
library(randomForest)
library(caret)
library(ggplot2)
library(lubridate)

# 1. 创建模拟电商数据
create_ecommerce_data <- function(n_users = 5000) {
  set.seed(42)
  
  # 用户基本信息
  users <- tibble(
    user_id = 1:n_users,
    age = sample(18:70, n_users, replace = TRUE),
    gender = sample(c("M", "F"), n_users, prob = c(0.48, 0.52), replace = TRUE),
    registration_date = as.Date("2022-01-01") + sample(0:730, n_users, replace = TRUE),
    city_tier = sample(c("Tier1", "Tier2", "Tier3"), n_users, prob = c(0.3, 0.4, 0.3), replace = TRUE)
  )
  
  # 用户行为数据
  behavior_data <- list()
  
  for (user_id in 1:n_users) {
    n_sessions <- rpois(1, 5)
    
    for (session in 1:n_sessions) {
      session_date <- users$registration_date[user_id] + sample(0:730, 1)
      
      n_views <- rpois(1, 8)
      n_add_to_cart <- rpois(1, 2)
      n_purchases <- rbinom(1, n_add_to_cart, 0.6)
      
      purchase_amount <- 0
      if (n_purchases > 0) {
        purchase_amount <- sum(rgamma(n_purchases, 2, 0.02))  # scale = 1/0.02 = 50
      }
      
      behavior_data[[length(behavior_data) + 1]] <- tibble(
        user_id = user_id,
        session_date = session_date,
        n_views = n_views,
        n_add_to_cart = n_add_to_cart,
        n_purchases = n_purchases,
        purchase_amount = purchase_amount,
        session_duration = rexp(1, 1/300)
      )
    }
  }
  
  behavior_df <- bind_rows(behavior_data)
  full_data <- left_join(users, behavior_df, by = "user_id")
  
  return(list(full_data = full_data, users = users))
}

# 2. 用户分群分析
user_segmentation_analysis <- function(df, users) {
  cat("=== 用户分群分析 ===\n")
  
  # 计算用户聚合特征
  user_features <- df %>%
    group_by(user_id) %>%
    summarise(
      total_views = sum(n_views, na.rm = TRUE),
      total_cart = sum(n_add_to_cart, na.rm = TRUE),
      total_purchases = sum(n_purchases, na.rm = TRUE),
      total_amount = sum(purchase_amount, na.rm = TRUE),
      total_duration = sum(session_duration, na.rm = TRUE),
      first_session = min(session_date, na.rm = TRUE),
      last_session = max(session_date, na.rm = TRUE)
    )
  
  # 计算RFM特征
  current_date <- max(df$session_date, na.rm = TRUE)
  user_features <- user_features %>%
    mutate(
      recency = as.numeric(difftime(current_date, last_session, units = "days")),
      frequency = total_purchases,
      monetary = total_amount
    )
  
  # 合并用户基本信息
  user_features <- left_join(users, user_features, by = "user_id")
  
  # 选择聚类特征
  cluster_features <- c("recency", "frequency", "monetary", "total_views", "total_duration")
  X_cluster <- user_features[, cluster_features]
  X_cluster[is.na(X_cluster)] <- 0
  
  # 标准化
  scaler <- preProcess(X_cluster, method = c("center", "scale"))
  X_scaled <- predict(scaler, X_cluster)
  
  # K-means聚类
  set.seed(42)
  kmeans_result <- kmeans(X_scaled, centers = 4, nstart = 10)
  user_features$cluster <- kmeans_result$cluster
  
  # 分析每个群组
  cluster_analysis <- user_features %>%
    group_by(cluster) %>%
    summarise(
      recency = mean(recency, na.rm = TRUE),
      frequency = mean(frequency, na.rm = TRUE),
      monetary = mean(monetary, na.rm = TRUE),
      total_views = mean(total_views, na.rm = TRUE),
      total_duration = mean(total_duration, na.rm = TRUE),
      user_count = n()
    ) %>%
    mutate(across(where(is.numeric), round, 2))
  
  print(cluster_analysis)
  
  # 可视化
  p1 <- ggplot(user_features, aes(x = recency, y = monetary, color = factor(cluster))) +
    geom_point(alpha = 0.6) +
    labs(title = "Recency vs Monetary", color = "Cluster") +
    theme_minimal()
  
  p2 <- ggplot(user_features, aes(x = frequency, y = monetary, color = factor(cluster))) +
    geom_point(alpha = 0.6) +
    labs(title = "Frequency vs Monetary", color = "Cluster") +
    theme_minimal()
  
  p3 <- ggplot(user_features, aes(x = factor(cluster))) +
    geom_bar(fill = "steelblue") +
    labs(title = "用户群组分布", x = "Cluster", y = "Count") +
    theme_minimal()
  
  p4 <- ggplot(cluster_analysis, aes(x = factor(cluster), y = monetary)) +
    geom_col(fill = "darkgreen") +
    labs(title = "各群组平均消费金额", x = "Cluster", y = "Avg Monetary") +
    theme_minimal()
  
  library(patchwork)
  print((p1 + p2) / (p3 + p4))
  
  return(list(user_features = user_features, cluster_analysis = cluster_analysis))
}

# 3. 购买预测模型
purchase_prediction_model <- function(df, user_features) {
  cat("\n=== 购买预测模型 ===\n")
  
  # 准备训练数据
  feature_data <- user_features %>%
    mutate(
      next_purchase = ifelse(frequency > 0, 1, 0),
      next_purchase = next_purchase + rbinom(n(), 1, 0.1)  # 添加噪声
    ) %>%
    mutate(next_purchase = ifelse(next_purchase > 1, 1, next_purchase))
  
  # 选择特征
  features <- c("recency", "frequency", "monetary", "total_views", "total_duration", "age")
  X <- feature_data[, features]
  X[is.na(X)] <- 0
  y <- factor(feature_data$next_purchase)
  
  # 分割数据
  set.seed(42)
  trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
  X_train <- X[trainIndex, ]
  X_test <- X[-trainIndex, ]
  y_train <- y[trainIndex]
  y_test <- y[-trainIndex]
  
  # 训练模型
  model <- randomForest(x = X_train, y = y_train, ntree = 100, importance = TRUE)
  
  # 预测与评估
  y_pred <- predict(model, X_test)
  
  cat("\n模型性能:\n")
  print(confusionMatrix(y_pred, y_test))
  
  # 特征重要性
  importance_df <- as.data.frame(importance(model)) %>%
    rownames_to_column("feature") %>%
    arrange(desc(MeanDecreaseGini)) %>%
    select(feature, importance = MeanDecreaseGini)
  
  cat("\n特征重要性:\n")
  print(importance_df)
  
  # 可视化
  p <- ggplot(importance_df, aes(x = reorder(feature, importance), y = importance)) +
    geom_col(fill = "steelblue") +
    coord_flip() +
    labs(title = "特征重要性", x = "Feature", y = "Importance") +
    theme_minimal()
  
  print(p)
  
  return(list(model = model, importance_df = importance_df))
}

# 4. 生成分析报告
generate_report <- function(user_features, cluster_analysis, importance_df) {
  cat("\n", paste(rep("=", 60), collapse = ""), "\n")
  cat("电商用户行为分析报告\n")
  cat(paste(rep("=", 60), collapse = ""), "\n")
  
  cat("\n1. 数据概览\n")
  cat(sprintf("   - 总用户数: %d\n", nrow(user_features)))
  cat(sprintf("   - 平均用户价值: $%.2f\n", mean(user_features$monetary, na.rm = TRUE)))
  cat(sprintf("   - 付费用户比例: %.1f%%\n", 
              mean(user_features$frequency > 0, na.rm = TRUE) * 100))
  
  cat("\n2. 用户分群洞察\n")
  for (i in 1:nrow(cluster_analysis)) {
    cat(sprintf("   群组 %d:\n", cluster_analysis$cluster[i]))
    cat(sprintf("     - 用户数: %d\n", cluster_analysis$user_count[i]))
    cat(sprintf("     - 平均消费: $%.2f\n", cluster_analysis$monetary[i]))
    cat(sprintf("     - 平均频次: %.2f\n", cluster_analysis$frequency[i]))
  }
  
  cat("\n3. 关键发现\n")
  top_features <- head(importance_df, 3)
  cat(sprintf("   - 影响购买的最重要因素: %s\n", 
              paste(top_features$feature, collapse = ", ")))
  
  cat("\n4. 业务建议\n")
  cat("   - 高价值用户: 提供VIP服务和专属优惠\n")
  cat("   - 潜在流失用户: 推送召回活动和优惠券\n")
  cat("   - 新用户: 优化首次购买体验,提供新手礼包\n")
  
  cat("\n", paste(rep("=", 60), collapse = ""), "\n")
}

# 主函数
main <- function() {
  cat("开始生成电商数据...\n")
  data <- create_ecommerce_data(5000)
  df <- data$full_data
  users <- data$users
  
  cat(sprintf("数据生成完成: %d 条记录\n", nrow(df)))
  cat("\n数据预览:\n")
  print(head(df))
  
  # 用户分群分析
  segmentation_result <- user_segmentation_analysis(df, users)
  user_features <- segmentation_result$user_features
  cluster_analysis <- segmentation_result$cluster_analysis
  
  # 购买预测模型
  model_result <- purchase_prediction_model(df, user_features)
  
  # 生成报告
  generate_report(user_features, cluster_analysis, model_result$importance_df)
  
  cat("\n分析完成!\n")
}

# 执行
main()

第五部分:最佳实践与进阶建议

5.1 代码组织与项目结构

Python项目结构:

project/
├── data/                 # 数据文件
├── notebooks/            # Jupyter笔记本
├── src/                  # 源代码
│   ├── __init__.py
│   ├── data_processing.py
│   ├── modeling.py
│   └── visualization.py
├── requirements.txt      # 依赖包
├── README.md
└── main.py

R项目结构:

project/
├── data/                 # 数据文件
├── R/                    # R脚本
│   ├── data_processing.R
│   ├── modeling.R
│   ┣── visualization.R
├── scripts/              # 分析脚本
├── output/               # 输出结果
├── README.md
└── project.Rproj

5.2 版本控制与协作

# Git基础命令
git init
git add .
git commit -m "Initial commit"
git branch main
git remote add origin <repository-url>
git push -u origin main

# .gitignore示例
# Python
__pycache__/
*.py[cod]
*$py.class
.env
*.pkl
*.model

# R
.Rhistory
.RData
.Rproj.user

5.3 性能优化技巧

Python优化:

# 1. 使用向量化操作
# 慢
result = []
for i in range(len(data)):
    result.append(data[i] * 2)

# 快
result = data * 2

# 2. 使用apply的替代方案
# 慢
df['new_col'] = df['col'].apply(lambda x: x**2)

# 快
df['new_col'] = df['col'] ** 2

# 3. 使用Dask处理大数据
import dask.dataframe as dd
ddf = dd.read_csv('large_file.csv')
result = ddf.groupby('category').value.mean().compute()

R优化:

# 1. 使用向量化操作
# 慢
result <- c()
for (i in 1:length(data)) {
  result[i] <- data[i] * 2
}

# 快
result <- data * 2

# 2. 使用data.table处理大数据
library(data.table)
dt <- fread('large_file.csv')
result <- dt[, .(mean_value = mean(value)), by = category]

# 3. 使用parallel进行并行计算
library(parallel)
cl <- makeCluster(4)
results <- parLapply(cl, data_list, my_function)
stopCluster(cl)

5.4 调试与测试

Python调试:

import pdb

def complex_function(x, y):
    # 设置断点
    pdb.set_trace()
    result = x / y
    return result ** 2

# 使用logging
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def process_data(data):
    logger.info(f"Processing {len(data)} records")
    # 处理逻辑
    logger.info("Processing complete")

R调试:

# 使用browser()
my_function <- function(x) {
  browser()  # 设置断点
  result <- x * 2
  return(result)
}

# 使用tryCatch处理错误
result <- tryCatch({
  risky_operation()
}, error = function(e) {
  message("Error occurred: ", e$message)
  return(NA)
})

5.5 持续学习资源

Python资源:

  • 官方文档:docs.python.org
  • DataCamp、Coursera在线课程
  • Kaggle竞赛和数据集
  • GitHub开源项目

R资源:

结论

数据科学是一个快速发展的领域,掌握Python和R语言的核心技能是成为数据科学家的必经之路。本文从基础语法到高级应用,从数据处理到机器学习,全面介绍了两种语言在数据科学中的应用。

关键要点回顾:

  1. Python:适合工程化、生产部署和深度学习
  2. R:适合统计分析、学术研究和数据可视化
  3. 两者结合:发挥各自优势,提高工作效率

下一步建议:

  1. 选择一种语言深入学习(建议从Python开始)
  2. 动手实践真实项目
  3. 参与开源社区和竞赛
  4. 持续学习新技术和工具

记住,编程技能只是工具,真正的价值在于用数据解决实际问题。保持好奇心,持续学习,你一定能在数据科学领域取得成功!


本文档提供了Python和R语言在数据科学中的完整入门指南,包含大量可运行的代码示例。建议读者按照顺序学习,并在实际项目中应用所学知识。