搬运工 发表于 2022-8-13 22:00:04

Hadoop数据分析



前言..........................................................................................................................................................ix
第一部分 分布式计算入门
第 1 章 数据产品时代 .......................................................................................................................2
1.1 什么是数据产品 .........................................................................................................................2
1.2 使用 Hadoop 构建大规模数据产品 ..........................................................................................4
1.2.1 利用大型数据集 ............................................................................................................4
1.2.2 数据产品中的 Hadoop ...................................................................................................5
1.3 数据科学流水线和 Hadoop 生态系统 ......................................................................................6
1.4 小结 .............................................................................................................................................8
第 2 章 大数据操作系统 ..................................................................................................................9
2.1 基本概念 ...................................................................................................................................10
2.2 Hadoop 架构 .............................................................................................................................11
2.2.1 Hadoop 集群 .................................................................................................................12
2.2.2 HDFS ............................................................................................................................14
2.2.3 YARN............................................................................................................................15
2.3 使用分布式文件系统 ...............................................................................................................16
2.3.1 基本的文件系统操作 ..................................................................................................16
2.3.2 HDFS 文件权限 ...........................................................................................................18
2.3.3 其他 HDFS 接口 ..........................................................................................................19
2.4 使用分布式计算 .......................................................................................................................20
2.4.1 MapReduce:函数式编程模型 ...................................................................................20
vi | 目录
2.4.2 MapReduce:集群上的实现 .......................................................................................22
2.4.3 不止一个 MapReduce:作业链 ..................................................................................27
2.5 向 YARN 提交 MapReduce 作业 ............................................................................................28
2.6 小结 ...........................................................................................................................................30
第 3 章 Python 框架和 Hadoop Streaming .............................................................................31
3.1 Hadoop Streaming .....................................................................................................................32
3.1.1 使用 Streaming 在 CSV 数据上运行计算 ..................................................................34
3.1.2 执行 Streaming 作业 ....................................................................................................38
3.2 Python 的 MapReduce 框架 .....................................................................................................39
3.2.1 短语计数 ......................................................................................................................42
3.2.2 其他框架 ......................................................................................................................45
3.3 MapReduce 进阶 .......................................................................................................................46
3.3.1 combiner .......................................................................................................................46
3.3.2 partitioner ......................................................................................................................47
3.3.3 作业链 ..........................................................................................................................47
3.4 小结 ...........................................................................................................................................50
第 4 章 Spark 内存计算 .................................................................................................................52
4.1 Spark 基础.................................................................................................................................53
4.1.1 Spark 栈 ........................................................................................................................54
4.1.2 RDD ..............................................................................................................................55
4.1.3 使用 RDD 编程 ............................................................................................................56
4.2 基于 PySpark 的交互性 Spark .................................................................................................59
4.3 编写 Spark 应用程序................................................................................................................61
4.4 小结 ...........................................................................................................................................67
第 5 章 分布式分析和模式 ............................................................................................................69
5.1 键计算 .......................................................................................................................................70
5.1.1 复合键 ..........................................................................................................................71
5.1.2 键空间模式 ..................................................................................................................74
5.1.3 pair 与 stripe .................................................................................................................78
5.2 设计模式 ...................................................................................................................................80
5.2.1 概要 ..............................................................................................................................81
5.2.2 索引 ..............................................................................................................................85
5.2.3 过滤 ..............................................................................................................................90
5.3 迈向最后一英里分析 ...............................................................................................................95
5.3.1 模型拟合 ......................................................................................................................96
5.3.2 模型验证 ......................................................................................................................97
5.4 小结 ...........................................................................................................................................98
目录 | vii
第二部分 大数据科学的工作流和工具
第 6 章 数据挖掘和数据仓储......................................................................................................102
6.1 Hive 结构化数据查询 ............................................................................................................103
6.1.1 Hive 命令行接口(CLI) ...........................................................................................103
6.1.2 Hive 查询语言 ............................................................................................................104
6.1.3 Hive 数据分析 ............................................................................................................108
6.2 HBase ......................................................................................................................................113
6.2.1 NoSQL 与列式数据库 ...............................................................................................114
6.2.2 HBase 实时分析 .........................................................................................................116
6.3 小结 .........................................................................................................................................122
第 7 章 数据采集 ............................................................................................................................123
7.1 使用 Sqoop 导入关系数据 .....................................................................................................124
7.1.1 从 MySQL 导入 HDFS ..............................................................................................124
7.1.2 从 MySQL 导入 Hive.................................................................................................126
7.1.3 从 MySQL 导入 HBase ..............................................................................................128
7.2 使用 Flume 获取流式数据 .....................................................................................................130
7.2.1 Flume 数据流 .............................................................................................................130
7.2.2 使用 Flume 获取产品印象数据 ................................................................................133
7.3 小结 .........................................................................................................................................136
第 8 章 使用高级 API 进行分析 .................................................................................................137
8.1 Pig............................................................................................................................................137
8.1.1 Pig Latin ......................................................................................................................138
8.1.2 数据类型 ....................................................................................................................142
8.1.3 关系运算符 ................................................................................................................142
8.1.4 用户定义函数 ............................................................................................................143
8.1.5 Pig 小结 ......................................................................................................................144
8.2 Spark 高级 API .......................................................................................................................144
8.2.1 Spark SQL...................................................................................................................146
8.2.2 DataFrame ...................................................................................................................148
8.3 小结 .........................................................................................................................................153
第 9 章 机器学习 ............................................................................................................................154
9.1 使用 Spark 进行可扩展的机器学习......................................................................................154
9.1.1 协同过滤 ....................................................................................................................156
9.1.2 分类 ............................................................................................................................161
9.1.3 聚类 ............................................................................................................................163
9.2 小结 .........................................................................................................................................166
图灵社区会员 ChenyangGao(2339083510@qq.com) 专享 尊重版权
viii | 目录
第 10 章 总结:分布式数据科学实战 ......................................................................................167
10.1 数据产品生命周期 ...............................................................................................................168
10.1.1 数据湖泊 .................................................................................................................169
10.1.2 数据采集 .................................................................................................................171
10.1.3 计算数据存储 .........................................................................................................172
10.2 机器学习生命周期 ...............................................................................................................173
10.3 小结 .......................................................................................................................................175
附录 A 创建 Hadoop 伪分布式开发环境 ................................................................................176
附录 B 安装 Hadoop 生态系统产品 .........................................................................................184
术语表..................................................................................................................................................193
关于作者..............................................................................................................................................211
关于封面..............................................................................................................................................211


**** Hidden Message *****

李才哥 发表于 2022-8-14 00:14:02

啥也不说了,感谢楼主分享哇!

xiaod 发表于 2022-8-14 00:22:46

啥也不说了,感谢楼主分享哇!

taipingyang2021 发表于 2022-8-14 03:15:07

啥也不说了,感谢楼主分享哇!

hgzhou6 发表于 2022-8-14 07:39:26

啥也不说了,感谢楼主分享哇!

neun 发表于 2022-8-14 13:35:07

啥也不说了,感谢楼主分享哇!

17770767379 发表于 2022-8-14 20:48:06

啥也不说了,感谢楼主分享哇!

wuya_scnu 发表于 2022-8-16 09:35:43

正需要,支持楼主大人了!

牛聪聪345 发表于 前天 10:29

啥也不说了,感谢楼主分享哇!
页: [1]
查看完整版本: Hadoop数据分析