SparkSQL分析

YMLiang

2019-02-03

sparkSql

SparkSql作为Spark的结构化数据处理模块，提供了非常强大的API，让分析人员用一次，就会为之倾倒，为之着迷，为之至死不渝。在内部，SparkSQL使用额外结构信息来执行额外的优化。在外部，可以使用SQL和DataSet 的API与之交互。

简单排名函数的使用

昨天和今天都登录了的用户id
昨天登录了但是今天没有登录的用户id

--ds1 中的数据代表昨天登录的用户id
--ds2 中的数据代表今天登录的用户id
--准备数据

val ds1 = spark.range(0,20)
val ds2 = spark.range(10,30)
ds1.createOrReplaceTempView("t1")
ds2.createOrReplaceTempView("t2")

--求：
--1. 昨天和今天都登录了的用户id
--2. 昨天登录了但是今天没有登录的用户id

--1. 这里用的是简单的表连接求交集,比较简单,如果看不懂可以把sql语句拆开看效果
sql("""
select id1 from(
select t1.id as id1,if(isnull(t2.id),'未出现',t2.id) as id2 from t1 left join t2 on t1.id = t2.id
)where id1 = id2
""").show
+---+
|id1|
+---+
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+

--2. 在1的基础上稍微变一下,因为求第一天出现第二天没出线时只需要满足第二天的=未出现即可,

sql("""
select id1 from(
select t1.id as id1,if(isnull(t2.id),'未出现',t2.id) as id2 from t1 left join t2 on t1.id = t2.id
)where id2 = '未出现'
""").show

+---+     
|id1|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

--3. 如果是求第二天出现第一天未出现的话t1 t2换个位置就好了

sql("""
select id2 from(
select if(isnull(t1.id),'未出现',t1.id) as id1,t2.id as id2 from t1 right join t2 on t1.id = t2.id
)where id1 = '未出现'
""").show

+---+
|id2|
+---+
| 20|
| 21|
| 22|
| 23|
| 24|
| 25|
| 26|
| 27|
| 28|
| 29|
+---+

这里如果对isnull函数不理解的话可以参考我写的,逻辑就是如果(if)字段为空isnull(字段),就替换为’未出现’ 否则还是原有字段,好像还有其他写法,大家自行百度吧,达到需求就可以了

if(isnull(t1.id),'未出现',t1.id)

每个部门工资最高的前3名

--这个数据表头是csv文件格式便于sparksql进行自动类型匹配
EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
7369,SMITH,CLERK,7902,2001-01-02 22:12:13,800,,20
7499,ALLEN,SALESMAN,7698,2002-01-02 22:12:13,1600,300,30
7521,WARD,SALESMAN,7698,2003-01-02 22:12:13,1250,500,30
7566,JONES,MANAGER,7839,2004-01-02 22:12:13,2975,,20
7654,MARTIN,SALESMAN,7698,2005-01-02 22:12:13,1250,1400,30
7698,BLAKE,MANAGER,7839,2005-04-02 22:12:13,2850,,30
7782,CLARK,MANAGER,7839,2006-03-02 22:12:13,2450,,10
7788,SCOTT,ANALYST,7566,2007-03-02 22:12:13,3000,,20
7839,KING,PRESIDENT,,2006-03-02 22:12:13,5000,,10
7844,TURNER,SALESMAN,7698,2009-07-02 22:12:13,1500,0,30
7876,ADAMS,CLERK,7788,2010-05-02 22:12:13,1100,,20
7900,JAMES,CLERK,7698,2011-06-02 22:12:13,950,,30
7902,FORD,ANALYST,7566,2011-07-02 22:12:13,3000,,20
7934,MILLER,CLERK,7782,2012-11-02 22:12:13,1300,,10

--最后达到的要求
+------+----+----+----+
|deptno|sal1|sal2|sal3|
+------+----+----+----+
|    10|5000|2450|1300|
|    20|3000|3000|2975|
|    30|2850|1600|1500|
+------+----+----+----+

--("header", true)表示如果这个是数据本身是有列名的就按第一行的列名去建表,列名就是这个文件的第一行的每个单词
--("inferschema", true)表示自动类型推断,我们可以用df.printSchema来看spark给我们匹配的类型是什么
--.csv("hdfs://master:9000/data/a.log")

val df = spark.read.option("header",true).option("inferschema",true).csv("hdfs://master:9000/data/a.log")

--建临时表
df.createOrReplaceTempView("temp")

row_number() over() –> 排名函数的用法

--这里我都是分布来算的,便于自己理清思路,同时用多层子查询时这样写思路会很清晰

sql("""
select deptno,sal,row_number() over(partition by deptno order by sal desc) as salnum from temp having salnum < 4
""").show

+------+----+------+                                                            
|deptno| sal|salnum|
+------+----+------+
|    20| 800|     1|
|    20|1100|     2|
|    20|2975|     3|
|    10|1300|     1|
|    10|2450|     2|
|    10|5000|     3|
|    30| 950|     1|
|    30|1250|     2|
|    30|1250|     3|
+------+----+------+

collect_list(sal) 列转行配合group by来使用

sql("""
select deptno,collect_list(sal) as sal from(
select deptno,sal,row_number() over(partition by deptno order by sal desc) as salnum from temp having salnum < 4)
group by deptno
order by deptno
""").show

+------+------------------+                                                     
|deptno|               sal|
+------+------------------+
|    10|[5000, 2450, 1300]|
|    20|[3000, 3000, 2975]|
|    30|[2850, 1600, 1500]|
+------+------------------+

--sal[0],sal[1],sal[2]实质上类似于数组的取值,可以把sal看作是数组,这一步与求结果无关只是展示一下怎么取值

sql("""
select deptno,sal[0],sal[1],sal[2] from(
select deptno,collect_list(sal) as sal from(
select deptno,sal,row_number() over(partition by deptno order by sal desc) as salnum from temp having salnum < 4)
group by deptno)
order by deptno
""").show
+------+------+------+------+                                                   
|deptno|sal[0]|sal[1]|sal[2]|
+------+------+------+------+
|    10|  5000|  2450|  1300|
|    20|  3000|  3000|  2975|
|    30|  2850|  1600|  1500|
+------+------+------+------+


--求每个部门的最高工资
--first_value(sal)  sal中的第一个值

sql("""
select distinct deptno,first from (
select deptno,sal,first_value(sal) over(partition by deptno order by sal desc) as first from temp
)order by deptno
""").show

+------+-----+                                                                  
|deptno|first|
+------+-----+
|    10| 5000|
|    20| 3000|
|    30| 2850|
+------+-----+

--求每个部门的最低工资
--求的时候order by 别加desc就好了
sql("""
select distinct deptno,first from (
select deptno,sal,first_value(sal) over(partition by deptno order by sal) as first from temp
)order by deptno
""").show

+------+-----+                                                                  
|deptno|first|
+------+-----+
|    10| 1300|
|    20|  800|
|    30|  950|
+------+-----+

求NBA球队连冠记录是在哪年到哪年

--这个数据与上一题的数据格式一样的
team,y
活塞,1990
公牛,1991
公牛,1992
公牛,1993
火箭,1994
火箭,1995
公牛,1996
公牛,1997
公牛,1998
马刺,1999
湖人,2000
湖人,2001
湖人,2002
马刺,2003
活塞,2004
马刺,2005
热火,2006
马刺,2007
凯尔特人,2008
湖人,2009
湖人,2010

--最终数据
+----+------+------+
|team|min(y)|max(y)|
+----+------+------+
|公牛|  1991|  1993|
|火箭|  1994|  1995|
|公牛|  1996|  1998|
|湖人|  2000|  2002|
|湖人|  2009|  2010|
+----+------+------+

--做这个题的时候我们要清除题的要求,是求连冠的球队,连冠指的是中间不能有其他球队夺冠,必须>=2才可以,而且年份要挨着

val df2 = spark.read.
option("header", true).
option("inferschema", true).
csv("hdfs://master:9000/data/team.log")
df2.createOrReplaceTempView("t2")

--这里我也是分步骤来求,便于理清思路
--这里用了y-row_number(),为什么这么用,在于如果是连冠,row_number()加排名函数,按球队名称做partition by,用夺冠年份减去row_number得到的数就是相同的,这里比较难理解,如果搞懂这里题也就做完了

-这里的a是分组排名的序号 aa也就是 y-a的值

sql("""
select team,y,a,aa from (
select team,y,row_number() over(partition by team order by y)as a,y-row_number() over(partition by team order by y)as aa from t2)
""").show


+----+----+---+----+  
|team|   y|  a|  aa|
+----+----+---+----+
|  热火|2006|  1|2005|
|  活塞|1990|  1|1989|
|  活塞|2004|  2|2002|
|  火箭|1994|  1|1993|
|  火箭|1995|  2|1993|
|凯尔特人|2008|  1|2007|
|  湖人|2000|  1|1999|
|  湖人|2001|  2|1999|
|  湖人|2002|  3|1999|
|  湖人|2009|  4|2005|
|  湖人|2010|  5|2005|
|  公牛|1991|  1|1990|
|  公牛|1992|  2|1990|
|  公牛|1993|  3|1990|
|  公牛|1996|  4|1992|
|  公牛|1997|  5|1992|
|  公牛|1998|  6|1992|
|  马刺|1999|  1|1998|
|  马刺|2003|  2|2001|
|  马刺|2005|  3|2002|
+----+----+---+----+

--只有重复出现的连续team,之后减去row_number()后才会有相同的aa列字段 此相同的aa字段代表在相同的字段下，球队是处于连冠状态

--按球队和aa分组，如果不按aa分组 那么公牛的连冠记录中间隔的几年也会被算进去,这样处理完的数据就不算是连冠的数据

sql("""
select team,min(y) miny,max(y) maxy from(
select team,y,y-row_number() over(partition by team order by y)as aa from t2)
group by team,aa
""").show

+----+----+----+
|team|miny|maxy|
+----+----+----+
|  热火|2006|2006|
|  活塞|1990|1990|
|  活塞|2004|2004|
|  火箭|1994|1995|
|凯尔特人|2008|2008|
|  湖人|2000|2002|
|  湖人|2009|2010|
|  公牛|1991|1993|
|  公牛|1996|1998|
|  马刺|1999|1999|
|  马刺|2003|2003|
|  马刺|2005|2005|
|  马刺|2007|2007|
+----+----+----+


--判断最小值是否小于最大值，目的是为了排除单独一年获得冠军的球队即最小值=最大值，取出连冠过的球队
--最终要尽可能的精简代码,这样会给面试官一个好印象(代码洁癖什么的???),也会显得自己sql写的很好

sql("""
select team,min(y) miny,max(y) maxy from(
select team,y,y-row_number() over(partition by team order by y)as aa from t2)
group by team,aa
having min(y)<max(y)
order by miny
""").show

--  大功告成
+----+----+----+
|team|miny|maxy|
+----+----+----+
|  公牛|1991|1993|
|  火箭|1994|1995|
|  公牛|1996|1998|
|  湖人|2000|2002|
|  湖人|2009|2010|
+----+----+----+

好啦大功告成！！！，有问题可以联系我我们一起交流
616581760 微信QQ同号

YMLiang

SparkSQL分析

简单排名函数的使用

每个部门工资最高的前3名

求NBA球队连冠记录是在哪年到哪年

上一页

浅谈项目中用到redis的地方

redis

下一页

SQL求K线波峰波谷

SparkSql

评论