一、SQL 书写技巧

1.1 常用函数

统计学相关

分位数
1. percentile：percentile(col, p) col 是要计算的列（值必须为 int 类型），p 的取值为 0-1，若为 0.5，那么就是 2 分位数，即中位数。
2. percentile_approx：percentile_approx(col, p)。列为数值类型都可以。percentile_approx 还有一种形式 percentile_approx(col, p，B)，参数 B 控制内存消耗的近似精度，B 越大，结果的精度越高。默认值为 10000。当 col 字段中的 distinct 值的个数小于 B 时，结果就为准确的百分位数。

1.2 技巧

SELECT 查询排除一列或多列

set hive.support.quoted.identifiers=none;
select `(ds)?+.+`  from `test`.`table` where ds='20231023';

行列转换

行转列

使用 LATERAL VIEW 可以和 split 和 explode 一起等 UDTF 用于讲一行才分为多行

select  id,filed2 from test5_text a  lateral view explode(collect_set(split(a.msg,','))) table1 as filed2  limit 10;

-- 输入
a 1,2,3、
-- 结果
a 1
a 2
a 3

列转行
使用 concat_ws() 函数可以实现多列转一行数据实现，collect_set 函数为去重函数
```
SELECT f1,concat_ws(',',collect_set(f3)) from a group by f1;

-- 输入
a 1
a 2
a 3
-- 结果
a 1,2,3
```

导出 CSV 文件

在导出 csv 后，使用 Excel 大家文件的时候，会遇到这样的问题，文件的存在少量串行的数据，虽然不影响整体数据，但是有的时候遇到高精度计算的过程中，还是难免有些尴尬?

导出的时候进行转义
导出的时候使用其他格式，在 Excel 中使用数据-> 文件-> 自文本导入使用自定义格式导入
使用 Hive 序列化导入,会自动规整格式

CREATE TABLE  wys.wys_output
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar"  = "\"",
"escapeChar"  = "\""
)
STORED AS TEXTFILE
as
select distinct a,b
from xxx;

Excel 中 csv 格式中会出现\n 使用 rtrim 和 ltrim 可以去除,每一列数据都是以 " 分割数据,但是在数据出现 " 号的话就会导致数据串行,此时我们可以使用 "" 来代表那是数据中双引号

MySQL 转 MaxCompute DDL

-- 本节点类型适用于MySQL数据源
-- 防止数据concat不全
SET SESSION group_concat_max_len = 1000000;

-- 查询每个字段的最大值，28是格式化的最小值，该指应该设置在如下关键字的加上_coll的长度，因为RPAD如果超过该长度会进行截取
-- ('sort','status','comment','count','from','end','desc','lifecycle','index','table') then concat(`COLUMN_NAME`,'_coll')
-- select 
--   `TABLE_SCHEMA`,
--   `TABLE_NAME`,
--   case 
--        when max(length(COLUMN_NAME)) > 28 then max(length(COLUMN_NAME)) 
--        else 28
--   end as lc
-- from `information_schema`.`COLUMNS`
-- where `TABLE_SCHEMA` not in ('information_schema','mysql') and `TABLE_NAME` not in ('migrations')
-- group by `TABLE_SCHEMA`,`TABLE_NAME`

select concat('-- ', tl.d_name, '.', tl.t_name, '\n', 'create table ods_', replace(tl.d_name, '-', '_'), '_',
              replace(tl.t_name, '-', '_'), '_df(\n', tr.fs,
              '\n)partitioned by (ds string comment \'时间分区\')\nSTORED AS ALIORC TBLPROPERTIES (\'comment\'=\'',
              tl.t_comment, '\');') as create_sql
from (select `TABLE_SCHEMA`  as `d_name`,
             `TABLE_NAME`    as `t_name`,
             `TABLE_COMMENT` as `t_comment`
      from `information_schema`.`TABLES`
      where `TABLE_SCHEMA` not in ('information_schema', 'mysql')
        and `TABLE_TYPE` = 'BASE TABLE') tl
         inner join
     (select `db_name`, `t_name`, GROUP_CONCAT(concat('  ', f) SEPARATOR ",\n") as fs
      from (select db_name,
                   `t_name`,
                   concat(RPAD(col_name, tt1.lc, ' '), ' ', RPAD(type_name, 15, ' '), ' comment \'',
                          replace(col_comment, '\n', ''), '\'') as f
            from (select `TABLE_SCHEMA`     as `db_name`,
                         `TABLE_NAME`       as `t_name`,
                         -- ODPS 关键字字段转换
                         case
                             when `COLUMN_NAME` in
                                  ('sort', 'status', 'comment', 'count', 'from', 'end', 'desc', 'lifecycle', 'index',
                                   'table') then concat(`COLUMN_NAME`, '_coll')
                             else `COLUMN_NAME`
                             end            as `col_name`,
                         -- 类型转换
                         case
                             when `DATA_TYPE` in ('timestamp', 'datetime') then 'datetime'
                             when `DATA_TYPE` in
                                  ('date', 'json', 'varchar', 'text', 'longtext', 'enum', 'date', 'time', 'char',
                                   'mediumtext') then 'string'
                             when `DATA_TYPE` in ('tinyint') then 'boolean'
                             when `DATA_TYPE` in ('int', 'bigint', 'smallint', 'mediumint') and `NUMERIC_PRECISION` > 4
                                 then 'bigint'
                             when `DATA_TYPE` in ('int', 'bigint', 'smallint', 'mediumint') and `NUMERIC_PRECISION` <= 4
                                 then 'int'
                             when `DATA_TYPE` in ('decimal') then 'decimal'
                             when `DATA_TYPE` in ('double') then 'double'
                             else 'string'
                             end            as type_name,
                         case
                             when (`COLUMN_COMMENT` = '' or `COLUMN_COMMENT` is null) and `COLUMN_NAME` = 'id'
                                 then '主键ID'
                             when (`COLUMN_COMMENT` = '' or `COLUMN_COMMENT` is null) and `COLUMN_NAME` = 'created_at'
                                 then '创建时间'
                             when (`COLUMN_COMMENT` = '' or `COLUMN_COMMENT` is null) and `COLUMN_NAME` = 'updated_at'
                                 then '修改时间'
                             when (`COLUMN_COMMENT` = '' or `COLUMN_COMMENT` is null) and `COLUMN_NAME` = 'deleted_at'
                                 then '删除时间'
                             else `COLUMN_COMMENT`
                             end            as `col_comment`,
                         `ORDINAL_POSITION` as `position`
                  from `information_schema`.`COLUMNS`
                  where `TABLE_SCHEMA` not in ('information_schema', 'mysql')
                    and `TABLE_NAME` not in ('migrations')) t1
                     inner join (select `TABLE_SCHEMA`,
                                        `TABLE_NAME`,
                                        case
                                            when max(length(COLUMN_NAME)) > 28 then max(length(COLUMN_NAME))
                                            else 28 end as lc
                                 from `information_schema`.`COLUMNS`
                                 where `TABLE_SCHEMA` not in ('information_schema', 'mysql')
                                   and `TABLE_NAME` not in ('migrations')
                                 group by `TABLE_SCHEMA`, `TABLE_NAME`) tt1
                                on t1.db_name = tt1.`TABLE_SCHEMA` and t1.t_name = tt1.`TABLE_NAME`
            order by db_name, t_name, position asc) t2
      group by `db_name`, `t_name`) tr on tl.d_name = tr.db_name and tl.t_name = tr.t_name;

二、编程技巧

Scala

Java和Scala集合转换

Scala 2.11
```
import collection.JavaConverters._
```

Scala 2.13

import scala.jdk.CollectionConverters._

可变参数

Scala Array传Java
```
arrayParm: _*
```
Scala 构建可变参数函数
```
@varargs def process(args: String*)
```

范型装箱/拆箱编译优化

class Test[@specialized(Int,Long,Double) A]

目录CONTENT

持续更新 | 数工常用片段