Kylin SQL快速参考


Apache Kylin使用Apache Calcite做SQL解析和优化。
作为一款OLAP引擎, Kylin支持SELECT 操作,而不支持其它操作例如 INSERTUPDATEDELETE,因此Kylin的SQL语法是Apache Calcite 支持语法的一个子集。
本文列举了Kylin支持的SQL语法函数以及数据类型,但可能并不完整。您可以查看Calcite SQL reference 以了解更多内容。

语法

QUERY SYNTAX
SELECT
STATEMENT
EXPRESSION
SUBQUERY
JOIN
INNER JOIN
LEFT JOIN
UNION
UNION ALL

函数

COUNT
COUNT(COLUMN)
COUNT(*)
COUNT_DISTINCT
MAX
MIN
PERCENTILE
SUM
TOP_N

WINDOW
ROW_NUMBER
AVG
RANK
DENSE_RANK
FIRST_VALUE
LAST_VALUE
LAG
LEAD
NTILE
CASE WHEN
CAST

SUSTRING
COALESCE
STDDEV_SUM
INTERSECT_COUNT
INTERSECT_VALUE

Sql 执行计划

你可以通过在Sql的前面添加explain plan for来获取Sql的执行计划,比如:

explain plan for select count(*) from KYLIN_SALES

前端展示的是一个一行的执行计划的字符串,最好通过结果导出功能将执行计划导出后查看。

数据类型

ANY CHAR VARCHAR STRING BOOLEAN
BYTE BINARY INT SHORT LONG
INTEGER TINYINT SMALLINT BIGINT TIMESTAMP
FLOAT REAL DOUBLE DECIMAL DATETIME
NUMERIC DATE TIME

查询语法

statement:
|  query

query:
      values
  |  WITH withItem [ , withItem ]* query
  |   {
          select
      |  selectWithoutFrom
      |  query UNION [ ALL | DISTINCT ] query
      |  query INTERSECT [ ALL | DISTINCT ] query
      }
      [ ORDER BY orderItem [, orderItem ]* ]
      [ LIMIT { count | ALL } ]
      [ OFFSET start { ROW | ROWS } ]
      [ FETCH { FIRST | NEXT } [ count ] { ROW| ROWS } ]

withItem:
      name
      ['(' column [, column ]* ')' ]
      AS '(' query ')'

orderItem:
      expression [ ASC | DESC ][ NULLS FIRST |NULLS LAST ]

select:
      SELECT [ ALL | DISTINCT]
          { * | projectItem [, projectItem ]* }
      FROM tableExpression
      [ WHERE booleanExpression ]
      [ GROUP BY { groupItem [, groupItem ]* }]
      [ HAVING booleanExpression ]
      [ WINDOW windowName AS windowSpec [,windowName AS windowSpec ]* ]

selectWithoutFrom:
      SELECT [ ALL | DISTINCT ]
          { * | projectItem [, projectItem ]* }

projectItem:
      expression [ [ AS ] columnAlias ]
  |  tableAlias . *

tableExpression:
      tableReference [, tableReference ]*
  |  tableExpression [ NATURAL ][( LEFT | RIGHT | FULL ) [ OUTER ] ] JOIN tableExpression [ joinCondition ]

joinCondition:
      ON booleanExpression
  |  USING '(' column [, column ]* ')'

tableReference:
      tablePrimary
      [ matchRecognize ]
      [ [ AS ] alias [ '(' columnAlias [,columnAlias ]* ')' ] ]

tablePrimary:
      [ [ catalogName . ] schemaName . ] tableName
      '(' TABLE [ [ catalogName . ] schemaName. ] tableName ')'
  |   [ LATERAL ] '(' query ')'
  |  UNNEST '(' expression ')' [ WITH ORDINALITY ]
  |   [ LATERAL ] TABLE '(' [ SPECIFIC ] functionName '(' expression [, expression ]*')' ')'

values:
      VALUES expression [, expression ]*

groupItem:
      expression
  |   '('')'
  |   '('expression [, expression ]* ')'
  |  GROUPING SETS '(' groupItem [, groupItem ]* ')'

windowRef:
      windowName
  |  windowSpec

windowSpec:
      [windowName ]
      '('
      [ ORDER BY orderItem [, orderItem ]* ]
      [ PARTITION BY expression [, expression]* ]
      [
          RANGE numericOrIntervalExpression { PRECEDING | FOLLOWING }
      |  ROWS numericExpression { PRECEDING | FOLLOWING }
      ]
    ')'

SELECT

STATEMENT

SELECT 用于从表中选取数据。COUNT 用于统计数据。DISTINCT 过滤掉重复的结果。AS 用于给表或列起别名。FROM 指定要查询的表。JOIN 用于连接两个表以获取所需的数据。WHERE 用于规定选择的标准。LIKE 用于在 WHERE 子句中搜索列中的指定模式。BETWEEN ... AND 选取介于两个值之间的数据范围。AND 和 OR 用于基于一个以上的条件对记录进行过滤。GROUP BY 按给定表达式对结果进行分组。HAVING 用于分组后过滤行。ORDER BY 用于对结果集进行排序,通常和 TOPN 一起使用。LIMIT 用来限制查询返回的行数。

例子:

SELECT COUNT(*) FROM kylin_sales;

SELECT COUNT(DISTINCT seller_id) FROM kylin_sales;

SELECT seller_id, COUNT(1) FROM kylin_sales GROUP BY seller_id;

SELECT lstg_format_name, SUM(price) AS gmv, COUNT(DISTINCT seller_id) AS dist_seller FROM kylin_sales WHERE lstg_format_name='FP-GTC' GROUP BY lstg_format_name HAVING COUNT(DISTINCT seller_id) > 50;

SELECT lstg_format_name FROM kylin_sales INNER JOIN kylin_cal_dt AS kylin_cal_dt ON kylin_sales.part_dt = kylin_cal_dt.cal_dt WHERE NOT(lstg_format_name NOT LIKE '%ab%') GROUP BY lstg_format_name;

SELECT kylin_cal_dt.cal_dt FROM kylin_sales INNER JOIN kylin_cal_dt AS kylin_cal_dt ON kylin_sales.part_dt = kylin_cal_dt.cal_dt WHERE kylin_cal_dt.cal_dt BETWEEN DATE '2013-01-01' AND DATE '2013-06-04';

SELECT kylin_sales.lstg_format_name, SUM(kylin_sales.price) AS gmv, COUNT(*) AS trans_cnt FROM kylin_sales WHERE kylin_sales.lstg_format_name IS NULL GROUP BY kylin_sales.lstg_format_name HAVING SUM(price)>5000 AND COUNT(*)>72; 

SELECT kylin_sales.lstg_format_name, SUM(kylin_sales.price) AS gmv, COUNT(*) AS trans_cnt FROM kylin_sales WHERE kylin_sales.lstg_format_name IS NOT NULL GROUP BY kylin_sales.lstg_format_name HAVING SUM(price)>5000 OR COUNT(*)>20;

SELECT lstg_format_name, SUM(price) AS gmv, COUNT(1) AS trans_cnt FROM kylin_sales WHERE lstg_format_name='FP-GTC' GROUP BY lstg_format_name ORDER BY lstg_format_name LIMIT 10;

EXPRESSION

在 SELECT 语句中的表达式。 可以使用 * 选择表中的所有列。
例子:
1. *
2. 将 ID 作为值
3. 值 + 1

SUBQUERY

例子:

SELECT cal_dt ,sum(price) AS sum_price FROM (SELECT kylin_cal_dt.cal_dt, kylin_sales.price FROM kylin_sales INNER JOIN kylin_cal_dt AS kylin_cal_dt ON kylin_sales.part_dt = kylin_cal_dt.cal_dt INNER JOIN kylin_category_groupings ON kylin_sales.leaf_categ_id = kylin_category_groupings.leaf_categ_id AND kylin_sales.lstg_site_id = kylin_category_groupings.site_id) t GROUP BY cal_dt;

JOIN

INNER JOIN

在表中存在至少一个匹配时,INNER JOIN 关键字返回行。
例子:

SELECT kylin_cal_dt.cal_dt, kylin_sales.price FROM kylin_sales INNER JOIN kylin_cal_dt AS kylin_cal_dt ON kylin_sales.part_dt = kylin_cal_dt.cal_dt;

LEFT JOIN

使用 LEFT JOIN 关键字会从左表 (kylin_sales) 那里返回所有的行,即使在右表 (kylin_category_groupings) 中没有匹配的行。
例子:

SELECT seller_id FROM kylin_sales LEFT JOIN kylin_category_groupings AS kylin_category_groupings ON kylin_sales.leaf_categ_id = kylin_category_groupings.leaf_categ_id AND kylin_sales.lstg_site_id = kylin_category_groupings.site_id WHERE lstg_format_name='FP-GTC' GROUP BY seller_id LIMIT 20;

UNION

UNION 操作符用于合并两个或多个 SELECT 语句的结果集。
注意 UNION 内部的 SELECT 语句必须拥有相同数量的列。列也必须拥有相似的数据类型。同时,每条 SELECT 语句中的列的顺序必须相同。
默认地,UNION 操作符选取不同的值。如果允许重复的值,请使用 UNION ALL

例子:

SELECT SUM(price) AS x, seller_id, COUNT(*) AS y FROM kylin_sales WHERE part_dt < DATE '2012-08-01' GROUP BY seller_id UNION (SELECT SUM(price) AS x, seller_id, COUNT(*) AS y FROM kylin_sales WHERE part_dt > DATE '2012-12-01' GROUP BY seller_id);

UNION ALL

UNION ALL 命令和 UNION 命令几乎是等效的,不过 UNION ALL 命令会列出所有的值。

例子:

SELECT COUNT(trans_id) AS trans_id FROM kylin_sales AS test_a WHERE trans_id <> 1 UNION ALL SELECT COUNT(trans_id) AS trans_id FROM kylin_sales AS test_b;

COUNT

用于返回与指定条件匹配的行数。

COUNT(COLUMN)

例子:

SELECT COUNT(seller_id) FROM kylin_sales;

COUNT(*)

例子:

SELECT COUNT(*) FROM kylin_sales;

COUNT_DISTINCT

例子:

SELECT COUNT(DISTINCT seller_id) AS DIST_SELLER FROM kylin_sales;

MAX

返回一列中的最大值。NULL 值不包括在计算中。
例子:

SELECT MAX(lstg_site_id) FROM kylin_sales;

MIN

返回一列中的最小值。NULL 值不包括在计算中。
例子:

SELECT MIN(lstg_site_id) FROM kylin_sales;

PERCENTILE

例子:

SELECT seller_id, PERCENTILE(price, 0.5) FROM kylin_sales GROUP BY seller_id;

SELECT seller_id, PERCENTILE_APPROX(price, 0.5) FROM kylin_sales GROUP BY seller_id;

SUM

返回数值列的总数。
例子:

SELECT SUM(price) FROM kylin_sales;

TOP_N

例子:

SELECT SUM(price) AS gmv
 FROM kylin_sales 
INNER JOIN kylin_cal_dt AS kylin_cal_dt
 ON kylin_sales.part_dt = kylin_cal_dt.cal_dt
 INNER JOIN kylin_category_groupings
 ON kylin_sales.leaf_categ_id = kylin_category_groupings.leaf_categ_id AND kylin_sales.lstg_site_id = kylin_category_groupings.site_id
 WHERE kylin_cal_dt.cal_dt between DATE '2013-09-01' AND DATE '2013-10-01' AND (lstg_format_name='FP-GTC' OR 'a' = 'b')
 GROUP BY kylin_cal_dt.cal_dt;
 
SELECT kylin_sales.part_dt, seller_id
FROM kylin_sales
INNER JOIN kylin_cal_dt AS kylin_cal_dt
ON kylin_sales.part_dt = kylin_cal_dt.cal_dt
INNER JOIN kylin_category_groupings
ON kylin_sales.leaf_categ_id = kylin_category_groupings.leaf_categ_id
AND kylin_sales.lstg_site_id = kylin_category_groupings.site_id 
GROUP BY 
kylin_sales.part_dt, kylin_sales.seller_id ORDER BY SUM(kylin_sales.price) DESC LIMIT 20;

WINDOW

WINDOW 函数在和当前行相关的一组表行上执行计算。
注意WINDOW 函数中必须有 OVER 子句

ROW_NUMBER

例子:

SELECT lstg_format_name, SUM(price) AS gmv, ROW_NUMBER() OVER() FROM kylin_sales GROUP BY lstg_format_name;

AVG

返回数值列的平均值。NULL 值不包括在计算中。
例子:

SELECT lstg_format_name, AVG(SUM(price)) OVER(PARTITION BY lstg_format_name) FROM kylin_sales GROUP BY part_dt, lstg_format_name;

RANK

例子:

SELECT part_dt, lstg_format_name, SUM(price), RANK() OVER(PARTITION BY lstg_format_name ORDER BY part_dt) AS "rank" FROM kylin_sales GROUP BY part_dt, lstg_format_name;

DENSE_RANK

例子:

SELECT part_dt, lstg_format_name, SUM(price), DENSE_RANK() OVER(PARTITION BY lstg_format_name ORDER BY part_dt) AS "dense_rank" FROM kylin_sales GROUP BY part_dt, lstg_format_name;

FIRST_VALUE

例子:

SELECT part_dt, lstg_format_name, SUM(price) AS gmv, FIRST_VALUE(SUM(price)) OVER(PARTITION BY lstg_format_name ORDER BY part_dt) AS "first" FROM kylin_sales WHERE part_dt < '2012-02-01' GROUP BY part_dt, lstg_format_name;

LAST_VALUE

例子:

SELECT part_dt, lstg_format_name, SUM(price) AS gmv, LAST_VALUE(SUM(price)) OVER(PARTITION BY lstg_format_name ORDER BY part_dt) AS "current" FROM kylin_sales WHERE part_dt < '2012-02-01' GROUP BY part_dt, lstg_format_name;

LAG

例子:

SELECT part_dt, lstg_format_name, SUM(price) AS gmv, LAG(SUM(price), 1, 0.0) OVER(PARTITION BY lstg_format_name ORDER BY part_dt) AS "prev" FROM kylin_sales WHERE part_dt < '2012-02-01' GROUP BY part_dt, lstg_format_name;

LEAD

例子:

SELECT part_dt, lstg_format_name, SUM(price) AS gmv, LEAD(SUM(price), 1, 0.0) OVER(PARTITION BY lstg_format_name ORDER BY part_dt) AS "next" FROM kylin_sales WHERE part_dt < '2012-02-01' GROUP BY part_dt, lstg_format_name;

NTILE

例子:

SELECT part_dt, lstg_format_name, SUM(price) AS gmv, NTILE(4) OVER (PARTITION BY lstg_format_name ORDER BY part_dt) AS "quarter" FROM kylin_sales WHERE part_dt < '2012-02-01' GROUP BY part_dt, lstg_format_name;

CASE WHEN

例子:

SELECT part_dt, lstg_format_name, SUM(price) AS gmv, (CASE LAG(SUM(price), 1, 0.0) OVER(PARTITION BY lstg_format_name ORDER BY part_dt) WHEN 0.0 THEN 0 ELSE SUM(price)/LAG(SUM(price), 1, 0.0) OVER(PARTITION BY lstg_format_name ORDER BY part_dt) END) AS "prev" FROM kylin_sales WHERE part_dt < '2012-02-01' GROUP BY part_dt, lstg_format_name;

CAST

RANGEINTERVAL 关键字指明了范围。PRECEDING 表示前几天(秒/分/时/月/年)。FOLLOWING 表示后几天(秒/分/时/月/年)。
例子:

SELECT part_dt, lstg_format_name, SUM(price) AS gmv, FIRST_VALUE(SUM(price)) OVER (PARTITION BY lstg_format_name ORDER BY CAST(part_dt AS timestamp) RANGE INTERVAL '3' DAY PRECEDING) AS "prev 3 days", LAST_VALUE(SUM(price)) OVER (PARTITION BY lstg_format_name ORDER BY CAST(part_dt AS timestamp) RANGE INTERVAL '3' DAY FOLLOWING) AS "next 3 days" FROM kylin_sales WHERE part_dt < '2012-02-01' GROUP BY part_dt, lstg_format_name;

SUBSTRING

例子:

SELECT SUBSTRING(lstg_format_name, 1) FROM kylin_sales;

COALESCE

例子:

SELECT COALESCE(lstg_format_name, '888888888888') FROM kylin_sales;

STDDEV_SUM

例子: 第一个查询和第二个查询是对等的,stddev_sum 是一个在 KYLIN-3361 引入的 UDAF。

select A, stddev_sum(sampling_dim, m)
from T
group by A
select A, stddev(SUM_M)
from (
      select A, sampling_dim, sum(m) as SUM_M
      from T
      group by A, sampling_dim
) a
group by A

INTERSECT_COUNT

INTERSECT_COUNT函数用于计算留存率,计算留存率的measure必须经过count_distinct精确去重的预计算。
例子1:

select city, version,
intersect_count(uuid, dt, array['20161014']) as first_day,
intersect_count(uuid, dt, array['20161015']) as second_day,
intersect_count(uuid, dt, array['20161016']) as third_day,
intersect_count(uuid, dt, array['20161014', '20161015']) as retention_oneday,
intersect_count(uuid, dt, array['20161014', '20161015', '20161016']) as retention_twoday
from visit_log
where dt in ('2016104', '20161015', '20161016')
group by city, version

例子2: 参考KYLIN-4314

select city, version,
intersect_count(uuid, dt, array['20161014']) as first_day,
intersect_count(uuid, dt, array['20161015']) as second_day,
intersect_count(uuid, dt, array['20161016']) as third_day,
intersect_count(uuid, dt, array['20161014', '20161015']) as retention_oneday,
intersect_count(uuid, dt, array['20161014|20161015', '20161016']) as retention_twoday
from visit_log
where dt in ('2016104', '20161015', '20161016')
group by city, version

INTERSECT_VALUE

INTERSECT_COUNT函数用于返回留存值的bitmap明细,使用它之前必须经过count_distinct精确去重的预计算。
例子:

select city, version,
intersect_value(uuid, dt, array['20161014']) as first_day,
intersect_value(uuid, dt, array['20161015']) as second_day,
intersect_value(uuid, dt, array['20161016']) as third_day,
intersect_value(uuid, dt, array['20161014', '20161015']) as retention_oneday,
intersect_value(uuid, dt, array['20161014|20161015', '20161016']) as retention_twoday
from visit_log
where dt in ('2016104', '20161015', '20161016')
group by city, version

评论区(0)

评论