Impala 使用SequenceFile文件格式

Impala支持使用SequenceFile数据文件。

文件类型	格式	压缩编解码器	是否支持Impala创建	是否支持Impala插入
SequenceFile	结构化的	snappy，gzip，deflate，bzip2	是的。	否。通过使用LOAD DATA已经采用正确格式的数据文件导入数据，或者INSERT在 Hive 中使用，然后在 Impala 中使用。 REFRESH table_name

创建SequenceFile表并加载数据

如果您没有要使用的现有数据文件，请先以适当的格式创建一个。

要创建SequenceFile表：

在impala-shell解释器中，发出类似于以下内容的命令：

create table sequencefile_table (column_specs) stored as sequencefile;

因为Impala可以查询它当前无法写入的某些类型的表，所以在创建某些文件格式的表后，您可以使用Hive shell加载数据。

通过Hive或Impala之外的其他机制将数据加载到表中后，下次连接Impala节点时，在查询表之前发出一条语句，使Impala识别新数据：REFRESH table_name

例如，您可以通过以下方式在Impala中创建一些SequenceFile表（通过显式指定列，或克隆另一个表的结构），通过Hive加载数据，并通过Impala查询它们：

$ impala-shell -i localhost
[localhost:21000] > create table seqfile_table (x int) stored as sequencefile;
[localhost:21000] > create table seqfile_clone like some_other_table stored as sequencefile;
[localhost:21000] > quit;

$ hive
hive> insert into table seqfile_table select x from some_other_table;
3 Rows loaded to seqfile_table
Time taken: 19.047 seconds
hive> quit;

$ impala-shell -i localhost
[localhost:21000] > select * from seqfile_table;
Returned 0 row(s) in 0.23s
[localhost:21000] > -- Make Impala recognize the data loaded through Hive;
[localhost:21000] > refresh seqfile_table;
[localhost:21000] > select * from seqfile_table;
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.23s

复杂类型注意事项：虽然您可以使用Impala 2.3及更高版本中提供的复杂类型（ARRAY、STRUCT、和 MAP）以这种文件格式创建表，但目前，Impala只能在Parquet表中查询这些类型。上述规则的一个例外是对包含复杂类型的RCFile表的查询。Impala 2.6及更高版本中允许此类查询：COUNT(*)

为SequenceFile表启用压缩

您可能希望对现有表启用压缩。在大多数情况下启用压缩可提供性能提升，并且支持SequenceFile表。例如，要启用Snappy压缩，您需要在通过Hive shell加载数据时指定以下附加设置：

hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> insert overwrite table new_table select * from old_table;

如果要转换分区表，则必须完成其他步骤。在这种情况下，请指定类似于以下内容的其他设置：

hive> create table new_table (your_cols) partitioned by (partition_cols) stored as new_format;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> insert overwrite table new_table partition(comma_separated_partition_cols) select * from old_table;

请记住，Hive不要求您为其指定源格式。考虑将名为year的分区列的表转换month为Snappy压缩的SequenceFile的情况。

结合前面概述的组件以完成此表转换，您将指定类似于以下内容的设置：

hive> create table TBL_SEQ (int_col int, string_col string) STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE tbl_seq SELECT * FROM tbl;

要对包含分区的表完成类似的过程，您需要指定类似于以下内容的设置：

hive> CREATE TABLE tbl_seq (int_col INT, string_col STRING) PARTITIONED BY (year INT) STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE tbl_seq PARTITION(year) SELECT * FROM tbl;

笔记：

压缩类型在以下命令中指定：

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

您可以选择指定替代编解码器，例如GzipCodec此处。

Impala SequenceFile表的查询性能

通常，期望使用 SequenceFile 表的查询性能比使用文本数据的表快，但比使用 Parquet 表慢。

在Impala 2.6及更高版本中，Impala查询针对存储在Amazon S3中的文件进行了优化。对于使用文件格式的实木复合地板，ORC，RCFile，SequenceFile，Avro的，和未压缩文本因帕拉表中，设置fs.s3a.block.size了在核心的site.xml 配置文件决定帕拉如何划分的读取数据文件的I/O工作。此配置设置以字节为单位指定。默认情况下，此值为33554432(32 MB)，这意味着Impala将文件上的S3读取操作并行化，就好像它们由32 MB块组成一样。例如，如果您的S3查询主要访问由MapReduce或Hive编写的Parquet文件，则增加fs.s3a.block.size到134217728 (128 MB) 以匹配这些文件的行组大小。如果大多数S3查询涉及Impala编写的Parquet文件，请增加到fs.s3a.block.size268435456 (256 MB) 以匹配Impala生成的行组大小。

Impala 使用RCFile文件格式 Impala 查询Kudu表