DataX 安装与部署


本教程主要介绍了DataX的安装教程(提供安装包下载链接),并对如何在本地环境下部署DataX做出了介绍。

方法一

直接下载DataX工具包:

DataX下载地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

下载后解压至本地某个目录,DataX即安装完成。

进入bin目录,即可运行同步作业:

$ cd  {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}

自检脚本:python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json

方法二

下载DataX源码,自主编译:https://github.com/alibaba/DataX

1、下载DataX源码:

$ git clone git@github.com:alibaba/DataX.git

2、通过maven打包

$ cd {DataX_source_code_home}
$ mvn -U clean package assembly:assembly -Dmaven.test.skip=true

打包成功,日志显示如下:

[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------
[INFO] Total time: 08:12 min
[INFO] Finished at: 2015-12-13T16:26:48+08:00
[INFO] Final Memory: 133M/960M
[INFO] -----------------------------------------------------------------

打包成功后的DataX包位于{DataX_source_code_home}/target/datax/datax/,结构如下:

$ cd  {DataX_source_code_home}
$ ls ./target/datax/datax/
bin		conf		job		lib		log		log_perf	plugin		script		tmp

部署、配置实例

从stream读取数据并打印到控制台

1)创建创业的的配置文件(json格式)

可以通过命令查看配置模板:python datax.py -r {YOUR_READER} -w {YOUR_WRITER}

[root@hadoop1 bin]# pwd
/home/installed/datax/bin
[root@hadoop1 bin]# python datax.py -r streamreader -w streamwriter

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


Please refer to the streamreader document:
     https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md 

Please refer to the streamwriter document:
     https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md 
 
Please save the following configuration as a json file and  use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "streamreader", 
                    "parameter": {
                        "column": [], 
                        "sliceRecordCount": ""
                    }
                }, 
                "writer": {
                    "name": "streamwriter", 
                    "parameter": {
                        "encoding": "", 
                        "print": true
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}
[root@hadoop1 bin]#

根据模板部署、配置json如下:

#stream2stream.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello,你好,世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
       }
    }
  }
}

2)启动DataX

[root@hadoop3 datax]# cd /home/installed/datax/bin/
[root@hadoop3 bin]# python datax.py /home/test/dataxtest/stream2stream.json

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2019-09-09 16:14:17.345 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2019-09-09 16:14:17.356 [main] INFO  Engine - the machine info  => 

	osInfo:	Oracle Corporation 1.8 25.161-b12
	jvmInfo:	Linux amd64 3.10.0-693.el7.x86_64
	cpu num:	4

	totalPhysicalMemory:	-0.00G
	freePhysicalMemory:	-0.00G
	maxFileDescriptorCount:	-1
	currentOpenFileDescriptorCount:	-1

	GC Names	[PS MarkSweep, PS Scavenge]

	MEMORY_NAME                    | allocation_size                | init_size                      
	PS Eden Space                  | 256.00MB                       | 256.00MB                       
	Code Cache                     | 240.00MB                       | 2.44MB                         
	Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
	PS Survivor Space              | 42.50MB                        | 42.50MB                        
	PS Old Gen                     | 683.00MB                       | 683.00MB                       
	Metaspace                      | -0.00MB                        | 0.00MB                         


2019-09-09 16:14:17.375 [main] INFO  Engine - 
{
	"content":[
		{
			"reader":{
				"name":"streamreader",
				"parameter":{
					"column":[
						{
							"type":"long",
							"value":"10"
						},
						{
							"type":"string",
							"value":"hello,你好,世界-DataX"
						}
					],
					"sliceRecordCount":10
				}
			},
			"writer":{
				"name":"streamwriter",
				"parameter":{
					"encoding":"UTF-8",
					"print":true
				}
			}
		}
	],
	"setting":{
		"speed":{
			"channel":5
		}
	}
}

2019-09-09 16:14:17.404 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2019-09-09 16:14:17.406 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2019-09-09 16:14:17.406 [main] INFO  JobContainer - DataX jobContainer starts job.
2019-09-09 16:14:17.409 [main] INFO  JobContainer - Set jobId = 0
2019-09-09 16:14:17.431 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2019-09-09 16:14:17.432 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do prepare work .
2019-09-09 16:14:17.432 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2019-09-09 16:14:17.433 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2019-09-09 16:14:17.433 [job-0] INFO  JobContainer - Job set Channel-Number to 5 channels.
2019-09-09 16:14:17.434 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks.
2019-09-09 16:14:17.435 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks.
2019-09-09 16:14:17.467 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2019-09-09 16:14:17.485 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2019-09-09 16:14:17.488 [job-0] INFO  JobContainer - Running by standalone Mode.
2019-09-09 16:14:17.507 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks.
2019-09-09 16:14:17.513 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2019-09-09 16:14:17.513 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2019-09-09 16:14:17.545 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started
2019-09-09 16:14:17.558 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
2019-09-09 16:14:17.580 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
2019-09-09 16:14:17.598 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
2019-09-09 16:14:17.619 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
2019-09-09 16:14:17.731 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[112]ms
2019-09-09 16:14:17.731 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[163]ms
2019-09-09 16:14:17.731 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[202]ms
2019-09-09 16:14:17.731 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[177]ms
2019-09-09 16:14:17.732 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[136]ms
2019-09-09 16:14:17.733 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2019-09-09 16:14:27.511 [job-0] INFO  StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2019-09-09 16:14:27.511 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2019-09-09 16:14:27.511 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do post work.
2019-09-09 16:14:27.512 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do post work.
2019-09-09 16:14:27.512 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2019-09-09 16:14:27.513 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /home/installed/datax/hook
2019-09-09 16:14:27.515 [job-0] INFO  JobContainer - 
	 [total cpu info] => 
		averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
		-1.00%                         | -1.00%                         | -1.00%
                        

	 [total gc info] => 
		 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
		 PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
		 PS Scavenge          | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             

2019-09-09 16:14:27.516 [job-0] INFO  JobContainer - PerfTrace not enable!
2019-09-09 16:14:27.516 [job-0] INFO  StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2019-09-09 16:14:27.517 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2019-09-09 16:14:17
任务结束时刻                    : 2019-09-09 16:14:27
任务总计耗时                    :                 10s
任务平均流量                    :               95B/s
记录写入速度                    :              5rec/s
读出记录总数                    :                  50
读写失败总数                    :                   0

[root@hadoop3 bin]#