本教程主要介绍了DataX的安装教程(提供安装包下载链接),并对如何在本地环境下部署DataX做出了介绍。
http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
$ cd {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}
https://github.com/alibaba/DataX
$ git clone git@github.com:alibaba/DataX.git
$ cd {DataX_source_code_home}
$ mvn -U clean package assembly:assembly -Dmaven.test.skip=true
[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------
[INFO] Total time: 08:12 min
[INFO] Finished at: 2015-12-13T16:26:48+08:00
[INFO] Final Memory: 133M/960M
[INFO] -----------------------------------------------------------------
$ cd {DataX_source_code_home}
$ ls ./target/datax/datax/
bin conf job lib log log_perf plugin script tmp
1)创建创业的的配置文件(json格式)
[root@hadoop1 bin]# pwd
/home/installed/datax/bin
[root@hadoop1 bin]# python datax.py -r streamreader -w streamwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md
Please refer to the streamwriter document:
https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column": [],
"sliceRecordCount": ""
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
[root@hadoop1 bin]#
#stream2stream.json
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,你好,世界-DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
2)启动DataX
[root@hadoop3 datax]# cd /home/installed/datax/bin/
[root@hadoop3 bin]# python datax.py /home/test/dataxtest/stream2stream.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2019-09-09 16:14:17.345 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2019-09-09 16:14:17.356 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.161-b12
jvmInfo: Linux amd64 3.10.0-693.el7.x86_64
cpu num: 4
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2019-09-09 16:14:17.375 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"streamreader",
"parameter":{
"column":[
{
"type":"long",
"value":"10"
},
{
"type":"string",
"value":"hello,你好,世界-DataX"
}
],
"sliceRecordCount":10
}
},
"writer":{
"name":"streamwriter",
"parameter":{
"encoding":"UTF-8",
"print":true
}
}
}
],
"setting":{
"speed":{
"channel":5
}
}
}
2019-09-09 16:14:17.404 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2019-09-09 16:14:17.406 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2019-09-09 16:14:17.406 [main] INFO JobContainer - DataX jobContainer starts job.
2019-09-09 16:14:17.409 [main] INFO JobContainer - Set jobId = 0
2019-09-09 16:14:17.431 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2019-09-09 16:14:17.432 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work .
2019-09-09 16:14:17.432 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2019-09-09 16:14:17.433 [job-0] INFO JobContainer - jobContainer starts to do split ...
2019-09-09 16:14:17.433 [job-0] INFO JobContainer - Job set Channel-Number to 5 channels.
2019-09-09 16:14:17.434 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks.
2019-09-09 16:14:17.435 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks.
2019-09-09 16:14:17.467 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2019-09-09 16:14:17.485 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2019-09-09 16:14:17.488 [job-0] INFO JobContainer - Running by standalone Mode.
2019-09-09 16:14:17.507 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks.
2019-09-09 16:14:17.513 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2019-09-09 16:14:17.513 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2019-09-09 16:14:17.545 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started
2019-09-09 16:14:17.558 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
2019-09-09 16:14:17.580 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
2019-09-09 16:14:17.598 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
2019-09-09 16:14:17.619 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
2019-09-09 16:14:17.731 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[112]ms
2019-09-09 16:14:17.731 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[163]ms
2019-09-09 16:14:17.731 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[202]ms
2019-09-09 16:14:17.731 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[177]ms
2019-09-09 16:14:17.732 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[136]ms
2019-09-09 16:14:17.733 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2019-09-09 16:14:27.511 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2019-09-09 16:14:27.511 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2019-09-09 16:14:27.511 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work.
2019-09-09 16:14:27.512 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work.
2019-09-09 16:14:27.512 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2019-09-09 16:14:27.513 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /home/installed/datax/hook
2019-09-09 16:14:27.515 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2019-09-09 16:14:27.516 [job-0] INFO JobContainer - PerfTrace not enable!
2019-09-09 16:14:27.516 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2019-09-09 16:14:27.517 [job-0] INFO JobContainer -
任务启动时刻 : 2019-09-09 16:14:17
任务结束时刻 : 2019-09-09 16:14:27
任务总计耗时 : 10s
任务平均流量 : 95B/s
记录写入速度 : 5rec/s
读出记录总数 : 50
读写失败总数 : 0
[root@hadoop3 bin]#