oozie创建工作流
工作流的执行命令参考博客:https://www.jianshu.com/p/6cb3a4b78556,也可以键入oozie help
查看帮助
手动配置oozie的workflow
job.properties文件,存放workflow.xml文件可能用到的一些参数
job.properties
# 注意变量名不要包含特殊字符,否则在 spark 中会出现无法解析变量名的问题
# oozie.wf.application.path的路径必须在hdfs上,因为整个集群要访问
nameNode=hdfs://txz-data0:9820
resourceManager=txz-data0:8032
oozie.use.system.libpath=true
oozie.libpath=${nameNode}/share/lib/spark2/jars/,${nameNode}/share/lib/spark2/python/lib/,${nameNode}/share/lib/spark2/hive-site.xml
oozie.wf.application.path=${nameNode}/workflow/data-factory/download_report_voice_and_upload/Workflow
oozie.action.sharelib.for.spark=spark2
archive=${nameNode}/envs/py3.tar.gz#py
# 如果 dryrun 为 true,表示只是测试当前的 workflow,并不具体记录相应 job
dryrun=false
sparkMaster=yarn-cluster
sparkMode=cluster
scriptRoot=/workflow/data-factory/download_report_voice_and_upload/Python
sparkScriptBasename=download_parquet_from_data0_upload_online.py
sparkScript=${scriptRoot}/${sparkScriptBasename}
pysparkPath=py/py3/bin/python3
workflow.xml文件
<!--
这是为oozie的workflow提供参数,里面用到的变量默认来自job.properties文件
-->
<workflow-app xmlns='uri:oozie:workflow:1.0' name='download_parquet_from_data0_upload_online'>
<global>
<resource-manager>${resourceManager}</resource-manager>
<name-node>${nameNode}</name-node>
</global>
<start to='spark-node' />
<action name='spark-node'>
<spark xmlns="uri:oozie:spark-action:1.0">
<master>${sparkMaster}</master>
<mode>${sparkMode}</mode>
<name>report_voice_download_pyspark</name>
<jar>${sparkScriptBasename}</jar>
<spark-opts>
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=${pysparkPath}
</spark-opts>
<file>${sparkScript}#${sparkScriptBasename}</file>
<archive>${archive}</archive>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>
Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name='end' />
</workflow-app>
将这两个文件放在本地磁盘上面,例如放在文件夹/home/workflow/
中
运行命令oozie job -oozie http://txz-data0:11000/oozie -config /home/workflow/job.properties -run
即可运行这个workflow
这样手写配置的话,在Hue上面是不可见的,所以后面都是在Hue上面配置workflow,然后再配置Schedule。具体配置见博客https://blog.csdn.net/qq_22918243/article/details/89204111