首页 > 解决方案 > How to get custom log4j.properties to take effect for Spark driver and executor on AWS EMR cluster?

问题描述

I have an AWS CLI cluster creation command that I am trying to modify so that it enables my driver and executor to work with a customized log4j.properties file. With Spark stand-alone clusters I have successfully used the approach of using the --files <log4j.file> switch together with setting -Dlog4j.configuration=<log4j.file> specified via spark.driver.extraJavaOptions, and spark.executor.extraJavaOptions.

I tried many different permutations and variations, but have yet to get this working with the Spark job that I am running on an AWS EMR clusters.

I use the AWS CLI's 'create cluster' command with an intermediate step that downloads my spark jar, unzips it to get at the log4j.properties packaged with that .jar. I then copy the log4j.properties to my hdfs /tmp folder and attempt to distribute that log4j.properties file via '--files'.

Note, I have also tried this without hdfs (specifying --files log4j.properties instead of --files hdfs:///tmp/log4j.properties) and that didn't work either.

My latest non-working version of this command (using hdfs) is given below. I'm wondering if anyone can share a recipe that actually does work. The output of the command from the driver when I run this version is:

log4j: Trying to find [log4j.properties] using context classloader sun.misc.Launcher$AppClassLoader@1e67b872.
log4j: Using URL [file:/etc/spark/conf.dist/log4j.properties] for automatic log4j configuration.
log4j: Reading configuration from URL file:/etc/spark/conf.dist/log4j.properties
log4j: Parsing for [root] with value=[WARN,stdout].

From the above I can see that my log4j.properties file is not being picked up (the default is). In addition to -Dlog4j.configuration=log4j.properties, I also tried configuring via -Dlog4j.configuration=classpath:log4j.properties (and again that failed).

Any guidance much appreciated !

AWS COMMAND

jarPath=s3://com-acme/deployments/spark.jar
class=com.acme.SparkFoo


log4jConfigExtractCmd="aws s3 cp $jarPath /tmp/spark.jar ; cd /home/hadoop ; unzip /tmp/spark.jar log4j.properties ;  hdfs dfs -put log4j.properties /tmp/log4j.properties  "


aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark \
--tags 'Project=mouse' \
      'Owner=SwarmAnalytics'\
       'DatadogMonitoring=True'\
       'StreamMonitorRedshift=False'\
       'DeployRedshiftLoader=False'\
       'Environment=dev'\
       'DeploySpark=False'\
       'StreamMonitorS3=False'\
       'Name=CCPASixCore' \
--ec2-attributes '{"KeyName":"mouse-spark-2021","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-07039960","EmrManagedSlaveSecurityGroup":"sg-09c806ca38fd32353","EmrManagedMasterSecurityGroup":"sg-092288bbc8812371a"}' \
--release-label emr-5.27.0 \
--log-uri 's3n://log-foo' \
--steps '[{"Args":["bash","-c", "$log4jConfigExtractCmd"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"downloadSparkJar"},{"Args":["spark-submit","--files", "hdfs:///tmp/log4j.properties","--deploy-mode","client","--class","$class","--driver-memory","24G","--conf","spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:G1HeapRegionSize=256    -Dlog4j.debug -Dlog4j.configuration=log4j.properties","--conf","spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:G1HeapRegionSize=256    -Dlog4j.debug -Dlog4j.configuration=log4j.properties","--conf","spark.yarn.executor.memoryOverhead=10g","--conf","spark.yarn.driver.memoryOverhead=10g","$jarPath"],"Type":"CUSTOM_JAR","ActionOnFailure":"CANCEL_AND_WAIT","Jar":"command-runner.jar","Properties":"","Name":"SparkFoo"}]'\
 --instance-groups '[{"InstanceCount":6,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"CORE","InstanceType":"r5d.4xlarge","Name":"Core - 6"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":4}]},"InstanceGroupType":"MASTER","InstanceType":"m5.2xlarge","Name":"Master - 1"}]' \
--configurations '[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"INFO","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR","log4j.logger.org.apache.hadoop":"ERROR","log4j.appender.stdout":"org.apache.log4j.ConsoleAppender","log4j.logger.io.netty":"ERROR","log4j.logger.org.apache.spark.scheduler.cluster":"ERROR","log4j.rootLogger":"WARN,stdout","log4j.appender.stdout.layout.ConversionPattern":"%d{yyyy-MM-dd HH:mm:ss,SSS} %p/%c{1}:%L - %m%n","log4j.logger.org.apache.spark.streaming.scheduler.JobScheduler":"INFO"}},{"Classification":"hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}},{"Classification":"spark-hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]'\
 --auto-terminate --ebs-root-volume-size 10 --service-role EMR_DefaultRole \
--security-configuration 'CCPA_dev_security_configuration_2' --enable-debugging --name 'SparkFoo' \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1 --profile sandbox

标签: amazon-web-servicesapache-sparklog4jamazon-emr

解决方案


Here is how to change the logging. The best way on AWS/EMR (that I have found) is to NOT fiddle with

spark.driver.extraJavaOptions  or 
spark.executor.extraJavaOptions

Instead, take advantage of configuration block that looks like this >

[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"INFO","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR",

And then, say you want to change all logging done by classes under com.foo and its decendants to TRACE. Then you'd change the above block to look like this ->

[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"TRACE","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR",

推荐阅读