2015-09-04 23 views
3

我试图从Spark数据框中将一些数据保存到S3存储桶。这很简单:什么是AWSRequestMetricsFullSupport,如何关闭它?

dataframe.saveAsParquetFile("s3://kirk/my_file.parquet") 

数据已成功保存,但UI很忙很长一段时间。我得到成千上万的这样的行:

2015-09-04 20:48:19,591 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[5C3211750F4FF5AB], ServiceEndpoint=[https://kirk.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[63.827], HttpRequestTime=[62.919], HttpClientReceiveResponseTime=[61.678], RequestSigningTime=[0.05], ResponseProcessingTime=[0.812], HttpClientSendRequestTime=[0.038], 
2015-09-04 20:48:19,610 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[204], ServiceName=[Amazon S3], AWSRequestID=[709DA41540539FE0], ServiceEndpoint=[https://kirk.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[18.064], HttpRequestTime=[17.959], HttpClientReceiveResponseTime=[16.703], RequestSigningTime=[0.06], ResponseProcessingTime=[0.003], HttpClientSendRequestTime=[0.046], 
2015-09-04 20:48:19,664 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[204], ServiceName=[Amazon S3], AWSRequestID=[1B1EB812E7982C7A], ServiceEndpoint=[https://kirk.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[54.36], HttpRequestTime=[54.26], HttpClientReceiveResponseTime=[53.006], RequestSigningTime=[0.057], ResponseProcessingTime=[0.002], HttpClientSendRequestTime=[0.034], 
2015-09-04 20:48:19,675 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: AF6F960F3B2BF3AB), S3 Extended Request ID: CLs9xY8HAxbEAKEJC4LS1SgpqDcnHeaGocAbdsmYKwGttS64oVjFXJOe314vmb9q], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[AF6F960F3B2BF3AB], ServiceEndpoint=[https://kirk.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[10.111], HttpRequestTime=[10.009], HttpClientReceiveResponseTime=[8.758], RequestSigningTime=[0.043], HttpClientSendRequestTime=[0.044], 
2015-09-04 20:48:19,685 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: F2198ACEB4B2CE72), S3 Extended Request ID: J9oWD8ncn6WgfUhHA1yqrBfzFC+N533oD/DK90eiSvQrpGH4OJUc3riG2R4oS1NU], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[F2198ACEB4B2CE72], ServiceEndpoint=[https://kirk.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[9.879], HttpRequestTime=[9.776], HttpClientReceiveResponseTime=[8.537], RequestSigningTime=[0.05], HttpClientSendRequestTime=[0.033], 

我可以理解,如果一些用户感兴趣的记录S3操作的等待时间,但有什么办法禁用任何和所有的监测和AWSRequestMetricsFullSupport登录?

当我检查Spark UI时,它告诉我作业完成得相对较快,但控制台充斥着这些消息很长一段时间。

+0

对于上下文,我保存了1m行和500列的数据帧。大约需要20秒钟才能保存,但延迟警告会出现在我的控制台中> 20分钟。 –

回答

1

该resp。 AWS SDK for Java source comment写着:

/** 
* Start an event which will be timed. [...] 
* 
* This feature is enabled if the system property 
* "com.amazonaws.sdk.enableRuntimeProfiling" is set, or if a 
* {@link RequestMetricCollector} is in use either at the request, web service 
* client, or AWS SDK level. 
* 
* @param eventName 
*   - The name of the event to start 
* 
* @see AwsSdkMetrics 
*/ 

如参考AwsSdkMetrics Java Docs进一步概括,你也许可以通过系统属性来禁用它:

了Java AWS SDK默认指标收集由 默认情况下禁用。要启用它,只需在启动JVM时指定系统属性 “com.amazonaws.sdk.enableDefaultMetrics”。 指定系统属性时,将在AWS SDK级别启动默认度量收集器 。默认实施将使用AWS 凭证通过DefaultAWSCredentialsProviderChain获取的凭证上载到Amazon CloudWatch的请求/响应指标上传到 。

这似乎可以被一个RequestMetricCollector硬有线的请求,Web服务客户端进行覆盖,或AWS SDK级别,这大概需要RESP。在使用客户端/框架结构的调整(如星火这里):

客户谁需要完全自定义指标收集通过可 实现SPI MetricCollector,然后更换集热器的默认AWS SDK实现 setMetricCollector(MetricCollector)

文档这些功能似乎有点稀疏,到目前为止,这里有两个相关的博客文章我所知道的:

+0

谢谢Steffen。我找到了同样的文件:'AwsSdkMetrics',它(如你所发布的)表示这应该在默认情况下被关闭。我想这是一个较老的文档。把它关掉似乎并不重要。我会跟随你最终引用的博客。 –

0

最好的解决方案我发现是通过将log4j配置文件传递给Spark上下文来配置Java日志记录(即,如果关闭)。

--driver-java-options "-Dlog4j.configuration=/home/user/log4j.properties" 

log4j.properties是禁用INFO类型消息中的log4j配置文件。

1

在释放标签EMR消除这些日志证明是一个相当大的挑战。在版本emr-4.7.2中修复了“an issue with Spark Log4j-based logging in YARN containers”。一个工作的解决方案是这些jsons添加为配置:

[ 
{ 
    "Classification": "hadoop-log4j", 
    "Properties": { 
    "log4j.logger.com.amazon.ws.emr.hadoop.fs": "ERROR", 
    "log4j.logger.com.amazonaws.latency": "ERROR" 
    }, 
    "Configurations": [] 
} 
] 

,并在预EMR-4.7.2也是这个JSON,它放弃“马车”错误log4j的火花,这是默认的:

[ 
{ 
    "Classification": "spark-defaults", 
    "Properties": { 
    "spark.driver.extraJavaOptions": "-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=512M -XX:OnOutOfMemoryError='kill -9 %p'" 
    }, 
    "Configurations": [] 
} 
] 
相关问题