2017-10-18 47 views
4

我们有一个将消息广播到Service Fabric无状态服务的类。这种无状态服务有一个分区,但有很多副本。 该消息应发送给系统中的所有副本。因此,我们查询单个分区的FabricClient以及该分区的所有副本。 我们使用标准的HTTP通信(无状态服务具有带有自托管OWIN侦听器的通信侦听器,使用WebListener/HttpSys)和共享HttpClient实例。 在负载测试期间,我们在发送消息期间收到许多错误。请注意,我们在同一个应用程序中还有其他服务,还可以进行通信(WebListener/HttpSys,ServiceProxy和ActorProxy)。负载测试期间的FabricTransientException“无法ping任何提供的Service Fabric网关端点。”

我们看到异常的代码是(堆栈跟踪是代码示例如下):

private async Task SendMessageToReplicas(string actionName, string message) 
{ 
    var fabricClient = new FabricClient(); 
    var eventNotificationHandlerServiceUri = new Uri(ServiceFabricSettings.EventNotificationHandlerServiceName); 

    var promises = new List<Task>(); 
    // There is only one partition of this service, but there are many replica's 
    Partition partition = (await fabricClient.QueryManager.GetPartitionListAsync(eventNotificationHandlerServiceUri).ConfigureAwait(false)).First(); 

    string continuationToken = null; 
    do 
    { 
    var replicas = await fabricClient.QueryManager.GetReplicaListAsync(partition.PartitionInformation.Id, continuationToken).ConfigureAwait(false); 
    foreach(Replica replica in replicas) 
    { 
     promises.Add(SendMessageToReplica(replica, actionName, message)); 
    } 

    continuationToken = replicas.ContinuationToken; 
    } while(continuationToken != null); 

    await Task.WhenAll(promises).ConfigureAwait(false); 
} 


private async Task SendMessageToReplica(Replica replica, string actionName, string message) 
{ 
    if(replica.TryGetEndpoint(out Uri replicaUrl)) 
    { 
    Uri requestUri = UriUtility.Combine(replicaUrl, actionName); 
    using(var response = await _httpClient.PostAsync(requestUri, message == null ? null : new JsonContent(message)).ConfigureAwait(false)) 
    { 
     string responseContent = await response.Content.ReadAsStringAsync().ConfigureAwait(false); 
     if(!response.IsSuccessStatusCode) 
     { 
     throw new Exception(); 
     } 
    } 
    } 
    else 
    { 
    throw new Exception(); 
    } 
} 

下抛出异常:

System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints. ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071C49 
at System.Fabric.Interop.NativeClient.IFabricQueryClient9.EndGetPartitionList2(IFabricAsyncOperationContext context) 
at System.Fabric.FabricClient.QueryClient.GetPartitionListAsyncEndWrapper(IFabricAsyncOperationContext context) 
at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously) 
--- End of inner exception stack trace --- 
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() 
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) 
at Company.ServiceFabric.ServiceFabricEventNotifier.<SendMessageToReplicas>d__7.MoveNext() in c:\work\ServiceFabricEventNotifier.cs:line 138 

在同一期间我们也看到这个例外是抛出:

System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.) ---> System.ComponentModel.Win32Exception (0x80004005): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full 
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection) 
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection) 
at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection) 
at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions) 
at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry) 
at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry) 
at System.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken) 

群集中的计算机上的事件日志显示Ë警告:

Event ID: 4231 
Source: Tcpip 
Level: Warning 
A request to allocate an ephemeral port number from the global TCP port space has failed due to all such ports being in use. 

Event ID: 4227 
Source: Tcpip 
Level: Warning 
TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint. 

最后的微软服务织物管理日志显示数百警告类似

Event 4121 
Source Microsoft-Service-Fabric 
Level: Warning 
client-02VM4.company.nl:19000/192.168.10.36:19000: error = 2147942452, failureCount=160522. Filter by (type~Transport.St && ~"(?i)02VM4.company.nl:19000") to get listener lifecycle. Connect failure is expected if listener was never started, or listener/its process was stopped before/during connecting. 

Event 4097 
Source Microsoft-Service-Fabric 
Level: Warning 
client-02VM4.company.nl:19000 : connect failed, having tried all addresses 

一段时间后,警告变成错误:

Event 4096 
Source Microsoft-Service-Fabric 
Level: Error 
client-02VM4.company.nl:19000 failed to bind to local port for connecting: 0x80072747 

人告诉我们为什么发生这种情况,以及我们可以做些什么来解决这个问题?我们做错了什么?

回答

1

我们(我和OP工作)一直在测试这一点,它竟然是由FabricClient巴赫Esben的建议。

FabricClient的文件也指出:

强烈建议您分享FabricClients尽可能。这是因为FabricClient有多个优化,如缓存和批处理,否则您将无法充分利用。

看起来FabricClient的行为就像HttpClient类,你应该共享这个实例,当你不这样做时,你会得到同样的问题,端口耗尽。

与FabricClient documentation工作常见异常但还提到,当FabricObjectClosedException发生时,你应该:

的FabricClient的

处置对象,你正在使用和实例化一个新的FabricClient对象。

共享FabricClient可修复端口耗尽问题。

1

看起来你有一个端口耗尽问题。假设情况如此, 要么你必须弄清楚如何重用你的连接,否则你将不得不实现某种节流机制,所以你不用完所有可用的端口。

不知道结构客户端如何行为,它可能是它负责耗尽,或者它可能是我们无法看到代码的SQL Server部分(但是因为您将其发布到日志中,我认为它可能很可能与你的ping测试无关)。

查看httpwebresponse的参考资源(https://github.com/Microsoft/referencesource/blob/master/System/net/System/Net/HttpWebResponse.cs),也可能是配置响应(即您的postasync使用语句)正在关闭HttpClients连接。这意味着你不是在重复使用连接,而是始终打开新连接。

我猜测测试一个不配置你的httpwebresponse的变体是一件相当容易的事情。

+0

确实我们认为这是一个端口耗尽问题。不处理HttpWebResponses似乎很奇怪。我们将尝试其他一些变体,以确定问题是否如您所暗示的那样存在于Service Fabric客户端或HttpClient用法中。 –

+0

同意它似乎很奇怪,但看看参考源它似乎是在关闭期间关闭连接组: ConnectStream connectStream = m_ConnectStream ConnectStream;如果(connectStream!= null && connectStream.Connection!= null) { connectStream.Connection.ServicePoint.CloseConnectionGroup(ConnectionGroupName); } 所以也许 –

1

调用每个现有服务实例的原因是什么?

通常,您应该只调用SF运行时提供的一个服务实例(如果此节点过载,它将尝试从同一节点/进程或另一个节点中选择一个)。

如果您需要在所有服务实例中发出某种状态更改/事件的信号,可能应该在服务实现内部完成此操作,以便检查此状态更改(可能是有状态服务)或发布 - 子事件队列每次需要此信息时(请参阅https://github.com/loekd/ServiceFabric.PubSubActors)。

另一个想法是在另一个支持批量数据的操作中同时向服务实例发送很多消息。

如果您必须以较高的频率从单一来源发送单个消息,那么保持连接处于前面的答案状态是一个很好的解决方案。

而且,主叫方应该做的连接弹性,例如参见https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-communication#communicating-with-a-service

+0

我们正在研究如何使用Actor Events来实现这段通信。但是,尽管也许这不是最好的解决方案,但我们预计它应该在技术上有效。 –