HDFS-17878. Reduce frequency of getDatanodeListForReport calls for metrics#8220
HDFS-17878. Reduce frequency of getDatanodeListForReport calls for metrics#8220kokonguyen191 wants to merge 1 commit intoapache:trunkfrom
Conversation
|
💔 -1 overall
This message was automatically generated. |
8dd731f to
a9bf648
Compare
|
💔 -1 overall
This message was automatically generated. |
| public boolean isSufficientlyReplicated(BlockInfo b) { | ||
| // Compare against the lesser of the minReplication and number of live DNs. | ||
| final int liveReplicas = countNodes(b).liveReplicas(); | ||
| if (liveReplicas == 0) { |
There was a problem hiding this comment.
Perhaps we can make this PR more focused, simply providing a cache machine for a list of DataNodes, without making any other modifications.
| private final boolean randomNodeOrderEnabled; | ||
|
|
||
| /** Cached map of DatanodeReportType -> list of DatanodeDescriptor for metrics purposes. */ | ||
| private volatile Cache<DatanodeReportType, List<DatanodeDescriptor>> datanodeListSnapshots = null; |
| synchronized(this) { | ||
| host2DatanodeMap.remove(datanodeMap.put(node.getDatanodeUuid(), node)); | ||
| } | ||
| Cache<DatanodeReportType, List<DatanodeDescriptor>> tmpDatanodeListSnapshots = |
There was a problem hiding this comment.
how about remove this invalidateAll.
If you want to remove the dead datanode from the cache in time, we also need to update the decommission status, right? So we can make it more simple, just updated the cache with a fixed interval.
| return dnId; | ||
| } | ||
|
|
||
| public void refreshDatanodeListSnapshot(long newExpirationMs) { |
There was a problem hiding this comment.
It's better to support this hot-reload operations in a new PR if you think it's necessary.
| * Low impact version of {@link #getDatanodeListForReport} with possible stale | ||
| * data for low impact usage (metrics). | ||
| */ | ||
| public List<DatanodeDescriptor> getDatanodeListSnapshotForReport( |
There was a problem hiding this comment.
getDatanodeListForReportWithCache
| /** Fetch live and dead datanodes. */ | ||
| public void fetchDatanodes(final List<DatanodeDescriptor> live, | ||
| final List<DatanodeDescriptor> dead, final boolean removeDecommissionNode) { | ||
| final List<DatanodeDescriptor> dead, final boolean removeDecommissionNode, boolean useCache) { |
There was a problem hiding this comment.
please add a method without many changes.
fetchDatanodes(final List<DatanodeDescriptor> live,
final List<DatanodeDescriptor> dead, final boolean removeDecommissionNode) {
fetchDatanodes(live, dead, removeDecommisionNode, false);
}
a9bf648 to
cc06d61
Compare
|
💔 -1 overall
This message was automatically generated. |
Description of PR
getDatanodeListForReportis called by a lot of metrics method while holding synchronized lock, interfering with more critical ops like datanodeReport while not having critical data (metrics). Best to reduce the frequency of calls to this method. This patch added a configurable cache that is force-wiped when there is a change in DNs, else expires using the configured expiration period.How was this patch tested?
Product cluster + local UT benchmark