Automigrate 20210618#1
Open
wenhuizhang wants to merge 10 commits into
Open
Conversation
Prepare for the kernel to auto-migrate pages to other memory nodes with a user defined node migration table. This allows creating single migration target for each NUMA node to enable the kernel to do NUMA page migrations instead of simply reclaiming colder pages. A node with no target is a "terminal node", so reclaim acts normally there. The migration target does not fundamentally _need_ to be a single node, but this implementation starts there to limit complexity. If you consider the migration path as a graph, cycles (loops) in the graph are disallowed. This avoids wasting resources by constantly migrating (A->B, B->A, A->B ...). The expectation is that cycles will never be allowed. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: osalvador <osalvador@suse.de> -- changes since 20200122: * Make node_demotion[] __read_mostly changes in July 2020: - Remove loop from next_demotion_node() and get_online_mems(). This means that the node returned by next_demotion_node() might now be offline, but the worst case is that the allocation fails. That's fine since it is transient.
When memory fills up on a node, memory contents can be automatically migrated to another node. The biggest problems are knowing when to migrate and to where the migration should be targeted. The most straightforward way to generate the "to where" list would be to follow the page allocator fallback lists. Those lists already tell us if memory is full where to look next. It would also be logical to move memory in that order. But, the allocator fallback lists have a fatal flaw: most nodes appear in all the lists. This would potentially lead to migration cycles (A->B, B->A, A->B, ...). Instead of using the allocator fallback lists directly, keep a separate node migration ordering. But, reuse the same data used to generate page allocator fallback in the first place: find_next_best_node(). This means that the firmware data used to populate node distances essentially dictates the ordering for now. It should also be architecture-neutral since all NUMA architectures have a working find_next_best_node(). The protocol for node_demotion[] access and writing is not standard. It has no specific locking and is intended to be read locklessly. Readers must take care to avoid observing changes that appear incoherent. This was done so that node_demotion[] locking has no chance of becoming a bottleneck on large systems with lots of CPUs in direct reclaim. This code is unused for now. It will be called later in the series. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: osalvador <osalvador@suse.de> -- Changes from 20200122: * Add big node_demotion[] comment Changes from 20210302: * Fix typo in node_demotion[] comment
Reclaim-based migration is attempting to optimize data placement in
memory based on the system topology. If the system changes, so must
the migration ordering.
The implementation is conceptually simple and entirely unoptimized.
On any memory or CPU hotplug events, assume that a node was added or
removed and recalculate all migration targets. This ensures that the
node_demotion[] array is always ready to be used in case the new
reclaim mode is enabled.
This recalculation is far from optimal, most glaringly that it does
not even attempt to figure out the hotplug event would have some
*actual* effect on the demotion order. But, given the expected
paucity of hotplug events, this should be fine.
=== What does RCU provide? ===
Imaginge a simple loop which walks down the demotion path looking
for the last node:
terminal_node = start_node;
while (node_demotion[terminal_node] != NUMA_NO_NODE) {
terminal_node = node_demotion[terminal_node];
}
The initial values are:
node_demotion[0] = 1;
node_demotion[1] = NUMA_NO_NODE;
and are updated to:
node_demotion[0] = NUMA_NO_NODE;
node_demotion[1] = 0;
What guarantees that the loop did not observe:
node_demotion[0] = 1;
node_demotion[1] = 0;
and would loop forever?
With RCU, a rcu_read_lock/unlock() can be placed around the
loop. Since the write side does a synchronize_rcu(), the loop
that observed the old contents is known to be complete after the
synchronize_rcu() has completed.
RCU, combined with disable_all_migrate_targets(), ensures that
the old migration state is not visible by the time
__set_migration_target_nodes() is called.
=== What does READ_ONCE() provide? ===
READ_ONCE() forbids the compiler from merging or reordering
successive reads of node_demotion[]. This ensures that any
updates are *eventually* observed.
Consider the above loop again. The compiler could theoretically
read the entirety of node_demotion[] into local storage
(registers) and never go back to memory, and *permanently*
observe bad values for node_demotion[].
Note: RCU does not provide any universal compiler-ordering
guarantees:
https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>
--
Changes since 20210302:
* remove duplicate synchronize_rcu()
The migrate_pages() returns the number of pages that were not migrated, or an error code. When returning an error code, there is no way to know how many pages were migrated or not migrated. In the following patch, migrate_pages() is used to demote pages to PMEM node, we need account how many pages are reclaimed (demoted) since page reclaim behavior depends on this. Add *nr_succeeded parameter to make migrate_pages() return how many pages are demoted successfully for all cases. Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: osalvador <osalvador@suse.de> -- Note: Yang Shi originally wrote the patch, thus the SoB. There was also a Reviewed-by provided since there were some modifications made to this after the original work. Changes since 20210302: * Fix definition of CONFIG_MIGRATION=n stub migrate_pages(). Its parameters were wrong, but oddly enough did not generate any compile errors. Changes since 20200122: * Fix migrate_pages() to manipulate nr_succeeded *value* rather than the pointer.
This is mostly derived from a patch from Yang Shi: https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/ Add code to the reclaim path (shrink_page_list()) to "demote" data to another NUMA node instead of discarding the data. This always avoids the cost of I/O needed to read the page back in and sometimes avoids the writeout cost when the pagee is dirty. A second pass through shrink_page_list() will be made if any demotions fail. This essentally falls back to normal reclaim behavior in the case that demotions fail. Previous versions of this patch may have simply failed to reclaim pages which were eligible for demotion but were unable to be demoted in practice. Note: This just adds the start of infratructure for migration. It is actually disabled next to the FIXME in migrate_demote_page_ok(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: osalvador <osalvador@suse.de> -- changes from 20210122: * move from GFP_HIGHUSER -> GFP_HIGHUSER_MOVABLE (Ying) changes from 202010: * add MR_NUMA_MISPLACED to trace MIGRATE_REASON define * make migrate_demote_page_ok() static, remove 'sc' arg until later patch * remove unnecessary alloc_demote_page() hugetlb warning * Simplify alloc_demote_page() gfp mask. Depend on __GFP_NORETRY to make it lightweight instead of fancier stuff like leaving out __GFP_IO/FS. * Allocate migration page with alloc_migration_target() instead of allocating directly. changes from 20200730: * Add another pass through shrink_page_list() when demotion fails. changes from 20210302: * Use __GFP_THISNODE and revise the comment explaining the GFP mask constructionn
Account the number of demoted pages.
Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.
[ daveh:
- __count_vm_events() a bit, and made them look at the THP
size directly rather than getting data from migrate_pages()
]
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>
--
Changes since 202010:
* remove unused scan-control 'demoted' field
Anonymous pages are kept on their own LRU(s). These lists could theoretically always be scanned and maintained. But, without swap, there is currently nothing the kernel can *do* with the results of a scanned, sorted LRU for anonymous pages. A check for '!total_swap_pages' currently serves as a valid check as to whether anonymous LRUs should be maintained. However, another method will be added shortly: page demotion. Abstract out the 'total_swap_pages' checks into a helper, give it a logically significant name, and check for the possibility of page demotion. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: osalvador <osalvador@suse.de>
Reclaim anonymous pages if a migration path is available now that demotion provides a non-swap recourse for reclaiming anon pages. Note that this check is subtly different from the anon_should_be_aged() checks. This mechanism checks whether a specific page in a specific context *can* actually be reclaimed, given current swap space and cgroup limits anon_should_be_aged() is a much simpler and more preliminary check which just says whether there is a possibility of future reclaim. Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: osalvador <osalvador@suse.de> -- Changes from Dave 10/2020: * remove 'total_swap_pages' modification Changes from Dave 06/2020: * rename reclaim_anon_pages()->can_reclaim_anon_pages() Note: Keith's Intel SoB is commented out because he is no longer at Intel and his @intel.com mail will bounce.
Global reclaim aims to reduce the amount of memory used on a given node or set of nodes. Migrating pages to another node serves this purpose. memcg reclaim is different. Its goal is to reduce the total memory consumption of the entire memcg, across all nodes. Migration does not assist memcg reclaim because it just moves page contents between nodes rather than actually reducing memory consumption. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: osalvador <osalvador@suse.de>
Some method is obviously needed to enable reclaim-based migration. Just like traditional autonuma, there will be some workloads that will benefit like workloads with more "static" configurations where hot pages stay hot and cold pages stay cold. If pages come and go from the hot and cold sets, the benefits of this approach will be more limited. The benefits are truly workload-based and *not* hardware-based. We do not believe that there is a viable threshold where certain hardware configurations should have this mechanism enabled while others do not. To be conservative, earlier work defaulted to disable reclaim- based migration and did not include a mechanism to enable it. This proposes add a new sysfs file /sys/kernel/mm/numa/demotion_enabled as a method to enable it. We are open to any alternative that allows end users to enable this mechanism or disable it if workload harm is detected (just like traditional autonuma). Once this is enabled page demotion may move data to a NUMA node that does not fall into the cpuset of the allocating process. This could be construed to violate the guarantees of cpusets. However, since this is an opt-in mechanism, the assumption is that anyone enabling it is content to relax the guarantees. Originally-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: osalvador <osalvador@suse.de> Changes since 20200122: * Changelog material about relaxing cpuset constraints Changes since 20210304: * Add Documentation/ material about relaxing cpuset constraints Changes since 20210331: * Use sysfs interface separated from the zone_reclaim sysctl.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.