Clarify details about incremental copy.#2777
Clarify details about incremental copy.#2777rodenkew wants to merge 1 commit intoMicrosoftDocs:mainfrom
Conversation
A customer expressed confusion that rows with a NULL value in the incremental column were not being copied during a subsequent load. This indicates that the documentation could use a bit of clarification. This PR includes some further clarification.
|
@rodenkew : Thanks for your contribution! The author(s) and reviewer(s) have been notified to review your proposed change. |
|
Learn Build status updates of commit ef8abb9: ✅ Validation status: passed
For more details, please refer to the build report. |
|
#sign-off |
|
Invalid command: '#sign-off'. Only the assigned author of one or more file in this PR can sign off. @whhender |
There was a problem hiding this comment.
Pull request overview
Clarifies Copy job incremental copy behavior—especially around watermark-based subsequent loads—and reorganizes content so incremental-copy reset guidance sits with other incremental copy details.
Changes:
- Expanded the “Incremental copy (CDC, Watermark)” section to describe how subsequent loads determine what to copy for watermark, CDC, and files.
- Added an explicit note explaining why rows with
NULLin the incremental (watermark) column aren’t copied in subsequent loads. - Moved the “Reset incremental copy” subsection to sit directly under the incremental copy content.
| Typically, an incremental column holds a date/time value or an increasing number. | ||
| If your database has CDC enabled, you don’t need to choose an incremental column — Copy job automatically detects the changes. | ||
|
|
||
| Note that if you are using a watermark to copy incrementally from a database, subsequent loads do not copy any rows with a "null" value in that column, because the "null" value is considered _less_ than any other value. |
There was a problem hiding this comment.
The explanation for why rows with NULL values aren't copied is technically incorrect: in most databases NULL isn't "less" than other values; comparisons like watermark > lastValue evaluate to UNKNOWN/false for NULL, so those rows don't match the incremental filter. Consider rephrasing to explain that subsequent loads filter on values greater than the last recorded watermark, and rows with NULL in the incremental column don't satisfy that filter (also format NULL as code, not quoted text).
| Note that if you are using a watermark to copy incrementally from a database, subsequent loads do not copy any rows with a "null" value in that column, because the "null" value is considered _less_ than any other value. | |
| Note that if you use a watermark column to copy incrementally from a database, subsequent loads don't copy rows where that column is `NULL`, because those rows don't satisfy the filter that selects values greater than the last recorded watermark. |
| - Databases: Only new or updated rows are copied. If Change Data Capture (CDC) is enabled, inserted, updated, and deleted rows are included. | ||
| - Storage: Only files with a newer LastModifiedTime are copied. | ||
| In incremental copy, every run after the initial full copy (called a "subsequent load") transfers only certain changes. Copy job automatically tracks and manages the state of the last successful run, so it knows what data to copy next. | ||
| - When copying from a database using an incremental column ("watermark column"), each subsequent load copies only rows with a value in that column larger than any row previously copied. |
There was a problem hiding this comment.
This description of watermark behavior is a bit misleading: Copy job tracks the last successful run’s stored watermark value, so the rule is typically "greater than the last recorded watermark" (not "larger than any row previously copied"). Rewording will better match how the state is tracked and reduce confusion about what value is being compared against.
| - When copying from a database using an incremental column ("watermark column"), each subsequent load copies only rows with a value in that column larger than any row previously copied. | |
| - When copying from a database using an incremental column ("watermark column"), Copy job stores the watermark value from the last successful run and, for each subsequent load, copies only rows with a value in that column greater than this stored watermark. |
| See more details for [Change data capture (CDC) in Copy Job](/fabric/data-factory/cdc-copy-job). | ||
|
|
||
| When a copy job fails, you don’t need to worry about data loss. Copy job always resumes from the state of the last successful run. A failure does not change the state managed by copy job. | ||
| If a copy job fails, you don’t need to worry about data loss. Copy job always resumes from the end of the last successful run. A failure does not change the state managed by Copy job. |
There was a problem hiding this comment.
The sentence says the job "resumes from the end of the last successful run", but earlier/elsewhere the concept is the stored state of the last successful run (watermark/CDC checkpoint). "End" can read like row-level continuation and could be inaccurate. Suggest aligning the wording to "resumes from the state/checkpoint of the last successful run" and also keeping product naming consistent ("Copy job" vs "Copy Job").
|
|
||
| ### Reset incremental copy | ||
|
|
||
| You have the flexibility in managing incremental copy, including the ability to reset it back to a full copy on the next run. This is incredibly useful when there’s a data discrepancy between your source and destination—you can simply let Copy Job perform a full copy in the next run to resolve the issue, then continue with incremental updates afterward. |
There was a problem hiding this comment.
Minor style/consistency issues in this paragraph: add spaces around the em dash (the file uses spaced em dashes elsewhere), and use consistent product casing ("Copy job" vs "Copy Job"). Also consider softening promotional phrasing like "incredibly useful" to keep the tone more neutral and instructional.
|
@rodenkew did you intend to open this PR in the private repo? It looks like you have access to the private repo, and if so, you should use that for content updates. I'll leave this PR open in case @whhender wants to review it here. If you decide to open a new PR in the private repo, please close this PR. Thanks. Can you review the proposed changes? IMPORTANT: When the changes are ready for publication, adding a #label:"aq-pr-triaged" |
|
This pull request has been inactive for 14 days, and an |
|
This pull request has been inactive for 28 days, and an |
A customer expressed confusion that rows with a NULL value in the incremental column were not being copied during a subsequent load.
This indicates that the documentation could use a bit of clarification.
This PR includes some further clarification.
This PR also moves a paragraph to keep the Incremental Copy information "together."
Thank you for contributing to Microsoft Fabric documentation
Fill out these items before submitting your pull request:
If you are working internally at Microsoft:
Who is your primary Skilling team contact? @mention them individually tag them and let them review the PR before signing off.
For internal Microsoft contributors, check off these quality control items as you go
1. Check the Acrolinx report: Make sure your Acrolinx Total score is above 80 minimum (higher is better) and with no spelling issues. Acrolinx ensures we are providing consistent terminology and using an appropriate voice and tone, and helps with localization.
2. Successful build with no warnings or suggestions: Review the build status to make sure all files are green (Succeeded).
3. Preview the pages:: Click each Preview URL link to view the rendered HTML pages on the review.learn.microsoft.com site to check the formatting and alignment of the page. Scan the page for overall formatting, and look at the parts you edited in detail.
4. Check the Table of Contents: If you are adding a new markdown file, make sure it is linked from the table of contents.
5. #sign-off to request PR review and merge: Once the pull request is finalized and ready to be merged, indicate so by typing
#sign-offin a new comment in the Pull Request. If you need to cancel that sign-off, type#hold-offinstead. Signing off means the document can be published at any time. Note, this is a formatting and standards review, not a technical review.Merge and publish
#sign-off, there is a separate PR Review team that will review the PR and describe any necessary feedback before merging.#sign-offagain. The PR Review team reviews and merges the pull request into the specified branch (usually the main branch or a release- branch).