Skip to content

Commit 65d2ebe

Browse files
committed
Restructure re.sub docs, clarify aspects of repl notation
- `flags` are only relevant when `pattern` is a string (followup to #119960). - Extended "beans and spam" example to demonstrate both string & re.compile flags usage, `\1` templating, and moved it close to start. - Discuss all how-we-match parameters before what-we-do-with-matches. TODO: Is important info close enough to start? - Explain callback before backslash notation because it's shorter but also to promote it. IMHO, people fear it as a "last-resort escape hatch" while it's actually *simpler* than backslashes (#128138 is one example). TODO: Will this order make sense after #135992 ? - Consolidated `repl` notation from two far-away paragraphs to one place. - Starting from `\1` and `\g` which are the whole purpose of dealing with backslashes! - Briefly mention `\octal` wart, 99 limit and `\g<100>` avoiding them. - Draw attention to `\\` for getting a literal backslash. - Clarify that *most* escapes are supported but `\x\u\U\N` aren't. - Move "Unknown escapes of ASCII letters" *after* listing all the known ones. - Added a note promoting raw string notation for `repl` too.
1 parent 936d60d commit 65d2ebe

File tree

1 file changed

+56
-41
lines changed

1 file changed

+56
-41
lines changed

Doc/library/re.rst

Lines changed: 56 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1064,34 +1064,19 @@ Functions
10641064

10651065
Return the string obtained by replacing the leftmost non-overlapping occurrences
10661066
of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
1067-
*string* is returned unchanged. *repl* can be a string or a function; if it is
1068-
a string, any backslash escapes in it are processed. That is, ``\n`` is
1069-
converted to a single newline character, ``\r`` is converted to a carriage return, and
1070-
so forth. Unknown escapes of ASCII letters are reserved for future use and
1071-
treated as errors. Other unknown escapes such as ``\&`` are left alone.
1072-
Backreferences, such
1073-
as ``\6``, are replaced with the substring matched by group 6 in the pattern.
1074-
For example::
1075-
1076-
>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
1077-
... r'static PyObject*\npy_\1(void)\n{',
1078-
... 'def myfunc():')
1079-
'static PyObject*\npy_myfunc(void)\n{'
1080-
1081-
If *repl* is a function, it is called for every non-overlapping occurrence of
1082-
*pattern*. The function takes a single :class:`~re.Match` argument, and returns
1083-
the replacement string. For example::
1067+
*string* is returned unchanged.
1068+
The pattern may be a string or a :class:`~re.Pattern`.
1069+
A string pattern's behaviour may be modified by specifying a *flags* value,
1070+
which can be any of the `flags`_ variables, combined using bitwise OR
1071+
(the ``|`` operator).
10841072

1085-
>>> def dashrepl(matchobj):
1086-
... if matchobj.group(0) == '-': return ' '
1087-
... else: return '-'
1088-
...
1089-
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
1090-
'pro--gram files'
1091-
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
1092-
'Baked Beans & Spam'
1073+
>>> re.sub(r'(and)', r'*\1*', 'Contraband Andalusian Beans AND Spam',
1074+
... flags=re.IGNORECASE)
1075+
'Contrab*and* *And*alusian Beans *AND* Spam'
10931076

1094-
The pattern may be a string or a :class:`~re.Pattern`.
1077+
>>> pattern = re.compile(r'(and)', flags=re.IGNORECASE)
1078+
>>> re.sub(pattern, r'*\1*', 'Contraband Andalusian Beans AND Spam')
1079+
'Contrab*and* *And*alusian Beans *AND* Spam'
10951080

10961081
The optional argument *count* is the maximum number of pattern occurrences to be
10971082
replaced; *count* must be a non-negative integer. If omitted or zero, all
@@ -1102,21 +1087,51 @@ Functions
11021087
As a result, ``sub('x*', '-', 'abxd')`` returns ``'-a-b--d-'``
11031088
instead of ``'-a-b-d-'``.
11041089

1105-
.. index:: single: \g; in regular expressions
1106-
1107-
In string-type *repl* arguments, in addition to the character escapes and
1108-
backreferences described above,
1109-
``\g<name>`` will use the substring matched by the group named ``name``, as
1110-
defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
1111-
group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
1112-
in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
1113-
reference to group 20, not a reference to group 2 followed by the literal
1114-
character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
1115-
substring matched by the RE.
1116-
1117-
The expression's behaviour can be modified by specifying a *flags* value.
1118-
Values can be any of the `flags`_ variables, combined using bitwise OR
1119-
(the ``|`` operator).
1090+
*repl* can be a string template or a function:
1091+
1092+
* If it is callable, it is called for every non-overlapping occurrence of
1093+
*pattern*. The function takes a single :class:`~re.Match` argument, and
1094+
returns the replacement string. For example::
1095+
1096+
>>> def dashrepl(matchobj):
1097+
... if matchobj.group(0) == '-': return ' '
1098+
... else: return '-'
1099+
...
1100+
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
1101+
'pro--gram files'
1102+
1103+
* If *repl* is a string, it's processed as a template based on backslash escapes:
1104+
1105+
.. index:: single: \g; in regular expressions
1106+
1107+
- ``\1`` .. ``\99`` are replaced by the substring matched by corresponding
1108+
``(...)`` groups in the pattern.
1109+
- However other ``\numbers`` get interpretted as *octal* character literals.
1110+
- ``\g<name>`` are replaced by the substring matched by named ``(?P<name>...)``
1111+
groups.
1112+
- ``\g<number>`` is another way to refer to numbered groups.
1113+
``\g<2>0`` inserts group 2 followed by the literal character ``'0'``,
1114+
whereas ``\20`` can only express a reference to group 20. ``\g<100>`` etc.
1115+
can refer to groups higher than 99, and the backreference ``\g<0>``
1116+
substitutes in the entire substring matched by the RE.
1117+
- ``\\`` is converted to a single backslash.
1118+
- Basic escapes ``\n\r\t\v\f\a\b`` work like in Python string literals.
1119+
That is, ``\n`` is converted to a single newline character, and so forth.
1120+
- Unknown escapes of ASCII letters are reserved for future use and
1121+
treated as errors. This includes ``\x..``, ``\u...``, ``\U...`` and
1122+
``\N{...}`` which are not presently supported.
1123+
- Other unknown escapes such as ``\&`` are left alone.
1124+
1125+
For example::
1126+
1127+
>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
1128+
... r'static PyObject*\npy_\1(void)\n{',
1129+
... 'def myfunc():')
1130+
'static PyObject*\npy_myfunc(void)\n{'
1131+
1132+
(Note the use of raw string notation for *repl* as well. Otherwise you'd have
1133+
to write ``'\\1'`` for Python to parse it into ``\1`` to be replaced by
1134+
``myfunc`` at substitution time...)
11201135

11211136
.. versionchanged:: 3.1
11221137
Added the optional flags argument.

0 commit comments

Comments
 (0)