Proposal to fix some surprising regex behaviors
## Summary CMake regular expressions have several surprising behaviors. Here I describe them, with the proposed changes. Only one change seems disruptive enough to require a policy. **Updated Jan 27, 2025**: changed the "Empty Matches", added the "Non-existent matches" section, removed the "CMAKE_MATCH_* variables" section. **Updated Jan 28, 2025**: removed the "Non-existing group references" section. Implemented the remaining sections. ## The `^` anchor matching The current implementation of `REGEX MATCHALL` and `REGEX REPLACE` matches the `^` anchor at the start of every successive match (including in the middle of the input string). For example, `"^x"` matches the whole `"xxxx"` instead of only the first `"x"`. This contradicts the meaning of `^` and should be fixed. Code in the wild probably depends on the old behavior, either because they had to work around it or because it worked for them accidentally. Issue: #16899, #18690 (in the comments). Fix: !10221. Policy: Yes. ## Unmatched group references When a regex replacement string refers to a group that didn't match anything (i.e. it was in a branch not taken), CMake reports an error. Other languages treat unmatched group references as empty strings. I propose to implement the common behavior. Some useful cases that currently trigger an error: * Extracting possibly optional parts: `"(required)(optional)?(required)(optional)?" -> "\1\4"`. * Combining multiple extractions into one expression: `"extract(this)|or(that)" -> "\1\2"`. This change only affects previously disallowed behavior. I think the policy is unnecessary. Issue: #19012. Fix: !10251. Policy: No. ## Empty Regex Matches `string(REGEX ...)` commands produce an error whenever they encounter a zero-length match. The error does make sense when looking for multiple matches because the search position needs to advance somehow. But there are cases when an empty match poses no problem: 1. `REGEX MATCH` performs a single find operation, there is no advance to worry about. Other single-match contexts, like `if(MATCH)` and `list(TRANSFORM REGEX)`, do accept empty matches. 2. Anchored matches. They match at most once, and their location is fixed. 3. End of string matches. If a previous match has already reached the end of input, reporting an error for the empty match at the end doesn't make sense. What's more, *all* popular regex implementations (Python, Perl, PHP, Java, C#, JS, ...) allow empty matches *everywhere*. They advance over a zero-length match by trying other, possibly non-empty, branches first and force advance by 1 as a last resort. This results in consistent, theoretically sound, but somewhat unintuitive behavior. For example, searching for `"a*"` in `"a"` produces `["a", ""]`, and searching for `"|a"` in `"ab"` produces `["", "a", "", ""]`. I propose to allow empty matches the same way other languages do. This change only affects previously disallowed behavior. I think the policy is unnecessary. Issue: #13790, #13792, #18690. Fix: !10251. Policy: No. ## Conclusion Please discuss and (dis)?approve.
issue