Regular Expression help (is a line end token being matched in the middle of a string?)

1 view (last 30 days)
Hi all,
I've hit a lack of understanding on regular expressions and I'm hoping someone can help
If I start with a simple enough string:
str = 'Serial Number: NEO1B39100092'
I first check for the existence of my keyword:
keyword = '(\W|^)serial number:*(\W|$)+'
and then try to find its value with:
expr = ['(?<=', keyword, ')[^\s]+'] % N.B. using [^\s] as a template for '[^', delimiters ']'
for comparison, I'll use the keyword without the line end anchor
keyword2 = '(\W|^)serial number:*(\W)+'
expr2 = ['(?<=', keyword2, ')[^\s]+']
Now if I use regexpi to test the two expressions
value = regexpi(str, expr, 'match', 'once')
value2 = regexpi(str, expr2, 'match', 'once')
I see that
value = ':',
while
value2 = 'NEO1B39100092'
My take on this is that the line anchor '$' from the original keyword is being match with the letter r somehow. For it to return ' : ' the character before ' : ' must match (\W|$)+ , and it can't be the \W because expr2 gives the expected result.
Can anyone shed some light on this for me?
Thanks for any help, Andrew

Answers (1)

Prateekshya
Prateekshya on 24 Oct 2024
Hello Andrew,
It looks like you are encountering an issue with how you are using regular expressions, specifically with the use of anchors and non-word character matching. Let us break down what is happening and how you can adjust your expressions to get the desired result.
Understanding the Regular Expression
  • Anchors:
\W matches any non-word character (anything other than a-z, A-Z, 0-9, and underscore). ^ matches the start of a string. $ matches the end of a string.
  • Your Pattern:
keyword = '(\W|^)serial number:*(\W|$)+' is intended to match "Serial Number:" preceded by a non-word character or the start of the string and followed by a non-word character or the end of the string. The issue arises from using (\W|$)+ at the end, which can match multiple non-word characters or the end of the string, potentially leading to unexpected results when combined with (?<=...). The pattern (\W|$)+ matches the colon (:) because the * in serial number:* allows for zero colons, and (?<=...) looks for a match right before the non-space sequence. The $ anchor does not work as expected here because it is not at the end of the string.
Solution
To extract "NEO1B39100092" correctly, you need to refine your regular expressions:
  • Define the Keyword:
Remove the $ anchor since it does not apply between words. Use \s* to handle potential spaces after the colon.
  • Expression for Extraction:
Use a positive lookbehind to identify the pattern correctly.
Revised Code
str = 'Serial Number: NEO1B39100092';
% Define a more precise keyword pattern
keyword = '(?i)(\W|^)serial number:\s*'; % Case-insensitive match, allows spaces after colon
% Expression to extract the serial number
expr = ['(?<=', keyword, ')[^\s]+'];
% Use regexpi to extract the value
value = regexpi(str, expr, 'match', 'once');
disp(['Extracted Value: ', value]);
I hope this helps!

Categories

Find more on Characters and Strings in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!