Need to remove punctuation from a string that was pulled from a cell array
14 views (last 30 days)
Show older comments
Conor Koch
on 21 Jan 2018
Commented: Walter Roberson
on 21 Jan 2018
I have a text file, and I am grabbing each line and entering that line into a cell array named 'input'. I am then applying a regex to remove punctuation from each cell.
The following code works, just as a test:
test = 'I''m testing to remove punctuation... Did it work?'
punct = regexp(test, '[^.,''!?]');
test = test(punct)
However, the code below is my actual code, and it does not work:
punct = regexp(input, '[^.,''!?]');
for i = (1:size(input))
temp = char(input{i})
input{i} = temp(punct)
end
I receive the following error on the line 'input{i} = temp(punct)'
Function 'subsindex' is not defined for values of class 'cell'.
Temp appears to be a character array (string) no different than test from above. However, Matlab seems to think it is still a cell, despite using curly braces to index it from input AND using char() to convert it (just to be sure).
Why is temp not acting like a proper string? How can I get this to work?
Thank you for your help.
1 Comment
Stephen23
on 21 Jan 2018
Edited: Stephen23
on 21 Jan 2018
"Why is temp not acting like a proper string?"
There is nothing wrong with temp, and temp is not the problem. You just forgot to look at the index punct, just as the error message is advising you about: indexing is indeed not defined for cell arrays.
Accepted Answer
More Answers (2)
Stephen23
on 21 Jan 2018
Edited: Stephen23
on 21 Jan 2018
out = regexprep(inp, '[.,''!?]', '');
BTW, the basic problem with the code in your question is that you did not read the regexp help, which states for the start indices that "If either str or expression is a cell array of character vectors or a string array, and the other is a character vector or a string scalar, the output is a cell array of row vectors". In your test case both inputs are char vectors, which means that the output index will be a simple numeric array. Howevery in the second example the variable input (you should change this variable name) is a cell array, and so the output punct is a cell array of numeric vectors, exactly as the help describes. Clearly a cell array of numeric vectors cannot be used for indices like this:
temp(punct)
because cell arrays are not defined for indexing, just as that error message tells you. Indices can only be numeric or logical arrays.
You could have debugged this yourself by simply looking at punct, and observing that it is a cell array. The first step of debugging is just looking at what the code is actually doing, and not relying on you imagine/want/assure/believe/hope it to be doing.
In any case, using regexp alone is not required as regexprep does the job more efficiently, as I showed above.
0 Comments
Walter Roberson
on 21 Jan 2018
The problem is not your input. The problem is that regexp is returning a cell array and you are attempting to index by the cell array.
You are misusing regexp. It does not return a logical vector indicating whether each character matches or not -- not in any of the possible outputs.
You should use ismember() to find the punctuation and use the logical negotiation as the places that are not punctuation.
Your list of punctuation is not correct. Apostrophe is not punctuation when used to indicate contractions or possessive or glotal stop. Period is not punctuation when used to indicate abbreviation or in numeric values. Exclamation mark is not punctuation when used to indicate click such as in some African names.
You have left out en dash and em dash (be careful though because hyphen is not punctuation except when it is substitute for one of those). You have left out double quotes in the straight versions and in the left and right version, and you have left out brackets and braces of all kinds and you have left out ‹ and › and the doubled version of those such as are used in French to quote material. You have also left out the punctuation from Japanese, Chinese, Korean, and other languages.
1 Comment
Walter Roberson
on 21 Jan 2018
To emphasize: some characters are only punctuation in some circumstances and you need to do semantic parsing to decide for any given character. You need to know a fair bit about English to decide properly.
For example suppose you are not at the beginning of a sentence and you encounter a word starting with capital M followed by a letter followed by a period. Is the period punctuation or abbreviation marker? If the contents of the discussion are technical then probably it is punctuation. But if not then the combination might be the title abbreviation Mr. or Ms. in which case the period is not punctuation. How about MD? MD is an abbreviation but by convention no period is used for it. Are there other abbreviations that match the pattern? Yes, Mz. is sometimes used as an alternative to Ms.. Others? Yes: although it is not common, Mx. is one of the gender-neutral or non-binary title abbreviations. But for example, Mv. is not used as a title abbreviation so the period would be punctuation. Except notice that in that context where I was giving an example following the pattern, the period was being "mentioned" rather than being "used" so in this context the Mv. does not have the period end the sentence, which leads to arguments about whether it is acting as punctuation there or not, whereas for the Mr. case it is certainly not acting as punctuation.t
There is clearly no way you can possibly implement your requirements using character by character analysis. English cannot be parsed by a true regular expression, or even by anything in the LALR family of parsers.
See Also
Categories
Find more on Language Support in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!