Removing unrecognised characters and superfluous data from a string

I have a set of data I have imported as a string, however when it is imported it fills in blank rows with arrow characters, as shown below:
NC(N)=N→59.0717→C→"5N3→-1.71840001642704→0.022000011056661606→75.89000129699707"→→→→→→→→→→→→"
"→→→→→→→→→→→→→"
"OC(=O)CCC(O)=O→118.08764→C4→"6O4→-0.6665999889373779→-0.4739999994635582→74.5999984741211"→→→→→→→→→→→→"
"→→→→→→→→→→→→→"
"CC(=O)NC1=CC=C(O)C=C1→151.16446→C8→"9NO2→1.0175999999046325→-1.6619999147951603→49.32999897003174"→→→→→→→→→→→→"
"→→→→→→→→→→→→→"
"NC(N)=N→59.0717→C→"5N3→-1.71840001642704→0.022000011056661606→75.89000129699707"→→→→→→→→→→→→"
"→→→→→→→→→→→→→"
I can't remove these by the method I would normally use below:
Data(Data == "→→→→→→→→→→→→→") = []
In the original file they appear to be blank lines, in between the actual lines of text.
Is there a way to remove these?

11 Comments

@Duncan: Remember that just like with numeric data, how text data is displayed is not the same thing as the data saved in a file! Those arrows look like a perfectly normal representation of a horizontal tab character (commonly called "tab") that is commonly used by text editors. It is unlikely that your file contains "arrow" characters, but we cannot check this unless you upload a sample file.
"In the original file they appear to be blank lines, in between the actual lines of text."
That would be consistent with them being horizontal tab characters.
"Is there a way to remove these?"
Most likely yes, but without information about the file format, how you import that data, a sample file, etc., you didn't give us much to work with.
Where do you see the arrows? Some text editors display TABs as arrows.
"I can't remove these by the method I would normally use below:" - Why not? Please explain what happens and what you expect instead.
@Stephen Cobeldick, The file format is csv, I have imported it using:
Data = readmatrix('CrystalDataSample', 'OutputType', 'string');
I have atttached a sample of the file here.
I realise they are not arrow characters, but that they represent something else. In the csv they appear to be spaces " " so I tried the method:
Data(Data == " ") = []
to remove them but it didn't work. I also tried to remove tabs and returns but that didn't work either.
@Jan I don't know why that didn't work, when it tried it it just gave me back the original data matrix. When I have tried to remove specific characters,strings in the past like that it has removed them fine.
@Duncan: the file appears to be a (rather broken) Tab-Separated Values text file, so all of the data fields are also separated by tabs. If you remove all of the tab characters, then your data values will not be separated by any delimiter. Is that really what you want to do?
"In the original file they appear to be blank lines, in between the actual lines of text."
Not when I look at the file: every single line is contained within one pair of double quotes. Double quotes are not "blank".
To be honest the format of that file is a mess: there are escaped double quotes spread randomly around the file, every line is contained within double quotes, and those lines of tab characters have no obvious purpose... can you fix the SW that created these awful files?
It may have the suffix .csv, but it is not a csv file -- the "arrows" are tabs -- here's a portion of the input file as shown by a file dump utility--
0000 0000 ef bb bf 43 6f 6c 75 6d 6e 31 0d 0a 22 53 74 72 Column1.."Str
0000 0010 75 63 74 75 72 65 09 6d 77 09 4d 46 09 6c 6f 67 ucture.mw.MF.log
0000 0020 50 09 6c 6f 67 53 09 50 6f 6c 61 72 20 73 75 72 P.logS.Polar sur
0000 0030 66 61 63 65 20 61 72 65 61 09 09 09 09 09 09 09 face area.......
0000 0040 09 09 09 09 09 09 22 0d 0a 22 09 09 09 09 09 09 ......".."......
0000 0050 09 09 09 09 09 09 09 22 0d 0a 22 4e 5b 43 40 40 .......".."N[C@@
0000 0060 09 22 22 5d 28 43 43 43 28 4e 29 3d 4f 29 43 28 .""](CCC(N)=O)C(
0000 0070 4f 29 3d 4f 09 31 34 36 2e 31 34 35 34 09 43 35 O)=O.146.1454.C5
0000 0080 22 22 09 22 22 31 30 4e 32 4f 33 09 2d 33 2e 37 "".""10N2O3.-3.7
0000 0090 35 36 34 30 30 30 37 38 35 33 35 30 38 09 2d 30 5640007853508.-0
0000 00a0 2e 34 39 36 39 39 39 39 39 39 35 30 38 32 36 31 .496999999508261
0000 00b0 37 09 31 30 36 2e 34 30 39 39 39 39 38 34 37 34 7.106.4099998474
0000 00c0 31 32 31 31 22 22 09 09 09 09 09 09 09 09 09 09 1211""..........
the 09 bytes are tabs. There are some non-ASCII characters at the beginning of the file "ef bb bf" -- not sure what those are about in what otherwise does appear to be a text file.
@Stephen Cobeldick: Yeah, that's what I'd thought initially, so i just uploaded the file into excel and tried to separate by tabs but it didn't help.
No I don't want to remove all the tab characters, I'm just trying to extract one of the numbers on each line, but because of the messiness of the file it's proving my harder than I'd first thought.
I have managed to remove the lines between the information I want by removing every other row using:
str(n:n:end,:) = [];
But yes becuase the file is so messy I am sruggling to get at the bits I want.
It's not my software and I don't understand enough about it to try to fix it.
"There are some non-ASCII characters at the beginning of the file "ef bb bf" -- not sure what those are about in what otherwise does appear to be a text file."
@dpb: That is the BOM of a UTF-8 text file: https://en.wikipedia.org/wiki/Byte_order_mark
@Duncan: please upload the original file, before you made any changes at all, i.e. the raw file that was output by some badly-written third-party SW. We will see what we can do...
@Stephen -- I figured something of the sort but not aware-enough of the specific header bytes to recognize from whence they came.
@Stephen Cobeldick: I have attached the data, it is just copy pasted from the website into an excel file.
@Duncan: the original file is an .xlsx file? If you actually copied that data from a website, can you please just give a link?
@Stephen Cobeldick: I used this website to generate data from a set of data which I have also attached:

Sign in to comment.

 Accepted Answer

That's not unrecognized data, it is tab, use sprintf('\t') to generate it.
please run:
replace(CrystalData(37),sprintf('\t'),'')

More Answers (0)

Products

Release

R2019a

Asked:

on 16 Jul 2019

Commented:

dpb
on 17 Jul 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!