extract part of a string with an extension

Hi, I have a long string and I want to just exctract the names that have "hdf" as an extension:
I want just to get "MOD11C1.A2013001.005.2013015221704.hdf"
My string is:
U.S. GOVERNMENT COMPUTER
This US Government computer is for authorized users only. By accessing this
system you are consenting to complete monitoring with no expectation of privacy.
Unauthorized access or use may subject you to disciplinary action and criminal
prosecution.
********************************************************************************
</pre>
<pre><img src="/icons/blank.gif" alt="Icon "> Name Last modified Size Description<hr><img src="/icons/back.gif" alt="[DIR]"> Parent Directory -
<img src="/icons/image2.gif" alt="[IMG]"> BROWSE.MOD11C1.A2013001.005.2013015221704.1.jpg 15-Jan-2013 16:29 3.2M
<img src="/icons/image2.gif" alt="[IMG]"> BROWSE.MOD11C1.A2013001.005.2013015221704.2.jpg 15-Jan-2013 16:29 3.3M
<img src="/icons/unknown.gif" alt="[ ]"> MOD11C1.A2013001.005.2013015221704.hdf 15-Jan-2013 16:29 46M
<img src="/icons/unknown.gif" alt="[ ]"> MOD11C1.A2013001.005.2013015221704.hdf.xml 16-Jan-2013 02:15 32K
<hr></pre>
</body></html>
Thanks,
Zeinab

3 Comments

Stephen23
Stephen23 on 3 Dec 2014
Edited: Stephen23 on 3 Dec 2014
In order to locate this substring, you will have to give us a bit more information, particularly:
  • How can we identify the start of the substring: does it always consist of exactly the same letters (eg MOD), or is it always preceded by some recognizable pattern of characters?
  • How can we identify the end of the substring: is it always exactly the same file extension that you need to locate?
Andrea
Andrea on 3 Dec 2014
Edited: Andrea on 3 Dec 2014
Thanks, It always has the exact same extension "hdf" file. And it always starts with MOD, as you see the name is I am interested in is: MOD11C1.A2013001.005.2013015221704.hdf But it will change in other loops according to the date. for instance: MOD11C1. A2013001.005.2013015221704 .hdf will be MOD11C1.A2013001.005.2013015221705.hdf.
The reason I need it, is I want to read the files in a web address (that will change with a loop) with urlread which gives me the content as string. Now I need to use urlwrite to save the files I want according to their filenames (with have hdf extension).
Please see this: str=urlread(path1);
Many thanks, I really spend more than 6 hours on it so far!
farz

Sign in to comment.

 Accepted Answer

Here is a solution(?) based on regexp
>> cac = cssm;
>> cac{:}
ans =
MOD11C1.A2013001.005.2013015221704.hdf
ans =
MOD11C1.A2013001.005.2013015221704.hdf
>>
where
function cac = cssm()
str = fileread( 'cssm.txt' );
name_xpr = '[\w\.]+\.hdf';
cac = regexp( str, name_xpr, 'match' );
end
and cssm.txt contains the text of your question. Two identical name seems to be correct. You might want to apply unique
&nbsp
In response to comments:
My mistake illustrates a problem with regular expressions. Expressions often matches unexpected strings. I missed the case that ".hdf" is part of the base name rather than an extension. Now I have added that ".hdf" should be followed by "\s, Any white-space character; equivalent to [\f\n\r\t\v]". However, that white-space is not included in the output.
>> cssm
ans =
'MOD11C1.A2013001.005.2013015221704.hdf'
function cac = cssm()
str = fileread( 'cssm.txt' );
name_xpr = '[\w\.]+\.hdf(?=\s)'; % <<<<<<< modified
cac = regexp( str, name_xpr, 'match' );
end
&nbsp
Stephen Cobeldick already proposed this modification to the expression. I like Stephen's list, which helps to pinpoint the unique characteristics of the string. It triggers thinking. Does the filename always start with "MOD"? Could "MOD" appear in the middle of the name? It's risky to deduce rules out of small samples. If the name shall always start with "MOD"
name_xpr = '(?<=\s)MOD[\w\.]+\.hdf(?=\s)';
is a better expression.

4 Comments

Thanks a lot! Finally I can continue from here. It really solved my problem.
I think this will fix per's code so that it doesn't get the xml files also
name_xpr = '[\w\.]+\.hdf\s' % Must have white space after hdf
Otherwise it will return 'MOD11C1.A2013001.005.2013015221704.hdf' when it was examining 'MOD11C1.A2013001.005.2013015221704.hdf.xml' which is an xml file, not an hdf file.
Thank you I tried the one with "s" as you suggested but it did not work. The previous one worked fine for me but gave me all the files with hdf extension which was not a big problem. The one you suggested seems to give me a unique answer but it isn't working and it gives an empty cell as a result.
I've added to my answer

Sign in to comment.

More Answers (1)

Stephen23
Stephen23 on 3 Dec 2014
Edited: Stephen23 on 3 Dec 2014
Why not all on one line?
str = fileread('temp.txt');
C = regexp(str,'MOD[\w\.]+\.hdf(?=\s)','match');
C =
'MOD11C1.A2013001.005.2013015221704.hdf'
This matches all substrings that meet the following conditions:
  • starts with 'MOD'
  • ends with '.hdf'
  • contains any combination of alphnumeric characters plus period
  • is followed by a space character (ie excludes '....hdf.xml')
As suggested by per isakson, you might also want to apply unique to the output.

Asked:

on 3 Dec 2014

Edited:

on 4 Dec 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!