downloading files from a website with conditions on names of files

2 views (last 30 days)
alpedhuez
alpedhuez on 9 Feb 2022
Commented: Walter Roberson on 12 Mar 2022
Question: I work on a website https://www.somecompany.com/xml/.
This directory has files whose filename starts with a letter "A" and "B".
The filenames in the directory are like:
A_20080403.xml
A_20080403_1.xml
A_20080403_2.xml
A_20080404_1.xml
B_20080403_1.xml
That is
  • Filenames are of the form "Capital letters"+"_"+"date"+"_"+"numbers".xml or "Capital letters"+"_"+"date".xml
  • There are dates that do not have corresponding files
I would like to download all the files whose filenames start with a letter "A".
What has been tried:
(a) I was able to save a single file using "websave" command.
for k = 20080401:20100101
filename = sprintf('A%d.xml', k);
url = ['https://www.somecompany.com/xml/' filename];
outfilename = websave(filename,url);
end
Problems with the above code: The above code does not work because
  • This code assumes the filename of the form "Capital letters"+"date".xml and not the filenames that explained above
  • This code returns the error for a date when there are no corresponding files and stops then
How shall one improve the above code?

Answers (1)

Walter Roberson
Walter Roberson on 9 Feb 2022
It would be more robust / faster if the site provided a way to list the available files, instead of having to do trial and error.
baseurl = "https://www.somecompany.com/xml/";
datelimits = datetime({'20080401', '20100101'}, 'InputFormat', 'yyyyMMdd');
subfile_limit = 5; %no more than _5 -- adjust as appropriate
subfile_modifier = ["", "_" + (1:subfile_limit)] + ".xml";
for Day = datelimits(1):datelimits(2)
daystr = string(Day);
for Sub = subfile_modifier
filename = "A_" + daystr + Sub;
url = baseurl + filename;
try
outfilename = websave(filename,url);
fprintf('fetched %s\n', filename);
catch
break; %skip remaining subfiles for this date upon first failure
end
end
end
  2 Comments
Walter Roberson
Walter Roberson on 12 Mar 2022
datelimits = datetime({'20080401', '20100101'}, 'InputFormat', 'yyyyMMdd', 'Format', 'yyyyMMdd');

Sign in to comment.

Tags

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!