Matlab - Parallel file search with afterEach
1 view (last 30 days)
Show older comments
Im trying to implement a function in Matlab, that searches for files and does that in parallel to speed up the process. Ive successfully implemented that in the following function:
function matches = searchfordata(starting_path, search_depth, checkFunction)
arguments
starting_path {isfolder}
search_depth int64
checkFunction function_handle
end
tic;
folders = struct('name', '' , 'folder', starting_path);
dataMap = containers.Map('KeyType', 'double', 'ValueType', 'any');
next_folders = [];
matches = [];
no_folders = height(folders);
current_depth = 0;
total_folders = 0;
if search_depth < 0
search_depth = 9001;
end
while current_depth <= search_depth && no_folders > 0
total_folders = total_folders + no_folders;
parfor n = 1:no_folders
path = strcat(folders(n).folder, filesep, folders(n).name);
[files, cfolders] = filesandfolders(path);
if height(files) > 0
check = checkFunction(files);
else
check = [];
end
matches = [matches;check];
next_folders = [next_folders; cfolders];
%matches(end+1) = check;
%next_folders(end+1) = cfolders;
end
if height(matches) > 0
dataMap(current_depth) = matches;
matches = [];
end
folders = next_folders;
no_folders = height(folders);
next_folders = [];
current_depth = current_depth + 1;
end
matches = dataMap;
toc;
end
Relevant other functions/classes for this:
function [files, folders] = filesandfolders(path)
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
directory_contents = dir(path);
files = directory_contents(~[directory_contents.isdir]);
folders = directory_contents([directory_contents.isdir]);
folders = folders(~ismember({folders.name}, {'.', '..'}));
end
Basically a dir, which splits the result into the files and folders and removes the "." and ".." from the folder results.
function boolobject = checkFiles(files)
%CHECKFILES Checks given files for powercycler files
% Detailed explanation goes here
%cyclingregex = 'cycling_parameters\.xml';
%transientregex = '\.(pol|par|raw)$';
cyclingregex = '\.txt$';
transientregex = '\.txt$';
matching_cycling = regexpi({files.name}, cyclingregex, 'Match');
matching_transient = regexpi({files.name}, transientregex, 'Match');
cycling_indices = ~cellfun(@isempty, matching_cycling);
transient_indices = ~cellfun(@isempty, matching_transient);
boolobject = FolderData(files(1).folder);
boolobject.cyclingData = any(cycling_indices);
boolobject.rthData = any(transient_indices);
if boolobject.cyclingData || boolobject.rthData
return
else
boolobject = [];
end
end
This gets the list of files from filesandfolders as input and filters for the files I am searching for. I changed this to txt for better reproduceability. The output of this function is this class:
classdef FolderData < handle
%FOLDERDATA Summary of this class goes here
% Detailed explanation goes here
properties
folder %folderpath
rthData %bool, true if this folder contains rth-data files
cyclingData %bool, true if this folder contains cycling-data files
end
methods
function this = FolderData(path)
this.rthData = false;
this.cyclingData = false;
this.folder = path;
end
end
end
Which just says what files were found and in which folder.
The actual search function at the top takes 8-30 seconds on my drives and is working. Now I thought I could maybe speed this up a little more with afterEach. The basic idea being, that the if the contents of the folders that are being processed in the parfor loop are very different in quantity, theres a single folder holding back the process, because it needs to finish before the function can resume its work after the parfor loop.
For that I have created the following script:
clc;
clear all;
path = 'D:\';
if isempty(gcp('nocreate'))
parpool(4);
end
fun = @checkFiles;
%output = searchparforae(path, fun);
%output = searchforae(path, fun);
output = searchfordata(path, 100, fun);
function matches = searchparforae(starting_path, checkFunction)
tic;
folder_que = parallel.pool.DataQueue;
matches = [];
listener = afterEach(folder_que, @search_folder);
starting_path = struct('folder', starting_path, 'name', '');
search_folder(starting_path);
function search_folder(input)
parfor n = 1:height(input)
folder_path = strcat(input(n).folder, filesep, input(n).name);
fprintf(1, folder_path);
fprintf(1, '\n');
[files, folders] = filesandfolders(folder_path);
if height(files) > 0
check = checkFunction(files);
else
check = [];
end
matches = [matches;check];
send(folder_que, folders);
end
end
toc;
end
function matches = searchforae(starting_path, checkFunction)
tic;
folder_que = parallel.pool.DataQueue;
matches = [];
listener = afterEach(folder_que, @search_folder);
starting_path = struct('folder', starting_path, 'name', '');
search_folder(starting_path);
function search_folder(input)
for n = 1:height(input)
folder_path = strcat(input(n).folder, filesep, input(n).name);
fprintf(1, folder_path);
fprintf(1, '\n');
[files, folders] = filesandfolders(folder_path);
if height(files) > 0
check = checkFunction(files);
else
check = [];
end
matches = [matches;check];
send(folder_que, folders);
end
end
toc;
end
The two functions "searchforae" and "searchparforae" are exactly the same, except for the loop. As it might be obvious from the names "searchforae" has a for loop, while "searchparforae" has a parfor loop.
Now searchforae is not working at all. The print outputs show, that searchforae is only processing files in the initially given directory and the directories directly below that. Print outputs:
D:\
D:\$RECYCLE.BIN
D:\Downloads
D:\OneDriveTemp
D:\Programme
D:\Repositories
D:\Sonstiges
D:\Spiele
D:\System Volume Information
D:\Uni
D:\Uni2
D:\Users
D:\Zwischenablage
The searchparforae function in contrast is working just as well as the searchfordata function at the top. But instead of 8-30 seconds its taking 5-10 minutes. Am I using afterEach wrong? Why is it taking that long? Also why isnt the searchforae function working correctly even tho the only difference is a for loop instead of a parfor compared to searchparforae?
1 Comment
Alvaro
on 2 Dec 2022
At a first glance, the line
afterEach(folder_que, @search_folder);
is calling the search_folder function everytime that you are sending data to the queue. Moreover, you are sending data to the queue within the search_folder function which might be causing some unwanted recursion.
Is this what you originally intended? Also, did you try doing the changes you suggested in your top function? It would be helpful if you could explain the purpose of having that additional script.
Answers (0)
See Also
Categories
Find more on Migrate GUIDE Apps in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!