searching NCBI by accession number

2 views (last 30 days)
Elissa Moller
Elissa Moller on 8 Jun 2020
I am trying to search the NCBI database using accession numbers instead of gi numbers. The example function is
function mapTaxoFile(taxonomyFilenameIn, taxonomyFilenameOut, blockSize)
%MAPTAXOFILE Helper function for METAGENOMICDEMO
% Copyright 2007-2016 The MathWorks, Inc.
fid1 = fopen(which(taxonomyFilenameIn),'rt'); % from NCBI TAXONOMY FTP site
if fid1<0
error('bioinfo:mapTaxoFile:invalidFile','Cannot open input file.')
end
fid2 = fopen(taxonomyFilenameOut, 'w');% binary file used for mapping
%===create a map between gi numbers and taxids
curr = 1; % current gi to consider
while(~feof(fid1))
data = textscan(fid1, '%d %d', blockSize);
gi = data{1};
taxa = data{2};
gap = gi(1) - curr;
%=== missing gi numbers between blocks are assigned a taxid = -1
if gap
D = -1 * ones(gap, 1);
fwrite(fid2, D, 'int32');
end
%=== populate array D such that D(gi) = taxid of gi
curr = gi(end) + 1; % current gi position in the final list
offset = min(gi) - 1; % starting gi in the current block
N = max(gi) - offset; % number of gi's to consider
D = -1 * ones(N,1);
D(gi - offset) = taxa;
%=== write array D into binary file
fwrite(fid2, D, 'int32');
end
fclose all;
I successfully got the data section to run with the following code
data = textscan(fid1, '%s %s %s %s', blockSize, 'HeaderLines', 1);
accession = data{1,2};
taxa = data{1,3};
However when I get to the part where it's populating the array the input is a mixture of numbers and letter so functions like max and min will not work. Is there another way to do this? I want to make sure it's reading the block size starting from the correct point and eventually save it in a memory map. The file is massive so I dont want to load it all at once.

Answers (0)

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!