Extracting Data field of a Series in HTML file
7 views (last 30 days)
Show older comments
In an HTML file, there is a section like this :
series: [{
name: 'Numbers',
color: '#33CCFF',
lineWidth: 5,
data: [45,78,84,91,111,125,178,231,274,283,303,333] }],
How to extract the 'data' field into an array in a matlab code ?
There are many such series' in that same HTML file with different 'name' fields. For example, name: 'Total Value', 'Log Scale', 'Base Value' etc.
4 Comments
Mohammad Sami
on 7 Apr 2020
are you parsing the html in Matlab as char array ? regexp is for string, cellstr or char data.
you can easily change the pattern to name: \'Numbers\'
Accepted Answer
per isakson
on 7 Apr 2020
Edited: per isakson
on 9 May 2020
I misunderstood your question. This is a bit of overkill.
Assumptions
- the string, series:, always indicates the start of a block of interest
I created a sample file, cssm.txt, which I uploaded. (Matlab Answers doesn't allow the extension .html ).
This script reads all blocks
%%
chr = fileread('cssm.txt');
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
%%
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
for jj = 1 : len
txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).name = matlab.lang.makeValidName( txt );
txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).color = txt;
txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
series(jj).lineWidth = str2double( txt );
txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
series(jj).data = str2num( txt ); %#ok<ST2NM>
end
and extract "series which matches name='Numbers'. Not the other series'."
>> series(strcmp({series.name},'Numbers')).data
ans =
45 78 84 91 111 125 178 231 274 283 303 333
In response to comment below
Assumptions
- the string, series:, always indicates the start of a block of interest
- the string, }], indicates the end of a block of interest
- all html-files of interest are named index.html
- all files named index.html are of interest
- all html-files of interest are in subfolders under a root-folder, ...\finCase
- every html-file, index.html, contains exactly one block that has a specific value of the field name:, e.g. Numbers
The overkill is still there. However, reading and parsing four html-files (copies of cssm.txt ) takes less than 10ms.
Try
>> client_data = read_client_data( 'd:\m\cssm\finCase', 'index.html', 'Numbers' )
client_data =
4×2 cell array
{'anderson' } {1×9 double}
{'kim-j-clijsters'} {1×10 double}
{'paul-judd' } {1×11 double}
{'simmi' } {1×12 double}
>>
where (in one m-file)
function client_data = read_client_data( root, file, name )
sad = dir( fullfile( root, '**', file ) );
len = length( sad );
client_data = cell( len, 2 );
for jj = 1 : len
cac = strsplit( sad(jj).folder, filesep );
client = cac{end};
series = read_one_file_( fullfile( sad(jj).folder, sad(jj).name ) );
client_data(jj,:) = { client, series(strcmp({series.name},name)).data };
end
end
function series = read_one_file_( file )
chr = fileread( fullfile( file ) );
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
for jj = 1 : len
txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).name = strtrim( txt );
txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).color = txt;
txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
series(jj).lineWidth = str2double( txt );
txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
series(jj).data = str2num( txt ); %#ok<ST2NM>
end
end
TODO: add error handling and comments
10 Comments
per isakson
on 9 May 2020
A nice thing with standards is that there are so many to chose between. Null (or NULL) is a special marker used in Structured Query Language to indicate that a data value does not exist in the database [Wikipedia]. However, Matlab doesn't honor Null.
Replace the statement
series(jj).data = str2num( txt ); %#ok<ST2NM>
by
out = textscan( txt , '%f' ...
, 'CollectOutput' , true ...
, 'Delimiter' , ',' ...
, 'EmptyValue' , 0 ...
, 'TreatAsEmpty' , 'null' ...
, 'Whitespace' , ' \t[]' );
series(jj).data = reshape( out{:}, 1,[] );
and read about textscan in the documentation.
More Answers (0)
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!