convert html content to table
22 views (last 30 days)
Show older comments
Hi everyone! I have a file with this content :
<html xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas- microsoft-com:office:office" xmlns="http://www.w3.org/TR/REC-html40"> <head><META http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <table> <tr> <th>DATE</th><th>OPENING</th><th>MAXIMUM</th><th>MINIMUM</th><th>CLOSURE</th><th>VAR % CLOSE</th> <th>VOLUME</th> </tr> <tr> <td>10/24/22</td><td>12,900</td><td>13,042</td><td>12,761</td><td>12,964</td><td>0.00 </td><td>80421</td> </tr> <tr>.
this is a sample line of content. i would like to convert the file content into a table. thanks
7 Comments
Rik
on 23 Apr 2023
If you have trouble with Matlab basics you may consider doing the Onramp tutorial (which is provided for free by Mathworks).
Accepted Answer
Star Strider
on 22 Apr 2023
Edited: Star Strider
on 23 Apr 2023
It appears to be an Excel file in a website. Perhaps using readtable would work. If not, then perhaps first using websave might.
EDIT — (22 Apr 2023 at 18:01)
Use the URL of the website as the argument to readtable.
Example from a recent post:
T1 = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/1245967/PPG_sensor.csv', 'VariableNamingRule','preserve')
EDIT — (23 Apr 2023 at 14:20)
This is not the best or most efficient code I ever wrote, however it has the virtue of working —
str = fileread('volm.xls');
expression = '(\d+\/\d+\/\d+)|(\d*\.\d*)|\d*';
[tokens,matches] = regexp(str,expression,'tokens','match');
vmatches = matches(5:end);
arlim = numel(vmatches)
v = reshape(vmatches, 8, []).';
dnv = str2double(v(:,2:end));
dtv = datetime(v(:,1), 'InputFormat','dd/MM/yy');% + timeofday(datetime(v(:,2), 'InputFormat','HH:mm:ss'));
T1 = array2table(dnv);
T1 = addvars(T1, dtv, 'Before',1);
T1.Properties.VariableNames = {'DATA','APERTURA','MASSIMO','MINIMO','CHISURA','VAR','% CLOSE','VOLUME'}
VN = T1.Properties.VariableNames;
figure
plot(T1{:,1}, T1{:,[2 3 4]})
hold on
patch([T1{:,1}; flip(T1{:,1})], [T1{:,3}; flip(T1{:,4})], [1 1 1]*0.25, 'EdgeColor','none', 'FaceAlpha',0.25)
hold off
grid
axis('padded')
xlabel(VN{1})
ylabel('Value')
legend(VN{[2 3 4]}, 'Location','best')
EDIT — (23 Apr 2023 at 14:52)
Added plot.
EDIT — (23 Apr 2023 at 17:48)
Changed code to use reshape on the the original cell array ‘vmatches’ to eliminate the loop and make the code more efficient.
.
7 Comments
Star Strider
on 24 Apr 2023
I learned something new, that being that this file (and presumably the others) have a ‘,’ decimal separator and a ‘.’ thousands separator. I was not aware of that previously, and did not suspect it. So the ‘str1’ - ‘str4’ conversion steps will likely have to be added to all the previous code as well.
This version also detects the variable names and the edits them to fit the final table (and intermediate operations, specifically combining the dates and times into a single datetime array, although I cannot determine if the other combinations of variable names to make them fit the column-size of the numeric data are correct).
Unfortunatley what I previous believed was reasonably robust code is not as robust as I thought it would be, and gives me new respect for the MathWorks folks who had to code readtable and readmatrix and the rest!
The latest version —
str = fileread('russfut.xls');
str2 = strrep(str,',','#');
str3 = strrep(str2,'.','');
str4 = strrep(str3,'#','.');
vars = regexp(str4,'([A-Z]\w*)','match');
VN1 = vars(:,7:end);
str = str4;
expression = '(\d+\/\d+\/\d+)|(\d+\:\d+\:\d+)|(\d*\.\d*)|\d*';
[tokens,matches] = regexp(str,expression,'tokens','match');
% matches
vmatches = matches(4:end);
arlim = numel(vmatches)
cols = numel(VN1)-1;
v = reshape(vmatches, cols, []).';
dnv = str2double(v(:,3:end));
dtv = datetime(v(:,1), 'InputFormat','dd/MM/yy') + timeofday(datetime(v(:,2), 'InputFormat','HH:mm:ss'));
T1 = array2table(dnv);
T1 = addvars(T1, dtv, 'Before',1);
VN1{1} = [VN1{1} ' ' VN1{2}];
VN1{8} = [VN1{7} ' ' VN1{8}];
VN1([2 7]) = [];
T1.Properties.VariableNames = VN1
VN = T1.Properties.VariableNames;
figure
plot(T1{:,1}, T1{:,[2 3 4]})
hold on
patch([T1{:,1}; flip(T1{:,1})], [T1{:,3}; flip(T1{:,4})], [1 1 1]*0.25, 'EdgeColor','none', 'FaceAlpha',0.25)
hold off
grid
axis('padded')
xlabel(VN{1})
ylabel('Value €')
legend(VN{[2 3 4]}, 'Location','best')
There could be endless variations on the formats of these files, and I cannot envision a robust code to handle all of them. It would be quite helpful to find a way for them to be made compatible with readmatrix. That, unfortunately, is beyond me.
.
Stephen23
on 24 Apr 2023
Edited: Stephen23
on 24 Apr 2023
"I learned something new, that being that this file (and presumably the others) have a ‘,’ decimal separator and a ‘.’ thousands separator."
It is a simple XML file (not a XLS file, regardless of its file extension). Everything in an XML is text.
More Answers (0)
See Also
Categories
Find more on String Parsing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!