convert html content to table

22 views (last 30 days)
roberto
roberto on 22 Apr 2023
Edited: Stephen23 on 24 Apr 2023
Hi everyone! I have a file with this content :
<html xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas- microsoft-com:office:office" xmlns="http://www.w3.org/TR/REC-html40"> <head><META http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <table> <tr> <th>DATE</th><th>OPENING</th><th>MAXIMUM</th><th>MINIMUM</th><th>CLOSURE</th><th>VAR % CLOSE</th> <th>VOLUME</th> </tr> <tr> <td>10/24/22</td><td>12,900</td><td>13,042</td><td>12,761</td><td>12,964</td><td>0.00 </td><td>80421</td> </tr> <tr>.
this is a sample line of content. i would like to convert the file content into a table. thanks
  7 Comments
Rik
Rik on 23 Apr 2023
If you have trouble with Matlab basics you may consider doing the Onramp tutorial (which is provided for free by Mathworks).
roberto
roberto on 23 Apr 2023
Moved: Star Strider on 23 Apr 2023
if you open the file directly in matlab you will get an error message. instead opening it first in excel and saving it as xlsx you will be able to use it in matlab. when opening in excel there will be a warning pop up, but you can safely open the file is safe. if opened as txt, you can see the html content.

Sign in to comment.

Accepted Answer

Star Strider
Star Strider on 22 Apr 2023
Edited: Star Strider on 23 Apr 2023
It appears to be an Excel file in a website. Perhaps using readtable would work. If not, then perhaps first using websave might.
EDIT — (22 Apr 2023 at 18:01)
Use the URL of the website as the argument to readtable.
Example from a recent post:
T1 = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/1245967/PPG_sensor.csv', 'VariableNamingRule','preserve')
T1 = 5187×1 table
Wave ____ 40 40 40 40 40 40 40 40 39 39 39 39 38 38 38 38
EDIT — (23 Apr 2023 at 14:20)
This is not the best or most efficient code I ever wrote, however it has the virtue of working —
str = fileread('volm.xls');
expression = '(\d+\/\d+\/\d+)|(\d*\.\d*)|\d*';
[tokens,matches] = regexp(str,expression,'tokens','match');
vmatches = matches(5:end);
arlim = numel(vmatches)
arlim = 504
v = reshape(vmatches, 8, []).';
dnv = str2double(v(:,2:end));
dtv = datetime(v(:,1), 'InputFormat','dd/MM/yy');% + timeofday(datetime(v(:,2), 'InputFormat','HH:mm:ss'));
T1 = array2table(dnv);
T1 = addvars(T1, dtv, 'Before',1);
T1.Properties.VariableNames = {'DATA','APERTURA','MASSIMO','MINIMO','CHISURA','VAR','% CLOSE','VOLUME'}
T1 = 63×8 table
DATA APERTURA MASSIMO MINIMO CHISURA VAR % CLOSE VOLUME ___________ ________ _______ ______ _______ ___ _______ ______ 23-Jan-2023 15.141 15.209 15.071 15.16 0 0 43887 24-Jan-2023 15.177 15.215 15.073 15.154 0 4 45015 25-Jan-2023 15.143 15.2 15.018 15.134 0 13 56891 26-Jan-2023 15.207 15.274 15.119 15.177 0 28 60531 27-Jan-2023 15.202 15.239 15.124 15.195 0 12 53538 30-Jan-2023 15.204 15.212 15.033 15.183 0 8 67499 31-Jan-2023 15.164 15.243 15.038 15.174 0 6 62442 01-Feb-2023 15.207 15.366 15.151 15.219 0 30 68697 02-Feb-2023 15.307 15.596 15.302 15.565 2 27 95973 03-Feb-2023 15.551 15.556 15.392 15.509 0 36 70726 06-Feb-2023 15.427 15.451 15.32 15.395 0 74 68405 07-Feb-2023 15.43 15.483 15.317 15.355 0 26 74440 08-Feb-2023 15.449 15.537 15.411 15.451 0 63 72839 09-Feb-2023 15.486 15.705 15.445 15.568 0 76 88021 10-Feb-2023 15.484 15.523 15.272 15.331 1 52 84766 13-Feb-2023 15.327 15.468 15.279 15.441 0 72 52535
VN = T1.Properties.VariableNames;
figure
plot(T1{:,1}, T1{:,[2 3 4]})
hold on
patch([T1{:,1}; flip(T1{:,1})], [T1{:,3}; flip(T1{:,4})], [1 1 1]*0.25, 'EdgeColor','none', 'FaceAlpha',0.25)
hold off
grid
axis('padded')
xlabel(VN{1})
ylabel('Value')
legend(VN{[2 3 4]}, 'Location','best')
The ‘DATA’ are a datetime array, the rest are double values.
EDIT — (23 Apr 2023 at 14:52)
Added plot.
EDIT — (23 Apr 2023 at 17:48)
Changed code to use reshape on the the original cell array ‘vmatches’ to eliminate the loop and make the code more efficient.
.
  7 Comments
Star Strider
Star Strider on 24 Apr 2023
I learned something new, that being that this file (and presumably the others) have a ‘,’ decimal separator and a ‘.’ thousands separator. I was not aware of that previously, and did not suspect it. So the ‘str1’ - ‘str4’ conversion steps will likely have to be added to all the previous code as well.
This version also detects the variable names and the edits them to fit the final table (and intermediate operations, specifically combining the dates and times into a single datetime array, although I cannot determine if the other combinations of variable names to make them fit the column-size of the numeric data are correct).
Unfortunatley what I previous believed was reasonably robust code is not as robust as I thought it would be, and gives me new respect for the MathWorks folks who had to code readtable and readmatrix and the rest!
The latest version —
str = fileread('russfut.xls');
str2 = strrep(str,',','#');
str3 = strrep(str2,'.','');
str4 = strrep(str3,'#','.');
vars = regexp(str4,'([A-Z]\w*)','match');
VN1 = vars(:,7:end);
str = str4;
expression = '(\d+\/\d+\/\d+)|(\d+\:\d+\:\d+)|(\d*\.\d*)|\d*';
[tokens,matches] = regexp(str,expression,'tokens','match');
% matches
vmatches = matches(4:end);
arlim = numel(vmatches)
arlim = 69776
cols = numel(VN1)-1;
v = reshape(vmatches, cols, []).';
dnv = str2double(v(:,3:end));
dtv = datetime(v(:,1), 'InputFormat','dd/MM/yy') + timeofday(datetime(v(:,2), 'InputFormat','HH:mm:ss'));
T1 = array2table(dnv);
T1 = addvars(T1, dtv, 'Before',1);
VN1{1} = [VN1{1} ' ' VN1{2}];
VN1{8} = [VN1{7} ' ' VN1{8}];
VN1([2 7]) = [];
T1.Properties.VariableNames = VN1
T1 = 8722×7 table
DATA ORA APERTURA MASSIMO MINIMO CHIUSURA VAR CLOSE VOLUME ____________________ ________ _______ ______ ________ _________ ______ 24-Apr-2023 00:00:00 13068 13068 13068 13068 0 0 24-Apr-2023 00:00:01 13068 13068 13068 13068 0 1 24-Apr-2023 00:00:02 13068 13068 13068 13068 0 8 24-Apr-2023 00:00:03 13069 13069 13069 13069 0.01 8 24-Apr-2023 00:00:04 13069 13069 13069 13069 0 8 24-Apr-2023 00:00:05 13071 13071 13071 13071 0.01 1 24-Apr-2023 00:00:06 13070 13070 13070 13070 0 8 24-Apr-2023 00:00:08 13070 13070 13070 13070 0 14 24-Apr-2023 00:00:09 13069 13069 13069 13069 0.01 6 24-Apr-2023 00:00:10 13070 13070 13070 13070 0.01 6 24-Apr-2023 00:00:11 13069 13069 13069 13069 0.01 3 24-Apr-2023 00:00:12 13070 13070 13070 13070 0.01 2 24-Apr-2023 00:00:13 13069 13069 13069 13069 0.01 4 24-Apr-2023 00:00:14 13069 13069 13069 13069 0 4 24-Apr-2023 00:00:15 13068 13068 13068 13068 0.01 11 24-Apr-2023 00:00:16 13068 13068 13068 13068 0.01 4
VN = T1.Properties.VariableNames;
figure
plot(T1{:,1}, T1{:,[2 3 4]})
hold on
patch([T1{:,1}; flip(T1{:,1})], [T1{:,3}; flip(T1{:,4})], [1 1 1]*0.25, 'EdgeColor','none', 'FaceAlpha',0.25)
hold off
grid
axis('padded')
xlabel(VN{1})
ylabel('Value €')
legend(VN{[2 3 4]}, 'Location','best')
There could be endless variations on the formats of these files, and I cannot envision a robust code to handle all of them. It would be quite helpful to find a way for them to be made compatible with readmatrix. That, unfortunately, is beyond me.
.
Stephen23
Stephen23 on 24 Apr 2023
Edited: Stephen23 on 24 Apr 2023
"I learned something new, that being that this file (and presumably the others) have a ‘,’ decimal separator and a ‘.’ thousands separator."
It is a simple XML file (not a XLS file, regardless of its file extension). Everything in an XML is text.

Sign in to comment.

More Answers (0)

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!