Pull webpage from MATLAB site using MATLAB (but with login)

28 views (last 30 days)
Hello there
I have recently been working on a code that pulls information from a webpage and stores it in a file
webread() isn't very hard to use
however, I have gotten to the point where I want to pull pages that can only be seen when logged in
I am using a MATLAB webpage (only visible when logged in) to work on my solution, but I can't quite figure it out
for example,
pageLink = 'https://www.mathworks.com/matlabcentral/cody/groups/345/problems/15-find-the-longest-sequence-of-1-s-in-a-binary-sequence/solutions/new';
options = weboptions;
options.Username = 'myEmail@email.com';
options.Password = 'myPassw0rd';
pageRead = webread(pageLink, options);
(obviously with real information)
This does not work, it always returns the 'You must log in page'
I have also tried to webwrite my options, as well as renaming them the parameters called, such as...
userPage = 'https://www.mathworks.com/login?uri=https%3A%2F%2Fwww.mathworks.com%2Fproducts%2Fmatlab.html';
userId = 'myEmail@email.com';
password = 'myPassw0rd';
webwrite(userPage, 'userId', userId, 'password', password)
and all various options between webwrite and webread and options and named parameters
but it won't return the page as if I was logged in
Could someone direct me along the right path? Is it just MATLAB and should I have tried with a different website or can this be done?
Thanks,
H
  1 Comment
Highphi
Highphi on 22 Jul 2020
update:
tried using...
system(['wget --auth-no-challenge --user=', userId, ' --password=', password, ' ', pageLink])
which started to feel like a step in the right direction... but I get:
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\Gow/etc/wgetrc
--2020-07-22 13:00:05-- https://www.mathworks.com/matlabcentral/cody/groups/345/problems/15-find-the-longest-sequence-of-1-s-in-a-binary-sequence/solutions/new
Resolving www.mathworks.com... 00.00.00.000
Connecting to www.mathworks.com|00.00.00.00|:443... connected.
ERROR: cannot verify www.mathworks.com's certificate, issued by `/C=US/O=DigiCert Inc/CN=DigiCert SHA2 Secure Server CA':
Unable to locally verify the issuer's authority.
To connect to www.mathworks.com insecurely, use `--no-check-certificate'.
Unable to establish SSL connection.
where 00.00.00.000 is (potentially) an IP address that I censored since I'm not sure what its significance is

Sign in to comment.

Accepted Answer

Highphi
Highphi on 22 Jul 2020
Figured it out...
By myself ...............
No worries. Here's how I did it for future reference:
1. Fix your default web browser preferences
Option 1: MANUALLY
A. Under the 'Home' tab, click 'Preferences'
Option 2: From the COMMAND WINDOW
A. CODE:
preferences Web
B. In the 'Preferences' window, now go to the 'Web' subsection make sure the box next to "Use system browser when opening links to external sites (recommended).". Then click Apply
(Please forgive my handwriting, as I wrote it in Snipping Tool with my mouse lol)
2. THE REST IS HISTORY
A. Use the following code to open your window:
[a,h] = web(pageLink);
It will popup a window with that link you told it to go to
B. IF prompted to login to the desired page, do so and try to click 'Remember Me' if it is an option.
Otherwise, do this step at the beginning of every script and leave one browser window open. I will explain in a second.
C. Use the following code to pull your HTML and then close the browser:
[a, h2] = web(pageLink);
pageHTML = get(h2, 'HtmlText');
close(h2);
Notice I used the handle 'h2' in the second part. This is so that you don't close 'h', if necessary. Closing h2 will ONLY close h2, allowing you to remain logged in.
D. Rinse and repeat.
  3 Comments
Highphi
Highphi on 5 Jan 2021
You will have to parse it.
I use this set of functions sooo much now, so here's an updated solution and some hints & tips:
1) you don't need to close(h2), it will take significantly longer to reopen if you're doing multiple pages. One thing you can do is throw a while loop in there to make sure the page is loaded and then break. i.e.
[~, h2] = web(pageLink);
pause(3)
doMe = 1;
while doMe == 1
pageHTML = get(h2, 'HtmlText');
f1 = strfind(pageHTML, 'footer'); % look for footer (is loaded)
if ~isempty(f1)
doMe = 0;
break
end
pause(1)
end
2) In order to parse the page, you may want to open the desired page in a browser (such as Chrome) and hit F12. This will open developer tools. If, say, you want to find text within a certain area, find the specific HTML surrounding it. i.e.
Then...
f1 = strfind(pageHTML, '<div class="comment "');
pageHTML = pageHTML(f1(1):end);
f2 = strfind(pageHTML, '<class="add-comment');
pageHTML = pageHTML(1:f2(1)-1);
% this will give you the code within the desired div, apply this however you need
Hopefully that helps

Sign in to comment.

More Answers (1)

Pascal Geschwill
Pascal Geschwill on 30 Apr 2021
Hi,
while this approach seems to work for now, it looks like this is deprecated functionality. At least with 2020a I am getting a warning:
Warning: [STAT,H] = WEB(___) does not return a handle for pages that open in the system browser. Use STAT = WEB(___) instead.
> In web>displayWarningMessage (line 432)
In web (line 96)
In my case, the solution described in this thread worked just as well. I am pulling build histories from our CI server via its REST API and then parsing them in MATLAB.

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!