You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
Regular expression for arabic text in matlab
5 views (last 30 days)
Show older comments
I used ocr in matlab to read arabic text from image.Now I want to write a regular expression that matches a word in arabic text but it does not work
19 Comments
Guillaume
on 21 Dec 2017
"it does not work"
As Stephen's said this is a useless statement if you don't even tell us what the "it" is. How can we know if you've made a mistake with the "it", or if you're using the "it" incorrectly, or if indeed the "it" does not support arabic.
So show us the "it", that is the exact code you're using and ideally an example input where "it" doesn't work.
Guillaume
on 21 Dec 2017
Seems to work for me (R2017b):
>> Pattern = '(فاتورة عدد)';
>> Lines = {Pattern(2:end-1); [Pattern(2:end-1), '2015/02 ']; Pattern(4:5)}
>> P = regexp(Lines,Pattern,'match');
>> P = [P{:}]
Lines =
3×1 cell array
{ فاتورة عدد'}
{'فاتورة عدد2015/02 '}
{ 'تو'}
P =
1×2 cell array
{'فاتورة عدد'} {'فاتورة عدد'}
Walter Roberson
on 21 Dec 2017
One thing to note is that if your operating system is set to English, then MATLAB might not store .m files with UTF encoding, so when you save the .m file and close it and open it again, any arabic characters you had in the file might be gone. With newer versions there is apparently a way to force MATLAB to permit UTF-8 for .m files, but it involves editing an obscure configuration file.
Walter Roberson
on 21 Dec 2017
Just to be sure we are all referring to the same thing:
It is not possible to use regexp() on an image, only on character vectors or cell array of character vectors or on string() arrays.
Walter Roberson
on 21 Dec 2017
Please attach a .mat containing the cell array and also containing the pattern you are trying to search for.
N Rh
on 21 Dec 2017
Edited: Walter Roberson
on 21 Dec 2017
this is the used code, you can execute it and the image in the attached file.
clear all;close all;clc;
!tesseract -l eng+ara fac.jpg output
slCharacterEncoding('UTF-8')
fid = fopen('output.txt');
b = fread(fid,'uint8')';
fclose(fid);
a=dec2bin(b);
c=dec2hex(b);
str = native2unicode(b,'UTF-8');
disp(str);
C = textscan(str,'%s');
data=cellstr(C{1});
for i=1:length(data)
if strfind(char(data(i)), 'عدد')==1
fprintf('Numero de la facture : %s\n',char(str(i+1)))
end
end
Walter Roberson
on 21 Dec 2017
I had to hunt around for the arabic training files for tessaract; perhaps I did not find the right ones. And I got a whole bunch of messages about
Cube ERROR (ConvNetCharClassifier::RunNets): NeuralNet is NULL
The output.txt file contained only English for me.
Answers (0)
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)