How to extract Arabic language words from list of different language

AlaaJ

New Member
Joined
Jun 21, 2019
Messages
4
Hi

I have a sheet with a column of list of words in different languages I want to get only Arabic words and put them in different row
For example:
input
<style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px 'Helvetica Neue'; color: #000000}span.s1 {font-kerning: none}span.s2 {font: 10.0px 'Geeza Pro'; font-kerning: none}</style>1- Sharifah,Sharīfah,shryft,شريفة
<style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px 'Helvetica Neue'; color: #000000}span.s1 {font-kerning: none}span.s2 {font: 10.0px 'Geeza Pro'; font-kerning: none}</style>2- Umm al `Ulaymat,Umm al ‘Ulaymāt,`Awayid,am almymat,أم الميمات,عوايد,‘Awāyid
<style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px 'Helvetica Neue'; color: #000000}span.s1 {font-kerning: none}span.s2 {font: 10.0px 'Geeza Pro'; font-kerning: none}</style>3- Al Jamaliyah,[FONT=&quot]جمالية[/FONT],Al Jamālīyah,Jamaliyah,Jamālīyah,aljmalyt,jmalyt,الجمالية

output:
row 1: [FONT=&quot]شريفة
row 2: [/FONT]أم الميمات[FONT=&quot]
row 3: [/FONT][FONT=&quot]عوايد
[/FONT][FONT=&quot]row 4: [/FONT][FONT=&quot]الجمالية[/FONT][FONT=&quot]
row 5: [/FONT][FONT=&quot]جمالية

[/FONT][FONT=&quot]Thank you in advance

[/FONT]
 

Excel Facts

Last used cell?
Press Ctrl+End to move to what Excel thinks is the last used cell.
Hi

A simple solution is to write a vba snippet and use the unicode codes to get the Arabic characters.
A quick question in the google tells me that the Arabic alphabet is characters hex 600 to 6FF and so you just loop through the characters and if they are in that range you extract them.
If you know Arabic you'll know better than I if these characters are enough or if you need some others.


The only difficulty I see is what you have in row 2. In all other rows you have extracted just one word but in row 2 you extracted 2 words. I'm sure that that will make sense to someone knowing Arabic but you'll have to include that kind of exceptions in the code.
Or maybe it's just because the text you want to extract is always between commas?
 
Upvote 0
This is a quick test to extract Arabic text between 2 commas, like in your example.

Place the text in your example A2:A4 and the code extracts the Arabic text to B2, down.

Code:
Sub GetArabic()
Dim regexMatches As Object
Dim r As Range
Dim s As String
Dim j As Long

Set r = Range("A2:A4")
s = "," & Join(Application.Transpose(r), ",") & ","

With CreateObject("VBScript.RegExp")
    .Pattern = ",[ \u0600-\u06FF]+(?=,)"
    .Global = True
    Set regexMatches = .Execute(s)
End With

' write the result
For j = 0 To regexMatches.Count - 1
    r(1).Offset(j, 1) = Mid(regexMatches(j).Value, 2)
Next j

End Sub
 
Upvote 0
another way:

screenshot-84.png


PowerQuery aka Get&Transform

Code:
[SIZE=1]// Table1
let
    C2R = List.Transform({0..300}, each Character.FromNumber(_)),
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    Split = Table.ExpandListColumn(Table.TransformColumns(Source, {{"raw", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "raw"),
    Replace = Table.ReplaceValue(Split,"‘","",Replacer.ReplaceText,{"raw"}),
    Arabic = Table.AddColumn(Replace, "Arabic", each Text.Trim([raw],C2R)),
    Filter = Table.SelectRows(Arabic, each ([Arabic] <> "")),
    ROC = Table.SelectColumns(Filter,{"Arabic"})
in
    ROC[/SIZE]
 
Last edited:
Upvote 0
Hi sandy666

Great to see an M solution. Always good to have different approaches.

Just a remark

I have a sheet with a column of list of words in different languages I want to get only Arabic words


If I understand your code you assumed that the text uses the latin alphabet.
Maybe the OP has texts written with other alphabets or with ideograms.
I tried with the more usual ones like Russian and Greek and it did not work.
Even with the Latin alphabet it will sometimes not work, for ex. try the Polish word może (maybe). The letter ż will not be correctly processed because you assumed only until 300.
I did not try with Chinese or others but I guess the problem will be the same.

Maybe the OP will not need it, but I wanted to give you a heads up, you might want to tweak the code.
 
Upvote 0
in the past I did everything with predicting "what if" but it turned out to be a bad method, so now I only do what I see in the first post, if OP shows a representative example, I will do it, if not I don't.

all depends on the definition of C2R

edit:
and no, there is not Latin only :)
 
Last edited:
Upvote 0
but it works with Polish, Russian and Greek text

screenshot-86.png
 
Last edited:
Upvote 0
Seems great.

Looking at your code didn't seem that just by expanding C2R to 1568, like you said in post #8 , would be enough to catch languages like Chinese or Japanese.
I'll give it a go when I have the time.
 
Upvote 0

Forum statistics

Threads
1,213,497
Messages
6,113,999
Members
448,543
Latest member
MartinLarkin

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top