Arabic Stop words: list and routins
Project description
Description
It’s not easy to detemine the stop words, and in other hand, stop words differs according to the case, for this purpos, we propose a classified list which can be parametered by developper.
The Word list contains only wonds in its commun forms, and we have generated all forms by a script.
It can used as library ‘see section arabicstopwords library’
Files
data/ : contains data of stopwords
data/classified/stopwords.ods: data in LibreOffice format with more valuble informations, and classified stopwords
docs: docs files
scripts: scripts used to generate all forms, and file formats
## Data Structure
All forms data .ODS/CSV file - 1st field : unvocalised word ( في) - 2nd field : unvocalised stemmed word with -‘-’ between affixes: e.g. ف-ب-خمسين-ي
Minimal classified data .ODS/CSV file - 1st field : unvocalised word ( في) - 2nd field : type of the word: e.g. حرف - 3rd field : class of word : e.g. preposition
How to customize stop word list
check the minimal form data file (stopwords.csv)
comment by “#” all words which you don’t need
run
make
catch the output of script in releases folder.
How to update data
check if the word doesn’t exist in the minimal form data file ( classified/stopwords.ods)
add affixation information
run
make
catch the output of script in releases folder.
Arabic Stopwords Library
install
pip install arabicstopwords
usage
test if a word is stop
>>> import arabicstopwords.arabicstopwords as stp >>> # test if a word is a stop ... stp.is_stop(u'ممكن') False >>> stp.is_stop(u'منكم') True
stem a stopword ```python >>> word = u”لعلهم” >>> stp.stop_stem(word) u’لعل’
* list all stop words stp.stopwords\_list() ...... len(stp.stopwords\_list()) 13629 len(stp.classed\_stopwords\_list()) 507 \`\`\` \* give all forms of a stopword
>>> stp.stopword_forms(u"على")
....
>>> len(stp.stopword_forms(u"على"))
144
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for Arabic_Stopwords-0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 942952615a4027aaa81914d9d678032f32f721c6989eb1d292aa1200720f3aa6 |
|
MD5 | f2210204ef2d9aa39dcfcd75cc951e12 |
|
BLAKE2b-256 | 7c9e40ee9b10f98b23b32bb7ca3f229ae78ae4379ebcb03cbb7b9e7399686ad8 |
Hashes for Arabic_Stopwords-0.3-py2-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ca967624abd0fff99048d6f32abdae7c884b8999bf316a3fc05f92bdb7b492c |
|
MD5 | c3f202d99dbb6e887769b68c676bafa0 |
|
BLAKE2b-256 | 66f9a85a93b0e43804c8f598556a718a5ba57e43d044eea9e3ea5f39b963db85 |