Arabic sentiment-analysis and sequence-modeling workflow
This project studies Arabic tweet representation and sentiment classification. It compares averaged pretrained word embeddings with a sequence-aware model to understand how representation choices affect Arabic sentiment prediction.
Challenge
- Arabic social text requires language-aware preprocessing and representation choices.
- Averaging word vectors can lose word-order information that matters for sentiment.
- Comparing a baseline model with a sequence-aware model helps clarify the value of contextual ordering.
System architecture
Data and inputs
- 5,000 Arabic tweets balanced across positive and negative sentiment.
- 2,500 positive and 2,500 negative examples.
- Train/test split of 4,000 / 1,000 tweets.
Technical approach
- Load AraVec Twitter CBOW embeddings for Arabic word representation.
- Build tweet-level vectors through averaged embeddings for the ANN baseline.
- Train an LSTM model using learned sequence embeddings and a fixed sequence length.
- Review model comparison, class F1-scores, semantic similarity examples, and embedding visualization.
Evaluation and results
5,000 Arabic tweets
2,500 positive / 2,500 negative
LSTM accuracy 83%
ANN baseline about 68%
- The LSTM model reached 83% accuracy in the reported evaluation.
- The ANN baseline using averaged AraVec vectors reached about 68% accuracy.
- The comparison suggests that sequence-aware modeling better captures sentiment when word order and context matter.
Implementation and code
Implementation focus
The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.
Source code
The code is available for exploring the implementation details and extending the experiment when needed.
Scope and responsible use
The project focuses on language-data modeling and evaluation. Broader use would require domain-specific validation, edge-case assessment, monitoring, and testing on fresh data.
Future development
- Evaluate modern Arabic transformer models as stronger baselines.
- Expand testing across dialects, domains, and informal text styles.
- Add explainability tools for influential tokens and sentiment cues.
Technical contribution
The project demonstrates Arabic-language modeling practice: representing text with pretrained embeddings, comparing baseline and sequence-aware models, and interpreting why sequence structure matters for sentiment analysis.