Customer Intelligence & Text Analytics May 4, 2025 Published project
Arabic Sentiment AraVec LSTM

Arabic sentiment-analysis and sequence-modeling workflow

This project studies Arabic tweet representation and sentiment classification. It compares averaged pretrained word embeddings with a sequence-aware model to understand how representation choices affect Arabic sentiment prediction.

PythonArabic NLPAraVecANNLSTMEmbeddingsPCA

Challenge

  • Arabic social text requires language-aware preprocessing and representation choices.
  • Averaging word vectors can lose word-order information that matters for sentiment.
  • Comparing a baseline model with a sequence-aware model helps clarify the value of contextual ordering.

System architecture

Arabic tweetspositive + negative
RepresentationAraVec embeddings
ModelsANN + LSTM
Evaluationaccuracy + F1

Data and inputs

  • 5,000 Arabic tweets balanced across positive and negative sentiment.
  • 2,500 positive and 2,500 negative examples.
  • Train/test split of 4,000 / 1,000 tweets.

Technical approach

  • Load AraVec Twitter CBOW embeddings for Arabic word representation.
  • Build tweet-level vectors through averaged embeddings for the ANN baseline.
  • Train an LSTM model using learned sequence embeddings and a fixed sequence length.
  • Review model comparison, class F1-scores, semantic similarity examples, and embedding visualization.

Evaluation and results

Key indicators

5,000 Arabic tweets

Key indicators

2,500 positive / 2,500 negative

Key indicators

LSTM accuracy 83%

Key indicators

ANN baseline about 68%

  • The LSTM model reached 83% accuracy in the reported evaluation.
  • The ANN baseline using averaged AraVec vectors reached about 68% accuracy.
  • The comparison suggests that sequence-aware modeling better captures sentiment when word order and context matter.

Implementation and code

Implementation focus

The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.

Source code

The code is available for exploring the implementation details and extending the experiment when needed.

Open source code

Scope and responsible use

The project focuses on language-data modeling and evaluation. Broader use would require domain-specific validation, edge-case assessment, monitoring, and testing on fresh data.

Future development

  • Evaluate modern Arabic transformer models as stronger baselines.
  • Expand testing across dialects, domains, and informal text styles.
  • Add explainability tools for influential tokens and sentiment cues.

Technical contribution

The project demonstrates Arabic-language modeling practice: representing text with pretrained embeddings, comparing baseline and sequence-aware models, and interpreting why sequence structure matters for sentiment analysis.