Customer Intelligence & Text Analytics May 25, 2025 Published project
Android Galaxy Sentiment Analysis

Customer sentiment intelligence workflow

This project analyzes Android Galaxy-related Reddit comments as a customer-intelligence workflow. It compares preprocessing choices, text representations, and classification models to understand sentiment signals and model tradeoffs on product-related discussions.

PythonScikit-learnNLTKTF-IDFWord2VecBiLSTM

Challenge

  • Product feedback in online discussions is noisy, imbalanced, and often short or informal.
  • Sentiment analysis requires careful preprocessing, label handling, and evaluation beyond a single model score.
  • Comparing traditional and neural approaches helps clarify which modeling choices work best for the available data.

System architecture

Commentsproduct discussions
Preprocessingcleaning and balancing
Model comparisontraditional + neural
Evaluationaccuracy and F1

Data and inputs

  • 1,000 raw Reddit comments related to Android Galaxy / Samsung Galaxy S25 discussions.
  • The original labels include positive, negative, and neutral comments.
  • Neutral comments are removed for a binary task, and the positive class is upsampled to create a balanced 752/752 dataset.

Technical approach

  • Clean text by removing HTML, lowercasing, tokenizing, removing stopwords, and lemmatizing.
  • Compare Bag of Words, TF-IDF, Word2Vec, and a neural embedding approach.
  • Train Logistic Regression, Support Vector Classifier, Random Forest, and Bidirectional LSTM models.
  • Review confusion matrices and class-level metrics to understand model behavior.

Evaluation and results

Key indicators

1,000 raw Reddit comments

Key indicators

752 positive / 752 negative after balancing

Key indicators

SVC + TF-IDF F1-score 97.07%

  • SVC with TF-IDF achieved the strongest reported traditional result with 97.12% accuracy and 97.07% F1-score.
  • Random Forest with Word2Vec performed strongly with a 94.92% F1-score.
  • The BiLSTM was competitive but did not outperform the best traditional approach on this dataset.

Implementation and code

Implementation focus

The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.

Source code

The code is available for exploring the implementation details and extending the experiment when needed.

Open source code

Scope and responsible use

The project focuses on language-data modeling and evaluation. Broader use would require domain-specific validation, edge-case assessment, monitoring, and testing on fresh data.

Future development

  • Evaluate transformer-based sentiment models.
  • Extend the workflow to multilingual or cross-product sentiment analysis.
  • Add explainability tools for reviewing influential text signals.

Technical contribution

The project demonstrates customer-intelligence modeling: turning informal user comments into structured sentiment signals, comparing representation strategies, and evaluating tradeoffs across model families.