Problem
Many websites contain multiple images such as banners, icons, illustrations, adverts, and favicons, making it difficult to identify the actual company logo automatically or consistently.
Solution
I built a modular Django application that retrieves website HTML, parses image elements, extracts logo candidates, filters irrelevant images, scores candidates using deterministic heuristics, ranks the results, and stores discovery history for later review.
Technologies Used
- Python
- Django
- BeautifulSoup
- Requests
- Pytest
- SQLite
- Pipeline Architecture
- Heuristic Scoring
- Service Layer
Engineering Highlights
- Pipeline-based architecture with independently testable stages
- Website fetching, HTML parsing, image extraction, filtering, scoring, and ranking workflow
- Heuristic logo scoring using technical signals, filename signals, SVG preference, dimensions, aspect ratio, alt text, and token similarity
- Django models for storing discovery requests, image candidates, and candidate scores
- Discovery history with saved result views and candidate comparison
- Pytest test suite covering pipeline components, scoring logic, service layer, Django views, and integration workflows
- Designed for maintainability, extensibility, and future AI-assisted verification
What I Learned
- Practised modular backend design using a deterministic data-processing pipeline
- Improved understanding of separating retrieval, parsing, extraction, filtering, scoring, ranking, and orchestration responsibilities
- Used Django models and views to persist and present structured discovery results
- Strengthened pytest testing discipline across unit, service-level, view, and integration tests
- Learned how heuristic scoring can solve practical discovery problems without relying immediately on heavy AI models
- Created a foundation for future extensions such as human review, caching, asynchronous crawling, multi-page discovery, and AI-assisted logo validation