Skip to content

mfscpayload-690/commons-depicts-analyzer

Repository files navigation

Commons Depicts Analyzer

Python Flask SQLite JavaScript License

Analyze Wikimedia Commons categories for depicts (P180) metadata coverage.

OverviewFeaturesInstallationSecurityAPI Reference


Overview

Structured metadata is critical for the discoverability and reusability of media on Wikimedia Commons. The Commons Depicts Analyzer is a specialized tool designed to audit files within a specific category, identifying those that lack "depicts" (P180) statements. It provides a robust backend for data retrieval and analysis, coupled with an interactive frontend for visualization and reporting.

This application is built with a focus on data integrity, user privacy, and security, employing production-grade authentication and session management standards.


Features

  • Categorical Analysis: Systematically fetches and audits all files within a specified Commons category.
  • Coverage Visualization: Real-time statistical analysis of metadata coverage with interactive charts.
  • OAuth 2.0 Authentication: Secure integration with Wikimedia accounts for authenticated operations.
  • Suggestions Engine: Suggests relevant Wikidata items for files based on title analysis and context.
  • Interactive Dashboard: Sortable and filterable results interfaces with pagination support.
  • Progress Tracking: Real-time progress monitoring with job cancellation support for long-running analyses.
  • Export Capabilities: Export analysis results in CSV or JSON formats for further processing.
  • Customizable Interface: Adjustable text size, width, and color themes (including dark mode - beta).
  • Responsive Design: Mobile-friendly interface optimized for various screen sizes.
  • Batch Operations: Add depicts statements to multiple files efficiently.

Architecture

The application follows a modular architecture:

Backend

  • Core: Python 3.8+ with Flask.
  • Security: Server-side session management (Flask-Session), CSRF protection, and strictly enforced rate limiting (Flask-Limiter).
  • Database: SQLite for lightweight, reliable data persistence with connection pooling for improved performance.
  • Performance: Multi-threaded batch processing using ThreadPoolExecutor for efficient parallel operations.
  • API Integration: Direct interaction with MediaWiki and Wikidata APIs.

Frontend

  • Framework: Semantic HTML5 and CSS3 (custom design system).
  • Interactivity: Vanilla JavaScript (ES6+) for performant client-side logic.
  • Visualization: Chart.js for interactive statistical charts and graphs.
  • Design: Wikipedia-inspired aesthetic with high contrast and accessibility focus.

Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)
  • A Wikimedia account (for OAuth configuration)

Quick Setup

  1. Clone the repository

    git clone https://github.com/mfscpayload-690/commons-depicts-analyzer.git
    cd commons-depicts-analyzer
  2. Install dependencies

    pip install -r requirements.txt
  3. Configure OAuth (Required for editing features)

    Option A: Automated Setup (Recommended)

    python setup_oauth.py

    The script will guide you through:

    • Registering an OAuth application with Wikimedia
    • Setting your Client ID and Secret
    • Generating a secure Flask secret key
    • Creating your .env file automatically

    Option B: Manual Setup

    a. Register an OAuth application:

    b. Create a .env file in the project root:

    cp .env.example .env

    c. Edit .env and add your credentials:

    OAUTH_CLIENT_ID=your_client_id_here
    OAUTH_CLIENT_SECRET=your_client_secret_here
    OAUTH_CALLBACK_URL=http://localhost:5000/auth/callback
    FLASK_SECRET_KEY=your_generated_secret_key_here

    Generate a Flask secret key:

    python -c "import secrets; print(secrets.token_hex(32))"

    Note: In development mode, FLASK_SECRET_KEY is automatically generated on first run and saved to a local .dev_secret file (owner-read-only, excluded from git). You only need to set it manually in .env for production deployments or when configuring OAuth.

  4. Run the application

    python backend/main.py

The application will be accessible at http://localhost:5000.

Note: OAuth is only required if you want to add depicts statements through the UI. The analysis features work without OAuth.


Security

This project adheres to strict security standards to protect user data and maintain service integrity.

Authentication & Sessions

  • Server-Side Sessions: User sessions are stored securely on the server filesystem (directory restricted to 0o700), not in client-side cookies.
  • OAuth 2.0: Standard authorization code flow for secure third-party authentication with Wikimedia.
  • Token Handling: Access tokens are stored server-side and never exposed to browser JavaScript.
  • Session Persistence: In development, the Flask secret key is persisted to .dev_secret (chmod 600) so sessions survive server restarts.

Protection Measures

  • Strict CSP: Content Security Policy disallows unsafe-inline in script-src — injected scripts are blocked by the browser even if they reach the page.
  • Event Delegation: All frontend interactivity uses data-action attributes with delegated listeners instead of inline onclick= handlers, compatible with the strict CSP.
  • Rate Limiting: API endpoints are protected against abuse with per-IP rate limits (login: 5/min, write ops: 30/min, default: 200/hour).
  • CSRF Protection: State-changing requests require a cryptographic CSRF token in the X-CSRF-Token header. OAuth flow uses a random state parameter with constant-time comparison.
  • Input Validation: All user inputs (category names, file titles, QIDs, language codes) are validated against strict whitelists before use.
  • Request Size Limit: Request bodies over 5 MB are rejected immediately with a 413 error.
  • HTTPS Enforcement: In production, all HTTP traffic is permanently redirected to HTTPS.
  • Security Headers: Every response includes CSP, HSTS, X-Frame-Options: DENY, X-Content-Type-Options: nosniff, and Referrer-Policy.

Database

  • WAL Mode: SQLite runs in Write-Ahead Logging mode so readers and writers do not block each other under concurrent load.
  • Busy Timeout: A 5-second busy timeout prevents immediate failures under write contention.
  • Absolute Path: The database path is resolved to an absolute path on startup, regardless of the working directory.

API Reference

The backend exposes a RESTful API for automation and integration.

Core Endpoints

Method Endpoint Description
POST /api/analyze Initiates analysis for a specific category.
GET /api/results/<category> Retrieves cached analysis results.
GET /api/history Lists all previously analyzed categories.
POST /api/add-depicts (Auth Required) Adds a P180 statement to a file.

Progress & Job Management

Method Endpoint Description
GET /api/progress/<job_id> Get progress status of a background analysis job.
POST /api/cancel/<job_id> Cancel a currently running analysis job.

Category & File Operations

Method Endpoint Description
GET /api/suggest Suggest Commons categories by prefix (autocomplete).
GET /api/verify/<category> Verify if a category exists on Wikimedia Commons.
DELETE /api/category/<category> Delete cached analysis data for a category.
GET /api/export/<category> Export analysis results in CSV or JSON format (`?format=csv
GET /api/fileinfo/<file_title> Get detailed information about a specific file.
GET /api/suggests/<file_title> Get Wikidata item suggestions for a specific file.

Authentication Endpoints

Method Endpoint Description
GET /auth/status Returns current authentication state and user context.
GET /auth/login Initiates the OAuth handshake.
GET /auth/logout Terminates the session and revokes tokens.

Contributing

Contributions are welcome. Please ensure that any pull requests verify against the security test suite before submission.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/SecureFeature)
  3. Commit your changes (git commit -m 'feat: Add SecureFeature')
  4. Push to the branch (git push origin feature/SecureFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

Developed for the Wikimedia Technical Workshop at THARANG 2K26.

Development Team

Name Role GitHub
Aravind Lal Core Developer @mfscpayload-690
Abhishek H Core Developer @unknownguyoffline

Documentation

Name Role GitHub
Aaromal V Documentation @Aaromal665
Sreeram S Nair Documentation @SreeramSNair-7

About

A Python-based tool to analyze Wikimedia Commons categories, list depicts (P180) metadata, and identify files missing depicts information.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors