Analyze Wikimedia Commons categories for depicts (P180) metadata coverage.
Overview • Features • Installation • Security • API Reference
Structured metadata is critical for the discoverability and reusability of media on Wikimedia Commons. The Commons Depicts Analyzer is a specialized tool designed to audit files within a specific category, identifying those that lack "depicts" (P180) statements. It provides a robust backend for data retrieval and analysis, coupled with an interactive frontend for visualization and reporting.
This application is built with a focus on data integrity, user privacy, and security, employing production-grade authentication and session management standards.
- Categorical Analysis: Systematically fetches and audits all files within a specified Commons category.
- Coverage Visualization: Real-time statistical analysis of metadata coverage with interactive charts.
- OAuth 2.0 Authentication: Secure integration with Wikimedia accounts for authenticated operations.
- Suggestions Engine: Suggests relevant Wikidata items for files based on title analysis and context.
- Interactive Dashboard: Sortable and filterable results interfaces with pagination support.
- Progress Tracking: Real-time progress monitoring with job cancellation support for long-running analyses.
- Export Capabilities: Export analysis results in CSV or JSON formats for further processing.
- Customizable Interface: Adjustable text size, width, and color themes (including dark mode - beta).
- Responsive Design: Mobile-friendly interface optimized for various screen sizes.
- Batch Operations: Add depicts statements to multiple files efficiently.
The application follows a modular architecture:
- Core: Python 3.8+ with Flask.
- Security: Server-side session management (
Flask-Session), CSRF protection, and strictly enforced rate limiting (Flask-Limiter). - Database: SQLite for lightweight, reliable data persistence with connection pooling for improved performance.
- Performance: Multi-threaded batch processing using
ThreadPoolExecutorfor efficient parallel operations. - API Integration: Direct interaction with MediaWiki and Wikidata APIs.
- Framework: Semantic HTML5 and CSS3 (custom design system).
- Interactivity: Vanilla JavaScript (ES6+) for performant client-side logic.
- Visualization: Chart.js for interactive statistical charts and graphs.
- Design: Wikipedia-inspired aesthetic with high contrast and accessibility focus.
- Python 3.8 or higher
- pip (Python package manager)
- A Wikimedia account (for OAuth configuration)
-
Clone the repository
git clone https://github.com/mfscpayload-690/commons-depicts-analyzer.git cd commons-depicts-analyzer -
Install dependencies
pip install -r requirements.txt
-
Configure OAuth (Required for editing features)
Option A: Automated Setup (Recommended)
python setup_oauth.py
The script will guide you through:
- Registering an OAuth application with Wikimedia
- Setting your Client ID and Secret
- Generating a secure Flask secret key
- Creating your
.envfile automatically
Option B: Manual Setup
a. Register an OAuth application:
- Visit: https://meta.wikimedia.org/wiki/Special:OAuthConsumerRegistration/propose/oauth2
- Application name:
Commons Depicts Analyzer (Development) - Callback URL:
http://localhost:5000/auth/callback - Grants: Check "Basic rights" and "Edit structured data"
- Copy your Client ID and Client Secret
b. Create a
.envfile in the project root:cp .env.example .env
c. Edit
.envand add your credentials:OAUTH_CLIENT_ID=your_client_id_here OAUTH_CLIENT_SECRET=your_client_secret_here OAUTH_CALLBACK_URL=http://localhost:5000/auth/callback FLASK_SECRET_KEY=your_generated_secret_key_here
Generate a Flask secret key:
python -c "import secrets; print(secrets.token_hex(32))"Note: In development mode,
FLASK_SECRET_KEYis automatically generated on first run and saved to a local.dev_secretfile (owner-read-only, excluded from git). You only need to set it manually in.envfor production deployments or when configuring OAuth. -
Run the application
python backend/main.py
The application will be accessible at http://localhost:5000.
Note: OAuth is only required if you want to add depicts statements through the UI. The analysis features work without OAuth.
This project adheres to strict security standards to protect user data and maintain service integrity.
- Server-Side Sessions: User sessions are stored securely on the server filesystem (directory restricted to
0o700), not in client-side cookies. - OAuth 2.0: Standard authorization code flow for secure third-party authentication with Wikimedia.
- Token Handling: Access tokens are stored server-side and never exposed to browser JavaScript.
- Session Persistence: In development, the Flask secret key is persisted to
.dev_secret(chmod 600) so sessions survive server restarts.
- Strict CSP: Content Security Policy disallows
unsafe-inlineinscript-src— injected scripts are blocked by the browser even if they reach the page. - Event Delegation: All frontend interactivity uses
data-actionattributes with delegated listeners instead of inlineonclick=handlers, compatible with the strict CSP. - Rate Limiting: API endpoints are protected against abuse with per-IP rate limits (login: 5/min, write ops: 30/min, default: 200/hour).
- CSRF Protection: State-changing requests require a cryptographic CSRF token in the
X-CSRF-Tokenheader. OAuth flow uses a randomstateparameter with constant-time comparison. - Input Validation: All user inputs (category names, file titles, QIDs, language codes) are validated against strict whitelists before use.
- Request Size Limit: Request bodies over 5 MB are rejected immediately with a
413error. - HTTPS Enforcement: In production, all HTTP traffic is permanently redirected to HTTPS.
- Security Headers: Every response includes CSP, HSTS,
X-Frame-Options: DENY,X-Content-Type-Options: nosniff, andReferrer-Policy.
- WAL Mode: SQLite runs in Write-Ahead Logging mode so readers and writers do not block each other under concurrent load.
- Busy Timeout: A 5-second busy timeout prevents immediate failures under write contention.
- Absolute Path: The database path is resolved to an absolute path on startup, regardless of the working directory.
The backend exposes a RESTful API for automation and integration.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/analyze |
Initiates analysis for a specific category. |
GET |
/api/results/<category> |
Retrieves cached analysis results. |
GET |
/api/history |
Lists all previously analyzed categories. |
POST |
/api/add-depicts |
(Auth Required) Adds a P180 statement to a file. |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/progress/<job_id> |
Get progress status of a background analysis job. |
POST |
/api/cancel/<job_id> |
Cancel a currently running analysis job. |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/suggest |
Suggest Commons categories by prefix (autocomplete). |
GET |
/api/verify/<category> |
Verify if a category exists on Wikimedia Commons. |
DELETE |
/api/category/<category> |
Delete cached analysis data for a category. |
GET |
/api/export/<category> |
Export analysis results in CSV or JSON format (`?format=csv |
GET |
/api/fileinfo/<file_title> |
Get detailed information about a specific file. |
GET |
/api/suggests/<file_title> |
Get Wikidata item suggestions for a specific file. |
| Method | Endpoint | Description |
|---|---|---|
GET |
/auth/status |
Returns current authentication state and user context. |
GET |
/auth/login |
Initiates the OAuth handshake. |
GET |
/auth/logout |
Terminates the session and revokes tokens. |
Contributions are welcome. Please ensure that any pull requests verify against the security test suite before submission.
- Fork the repository
- Create your feature branch (
git checkout -b feature/SecureFeature) - Commit your changes (
git commit -m 'feat: Add SecureFeature') - Push to the branch (
git push origin feature/SecureFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Developed for the Wikimedia Technical Workshop at THARANG 2K26.
| Name | Role | GitHub |
|---|---|---|
| Aravind Lal | Core Developer | @mfscpayload-690 |
| Abhishek H | Core Developer | @unknownguyoffline |
| Name | Role | GitHub |
|---|---|---|
| Aaromal V | Documentation | @Aaromal665 |
| Sreeram S Nair | Documentation | @SreeramSNair-7 |