Data Sources

WayHack integrates with 20+ OSINT data sources to provide comprehensive URL reconnaissance capabilities. This guide provides detailed information about each source, their strengths, configuration options, and best use cases.

Archive Sources

Wayback Machine

Purpose: Historical snapshots of websites from the Internet Archive
Best For: Finding old URLs, tracking changes over time, discovering removed content
Data Coverage: 735+ billion web pages since 1996
Update Frequency: Continuous crawling

Configuration:

wayback:
  enabled: true
  max_results: 10000
  collapse_duplicates: true
  filter_status_codes: [200, 301, 302]
  date_range:
    start: "2020-01-01"
    end: "2023-12-31"

Strengths:

  • Extensive historical coverage

  • No API key required

  • Reliable and stable

  • Excellent for timeline analysis

Limitations:

  • Rate limiting on high-volume requests

  • Some content may be excluded

  • Occasional service downtime

Common Crawl

Purpose: Web crawl archives from large-scale internet crawling
Best For: Comprehensive URL discovery, large-scale data analysis
Data Coverage: Petabytes of web data, monthly crawls
Update Frequency: Monthly crawl releases

Configuration:

commoncrawl:
  enabled: true
  max_results: 50000
  index_collections: ["CC-MAIN-2023-50", "CC-MAIN-2023-40"]
  content_types: ["text/html", "application/json"]

Strengths:

  • Massive data coverage

  • Free and open access

  • Regular updates

  • Rich metadata

Limitations:

  • Large dataset can be overwhelming

  • Processing time can be significant

  • Limited real-time data

Security Intelligence Sources

URLScan.io

Purpose: Real-time URL scanning and analysis
Best For: Recent URLs, active threat intelligence, malware analysis
Data Coverage: Millions of scanned URLs daily
Update Frequency: Real-time

Configuration:

urlscan:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  include_private: false
  scan_types: ["public", "unlisted"]

Strengths:

  • Real-time scanning data

  • Rich technical details

  • Screenshot capabilities

  • Active threat detection

Limitations:

  • API key required for full access

  • Rate limiting

  • Focus on recently scanned URLs

VirusTotal

Purpose: Malware and URL analysis platform
Best For: Security analysis, malware detection, reputation checking
Data Coverage: Billions of files and URLs analyzed
Update Frequency: Real-time

Configuration:

virustotal:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  include_detections: true
  min_positives: 1

Strengths:

  • Comprehensive malware analysis

  • Multiple antivirus engine results

  • Rich metadata and relationships

  • Community contributions

Limitations:

  • API key required

  • Rate limiting on free tier

  • Focus on malicious content

AlienVault OTX

Purpose: Open Threat Exchange intelligence platform
Best For: Threat intelligence, IOC discovery, security research
Data Coverage: Millions of threat indicators
Update Frequency: Real-time community contributions

Configuration:

alienault:
  enabled: true
  api_key: "your_api_key"
  max_results: 5000
  pulse_types: ["domain", "url", "hostname"]

Strengths:

  • Community-driven intelligence

  • Rich threat context

  • Free access with registration

  • Global threat visibility

Limitations:

  • Quality varies by contributor

  • May include false positives

  • Requires registration

Search Engine Sources

Shodan

Purpose: Internet-connected device search engine
Best For: Finding exposed services, IoT devices, infrastructure mapping
Data Coverage: 500+ million connected devices
Update Frequency: Continuous scanning

Configuration:

shodan:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  search_facets: ["port", "country", "org"]
  include_banners: true

Strengths:

  • Unique infrastructure visibility

  • Service banner information

  • Geolocation data

  • Historical tracking

Limitations:

  • API key required

  • Limited free tier

  • Focus on exposed services

Censys

Purpose: Internet-wide scanning and analysis
Best For: Certificate analysis, service discovery, infrastructure research
Data Coverage: Internet-wide IPv4 scanning
Update Frequency: Daily scans

Configuration:

censys:
  enabled: true
  api_id: "your_api_id"
  api_secret: "your_api_secret"
  max_results: 1000
  search_types: ["hosts", "certificates"]

Strengths:

  • Comprehensive internet scanning

  • Certificate transparency integration

  • Academic research focus

  • High-quality data

Limitations:

  • API credentials required

  • Rate limiting

  • Academic/research oriented

FOFA

Purpose: Cyberspace search engine
Best For: Asset discovery, vulnerability research, threat hunting
Data Coverage: Global internet assets
Update Frequency: Regular scanning

Configuration:

fofa:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  search_fields: ["host", "title", "body"]

Strengths:

  • Comprehensive asset coverage

  • Advanced search syntax

  • Rich metadata

  • Global perspective

Limitations:

  • API key required

  • Primarily Chinese interface

  • Rate limiting

ZoomEye

Purpose: Cyberspace search engine
Best For: Device discovery, vulnerability research, network mapping
Data Coverage: Global internet-connected devices
Update Frequency: Regular scanning

Configuration:

zoomeye:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  search_types: ["host", "web"]

Strengths:

  • Dual search capabilities (host/web)

  • Good coverage of Asian networks

  • Detailed device information

  • Free tier available

Limitations:

  • API key required for full access

  • Rate limiting

  • Interface primarily in Chinese

Certificate Sources

Certificate Transparency (crt.sh)

Purpose: SSL certificate transparency logs
Best For: Subdomain discovery, certificate monitoring, domain validation
Data Coverage: All public SSL certificates
Update Frequency: Real-time certificate logging

Configuration:

crtsh:
  enabled: true
  max_results: 5000
  include_expired: false
  wildcard_search: true
  certificate_types: ["DV", "OV", "EV"]

Strengths:

  • Complete certificate coverage

  • No API key required

  • Real-time updates

  • Excellent for subdomain discovery

Limitations:

  • Only shows certificated domains

  • May include test/development certificates

  • Rate limiting on high volume

SecurityTrails

Purpose: DNS and domain intelligence platform
Best For: DNS history, subdomain discovery, domain monitoring
Data Coverage: Billions of DNS records
Update Frequency: Real-time DNS monitoring

Configuration:

securitytrails:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  record_types: ["A", "AAAA", "CNAME", "MX"]
  include_historical: true

Strengths:

  • Comprehensive DNS data

  • Historical DNS records

  • Subdomain enumeration

  • Domain monitoring

Limitations:

  • API key required

  • Rate limiting

  • Paid service for full features

Specialized Sources

IntelX

Purpose: Search engine for leaked data and documents
Best For: Data breach research, leaked information discovery
Data Coverage: Billions of leaked documents and data
Update Frequency: Continuous data ingestion

Configuration:

intelx:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  search_types: ["domain", "email", "url"]

Strengths:

  • Unique leaked data coverage

  • Comprehensive search capabilities

  • Historical data preservation

  • Multiple data types

Limitations:

  • API key required

  • Sensitive data handling required

  • Rate limiting

LeakIX

Purpose: Public information leak database
Best For: Exposed database discovery, misconfiguration detection
Data Coverage: Publicly exposed databases and services
Update Frequency: Continuous scanning

Configuration:

leakix:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  leak_types: ["database", "config", "backup"]

Strengths:

  • Focus on exposed data

  • Real-time discovery

  • Multiple data types

  • Security-focused

Limitations:

  • API key required

  • Sensitive data handling

  • Rate limiting

Netlas

Purpose: Internet asset discovery and analysis
Best For: Asset inventory, service discovery, network mapping
Data Coverage: Global internet assets
Update Frequency: Regular scanning

Configuration:

netlas:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  search_types: ["domain", "ip", "certificate"]

Strengths:

  • Comprehensive asset coverage

  • Multiple search types

  • Rich metadata

  • Good API documentation

Limitations:

  • API key required

  • Rate limiting

  • Newer platform

BuiltWith

Purpose: Technology profiling and website analysis
Best For: Technology stack discovery, competitor analysis
Data Coverage: Millions of websites analyzed
Update Frequency: Regular crawling

Configuration:

builtwith:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  include_historical: true

Strengths:

  • Detailed technology analysis

  • Historical technology tracking

  • Comprehensive coverage

  • Easy integration

Limitations:

  • API key required

  • Rate limiting

  • Focus on technology, not URLs

Hunter.io

Purpose: Email finder and domain search
Best For: Email discovery, contact research, domain profiling
Data Coverage: Millions of email addresses and domains
Update Frequency: Regular crawling

Configuration:

hunter:
  enabled: true
  api_key: "your_api_key"
  max_results: 1000
  include_sources: true

Strengths:

  • Email discovery capabilities

  • Source attribution

  • Domain profiling

  • Good accuracy

Limitations:

  • API key required

  • Rate limiting

  • Focus on email, limited URL data

Code Repository Sources

GitHub

Purpose: Code repository search
Best For: Source code analysis, configuration discovery, secret detection
Data Coverage: Millions of public repositories
Update Frequency: Real-time

Configuration:

github:
  enabled: true
  api_token: "your_token"
  max_results: 1000
  search_types: ["code", "repositories", "commits"]

Strengths:

  • Massive code repository

  • Advanced search capabilities

  • Real-time updates

  • Rich metadata

Limitations:

  • API token required

  • Rate limiting

  • Public repositories only

GitLab

Purpose: Code repository search
Best For: Source code analysis, project discovery
Data Coverage: Public GitLab repositories
Update Frequency: Real-time

Configuration:

gitlab:
  enabled: true
  api_token: "your_token"
  max_results: 1000
  search_scope: ["projects", "blobs", "commits"]

Strengths:

  • Alternative to GitHub

  • Good search capabilities

  • Open source focus

  • API access

Limitations:

  • API token required

  • Smaller dataset than GitHub

  • Rate limiting

Google Dorking

Dorki

Purpose: Google dorking automation
Best For: Search engine reconnaissance, information gathering
Data Coverage: Google search results
Update Frequency: Real-time search

Configuration:

dorki:
  enabled: true
  max_results: 1000
  dork_types: ["files", "directories", "parameters"]
  custom_dorks: ["site:target.com filetype:pdf"]

Strengths:

  • Automated Google dorking

  • Custom dork support

  • No API key required

  • Flexible search patterns

Limitations:

  • Google rate limiting

  • CAPTCHA challenges

  • Search result limitations

Source Selection Strategy

Comprehensive Reconnaissance

For thorough domain analysis, use:

  • Wayback Machine: Historical coverage

  • Common Crawl: Large-scale discovery

  • URLScan.io: Recent activity

  • crt.sh: Subdomain discovery

  • Shodan: Infrastructure mapping

Quick Assessment

For rapid reconnaissance, use:

  • Wayback Machine: Quick historical check

  • URLScan.io: Recent scans

  • crt.sh: Subdomain enumeration

Security Focus

For security-oriented research, use:

  • VirusTotal: Malware analysis

  • AlienVault OTX: Threat intelligence

  • URLScan.io: Security scanning

  • LeakIX: Exposed data

  • IntelX: Leaked information

Infrastructure Analysis

For infrastructure mapping, use:

  • Shodan: Service discovery

  • Censys: Certificate analysis

  • SecurityTrails: DNS intelligence

  • Netlas: Asset discovery

Next Steps: