Data Sources
WayHack integrates with 20+ OSINT data sources to provide comprehensive URL reconnaissance capabilities. This guide provides detailed information about each source, their strengths, configuration options, and best use cases.
Archive Sources
Wayback Machine
Purpose: Historical snapshots of websites from the Internet Archive
Best For: Finding old URLs, tracking changes over time, discovering removed content
Data Coverage: 735+ billion web pages since 1996
Update Frequency: Continuous crawling
Configuration:
wayback:
enabled: true
max_results: 10000
collapse_duplicates: true
filter_status_codes: [200, 301, 302]
date_range:
start: "2020-01-01"
end: "2023-12-31"
Strengths:
-
Extensive historical coverage
-
No API key required
-
Reliable and stable
-
Excellent for timeline analysis
Limitations:
-
Rate limiting on high-volume requests
-
Some content may be excluded
-
Occasional service downtime
Common Crawl
Purpose: Web crawl archives from large-scale internet crawling
Best For: Comprehensive URL discovery, large-scale data analysis
Data Coverage: Petabytes of web data, monthly crawls
Update Frequency: Monthly crawl releases
Configuration:
commoncrawl:
enabled: true
max_results: 50000
index_collections: ["CC-MAIN-2023-50", "CC-MAIN-2023-40"]
content_types: ["text/html", "application/json"]
Strengths:
-
Massive data coverage
-
Free and open access
-
Regular updates
-
Rich metadata
Limitations:
-
Large dataset can be overwhelming
-
Processing time can be significant
-
Limited real-time data
Security Intelligence Sources
URLScan.io
Purpose: Real-time URL scanning and analysis
Best For: Recent URLs, active threat intelligence, malware analysis
Data Coverage: Millions of scanned URLs daily
Update Frequency: Real-time
Configuration:
urlscan:
enabled: true
api_key: "your_api_key"
max_results: 1000
include_private: false
scan_types: ["public", "unlisted"]
Strengths:
-
Real-time scanning data
-
Rich technical details
-
Screenshot capabilities
-
Active threat detection
Limitations:
-
API key required for full access
-
Rate limiting
-
Focus on recently scanned URLs
VirusTotal
Purpose: Malware and URL analysis platform
Best For: Security analysis, malware detection, reputation checking
Data Coverage: Billions of files and URLs analyzed
Update Frequency: Real-time
Configuration:
virustotal:
enabled: true
api_key: "your_api_key"
max_results: 1000
include_detections: true
min_positives: 1
Strengths:
-
Comprehensive malware analysis
-
Multiple antivirus engine results
-
Rich metadata and relationships
-
Community contributions
Limitations:
-
API key required
-
Rate limiting on free tier
-
Focus on malicious content
AlienVault OTX
Purpose: Open Threat Exchange intelligence platform
Best For: Threat intelligence, IOC discovery, security research
Data Coverage: Millions of threat indicators
Update Frequency: Real-time community contributions
Configuration:
alienault:
enabled: true
api_key: "your_api_key"
max_results: 5000
pulse_types: ["domain", "url", "hostname"]
Strengths:
-
Community-driven intelligence
-
Rich threat context
-
Free access with registration
-
Global threat visibility
Limitations:
-
Quality varies by contributor
-
May include false positives
-
Requires registration
Search Engine Sources
Shodan
Purpose: Internet-connected device search engine
Best For: Finding exposed services, IoT devices, infrastructure mapping
Data Coverage: 500+ million connected devices
Update Frequency: Continuous scanning
Configuration:
shodan:
enabled: true
api_key: "your_api_key"
max_results: 1000
search_facets: ["port", "country", "org"]
include_banners: true
Strengths:
-
Unique infrastructure visibility
-
Service banner information
-
Geolocation data
-
Historical tracking
Limitations:
-
API key required
-
Limited free tier
-
Focus on exposed services
Censys
Purpose: Internet-wide scanning and analysis
Best For: Certificate analysis, service discovery, infrastructure research
Data Coverage: Internet-wide IPv4 scanning
Update Frequency: Daily scans
Configuration:
censys:
enabled: true
api_id: "your_api_id"
api_secret: "your_api_secret"
max_results: 1000
search_types: ["hosts", "certificates"]
Strengths:
-
Comprehensive internet scanning
-
Certificate transparency integration
-
Academic research focus
-
High-quality data
Limitations:
-
API credentials required
-
Rate limiting
-
Academic/research oriented
FOFA
Purpose: Cyberspace search engine
Best For: Asset discovery, vulnerability research, threat hunting
Data Coverage: Global internet assets
Update Frequency: Regular scanning
Configuration:
fofa:
enabled: true
api_key: "your_api_key"
max_results: 1000
search_fields: ["host", "title", "body"]
Strengths:
-
Comprehensive asset coverage
-
Advanced search syntax
-
Rich metadata
-
Global perspective
Limitations:
-
API key required
-
Primarily Chinese interface
-
Rate limiting
ZoomEye
Purpose: Cyberspace search engine
Best For: Device discovery, vulnerability research, network mapping
Data Coverage: Global internet-connected devices
Update Frequency: Regular scanning
Configuration:
zoomeye:
enabled: true
api_key: "your_api_key"
max_results: 1000
search_types: ["host", "web"]
Strengths:
-
Dual search capabilities (host/web)
-
Good coverage of Asian networks
-
Detailed device information
-
Free tier available
Limitations:
-
API key required for full access
-
Rate limiting
-
Interface primarily in Chinese
Certificate Sources
Certificate Transparency (crt.sh)
Purpose: SSL certificate transparency logs
Best For: Subdomain discovery, certificate monitoring, domain validation
Data Coverage: All public SSL certificates
Update Frequency: Real-time certificate logging
Configuration:
crtsh:
enabled: true
max_results: 5000
include_expired: false
wildcard_search: true
certificate_types: ["DV", "OV", "EV"]
Strengths:
-
Complete certificate coverage
-
No API key required
-
Real-time updates
-
Excellent for subdomain discovery
Limitations:
-
Only shows certificated domains
-
May include test/development certificates
-
Rate limiting on high volume
SecurityTrails
Purpose: DNS and domain intelligence platform
Best For: DNS history, subdomain discovery, domain monitoring
Data Coverage: Billions of DNS records
Update Frequency: Real-time DNS monitoring
Configuration:
securitytrails:
enabled: true
api_key: "your_api_key"
max_results: 1000
record_types: ["A", "AAAA", "CNAME", "MX"]
include_historical: true
Strengths:
-
Comprehensive DNS data
-
Historical DNS records
-
Subdomain enumeration
-
Domain monitoring
Limitations:
-
API key required
-
Rate limiting
-
Paid service for full features
Specialized Sources
IntelX
Purpose: Search engine for leaked data and documents
Best For: Data breach research, leaked information discovery
Data Coverage: Billions of leaked documents and data
Update Frequency: Continuous data ingestion
Configuration:
intelx:
enabled: true
api_key: "your_api_key"
max_results: 1000
search_types: ["domain", "email", "url"]
Strengths:
-
Unique leaked data coverage
-
Comprehensive search capabilities
-
Historical data preservation
-
Multiple data types
Limitations:
-
API key required
-
Sensitive data handling required
-
Rate limiting
LeakIX
Purpose: Public information leak database
Best For: Exposed database discovery, misconfiguration detection
Data Coverage: Publicly exposed databases and services
Update Frequency: Continuous scanning
Configuration:
leakix:
enabled: true
api_key: "your_api_key"
max_results: 1000
leak_types: ["database", "config", "backup"]
Strengths:
-
Focus on exposed data
-
Real-time discovery
-
Multiple data types
-
Security-focused
Limitations:
-
API key required
-
Sensitive data handling
-
Rate limiting
Netlas
Purpose: Internet asset discovery and analysis
Best For: Asset inventory, service discovery, network mapping
Data Coverage: Global internet assets
Update Frequency: Regular scanning
Configuration:
netlas:
enabled: true
api_key: "your_api_key"
max_results: 1000
search_types: ["domain", "ip", "certificate"]
Strengths:
-
Comprehensive asset coverage
-
Multiple search types
-
Rich metadata
-
Good API documentation
Limitations:
-
API key required
-
Rate limiting
-
Newer platform
BuiltWith
Purpose: Technology profiling and website analysis
Best For: Technology stack discovery, competitor analysis
Data Coverage: Millions of websites analyzed
Update Frequency: Regular crawling
Configuration:
builtwith:
enabled: true
api_key: "your_api_key"
max_results: 1000
include_historical: true
Strengths:
-
Detailed technology analysis
-
Historical technology tracking
-
Comprehensive coverage
-
Easy integration
Limitations:
-
API key required
-
Rate limiting
-
Focus on technology, not URLs
Hunter.io
Purpose: Email finder and domain search
Best For: Email discovery, contact research, domain profiling
Data Coverage: Millions of email addresses and domains
Update Frequency: Regular crawling
Configuration:
hunter:
enabled: true
api_key: "your_api_key"
max_results: 1000
include_sources: true
Strengths:
-
Email discovery capabilities
-
Source attribution
-
Domain profiling
-
Good accuracy
Limitations:
-
API key required
-
Rate limiting
-
Focus on email, limited URL data
Code Repository Sources
GitHub
Purpose: Code repository search
Best For: Source code analysis, configuration discovery, secret detection
Data Coverage: Millions of public repositories
Update Frequency: Real-time
Configuration:
github:
enabled: true
api_token: "your_token"
max_results: 1000
search_types: ["code", "repositories", "commits"]
Strengths:
-
Massive code repository
-
Advanced search capabilities
-
Real-time updates
-
Rich metadata
Limitations:
-
API token required
-
Rate limiting
-
Public repositories only
GitLab
Purpose: Code repository search
Best For: Source code analysis, project discovery
Data Coverage: Public GitLab repositories
Update Frequency: Real-time
Configuration:
gitlab:
enabled: true
api_token: "your_token"
max_results: 1000
search_scope: ["projects", "blobs", "commits"]
Strengths:
-
Alternative to GitHub
-
Good search capabilities
-
Open source focus
-
API access
Limitations:
-
API token required
-
Smaller dataset than GitHub
-
Rate limiting
Google Dorking
Dorki
Purpose: Google dorking automation
Best For: Search engine reconnaissance, information gathering
Data Coverage: Google search results
Update Frequency: Real-time search
Configuration:
dorki:
enabled: true
max_results: 1000
dork_types: ["files", "directories", "parameters"]
custom_dorks: ["site:target.com filetype:pdf"]
Strengths:
-
Automated Google dorking
-
Custom dork support
-
No API key required
-
Flexible search patterns
Limitations:
-
Google rate limiting
-
CAPTCHA challenges
-
Search result limitations
Source Selection Strategy
Comprehensive Reconnaissance
For thorough domain analysis, use:
-
Wayback Machine: Historical coverage
-
Common Crawl: Large-scale discovery
-
URLScan.io: Recent activity
-
crt.sh: Subdomain discovery
-
Shodan: Infrastructure mapping
Quick Assessment
For rapid reconnaissance, use:
-
Wayback Machine: Quick historical check
-
URLScan.io: Recent scans
-
crt.sh: Subdomain enumeration
Security Focus
For security-oriented research, use:
-
VirusTotal: Malware analysis
-
AlienVault OTX: Threat intelligence
-
URLScan.io: Security scanning
-
LeakIX: Exposed data
-
IntelX: Leaked information
Infrastructure Analysis
For infrastructure mapping, use:
-
Shodan: Service discovery
-
Censys: Certificate analysis
-
SecurityTrails: DNS intelligence
-
Netlas: Asset discovery
Next Steps:
-
URL Discovery Techniques - Advanced search methods
-
Advanced Search Configurations - Complex search setups
-
CLI Tool Mastery - Command-line usage