Add OCR support for SiteReview CAPTCHA using tesseract

Add support for input file list of potential domains
Add additional error checking for ExpiredDomains.net parsing
Change -q/--query switch to -k/--keyword to better match its purpose
master
Andrew Chiles 2018-04-11 14:46:15 +02:00
parent 3b6078e9fd
commit 26dd64870d
4 changed files with 181 additions and 86 deletions

4
.gitignore vendored
View File

@ -1 +1,3 @@
*.html
*.html
*.txt
*.jpg

View File

@ -4,38 +4,52 @@ Authors Joe Vest (@joevest) & Andrew Chiles (@andrewchiles)
Domain name selection is an important aspect of preparation for penetration tests and especially Red Team engagements. Commonly, domains that were used previously for benign purposes and were properly categorized can be purchased for only a few dollars. Such domains can allow a team to bypass reputation based web filters and network egress restrictions for phishing and C2 related tasks.
This Python based tool was written to quickly query the Expireddomains.net search engine for expired/available domains with a previous history of use. It then optionally queries for domain reputation against services like Symantec Web Filter (BlueCoat), IBM X-Force, and Cisco Talos. The primary tool output is a timestamped HTML table style report.
This Python based tool was written to quickly query the Expireddomains.net search engine for expired/available domains with a previous history of use. It then optionally queries for domain reputation against services like Symantec WebPulse (BlueCoat), IBM X-Force, and Cisco Talos. The primary tool output is a timestamped HTML table style report.
## Changes
- 9 April 2018
+ Added -t switch for timing control. -t <1-5>
+ Added Google SafeBrowsing and PhishTank reputation checks
+ Fixed bug in IBMXForce response parsing
- 7 April 2018
+ Fixed support for Symantec WebPulse Site Review (formerly Blue Coat WebFilter)
+ Added Cisco Talos Domain Reputation check
+ Added feature to perform a reputation check against a single non-expired domain. This is useful when monitoring reputation for domains used in ongoing campaigns and engagements.
- 6 June 2017
+ Added python 3 support
+ Code cleanup and bug fixes
+ Added Status column (Available, Make Offer, Price, Backorder, etc)
- 11 April 2018
+ Added OCR support for CAPTCHA solving with tesseract. Thanks to t94j0 for the idea in [AIRMASTER](https://github.com/t94j0/AIRMASTER)
+ Added support for input file list of potential domains (-f/--filename)
+ Changed -q/--query switch to -k/--keyword to better match its purpose
+ Added additional error checking for ExpiredDomains.net parsing
- 9 April 2018
+ Added -t switch for timing control. -t <1-5>
+ Added Google SafeBrowsing and PhishTank reputation checks
+ Fixed bug in IBMXForce response parsing
- 7 April 2018
+ Fixed support for Symantec WebPulse Site Review (formerly Blue Coat WebFilter)
+ Added Cisco Talos Domain Reputation check
+ Added feature to perform a reputation check against a single non-expired domain. This is useful when monitoring reputation for domains used in ongoing campaigns and engagements.
- 6 June 2017
+ Added python 3 support
+ Code cleanup and bug fixes
+ Added Status column (Available, Make Offer, Price, Backorder, etc)
## Features
- Retrieve specified number of recently expired and deleted domains (.com, .net, .org primarily) from ExpiredDomains.net
- Retrieve available domains based on keyword search from ExpiredDomains.net
- Perform reputation checks against the Symantec Web Filter (BlueCoat), IBM x-Force, Cisco Talos, Google SafeBrowsing, and PhishTank services
- Perform reputation checks against the Symantec WebPulse Site Review (BlueCoat), IBM x-Force, Cisco Talos, Google SafeBrowsing, and PhishTank services
- Sort results by domain age (if known)
- Text-based table and HTML report output with links to reputation sources and Archive.org entry
## Usage
## Installation
Install Requirements
Install Python requirements
pip3 install -r requirements.txt
or
pip3 install requests texttable beautifulsoup4 lxml
Optional - Install additional OCR support dependencies
- Debian/Ubuntu: `apt-get install tesseract-ocr python3-imaging`
- MAC OSX: `brew install tesseract`
## Usage
List DomainHunter options
@ -48,9 +62,13 @@ List DomainHunter options
optional arguments:
-h, --help show this help message and exit
-q QUERY, --query QUERY
-k KEYWORD, --keyword KEYWORD
Keyword used to refine search results
-c, --check Perform domain reputation checks
-f FILENAME, --filename FILENAME
Specify input file of line delimited domain names to
check
--ocr Perform OCR on CAPTCHAs when present
-r MAXRESULTS, --maxresults MAXRESULTS
Number of results to return when querying latest
expired/deleted domains
@ -63,7 +81,7 @@ List DomainHunter options
Fastest(5) = no delay
-w MAXWIDTH, --maxwidth MAXWIDTH
Width of text table
-v, --version show program's version number and exit
-V, --version show program's version number and exit
Use defaults to check for most recent 100 domains and check reputation
@ -89,9 +107,13 @@ Perform all reputation checks for a single domain
[*] Cisco Talos: mydomain.com
[+] mydomain.com: Web Hosting (Score: Neutral)
Search for available domains with search term of "dog", max results of 100, and check reputation
Perform all reputation checks for a list of domains at max speed with OCR of CAPTCHAs
python3 ./domainhunter.py -f <domainslist.txt> -t 5 --ocr
Search for available domains with keyword term of "dog", max results of 100, and check reputation
python3 ./domainhunter.py -q dog -r 100 -c
python3 ./domainhunter.py -k dog -r 100 -c
____ ___ __ __ _ ___ _ _ _ _ _ _ _ _ _____ _____ ____
| _ \ / _ \| \/ | / \ |_ _| \ | | | | | | | | | \ | |_ _| ____| _ \
| | | | | | | |\/| | / _ \ | || \| | | |_| | | | | \| | | | | _| | |_) |

View File

@ -4,17 +4,17 @@
## Author: @joevest and @andrewchiles
## Description: Checks expired domains, reputation/categorization, and Archive.org history to determine
## good candidates for phishing and C2 domain names
# To-do:
# Code cleanup/optimization
# Add Authenticated "Members-Only" option to download CSV/txt (https://member.expireddomains.net/domains/expiredcom/)
# Add OCR support for BlueCoat/SiteReview CAPTCHA using tesseract
# Add support for input file list of potential domains
# Add additional error checking for ExpiredDomains.net parsing
# Changed -q/--query switch to -k/--keyword to better match its purpose
import time
import random
import argparse
import json
import base64
__version__ = "20180409"
__version__ = "20180411"
## Functions
@ -29,9 +29,7 @@ def doSleep(timing):
time.sleep(random.randrange(10,20))
elif timing == 4:
time.sleep(random.randrange(5,10))
else:
# Maxiumum speed - no delay
pass
# There's no elif timing == 5 here because we don't want to sleep for -t 5
def checkBluecoat(domain):
try:
@ -43,7 +41,6 @@ def checkBluecoat(domain):
print('[*] BlueCoat: {}'.format(domain))
response = s.post(url,headers=headers,json=postData,verify=False)
responseJSON = json.loads(response.text)
if 'errorType' in responseJSON:
@ -51,17 +48,46 @@ def checkBluecoat(domain):
else:
a = responseJSON['categorization'][0]['name']
# # Print notice if CAPTCHAs are blocking accurate results
# if a == 'captcha':
# print('[-] Error: Blue Coat CAPTCHA received. Change your IP or manually solve a CAPTCHA at "https://sitereview.bluecoat.com/sitereview.jsp"')
# #raw_input('[*] Press Enter to continue...')
# Print notice if CAPTCHAs are blocking accurate results and attempt to solve if --ocr
if a == 'captcha':
if ocr:
# This request is performed in a browser, but is not needed for our purposes
#captcharequestURL = 'https://sitereview.bluecoat.com/resource/captcha-request'
#print('[*] Requesting CAPTCHA')
#response = s.get(url=captcharequestURL,headers=headers,cookies=cookies,verify=False)
print('[*] Received CAPTCHA challenge!')
captcha = solveCaptcha('https://sitereview.bluecoat.com/resource/captcha.jpg',s)
if captcha:
b64captcha = base64.b64encode(captcha.encode('utf-8')).decode('utf-8')
# Send CAPTCHA solution via GET since inclusion with the domain categorization request doens't work anymore
captchasolutionURL = 'https://sitereview.bluecoat.com/resource/captcha-request/{0}'.format(b64captcha)
print('[*] Submiting CAPTCHA at {0}'.format(captchasolutionURL))
response = s.get(url=captchasolutionURL,headers=headers,verify=False)
# Try the categorization request again
response = s.post(url,headers=headers,json=postData,verify=False)
responseJSON = json.loads(response.text)
if 'errorType' in responseJSON:
a = responseJSON['errorType']
else:
a = responseJSON['categorization'][0]['name']
else:
print('[-] Error: Failed to solve BlueCoat CAPTCHA with OCR! Manually solve at "https://sitereview.bluecoat.com/sitereview.jsp"')
else:
print('[-] Error: BlueCoat CAPTCHA received. Try --ocr flag or manually solve a CAPTCHA at "https://sitereview.bluecoat.com/sitereview.jsp"')
return a
except:
print('[-] Error retrieving Bluecoat reputation!')
except Exception as e:
print('[-] Error retrieving Bluecoat reputation! {0}'.format(e))
return "-"
def checkIBMxForce(domain):
def checkIBMXForce(domain):
try:
url = 'https://exchange.xforce.ibmcloud.com/url/{}'.format(domain)
headers = {'User-Agent':useragent,
@ -184,7 +210,7 @@ def checkMXToolbox(domain):
def downloadMalwareDomains(malwaredomainsURL):
url = malwaredomainsURL
response = s.get(url,headers=headers,verify=False)
response = s.get(url=url,headers=headers,verify=False)
responseText = response.text
if response.status_code == 200:
return responseText
@ -203,7 +229,7 @@ def checkDomain(domain):
bluecoat = checkBluecoat(domain)
print("[+] {}: {}".format(domain, bluecoat))
ibmxforce = checkIBMxForce(domain)
ibmxforce = checkIBMXForce(domain)
print("[+] {}: {}".format(domain, ibmxforce))
ciscotalos = checkTalos(domain)
@ -211,9 +237,43 @@ def checkDomain(domain):
print("")
return
def solveCaptcha(url,session):
# Downloads CAPTCHA image and saves to current directory for OCR with tesseract
# Returns CAPTCHA string or False if error occured
jpeg = 'captcha.jpg'
try:
response = session.get(url=url,headers=headers,verify=False, stream=True)
if response.status_code == 200:
with open(jpeg, 'wb') as f:
response.raw.decode_content = True
shutil.copyfileobj(response.raw, f)
else:
print('[-] Error downloading CAPTCHA file!')
return False
text = pytesseract.image_to_string(Image.open(jpeg))
text = text.replace(" ", "")
return text
except Exception as e:
print("[-] Error solving CAPTCHA - {0}".format(e))
return False
## MAIN
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Finds expired domains, domain categorization, and Archive.org history to determine good candidates for C2 and phishing domains')
parser.add_argument('-k','--keyword', help='Keyword used to refine search results', required=False, default=False, type=str, dest='keyword')
parser.add_argument('-c','--check', help='Perform domain reputation checks', required=False, default=False, action='store_true', dest='check')
parser.add_argument('-f','--filename', help='Specify input file of line delimited domain names to check', required=False, default=False, type=str, dest='filename')
parser.add_argument('--ocr', help='Perform OCR on CAPTCHAs when present', required=False, default=False, action='store_true')
parser.add_argument('-r','--maxresults', help='Number of results to return when querying latest expired/deleted domains', required=False, default=100, type=int, dest='maxresults')
parser.add_argument('-s','--single', help='Performs detailed reputation checks against a single domain name/IP.', required=False, default=False, dest='single')
parser.add_argument('-t','--timing', help='Modifies request timing to avoid CAPTCHAs. Slowest(0) = 90-120 seconds, Default(3) = 10-20 seconds, Fastest(5) = no delay', required=False, default=3, type=int, choices=range(0,6), dest='timing')
parser.add_argument('-w','--maxwidth', help='Width of text table', required=False, default=400, type=int, dest='maxwidth')
parser.add_argument('-V','--version', action='version',version='%(prog)s {version}'.format(version=__version__))
args = parser.parse_args()
# Load dependent modules
try:
import requests
from bs4 import BeautifulSoup
@ -221,24 +281,30 @@ if __name__ == "__main__":
except Exception as e:
print("Expired Domains Reputation Check")
print("[-] Missing dependencies: {}".format(str(e)))
print("[*] Install required dependencies by running `pip install -r requirements.txt`")
print("[-] Missing basic dependencies: {}".format(str(e)))
print("[*] Install required dependencies by running `pip3 install -r requirements.txt`")
quit(0)
parser = argparse.ArgumentParser(description='Finds expired domains, domain categorization, and Archive.org history to determine good candidates for C2 and phishing domains')
parser.add_argument('-q','--query', help='Keyword used to refine search results', required=False, default=False, type=str, dest='query')
parser.add_argument('-c','--check', help='Perform domain reputation checks', required=False, default=False, action='store_true', dest='check')
parser.add_argument('-r','--maxresults', help='Number of results to return when querying latest expired/deleted domains', required=False, default=100, type=int, dest='maxresults')
parser.add_argument('-s','--single', help='Performs detailed reputation checks against a single domain name/IP.', required=False, default=False, dest='single')
parser.add_argument('-t','--timing', help='Modifies request timing to avoid CAPTCHAs. Slowest(0) = 90-120 seconds, Default(3) = 10-20 seconds, Fastest(5) = no delay', required=False, default=3, type=int, choices=range(0,6), dest='timing')
parser.add_argument('-w','--maxwidth', help='Width of text table', required=False, default=400, type=int, dest='maxwidth')
parser.add_argument('-v','--version', action='version',version='%(prog)s {version}'.format(version=__version__))
args = parser.parse_args()
# Load OCR related modules if --ocr flag is set since these can be difficult to get working
if args.ocr:
try:
import pytesseract
from PIL import Image
import shutil
except Exception as e:
print("Expired Domains Reputation Check")
print("[-] Missing OCR dependencies: {}".format(str(e)))
print("[*] Install required Python dependencies by running `pip3 install -r requirements.txt`")
print("[*] Ubuntu\Debian - Install tesseract by running `apt-get install tesseract-ocr python3-imaging`")
print("[*] MAC OSX - Install tesseract with homebrew by running `brew install tesseract`")
quit(0)
## Variables
query = args.query
keyword = args.keyword
check = args.check
filename = args.filename
maxresults = args.maxresults
@ -251,6 +317,8 @@ if __name__ == "__main__":
malwaredomainsURL = 'http://mirror1.malwaredomains.com/files/justdomains'
expireddomainsqueryURL = 'https://www.expireddomains.net/domain-name-search'
ocr = args.ocr
timestamp = time.strftime("%Y%m%d_%H%M%S")
useragent = 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)'
@ -291,26 +359,21 @@ If you plan to use this content for illegal purpose, don't. Have a nice day :)'
checkDomain(single)
quit(0)
# Calculate estimated runtime based on timing variable if checking domain categorization for all returned domains
if check:
if timing == 0:
seconds = 90
elif timing == 1:
seconds = 60
elif timing == 2:
seconds = 30
elif timing == 3:
seconds = 20
elif timing == 4:
seconds = 10
else:
seconds = 0
runtime = (maxresults * seconds) / 60
print("[*] Peforming Domain Categorization Lookups:")
print("[*] Estimated duration is {} minutes. Modify lookup speed with -t switch.\n".format(int(runtime)))
else:
pass
# Perform detailed domain reputation checks against input file
if filename:
try:
with open(filename, 'r') as domainsList:
for line in domainsList.read().splitlines():
checkDomain(line)
doSleep(timing)
except KeyboardInterrupt:
print('Caught keyboard interrupt. Exiting!')
quit(0)
except Exception as e:
print('[-] {}'.format(e))
quit(1)
quit(0)
# Generic Proxy support
# TODO: add as a parameter
proxies = {
@ -328,15 +391,15 @@ If you plan to use this content for illegal purpose, don't. Have a nice day :)'
urls = []
# Use the keyword string to narrow domain search if provided
if query:
print('[*] Fetching expired or deleted domains containing "{}"'.format(query))
if keyword:
print('[*] Fetching expired or deleted domains containing "{}"'.format(keyword))
for i in range (0,maxresults,25):
if i == 0:
urls.append("{}/?q={}".format(expireddomainsqueryURL,query))
headers['Referer'] ='https://www.expireddomains.net/domain-name-search/?q={}&start=1'.format(query)
urls.append("{}/?q={}".format(expireddomainsqueryURL,keyword))
headers['Referer'] ='https://www.expireddomains.net/domain-name-search/?q={}&start=1'.format(keyword)
else:
urls.append("{}/?start={}&q={}".format(expireddomainsqueryURL,i,query))
headers['Referer'] ='https://www.expireddomains.net/domain-name-search/?start={}&q={}'.format((i-25),query)
urls.append("{}/?start={}&q={}".format(expireddomainsqueryURL,i,keyword))
headers['Referer'] ='https://www.expireddomains.net/domain-name-search/?start={}&q={}'.format((i-25),keyword)
# If no keyword provided, retrieve list of recently expired domains in batches of 25 results.
else:
@ -375,9 +438,9 @@ If you plan to use this content for illegal purpose, don't. Have a nice day :)'
# Turn the HTML into a Beautiful Soup object
soup = BeautifulSoup(domains, 'lxml')
table = soup.find("table")
try:
table = soup.find("table")
for row in table.findAll('tr')[1:]:
# Alternative way to extract domain name
@ -388,7 +451,7 @@ If you plan to use this content for illegal purpose, don't. Have a nice day :)'
if len(cells) >= 1:
output = ""
if query:
if keyword:
c0 = row.find('td').find('a').text # domain
c1 = cells[1].find(text=True) # bl
@ -460,7 +523,7 @@ If you plan to use this content for illegal purpose, don't. Have a nice day :)'
elif check == True:
bluecoat = checkBluecoat(c0)
print("[+] {}: {}".format(c0, bluecoat))
ibmxforce = checkIBMxForce(c0)
ibmxforce = checkIBMXForce(c0)
print("[+] {}: {}".format(c0, ibmxforce))
# Sleep to avoid captchas
doSleep(timing)
@ -470,8 +533,14 @@ If you plan to use this content for illegal purpose, don't. Have a nice day :)'
# Append parsed domain data to list
data.append([c0,c3,c4,available,status,bluecoat,ibmxforce])
except Exception as e:
print(e)
#print(e)
pass
# Check for valid results before continuing
if not(data):
print("[-] No results found for keyword: {0}".format(keyword))
quit(0)
# Sort domain list by column 2 (Birth Year)
sortedData = sorted(data, key=lambda x: x[1], reverse=True)

View File

@ -2,3 +2,5 @@ requests==2.13.0
texttable==0.8.7
beautifulsoup4==4.5.3
lxml
pillow==5.0.0
pytesseract