Add OCR support for SiteReview CAPTCHA using tesseract

Add support for input file list of potential domains Add additional error checking for ExpiredDomains.net parsing Change -q/--query switch to -k/--keyword to better match its purpose
2018-04-11 14:46:15 +02:00 · 2018-04-11 14:46:15 +02:00 · 26dd64870d
parent 3b6078e9fd
commit 26dd64870d
4 changed files with 181 additions and 86 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1 +1,3 @@
-*.html
+*.html
+*.txt
+*.jpg
--- a/README.md
+++ b/README.md
@ -4,38 +4,52 @@ Authors Joe Vest (@joevest) & Andrew Chiles (@andrewchiles)

 Domain name selection is an important aspect of preparation for penetration tests and especially Red Team engagements. Commonly, domains that were used previously for benign purposes and were properly categorized can be purchased for only a few dollars. Such domains can allow a team to bypass reputation based web filters and network egress restrictions for phishing and C2 related tasks. 

-This Python based tool was written to quickly query the Expireddomains.net search engine for expired/available domains with a previous history of use. It then optionally queries for domain reputation against services like Symantec Web Filter (BlueCoat), IBM X-Force, and Cisco Talos. The primary tool output is a timestamped HTML table style report.
+This Python based tool was written to quickly query the Expireddomains.net search engine for expired/available domains with a previous history of use. It then optionally queries for domain reputation against services like Symantec WebPulse (BlueCoat), IBM X-Force, and Cisco Talos. The primary tool output is a timestamped HTML table style report.

 ## Changes
-    - 9 April 2018
-        + Added -t switch for timing control. -t <1-5>
-        + Added Google SafeBrowsing and PhishTank reputation checks
-        + Fixed bug in IBMXForce response parsing
-    - 7 April 2018
-        + Fixed support for Symantec WebPulse Site Review (formerly Blue Coat WebFilter)
-        + Added Cisco Talos Domain Reputation check
-        + Added feature to perform a reputation check against a single non-expired domain. This is useful when monitoring reputation for domains used in ongoing campaigns and engagements.

-    - 6 June 2017
-        + Added python 3 support
-        + Code cleanup and bug fixes
-        + Added Status column (Available, Make Offer, Price, Backorder, etc)
+- 11 April 2018
+    + Added OCR support for CAPTCHA solving with tesseract. Thanks to t94j0 for the idea in [AIRMASTER](https://github.com/t94j0/AIRMASTER)  
+    + Added support for input file list of potential domains (-f/--filename)
+    + Changed -q/--query switch to -k/--keyword to better match its purpose
+    + Added additional error checking for ExpiredDomains.net parsing
+
+- 9 April 2018
+    + Added -t switch for timing control. -t <1-5>
+    + Added Google SafeBrowsing and PhishTank reputation checks
+    + Fixed bug in IBMXForce response parsing
+
+- 7 April 2018
+    + Fixed support for Symantec WebPulse Site Review (formerly Blue Coat WebFilter)
+    + Added Cisco Talos Domain Reputation check
+    + Added feature to perform a reputation check against a single non-expired domain. This is useful when monitoring reputation for domains used in ongoing campaigns and engagements.
+
+- 6 June 2017
+    + Added python 3 support
+    + Code cleanup and bug fixes
+    + Added Status column (Available, Make Offer, Price, Backorder, etc)

 ## Features

 - Retrieve specified number of recently expired and deleted domains (.com, .net, .org primarily) from ExpiredDomains.net
 - Retrieve available domains based on keyword search from ExpiredDomains.net
- Perform reputation checks against the Symantec Web Filter (BlueCoat), IBM x-Force, Cisco Talos, Google SafeBrowsing, and PhishTank services
+- Perform reputation checks against the Symantec WebPulse Site Review (BlueCoat), IBM x-Force, Cisco Talos, Google SafeBrowsing, and PhishTank services
 - Sort results by domain age (if known)
 - Text-based table and HTML report output with links to reputation sources and Archive.org entry

-## Usage
+## Installation

-Install Requirements
+Install Python requirements

    pip3 install -r requirements.txt
-    or
-    pip3 install requests texttable beautifulsoup4 lxml
+    
+Optional - Install additional OCR support dependencies
+
+- Debian/Ubuntu: `apt-get install tesseract-ocr python3-imaging`
+
+- MAC OSX: `brew install tesseract`
+
+## Usage

 List DomainHunter options
    
@ -48,9 +62,13 @@ List DomainHunter options

    optional arguments:
      -h, --help            show this help message and exit
-      -q QUERY, --query QUERY
+      -k KEYWORD, --keyword KEYWORD
                            Keyword used to refine search results
      -c, --check           Perform domain reputation checks
+      -f FILENAME, --filename FILENAME
+                            Specify input file of line delimited domain names to
+                            check
+      --ocr                 Perform OCR on CAPTCHAs when present
      -r MAXRESULTS, --maxresults MAXRESULTS
                            Number of results to return when querying latest
                            expired/deleted domains
@ -63,7 +81,7 @@ List DomainHunter options
                            Fastest(5) = no delay
      -w MAXWIDTH, --maxwidth MAXWIDTH
                            Width of text table
-      -v, --version         show program's version number and exit
+      -V, --version         show program's version number and exit

 Use defaults to check for most recent 100 domains and check reputation
    
@ -89,9 +107,13 @@ Perform all reputation checks for a single domain
    [*] Cisco Talos: mydomain.com
    [+] mydomain.com: Web Hosting (Score: Neutral)

-Search for available domains with search term of "dog", max results of 100, and check reputation
+Perform all reputation checks for a list of domains at max speed with OCR of CAPTCHAs
+
+    python3 ./domainhunter.py -f <domainslist.txt> -t 5 --ocr
+
+Search for available domains with keyword term of "dog", max results of 100, and check reputation
    
-    python3 ./domainhunter.py -q dog -r 100 -c
+    python3 ./domainhunter.py -k dog -r 100 -c
     ____   ___  __  __    _    ___ _   _   _   _ _   _ _   _ _____ _____ ____
    |  _ \ / _ \|  \/  |  / \  |_ _| \ | | | | | | | | | \ | |_   _| ____|  _ \
    | | | | | | | |\/| | / _ \  | ||  \| | | |_| | | | |  \| | | | |  _| | |_) |
--- a/domainhunter.py
+++ b/domainhunter.py
@ -4,17 +4,17 @@
 ## Author:      @joevest and @andrewchiles
 ## Description: Checks expired domains, reputation/categorization, and Archive.org history to determine 
 ##              good candidates for phishing and C2 domain names
-
-# To-do:
-# Code cleanup/optimization
-# Add Authenticated "Members-Only" option to download CSV/txt (https://member.expireddomains.net/domains/expiredcom/)
-
+# Add OCR support for BlueCoat/SiteReview CAPTCHA using tesseract
+# Add support for input file list of potential domains
+# Add additional error checking for ExpiredDomains.net parsing
+# Changed -q/--query switch to -k/--keyword to better match its purpose
 import time 
 import random
 import argparse
 import json
+import base64

-__version__ = "20180409"
+__version__ = "20180411"

 ## Functions

@ -29,9 +29,7 @@ def doSleep(timing):
        time.sleep(random.randrange(10,20))
    elif timing == 4:
        time.sleep(random.randrange(5,10))
-    else:
-        # Maxiumum speed - no delay
-        pass
+    # There's no elif timing == 5 here because we don't want to sleep for -t 5

 def checkBluecoat(domain):
    try:
@ -43,7 +41,6 @@ def checkBluecoat(domain):

        print('[*] BlueCoat: {}'.format(domain))
        response = s.post(url,headers=headers,json=postData,verify=False)
-
        responseJSON = json.loads(response.text)

        if 'errorType' in responseJSON:
@ -51,17 +48,46 @@ def checkBluecoat(domain):
        else:
            a = responseJSON['categorization'][0]['name']
        
-        # # Print notice if CAPTCHAs are blocking accurate results
-        # if a == 'captcha':
-        #     print('[-] Error: Blue Coat CAPTCHA received. Change your IP or manually solve a CAPTCHA at "https://sitereview.bluecoat.com/sitereview.jsp"')
-        #     #raw_input('[*] Press Enter to continue...')
+        # Print notice if CAPTCHAs are blocking accurate results and attempt to solve if --ocr
+        if a == 'captcha':
+            if ocr:
+                # This request is performed in a browser, but is not needed for our purposes
+                #captcharequestURL = 'https://sitereview.bluecoat.com/resource/captcha-request'
+                #print('[*] Requesting CAPTCHA')
+                #response = s.get(url=captcharequestURL,headers=headers,cookies=cookies,verify=False)
+
+                print('[*] Received CAPTCHA challenge!')
+                captcha = solveCaptcha('https://sitereview.bluecoat.com/resource/captcha.jpg',s)
+                
+                if captcha:
+                    b64captcha = base64.b64encode(captcha.encode('utf-8')).decode('utf-8')
+                   
+                    # Send CAPTCHA solution via GET since inclusion with the domain categorization request doens't work anymore
+                    captchasolutionURL = 'https://sitereview.bluecoat.com/resource/captcha-request/{0}'.format(b64captcha)
+                    print('[*] Submiting CAPTCHA at {0}'.format(captchasolutionURL))
+                    response = s.get(url=captchasolutionURL,headers=headers,verify=False)
+
+                    # Try the categorization request again
+                    response = s.post(url,headers=headers,json=postData,verify=False)
+
+                    responseJSON = json.loads(response.text)
+
+                    if 'errorType' in responseJSON:
+                        a = responseJSON['errorType']
+                    else:
+                        a = responseJSON['categorization'][0]['name']
+                else:
+                    print('[-] Error: Failed to solve BlueCoat CAPTCHA with OCR! Manually solve at "https://sitereview.bluecoat.com/sitereview.jsp"')
+            else:
+                print('[-] Error: BlueCoat CAPTCHA received. Try --ocr flag or manually solve a CAPTCHA at "https://sitereview.bluecoat.com/sitereview.jsp"')

        return a
-    except:
-        print('[-] Error retrieving Bluecoat reputation!')
+
+    except Exception as e:
+        print('[-] Error retrieving Bluecoat reputation! {0}'.format(e))
        return "-"

-def checkIBMxForce(domain):
+def checkIBMXForce(domain):
    try: 
        url = 'https://exchange.xforce.ibmcloud.com/url/{}'.format(domain)
        headers = {'User-Agent':useragent,
@ -184,7 +210,7 @@ def checkMXToolbox(domain):

 def downloadMalwareDomains(malwaredomainsURL):
    url = malwaredomainsURL
-    response = s.get(url,headers=headers,verify=False)
+    response = s.get(url=url,headers=headers,verify=False)
    responseText = response.text
    if response.status_code == 200:
        return responseText
@ -203,7 +229,7 @@ def checkDomain(domain):
    bluecoat = checkBluecoat(domain)
    print("[+] {}: {}".format(domain, bluecoat))
    
-    ibmxforce = checkIBMxForce(domain)
+    ibmxforce = checkIBMXForce(domain)
    print("[+] {}: {}".format(domain, ibmxforce))

    ciscotalos = checkTalos(domain)
@ -211,9 +237,43 @@ def checkDomain(domain):
    print("")
    return

+def solveCaptcha(url,session):  
+    # Downloads CAPTCHA image and saves to current directory for OCR with tesseract
+    # Returns CAPTCHA string or False if error occured
+    jpeg = 'captcha.jpg'
+    try:
+        response = session.get(url=url,headers=headers,verify=False, stream=True)
+        if response.status_code == 200:
+            with open(jpeg, 'wb') as f:
+                response.raw.decode_content = True
+                shutil.copyfileobj(response.raw, f)
+        else:
+            print('[-] Error downloading CAPTCHA file!')
+            return False
+
+        text = pytesseract.image_to_string(Image.open(jpeg))
+        text = text.replace(" ", "")
+        return text
+    except Exception as e:
+        print("[-] Error solving CAPTCHA - {0}".format(e))
+        return False
+
 ## MAIN
 if __name__ == "__main__":

+    parser = argparse.ArgumentParser(description='Finds expired domains, domain categorization, and Archive.org history to determine good candidates for C2 and phishing domains')
+    parser.add_argument('-k','--keyword', help='Keyword used to refine search results', required=False, default=False, type=str, dest='keyword')
+    parser.add_argument('-c','--check', help='Perform domain reputation checks', required=False, default=False, action='store_true', dest='check')
+    parser.add_argument('-f','--filename', help='Specify input file of line delimited domain names to check', required=False, default=False, type=str, dest='filename')
+    parser.add_argument('--ocr', help='Perform OCR on CAPTCHAs when present', required=False, default=False, action='store_true')
+    parser.add_argument('-r','--maxresults', help='Number of results to return when querying latest expired/deleted domains', required=False, default=100, type=int, dest='maxresults')
+    parser.add_argument('-s','--single', help='Performs detailed reputation checks against a single domain name/IP.', required=False, default=False, dest='single')
+    parser.add_argument('-t','--timing', help='Modifies request timing to avoid CAPTCHAs. Slowest(0) = 90-120 seconds, Default(3) = 10-20 seconds, Fastest(5) = no delay', required=False, default=3, type=int, choices=range(0,6), dest='timing')
+    parser.add_argument('-w','--maxwidth', help='Width of text table', required=False, default=400, type=int, dest='maxwidth')
+    parser.add_argument('-V','--version', action='version',version='%(prog)s {version}'.format(version=__version__))
+    args = parser.parse_args()
+
+    # Load dependent modules
    try:
        import requests
        from bs4 import BeautifulSoup
@ -221,24 +281,30 @@ if __name__ == "__main__":
        
    except Exception as e:
        print("Expired Domains Reputation Check")
-        print("[-] Missing dependencies: {}".format(str(e)))
-        print("[*] Install required dependencies by running `pip install -r requirements.txt`")
+        print("[-] Missing basic dependencies: {}".format(str(e)))
+        print("[*] Install required dependencies by running `pip3 install -r requirements.txt`")
        quit(0)

-    parser = argparse.ArgumentParser(description='Finds expired domains, domain categorization, and Archive.org history to determine good candidates for C2 and phishing domains')
-    parser.add_argument('-q','--query', help='Keyword used to refine search results', required=False, default=False, type=str, dest='query')
-    parser.add_argument('-c','--check', help='Perform domain reputation checks', required=False, default=False, action='store_true', dest='check')
-    parser.add_argument('-r','--maxresults', help='Number of results to return when querying latest expired/deleted domains', required=False, default=100, type=int, dest='maxresults')
-    parser.add_argument('-s','--single', help='Performs detailed reputation checks against a single domain name/IP.', required=False, default=False, dest='single')
-    parser.add_argument('-t','--timing', help='Modifies request timing to avoid CAPTCHAs. Slowest(0) = 90-120 seconds, Default(3) = 10-20 seconds, Fastest(5) = no delay', required=False, default=3, type=int, choices=range(0,6), dest='timing')
-    parser.add_argument('-w','--maxwidth', help='Width of text table', required=False, default=400, type=int, dest='maxwidth')
-    parser.add_argument('-v','--version', action='version',version='%(prog)s {version}'.format(version=__version__))
-    args = parser.parse_args()
+    # Load OCR related modules if --ocr flag is set since these can be difficult to get working
+    if args.ocr:
+        try:
+            import pytesseract
+            from PIL import Image
+            import shutil
+        except Exception as e:
+            print("Expired Domains Reputation Check")
+            print("[-] Missing OCR dependencies: {}".format(str(e)))
+            print("[*] Install required Python dependencies by running `pip3 install -r requirements.txt`")
+            print("[*] Ubuntu\Debian - Install tesseract by running `apt-get install tesseract-ocr python3-imaging`")
+            print("[*] MAC OSX - Install tesseract with homebrew by running `brew install tesseract`")
+            quit(0)

 ## Variables
-    query = args.query
+    keyword = args.keyword

    check = args.check
+
+    filename = args.filename
    
    maxresults = args.maxresults
    
@ -251,6 +317,8 @@ if __name__ == "__main__":
    malwaredomainsURL = 'http://mirror1.malwaredomains.com/files/justdomains'
    expireddomainsqueryURL = 'https://www.expireddomains.net/domain-name-search'
    
+    ocr = args.ocr
+
    timestamp = time.strftime("%Y%m%d_%H%M%S")
            
    useragent = 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)'
@ -291,26 +359,21 @@ If you plan to use this content for illegal purpose, don't.  Have a nice day :)'
        checkDomain(single)
        quit(0)

-    # Calculate estimated runtime based on timing variable if checking domain categorization for all returned domains
-    if check:
-        if timing == 0:
-            seconds = 90
-        elif timing == 1:
-            seconds = 60
-        elif timing == 2:
-            seconds = 30
-        elif timing == 3:
-            seconds = 20
-        elif timing == 4:
-            seconds = 10
-        else:
-            seconds = 0
-        runtime = (maxresults * seconds) / 60
-        print("[*] Peforming Domain Categorization Lookups:")
-        print("[*] Estimated duration is {} minutes. Modify lookup speed with -t switch.\n".format(int(runtime)))
-    else:
-        pass
-      
+    # Perform detailed domain reputation checks against input file
+    if filename:
+        try:
+            with open(filename, 'r') as domainsList:
+                for line in domainsList.read().splitlines():
+                    checkDomain(line)
+                    doSleep(timing)
+        except KeyboardInterrupt:
+            print('Caught keyboard interrupt. Exiting!')
+            quit(0)
+        except Exception as e:
+            print('[-] {}'.format(e))
+            quit(1)
+        quit(0)
+     
    # Generic Proxy support 
    # TODO: add as a parameter 
    proxies = {
@ -328,15 +391,15 @@ If you plan to use this content for illegal purpose, don't.  Have a nice day :)'
    urls = []

    # Use the keyword string to narrow domain search if provided
-    if query:
-        print('[*] Fetching expired or deleted domains containing "{}"'.format(query))
+    if keyword:
+        print('[*] Fetching expired or deleted domains containing "{}"'.format(keyword))
        for i in range (0,maxresults,25):
            if i == 0:
-                urls.append("{}/?q={}".format(expireddomainsqueryURL,query))
-                headers['Referer'] ='https://www.expireddomains.net/domain-name-search/?q={}&start=1'.format(query)
+                urls.append("{}/?q={}".format(expireddomainsqueryURL,keyword))
+                headers['Referer'] ='https://www.expireddomains.net/domain-name-search/?q={}&start=1'.format(keyword)
            else:
-                urls.append("{}/?start={}&q={}".format(expireddomainsqueryURL,i,query))
-                headers['Referer'] ='https://www.expireddomains.net/domain-name-search/?start={}&q={}'.format((i-25),query)
+                urls.append("{}/?start={}&q={}".format(expireddomainsqueryURL,i,keyword))
+                headers['Referer'] ='https://www.expireddomains.net/domain-name-search/?start={}&q={}'.format((i-25),keyword)
    
    # If no keyword provided, retrieve list of recently expired domains in batches of 25 results.
    else:
@ -375,9 +438,9 @@ If you plan to use this content for illegal purpose, don't.  Have a nice day :)'

        # Turn the HTML into a Beautiful Soup object
        soup = BeautifulSoup(domains, 'lxml')
-        table = soup.find("table")
-
+        
        try:
+            table = soup.find("table")
            for row in table.findAll('tr')[1:]:

                # Alternative way to extract domain name
@ -388,7 +451,7 @@ If you plan to use this content for illegal purpose, don't.  Have a nice day :)'
                if len(cells) >= 1:
                    output = ""

-                    if query:
+                    if keyword:

                        c0 = row.find('td').find('a').text   # domain
                        c1 = cells[1].find(text=True)   # bl
@ -460,7 +523,7 @@ If you plan to use this content for illegal purpose, don't.  Have a nice day :)'
                        elif check == True:
                            bluecoat = checkBluecoat(c0)
                            print("[+] {}: {}".format(c0, bluecoat))
-                            ibmxforce = checkIBMxForce(c0)
+                            ibmxforce = checkIBMXForce(c0)
                            print("[+] {}: {}".format(c0, ibmxforce))
                            # Sleep to avoid captchas
                            doSleep(timing)
@ -470,8 +533,14 @@ If you plan to use this content for illegal purpose, don't.  Have a nice day :)'
                        # Append parsed domain data to list
                        data.append([c0,c3,c4,available,status,bluecoat,ibmxforce])
        except Exception as e: 
-            print(e) 
-            
+            #print(e)
+            pass
+
+    # Check for valid results before continuing
+    if not(data):
+        print("[-] No results found for keyword: {0}".format(keyword))
+        quit(0)
+
    # Sort domain list by column 2 (Birth Year)
    sortedData = sorted(data, key=lambda x: x[1], reverse=True) 

--- a/requirements.txt
+++ b/requirements.txt
@ -2,3 +2,5 @@ requests==2.13.0
 texttable==0.8.7
 beautifulsoup4==4.5.3
 lxml
+pillow==5.0.0
+pytesseract