Introducing dirtywords - A Targeted Word List Generator

dirtywords.png

“How do you craft a word list?”

It’s one of the questions I see asked time and time again by beginners to the bug bounty scene. While there are many great options out there, crafting a good custom word list often takes months or even years of time and effort. There are many factors that come into play such as past experiences, recent CVEs, the underlying technologies in use on your target, familiar naming conventions, and much much more.

While I won’t be going into how to craft a fully custom word list in this post, I do want to highlight a tool that I just released that was built to accomplish a similar task - generating a word list for a specific target.

targetedlist.png

“Wait, why do I even need a word list?”

I know this should be obvious, but in case some of you reading this are unsure of what a wordlist is used for, I’ll back up a bit and go over the basics.

When approaching a target, whether on a penetration testing assessment or in a bug bounty program, it is usually important to conduct reconnaissance of the in-scope infrastructure. This is often done by conducting open source intelligence (OSINT), subdomain brute-forcing, reviewing certificate transparency records, performing DNS zone transfers, and more. Once targets are discovered, or more specifically once web targets are discovered, the enumeration phase begins.

Aside from manually (or automatically) spidering the application, an attacker can brute-force directory and file names in order to discover files and folders hosted on the target server. Often times subdirectory brute-forcing tools have default word lists built in, but they are usually outdated when it comes to CVEs and are not built with a specific target in mind. Repositories such as SecLists supply a more up-to-date and focused set of wordlists, but still do not solve the problem of building a target specific list.

seclists.PNG

So how do I build a “target-specific” wordlist?

A target-specific word list is a list of words based on company culture, products, commonly used technology, connected assets, and other related data. Tools like CeWL, for example, assist in building a target-specific word list by using words contained in the content of a hosted web application. This is good if the file structure of the application shares its vocabulary with the content hosted on the page, but often times after using CeWL I have been left searching for more. With this issue at hand and a newfound love of go, I decided to build a tool of my own.

When an organization has multiple public-facing web applications, it is highly likely that file and folder names are shared or similar between the two applications. Even if the applications have entirely different functionality, web developers may get in the habit of using the same naming schemes across applications (or it may even be a company standard or policy!).

After seeing file and folder names being re-used multiple times within related environments, I started working on building a word list using common names I have seen in the past within the same organization; but then it hit me - there is an easier way to do this. As I’m sure many of you are aware, Corben Leo has released an awesome tool named gau based on waybackurls by tomnomnom. After using both of these tools extensively, I decided to build upon the foundation of these tools to generate custom word lists.

dirtywords

dirtywords-github.PNG

After reviewing the before mentioned tools, I developed dirtywords to use similar functionality to gau. By querying Common Crawl, Internet Archive’s Wayback Machine, and AlienVault’s Open Threat Exchange for a list of archived URLs, an attacker can identify what words have been historically used for file and folder names within an organization’s public facing assets. The tool will pull out all of these file and folder paths (or the ones that meet specified criteria) and write them to a word list on the local system. The list is then sorted uniquely for optimization purposes.

dirtywords2.PNG

As shown above, the tool was used to generate a word list for the AT&T bug bounty program containing over 900,000 words (all based on historically valid file/folder paths). I have been playing around with the tool and implementing into my own workflow, I decided to also release the tool to the open source community in hopes that it will help others find bugs! If you are reading this, I do not expect you to implement or even use the tool, but if you do decide to use it I would love to hear any success stories or suggestions for improvement! To get in contact, you can reach me on Twitter.


Previous
Previous

FAV/E - Find A Vulnerability/Exposure

Next
Next

Advisory: Multiple Vulnerabilities in Quest Policy Authority for Unified Communications