Cisco offers a daily list of the million most queried domain names from Umbrella (OpenDNS) users. I had some time this weekend so decided to spend some time playing around with the data to see what I could find so I spun up a lightsail server and got to work.
Grabbing the file is as simple as:
wget http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
You can retrieve a specific date like this:
wget http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-yyyy-mm-dd.csv.zip
(Looks like 2017-01-20 is the earliest they have online).
Once you get that downloaded and unzipped (unzip top-1m.csv.zip
) you can start exploring.
You can pull out the top 10 domains with this command:
head -n 10 top-1m.csv
1,google.com 2,www.google.com 3,microsoft.com 4,facebook.com 5,doubleclick.net 6,g.doubleclick.net 7,clients4.google.com 8,googleads.g.doubleclick.net 9,apple.com 10,fbcdn.net
You can search for keywords with this command:
cat top-1m.csv | grep "opendns"
437,opendns.com 719,hydra.opendns.com 720,sync.hydra.opendns.com 1314,disthost.opendns.com 2756,api.opendns.com 4565,cacerts.opendns.com 5569,ipf.opendns.com 5699,block.opendns.com 7024,updates.opendns.com 8482,bpb.opendns.com
To count the domain levels use this command:
awk -F, '{count=split($2,a,"."); print count}' top-1m.csv | sort | uniq -c | awk '{print $2,$1}' | sort -k1,1n
1 1086 2 263509 3 469756 4 193802 5 54281 6 13698 7 2952 8 689 9 172 10 16 11 26 12 2 13 1 14 1 15 1 16 1 17 1 18 1 19 1 20 1 21 1 22 1 23 1
(Full Output)
Notice anything strange here? Hint: A domain name requires at least two levels to be valid.
To find the broken DNS names in this list this command works:
cat top-1m.csv | awk -F, 'BEGIN {file="top-1m.csv" ; while ((getline line < file) > 0) {if (line ~ /#/) continue; tld[tolower(line)] = 1}} {foo=split($2,a,"."); if (foo == 1) {if (!(a[1] in tld)) {print $0}}}'
1200,home 1490,local 2082,za 3916,lan 6350,url 10173,belkin 10869,uop 11187,localdomain 12887,localhost
Find domains added to the list for today.
I wrote a script to download the last two days of files and compare them for new domains:
https://gist.github.com/jgamblin/184590e2ba64371730e435ab2977e4cf
You can find the output for April 24, 2017 here.
Overall I am really impressed with this data and will be using it to do more research and to track trends across the internet. They have some more to do but it is an amazingly valuable free tool.
Also recently I have feel in love with sprunge to push data to an ad free “pastebin” from the command line:
cat file.txt | curl -F 'sprunge=<-' http://sprunge.us