DNS Data Query Log
Every request to be connected to a website results in a notation in a Data Query Log, part of the Internet's Domain Name System or DNS. These entries provide raw data for creating a "twitteresque" pulse of the city. [Caution is necessary in releasing this data as privacy and security issues are associated with this big data trove.]
Scylla and Charybdis
(Commons image courtesy of Wikimedia)
In James Gillray's 1793 Scylla and Charybdis William Pitt sails between the rock of democracy and the whirlpool of arbitrary power to the distant haven of liberty. Security and privacy?
The most widely used DNS software on the Internet by far is Bind. If activated, its data query log contains one entry for each DNS query made to the name server. Here's a typical inquiry:
10-Apr-2000 00:01:20.308 XX+/10.4.3.2/host.foo.com/A/IN
The Bind software logs:
- Time of Inquiry: 10-Apr-2000 00:01:20.308
- If it was recursive: recursive = XX+, or non-recursive is XX. (More.)
- The IP of Client: 10.4.3.2
- The Name Inquiry: host.foo.com
- The type of query: A
The DNS Data Query Log might be formatted to facilitate searches such as the following:
- Any time
- Past hour
- Past 24 hours
- Past week
- Past month
- Past year
- Custom range...
Thoughts on the DNS Query Log
We discussed with Cloud Registry of Australia the implications and potential of the DNS Query Log. Here are a few points that arose in early conversations:
- it is a great idea to log the queries, for statistics, analysis, and to understand how the namespace is used
- one should not assume automatically that whoever runs the TLD will actually log this information
- it is an involved operation to do it in real time, and to bring the dispersed log back into a data warehouse and perform some aggregation and extract meaningful information out it
- the dataset is likely to be huge
- caching at the ISP's nameservers make it a challenging task to get the most accurate view of what people actually type at the browser
- what you get is a subset of what's actually typed which is good enough for your use case
- the numbers can't be trusted due to caching, but if we're looking for the names, then it's fine
- there are other pieces of information that a DNS server records, but they're less useful
- depending on how a TLD is structured, you might find that it is handled by diverse parties (e.g. of 5 DNS servers handling the TLD, only 3 might belong to the TLD registry, with the rest contracted to third parties, or run by universities).
Operating The Logs
The DNS log files are likely to be huge. How and by whom are they to be hosted? Here's one approach:
- log files should be managed by an independent entity
- that this entity scrub the logs of trademark indices, user footprints, and other sensitive data
- that access and use standards be set
- log maintenance costs should be assessed to users.
The .nyc TLD's Domain Name Server will have the capacity to record every request for the address of a computer using a .nyc domain name. Two logs will be created from this stream of requests:
- Success Log - This log will show requests made for .nyc domain names that have been assigned or allocated.
- Error Log - The error log records requests for domain names that do not exist. This Error Log can be a "crystal ball" of sorts, providing insight into newly arising issues.
The following comments on the benefits derived from the thoughtful development of the Error Log, a small portion of the broader Data Query Log, were delivered by Connecting.nyc Inc. to the New York City Council's Technology Committee on June 21, 2010. They describe how the Asian Longhorn Beetle infestation might have been avoided had the Error Log been available in 1985.
New York City Council's Technology Committee
Open Data and the DNS Query Log
by Thomas Lowenhaupt, Director, Connecting.nyc Inc.
June 21, 2010
Good morning. I’m Tom Lowenhaupt, founding director of Connecting.nyc Inc., a New York State not-for-profit advocating for the development of the .nyc TLD as a public interest resource. My presentation is on the DNS Query Log – a soon to arrive database.
Within the next few years the Internet is going to change in a fundamental way - it is going to become more intuitive.
This will happen as the ICANN, the entity that issues new Top Level Domains such as .com, .org, and .gov finalizes its application process. There will initially be hundreds and then thousands of New Top Level Domains (or TLDs for short), with names such as .bank, .sport, and .news.
So the future holds Chase and Citibank moving from Chase.com and citibank.com to Chase.bank and city.bank. ESPN will move to ESPN.sports and the Wall Street Journal will find advantage in moving to WSJ.news.
With this transition people will come to see the Internet as far more intuitive than today and will begin entering their domain name requests directly. So for example, if you’re looking for a bank you are likely to enter index.bank or directory.bank. Or if you’re looking for news sources you might try categories.news. And information about baseball might be best found from baseball.sports. It’s going to be a different Internet, one where our dependence of search engines will be diminished.
In addition to the forementioned .sport, .news, and .bank, there will be city TLDs such as .paris, .berlin, .tokyo and my favorite .nyc.
Getting to today’s topic.
Imagine the .nyc Top Level Domain name is fully functional in 5 years. And people have come to recognize the benefit of directly entering domain names rather than always relying on Google. So people learn that it’s faster and more direct to enter mayor.nyc, citycouncil.nyc, firedepartment.nyc, and police.nyc.
The operator of the .nyc TLD will connect each of these queries to the appropriate website and create an entry in a Query Log. This Query Log will contain valuable information from a marketing, governance, and civic life perspective.
Let me give an example.
Imagine in 1985 we had the intuitive Internet as I’ve described it – baseball.sports, police.nyc... And imagine the residents of Greenpoint, Brooklyn started entering intuitive inquiries into their search boxes such as:
What happens to these queries? If they are for an existing website, people will be directly connected to the site. (And I’ll skip for now the privacy issues associated with that database of successful connections.)
But imagine it’s a time like 1985 when the Asian Longhorn Beetle had just arrived on our shores. And residents of Greenpoint are entering intuitive inquiries seeking information about strange developments downing their trees. And let’s assume that none of these intuitive inquiries had existing websites. What happens to erroneous queries such as Holeintree.nyc, Spottedbeetles.nyc,Dyingtreesingreenpoint.nyc?
We advocate that this information go to an Error Query Log Database, and be made available to all for inspection. So some clever researcher can begin exploring these entries and create a proper response.
In 1985 that would have been to inform the Parks Department that there are a number of odd things going on with the trees in Greenpoint. And an inspector could have been dispatched to investigate. In reality it took 10 years before that happened and we now face the prospect of 1,200,000,000 trees being lost in America to the Asian Longhorn Beetle.
So what will the Error Query Log show in the future?
I’ve no crystal ball, but it could be the central location for sensing change in our city, in a twitteresque database controlled by the city. This database should be made available to researchers and programmers on a minute by minute or minimally, hourly basis.
Public access to this sensitive database should be prescribed in your legislation.
Thank you for your attention to this matter.
There are multiple benefits that might arise from the thoughtful development of these data logs, however, privacy and security concerns require early consideration in establishing data retention and access policies (hence our Scylla and Charybdis graphic).
A 2013 presentation by MIT's Kevin Slavin provides a good example of how we might benefit from accessing raw data to create a better city.
Because of the sensitive nature of the DNS Data Query Log, we conclude with the following security and privacy links to guide those considering retention and access policies.
- Domain Name Server (DNS) security
- Type 911.gov -Two scientists think that social networks can improve disaster relief
- ICANN Report on February 2007 DNS Attack
- DNSSEC for TLDs Session from Lisboa ICANN meeting - March 2007
- Platform for Internet Content Selection
- The Global Environment for Network Innovations (GENI) by the National Science Foundation
- The .hk Phishing Experience (Hong Kong)
- Internet Privacy - from Wikipedia
- Electronic Frontier Foundation - privacy page
- Computer Professionals for Social Responsibility
- Electronic Privacy Information Center
- Madrid Privacy Declaration
- Wendy Seltzer.privacy
- Carnegie-Mellon Data Privacy Lab
- 1973 HEW Report on Privacy - A prescient report on privacy and Automated Data Systems
- Truste Web Privacy Seal
- The "I've Got Nothing To Hide" Fallacy - And Other Misunderstandings of Privacy, by Daniel J. Solove