DATAMINING THE TERROR WEB
INTELLIGENT DEFENSE:
The Internet and the WWW are critical to terrorist/global guerrilla communications. On the 8th of August, I will talk with Dr. Mark Last of Ben-Gurion University of the Negev in Israel. He and his team are working on a project to apply advanced data mining techniques to unearth terrorist activity on the Internet.
Dr. Last co-authored a paper, "Using Data Mining Techniques for Detecting Terror-Related Activities on the Web" (PDF) that provides detail on the approach. This proposed system works in conjunction with an ISP (with or without their permission) to determine potential terrorist activity. If applied to ISPs (and potentially Internet Cafes) across the Middle East, it might provide a means of detecting patterns of behavior that may be used to prevent attacks.
Here is Dr. Last's biography and a list of papers he has published.
Here's the audio of the conversation. Please check the Intelligent Defense site for updates on the program.
Interesting, but I have serious doubts that this approach can actually work in practice.
Who exactly makes the decision as to what constitutes a "terrorist website" and what is free speech ?
It is difficult enough tuning an Intrusion Detection System or an Intrusion Prevention System to understand what is "normal" behavior within a small network of real users connected to the internet.
Trying to apply similar techniques at the scale of a whole ISP or of multiple ISPs or in multiple countries, seems to be a hugely expensive and intrusive endeavour. Not even the communist Chinese or the Saudia Arabian national level content filters seem to be able to achieve this in their attempts to censor various well known (fixed) political and news websites.
It is the sheer amount of internet traffic that foils attempts to classify or analyse
"suspicious" behavior by a tiny minority, who are highly motivated to obscure their activities.
Even conducting "traffic analysis" of who accesses suspect websites and when, is fraught with issues of scale and cost.
The UK Government reluctantly modified its original "voluntary" Data Retention scheme, which it originally planned to force ISPs to retain , amongst many other categories of data, web server access logs for 12 months beyond the time which they needed it for normal business purposes.
After much discussion, this figure was revised to something more reasonable technically and financially: - 4 days.
Neither the UK Government nor the UK ISP industry can be accused of being "soft" on terrorist or child porn websites or the people who try to access them, but the task is just too difficult and expensive.
Classifying internet users simply into "terrorists" or "non-terrorists" is wrong, especially if the intention is to build an automatic system which has potentially dire consequences for the False Positives.
Content filtering "nanny ware" is easily fooled, and the same techniques would apply to this datamining approach, without even the difficulty of having a "nanny ware" snooping filter on the client PC being used to access the suspect websites.
What if thee terrorists or their supporters use Secure Sockets Layer (SSL) or Transport Layer Security (TLS) which end to end encrypts the clickstream (necessary for credit card e-commerce etc.) and is built into almost every web browser.
Even simple password or cookie registration based access to private areas or discussion forums etc. will prevent accurate cluster analysis by preventing the suspect material from being analysed in the first place.
What if instead of static web pages, the content is dynamically generated by a back end database ?
Surely the "Access Vector" hash function of the content of the web pages is trivial to fool, using existing readily available programs which insert random dictionary text into HTML tags or into hidden non-displaying text (e.g. black text on a black background), in order to fool Baysian email spam filters ?
What if the ISP that is being used to access the alleged "terrorist web sites" uses proxy servers ?
What if the people accessing these dubious websites use the well established techniques of chains of proxy servers or of a list of open proxy servers which changes every few seconds or every web page access, as used by spammers etc. ?
Posted by: Watching Them, Watching Us | Monday, 02 August 2004 at 04:37 PM
Great questions WTWU. Maybe some could be asked during the interview?
Data mining is great for marketing where false positives are a mere nuisance, but here...
Posted by: Valdis K | Monday, 02 August 2004 at 09:27 PM
"What if instead of static web pages, the content is dynamically generated by a back end database?"
I think a few of the AQ sites actually are database driven. And yes, ME users are likely to be familiar with using proxy servers, as they are accustomed to getting around filters for other reasons.
Posted by: praktike | Tuesday, 03 August 2004 at 12:52 PM