Robots.txt
Our initial analysis of robots.txt files from a list of 94,593 hostnames gave some interesting results, although we need a lot more work to build the anti-search engine (one which only lists things denied by robots.txt files). The first crawl started generating errors after 812 hosts were connected to though, which gives us limited amount of data to analyse (and means we need to rethink the rough and ready scripts).
Of the 812 hosts, we have 732 robots.txt files with a total of 8935 disallow entries. Taking into account duplicate entries (where a file has an entry for the same path twice) there are 7517 unique disallowed urls to consider. Lets have a search for interesting entries.
There are:
- 29 entries including ‘private’, 8 of which are for Frontpage _private
- 8 entries including ‘secret’
- 27 entries for ‘password’, although most seem innocuous
- 148 entries for ‘admin’, although we assume most will be passworded
- 12 entries for ‘secure’
- 11 entries for ‘backup’
- 69 entries for ‘mail’, including webmail, spam bait, mailing list signups and mail logs
- 493 entries for ‘log’, although many of these are blogs and not interesting logs or login pages
- 6 entries for ‘phpmyadmin’
- 30 entries for ‘stats’, not all of which are webstats (‘stat’ gives 157, but again many are false positives)
- 62 have comments as to why they are disallowed
Don't hack the hackers
Sometimes you need to be extremely careful what you are hacking into, take the following hypothetical situation:
Walking through town, I was keeping one eye on my frequency counter as I quite often do. Passing through the main shopping street I spotted a transmission in the 1.5Ghz band, it’s hard to be exact with my cheap frequency counter. Interested, I wandered around trying to find the source and discovered it was strongest on one side of the street between a phone shop and a bank. So I stood around for a while, waiting to see if I could eek out the source or see any likely candidate.
It’s then that I spotted the bank’s cash machine, and vaguely wandering towards it the signal got stronger. Cash machines aren’t supposed to transmit anything, and it took me longer than it should have to realise that somebody had attached a card reader and wireless camera in order to steal peoples credit card numbers.
I then smoothly put my counter away and started walking, just before two mounted policemen turned round the corner and started to approach the bank…
01:57 PM | 0 CommentsIBQ 2006 Frequency Finder
Around a decade ago I had an Optoelectronix Scout for finding frequencies, when new it worked well. It saved frequencies into it’s internal memory, which could be uploaded to a scanner or PC, it was sensitive, accurate and speedy. After a while it’s sensitivity became worse, to the extent that it failed to display anything and so I gave up on it.
So with my old frequency finder in mind, I’m so far less than excited about the quality of my new IBQ 2006 frequency counter. Admittedly it’s usage so far has been mostly on data rather than voice, and so isn’t really a fair comparison, but after a decade you’d expect a cheap Hong Kong knock-off to be better quality than those early Scouts.
It’s aerial is a poor quality sma-type, it’s sensitivity isn’t great, it’s accuracy is rather poor (especially on microwave data bands). It does however work, and it’s found me a handful of transmitters I’d previously been unaware of – the data stream from a pollution monitoring station, a couple of mobile phone masts I hadn’t noticed, CSR. It’s also smaller than a Scout, and doesn’t get commented on when I’m walking around with it.
A word on model numbers, the manuals online suggest a different series of model numbers to the manual that got sent with it. From what I can tell, only the ST series differs in functionality (greater coverage) and that the others largely only differ in what type of batteries (AAA or custom).
So in summary, yeah it’s not the greatest frequency finder (care to give me £1000 so I can test the best?) but for the price and size I’m perfectly happy.
09:29 AM | 0 CommentsGobuntu
The geeks have been writing about Gobuntu recently, an Ubuntu variation which only installs free software. On the face of it, that’s an awesome prospect – an Ubuntu which by default has nothing installed which is nonfree. However, it’s far from perfect.
The installer is text-only, which is enough to scare new users away. Forget the fact the XP installer is text-only to begin with and is thus just as scary, or the fact that a text-based installer will run much faster, gui installers keep people happy.
Once Gobuntu is installed, there is nothing to stop proprietary software from being installed. The sources.list which Ubuntu computers use to get their software lists from includes multiverse and universe, and there’s nothing to alert you before installing any of this non-free software – something based upon apt-listbugs would be a wonderful addition.
Gobuntu-desktop is simply an Ubuntu package, and so doesn’t have the free software that we need to replace the commercial software included within Ubuntu. We really need to see IceCat/IceWeasel 3.0 and Icedove appear in the Ubuntu repositories, and it really shouldn’t be hard to pull in the Debian version of IceWeasel or the Gnu version of IceCat. I’m currently having to run IceCat from a downloaded .deb file from Gnu and Claws Mail until Ubuntu can solve this.
So in reality, it makes sense for most users to stick to their favourite Ubuntu variation and then use the vrms package to search for non-free software installed on their machines. Whilst vrms fails to find some non-free packages, and knows nothing about software not installed through package management, it’s stable and easy to use.
kaerast@bennet:~$ vrms
No non-free packages installed on bennet! rms would be proud.
kaerast@bennet:~$
There’s also a wiki page for tracking non-free software in Ubuntu’s main repository and thus not detected as non-free by vrms.
Of course it’s also up to the individual user where to draw the line between free and non-free and by how much it’s ok to sell out when something free can’t do the job (I’m very much missing the ability to play flash videos at the moment.) Are patents with source code more or less evil than commercial binary-only software? Do you take issue with the non-commercial or share-alike aspects of Creative Commons? Is privacy and security your main concern, or is freedom to tinker? These questions perhaps belong in a separate post, but are important for anybody considering how free they want their computer to be and are something we’ll all disagree on. And if we all disagree on how free a computer system needs to be, any attempt at a free Linux distro is going to be a kludge of mismatched ideals.
11:46 AM | 0 CommentsBrowser Address Bar
How do you copy and paste some text? Edit->Copy? Right-click? Ctrl-C? Just select it? In Linux, all of these are valid choices and how you copy and paste text may say something about you understand software interfaces. Are the people who haven’t learnt quicker methods than edit->copy the ones that are typing urls into search bars?
Last week, data was released about search market share for dating websites in June. The results are unsurprising, more than 10% of the search terms were for URLs and almost all of the queries were something that adding .com to would have worked. So what’s going on? Are people not smart enough, or not interested in, remembering domain names? Do people still not understand what a URL is? Do people get confused between the url bar and the search bar?
In Japan, whether it’s because of this or because of limited availability of domain names, offline advertising for websites has moved away from mentioning domain names and has started telling people to search Google for the brand name. I’m undecided as to whether that is unbelievably smart or unbelievably dumb; when the company’s website remains at number one search position then it works, as soon as somebody else manages to get number one spot it’s an epic fail.
Search Google for the domain name of a company and something interesting happens in the results. If the company is big, then they will likely have bought top spot to ensure that people reach their site. If they’re smaller or are actively fighting brand name infringement then paid results will be more limited. The sort of people who will enter a url into a google search instead of the address bar are also more likely to click on the paid search positions, and so there’s clearly a large market built out of stupidity.
That’s not to say all search queries for domain names are made out of stupidity though. Sometimes it’s interesting to see what competitors have paid listings, sometimes it’s interesting to see what appears on a website that you aren’t finding (by searching site:example.com), and sometimes you want to search your own domain to see what your stupid users are seeing.
If we tought people to use the internet correctly, to remember urls and enter them in the correct box, to know the difference between paid and organic search results, to tune out (or block) the advertising, to boycott ISPs who add advertising to user’s browsing, what difference would it make to the economy of the internet? Do we rely on these stupid users to pay our way, or would the internet become so much nicer if the advertising, spam and scams that were targetting them disappeared completely?
12:51 PM | 0 Comments