Close Menu
TechCentralTechCentral

    Subscribe to the newsletter

    Get the best South African technology news and analysis delivered to your e-mail inbox every morning.

    Facebook X (Twitter) YouTube LinkedIn
    WhatsApp Facebook X (Twitter) LinkedIn YouTube
    TechCentralTechCentral
    • News
      TCS | Werner Lindemann on how AI is rewriting the infosec rulebook

      TCS | Werner Lindemann on how AI is rewriting the infosec rulebook

      15 April 2026
      Draft AI policy: South Africa 'too dependent' on US, China

      Draft AI policy: South Africa ‘too dependent’ on US, China

      15 April 2026
      R85-million for SA start-up reinventing the stethoscope with AI

      R85-million for SA start-up reinventing the stethoscope with AI

      15 April 2026
      The end of load shedding hasn't fixed South Africa's power problem

      The end of load shedding hasn’t fixed South Africa’s power problem

      15 April 2026
      Amazon ramps up satellite war with $11.6-billion Globalstar buy

      Amazon ramps up satellite war with $11.6-billion Globalstar buy

      15 April 2026
    • World
      Google poised to lose ad crown to Meta

      Google poised to lose ad crown to Meta

      14 April 2026
      Grand Theft Data - hackers hit Rockstar Games - Grand Theft Auto

      Grand Theft Data – hackers hit Rockstar Games

      14 April 2026
      UK PM Keir Starmer declares war on doomscrolling

      UK PM Keir Starmer declares war on doomscrolling

      13 April 2026
      Big Tech is going nuclear

      Big Tech is going nuclear

      10 April 2026
      Software rout deepens as AI fears grip investors

      Software rout deepens as AI fears grip investors

      10 April 2026
    • In-depth
      Africa switches on as Europe dims the lights

      Africa switches on as Europe dims the lights

      9 April 2026
      The biggest untapped EV market on Earth is hiding in plain sight

      The biggest untapped EV market on Earth is hiding in plain sight

      1 April 2026
      The R18-billion tech giant hiding in plain sight - Jens Montanana

      The R16-billion tech giant hiding in plain sight

      26 March 2026
      The last generation of coders

      The last generation of coders

      18 February 2026
      Sentech is in dire straits

      Sentech is in dire straits

      10 February 2026
    • TCS
      TCS | Donovan Marsh on AI and the future of filmmaking

      TCS | Donovan Marsh on AI and the future of filmmaking

      7 April 2026
      TCS+ | Vodacom Business moves to crack the SME tech gap - Andrew Fulton, Sannesh Beharie

      TCS+ | Vodacom Business moves to crack the SME tech gap

      7 April 2026
      TCS | MTN's Divysh Joshi on the strategy behind Pi - Divyesh Joshi

      TCS | MTN’s Divyesh Joshi on the strategy behind Pi

      1 April 2026
      Anoosh Rooplal

      TCS | Anoosh Rooplal on the Post Office’s last stand

      27 March 2026
      Meet the CIO | HealthBridge CTO Anton Fatti on the future of digital health

      Meet the CIO | Healthbridge CTO Anton Fatti on the future of digital health

      23 March 2026
    • Opinion
      The conflict of interest at the heart of PayShap's slow adoption - Cheslyn Jacobs

      The conflict of interest at the heart of PayShap’s slow adoption

      26 March 2026
      South Africa's energy future hinges on getting wheeling right - Aishah Gire

      South Africa’s energy future hinges on getting wheeling right

      10 March 2026
      Hold the doom: the case for a South African comeback - Duncan McLeod

      Apple just dropped a bomb on the Windows world

      5 March 2026
      R230-million in the bag for Endeavor's third Harvest Fund - Alison Collier

      VC’s centre of gravity is shifting – and South Africa is in the frame

      3 March 2026
      Hold the doom: the case for a South African comeback - Duncan McLeod

      Hold the doom: the case for a South African comeback

      26 February 2026
    • Company Hubs
      • 1Stream
      • Africa Data Centres
      • AfriGIS
      • Altron Digital Business
      • Altron Document Solutions
      • Altron Group
      • Arctic Wolf
      • Ascent Technology
      • AvertITD
      • BBD
      • Braintree
      • CallMiner
      • CambriLearn
      • CYBER1 Solutions
      • Digicloud Africa
      • Digimune
      • Domains.co.za
      • ESET
      • Euphoria Telecom
      • HOSTAFRICA
      • Incredible Business
      • iONLINE
      • IQbusiness
      • Iris Network Systems
      • Kaspersky
      • LSD Open
      • Mitel
      • NEC XON
      • Netstar
      • Network Platforms
      • Next DLP
      • Ovations
      • Paracon
      • Paratus
      • Q-KON
      • SevenC
      • SkyWire
      • Solid8 Technologies
      • Telit Cinterion
      • Telviva
      • Tenable
      • Vertiv
      • Videri Digital
      • Vodacom Business
      • Wipro
      • Workday
      • XLink
    • Sections
      • AI and machine learning
      • Banking
      • Broadcasting and Media
      • Cloud services
      • Contact centres and CX
      • Cryptocurrencies
      • Education and skills
      • Electronics and hardware
      • Energy and sustainability
      • Enterprise software
      • Financial services
      • HealthTech
      • Information security
      • Internet and connectivity
      • Internet of Things
      • Investment
      • IT services
      • Lifestyle
      • Motoring
      • Policy and regulation
      • Public sector
      • Retail and e-commerce
      • Satellite communications
      • Science
      • SMEs and start-ups
      • Social media
      • Talent and leadership
      • Telecoms
    • Events
    • Advertise
    TechCentralTechCentral
    Home » In-depth » New tools shine a light into the dark Web

    New tools shine a light into the dark Web

    By The Conversation9 January 2017
    Twitter LinkedIn Facebook WhatsApp Email Telegram Copy Link
    News Alerts
    WhatsApp

    In today’s data-rich world, companies, governments and individuals want to analyse anything and everything they can get their hands on — and the Web has loads of information. At present, the most easily indexed material from the Web is text.

    But as much as 89% to 96% of the content on the Internet is actually something else — images, video, audio, in all thousands of different kinds of nontextual data types.

    Further, the vast majority of online content isn’t available in a form that’s easily indexed by electronic archiving systems like Google’s. Rather, it requires a user to log in, or it is provided dynamically by a program running when a user visits the page. If we’re going to catalogue online human knowledge, we need to be sure we can get to and recognise all of it, and that we can do so automatically.

    How can we teach computers to recognise, index and search all the different types of material that’s available online? Thanks to federal efforts in the global fight against human trafficking and weapons dealing, my research forms the basis for a new tool that can help with this effort.

    The “deep Web” and the “dark Web” are often discussed in the context of scary news or films like Deep Web, in which young and intelligent criminals get away with illicit activities such as drug dealing and human trafficking — or even worse. But what do these terms mean?

    The “deep Web” has existed ever since businesses and organisations, including universities, put large databases online in ways people could not directly view. Rather than allowing anyone to get students’ phone numbers and e-mail addresses, for example, many universities require people to log in as members of the campus community before searching online directories for contact information. Online services such as Dropbox and Gmail are publicly accessible and part of the Web — but indexing a user’s files and e-mails on these sites does require an individual login, which our project does not get involved with.

    The “surface Web” is the online world we can see — shopping sites, businesses’ information pages, news organisations and so on. The “deep Web” is closely related, but less visible, to human users and — in some ways more importantly — to search engines exploring the Web to catalogue it. I tend to describe the “deep Web” as those parts of the public internet that:

    1. Require a user to first fill out a login form;
    2. Involve dynamic content like Ajax or JavaScript; or
    3. Present images, video and other information in ways that aren’t typically indexed properly by search services.

    What’s dark?

    The “dark Web,” by contrast, is made up of pages — some of which may also have “deep Web” elements — that are hosted by Web servers using the anonymous Web protocol called Tor. Originally developed by US defence department researchers to secure sensitive information, Tor was released into the public domain in 2004.

    Like many secure systems such as the WhatsApp messaging app, its original purpose was for good, but has also been used by criminals hiding behind the system’s anonymity. Some people run Tor sites handling illicit activity, such as drug trafficking, weapons and human trafficking and even murder for hire.

     

    The US government has been interested in trying to find ways to use modern computer science to combat these criminal activities. In 2014, the Defence Advanced Research Projects Agency (more commonly known as Darpa), a part of the defence department, launched a programme called Memex to fight human trafficking with these tools.

    Specifically, Memex wanted to create a search index that would help law enforcement identify human trafficking operations online – in particular by mining the deep and dark Web. One of the key systems used by the project’s teams of scholars, government workers and industry experts was one I helped develop, called Apache Tika.

    The ‘digital Babel fish’

    Tika is often referred to as the “digital Babel fish”, a play on a creature called the “Babel fish” in the Hitchhiker’s Guide to the Galaxy book series. Once inserted into a person’s ear, the Babel fish allowed them to understand any language spoken. Tika lets users understand any file and the information contained within it.

    When Tika examines a file, it automatically identifies what kind of file it is — such as a photo, video or audio. It does this with a curated taxonomy of information about files: their name, their extension, a sort of “digital fingerprint. When it encounters a file whose name ends in .mp4, for example, Tika assumes it’s a video file stored in the Mpeg-4 format. By directly analysing the data in the file, Tika can confirm or refute that assumption — all video, audio, image and other files must begin with specific codes saying what format their data is stored in.

    Once a file’s type is identified, Tika uses specific tools to extract its content such as Apache PDFBox for PDF files, or Tesseract for capturing text from images. In addition to content, other forensic information or “metadata” is captured, including the file’s creation date, who edited it last, and what language the file is authored in.

    From there, Tika uses advanced techniques like Named Entity Recognition (NER) to further analyse the text. NER identifies proper nouns and sentence structure, and then fits this information to databases of people, places and things, identifying not just whom the text is talking about, but where, and why they are doing it. This technique helped Tika to automatically identify offshore shell corporations (the things); where they were located; and who (people) was storing their money in them as part of the Panama Papers scandal that exposed financial corruption among global political, societal and technical leaders.

    Improvements to Tika during the Memex project made it even better at handling multimedia and other content found on the deep and dark Web. Now Tika can process and identify images with common human trafficking themes. For example, it can automatically process and analyze text in images — a victim alias or an indication about how to contact them — and certain types of image properties, such as camera lighting. In some images and videos, Tika can identify the people, places and things that appear.

    Additional software can help Tika find automatic weapons and identify a weapon’s serial number. That can help to track down whether it is stolen or not.

    Employing Tika to monitor the deep and dark Web continuously could help identify human- and weapons-trafficking situations shortly after the photos are posted online. That could stop a crime from occurring and save lives.

    Memex is not yet powerful enough to handle all of the content that’s out there, nor to comprehensively assist law enforcement, contribute to humanitarian efforts to stop human trafficking and even interact with commercial search engines.

    It will take more work, but we’re making it easier to achieve those goals. Tika and related software packages are part of an open-source software library available on Darpa’s Open Catalogue to anyone — in law enforcement, the intelligence community or the public at large — who wants to shine a light into the deep and the dark.The Conversation

    • Christian Mattmann is director, Information Retrieval and Data Science Group, and adjunct associate professor, USC, and principal data scientist
    • This article was originally published on The Conversation
    Follow TechCentral on Google News Add TechCentral as your preferred source on Google


    WhatsApp YouTube
    Share. Facebook Twitter LinkedIn WhatsApp Telegram Email Copy Link
    Previous ArticleUber to open traffic data to cities
    Next Article Facebook’s VR foray challenged as ‘fanciful story’

    Related Posts

    TCS | Werner Lindemann on how AI is rewriting the infosec rulebook

    TCS | Werner Lindemann on how AI is rewriting the infosec rulebook

    15 April 2026
    Draft AI policy: South Africa 'too dependent' on US, China

    Draft AI policy: South Africa ‘too dependent’ on US, China

    15 April 2026
    R85-million for SA start-up reinventing the stethoscope with AI

    R85-million for SA start-up reinventing the stethoscope with AI

    15 April 2026
    Company News
    New man to accelerate wholesale connectivity in the DRC - Gaetan Soltesz, FAST Congo

    New man to accelerate wholesale connectivity in the DRC

    15 April 2026
    Avast Business and Avert IT Distribution rewrite the SMB cybersecurity playbook

    Avast Business and Avert IT Distribution rewrite the SMB cybersecurity playbook

    15 April 2026
    The hidden risk in South Africa's payment infrastructure - AfriGIS

    The hidden risk in South Africa’s payment infrastructure

    14 April 2026
    Opinion
    The conflict of interest at the heart of PayShap's slow adoption - Cheslyn Jacobs

    The conflict of interest at the heart of PayShap’s slow adoption

    26 March 2026
    South Africa's energy future hinges on getting wheeling right - Aishah Gire

    South Africa’s energy future hinges on getting wheeling right

    10 March 2026
    Hold the doom: the case for a South African comeback - Duncan McLeod

    Apple just dropped a bomb on the Windows world

    5 March 2026

    Subscribe to Updates

    Get the best South African technology news and analysis delivered to your e-mail inbox every morning.

    Latest Posts
    TCS | Werner Lindemann on how AI is rewriting the infosec rulebook

    TCS | Werner Lindemann on how AI is rewriting the infosec rulebook

    15 April 2026
    Draft AI policy: South Africa 'too dependent' on US, China

    Draft AI policy: South Africa ‘too dependent’ on US, China

    15 April 2026
    R85-million for SA start-up reinventing the stethoscope with AI

    R85-million for SA start-up reinventing the stethoscope with AI

    15 April 2026
    The end of load shedding hasn't fixed South Africa's power problem

    The end of load shedding hasn’t fixed South Africa’s power problem

    15 April 2026
    © 2009 - 2026 NewsCentral Media
    • Cookie policy (ZA)
    • TechCentral – privacy and Popia

    Type above and press Enter to search. Press Esc to cancel.

    Manage consent

    TechCentral uses cookies to enhance its offerings. Consenting to these technologies allows us to serve you better. Not consenting or withdrawing consent may adversely affect certain features and functions of the website.

    Functional Always active
    The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
    Preferences
    The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
    Statistics
    The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
    Marketing
    The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
    • Manage options
    • Manage services
    • Manage {vendor_count} vendors
    • Read more about these purposes
    View preferences
    • {title}
    • {title}
    • {title}