GSoC '21 Report


Submitted report is here

Project: OWASP Maryam

Proposal: Dark Web Exploration (for Cyber Threat Analysis) And Expansion of Data Sources


Milestones Achieved

  • Designed and implemented a text document clustering module using TFIDF, KMeans and FP-growth (for assigning titles to clusters). (Commit) (Personal repo link)

    • Generate text data using,

      ./maryam.py -e google -q 'Marvel' -l 10 --api --format > test.json

    • Pass it to cluster module with,

      ./maryam.py -e cluster --json test.json

      Screenshot 2021-08-17 at 12.49.58 PM

      Screenshot 2021-08-17 at 12.50.48 PM

      Screenshot 2021-08-17 at 12.51.08 PM

      Screenshot 2021-08-17 at 12.51.23 PM

      Screenshot 2021-08-17 at 12.51.43 PM

  • Designed and implemented a smart dark web crawler module, using a custom TFIDF text retriever class using cosine similarity to rank best pages to crawl per Snowball Sampling iteration. (Commit) (Note: explicit results not shown unless searched for)

    In progress:

    Screenshot 2021-08-17 at 12.55.42 PM

    Results after reaching target depth:

    Screenshot 2021-08-17 at 12.58.04 PM

  • Implemented various search modules over diverse sources, namely,

    • Phone Number Search using NumVerify (PR).

      Screenshot 2021-08-17 at 12.16.23 PM

    • Dictionary module using Google Dictionary (PR).

      Screenshot 2021-08-17 at 12.14.48 PM

    • SanctionSearch (PR)

      Screenshot 2021-08-17 at 12.16.02 PM

    • Gigablast (PR)

      Screenshot 2021-08-17 at 12.17.08 PM

    • Reddit Search (without official API or scraping) (Commit).

      Screenshot 2021-08-17 at 12.18.26 PM

    • Twitter Tweet Search (without official API or scraping) w/ Sentiment Analysis (Commit).

      Screenshot 2021-08-17 at 12.20.22 PM

      Screenshot 2021-08-17 at 12.20.52 PM

    • ActiveSearchResults (PR)

      Screenshot 2021-08-17 at 12.21.13 PM

    • PirateBay (PR) (Later updated to use undocumented backend API)

      Screenshot 2021-08-17 at 12.21.55 PM

    • Google Scholar (PR)

      Screenshot 2021-08-17 at 12.22.45 PM

    • ArXiv (PR)

      Screenshot 2021-08-17 at 12.23.27 PM

    • PubMed (PR)

      Screenshot 2021-08-17 at 12.25.07 PM

    • Core.ac.uk Search (PR)

      Screenshot 2021-08-17 at 12.26.27 PM

    • Famous Person Search (Commit)

      Screenshot 2021-08-17 at 12.28.36 PM

    • Article Search (Commit)

      Screenshot 2021-08-17 at 12.29.30 PM

  • And standalone utility classes, namely,

    • Web Page Term Frequency Histogram class. (Commit (Brought to Maryam from an extension repo which is now deleted to reduce clutter))

      Jen

      Image taken from famous person module output for Jennifer Aniston.

    • Safe Search Class (manages captcha and evades engine specific errors using rotation). (Commit (previously named CaptchaManager))

  • Discovered startup lag due to heavy imports such as matplotlib and implemented optimization with cleanup resulting in significant reduction in startup time. (Commit 1, Commit 2)

  • Restructured and cleaned up Maryam’s file tree in order to make it suitable for packaging and distribution. (PR (closed but later rechecked and commited manually by mentor saeeddhqan))

  • Packaged and deployed Maryam to PyPi. (link)

  • Fixed critical bug affecting OSX on Python3.8 and 3.9. (Issue)

  • Made numerous bug fixes, all of which can be accessed from the list of my commits.

To Continue My Work

  • Implement frontend for Web API.
  • A way to test module utils (at least engines) without module_api or module_run.
  • Iris is key. The ultimate goal of Maryam is to improve Iris to the extent at which it can smartly leverage collaboratively, the capabilities of all modules and present its output intuitively.
    • This requires us to classify an input query into the module that (we think) can handle it best.
    • Output could be formatted as accordion of most suitable module outputs.