# Episode #276: Tracking cyber intruders with Jupyter and Python **[pythonbytes.fm/episodes/show/276/tracking-cyber-intruders-with-jupyter-and-python](https://pythonbytes.fm/episodes/show/276/tracking-cyber-intruders-with-jupyter-and-python)** Published Wed, Mar 23, 2022, recorded Tue, Mar 22, 2022. **Watch the live stream:** ----- Play on YouTube **About the show** [Sponsored by FusionAuth: pythonbytes.fm/fusionauth](http://pythonbytes.fm/fusionauth) [Special guest: Ian Hellen](https://twitter.com/ianhellen) **Brian #1:** **[gensim.parsing.preprocessing](https://radimrehurek.com/gensim/parsing/preprocessing.html)** Problem I’m working on Turn a blog title into a possible url example: “Twisted and Testing Event Driven / Asynchronous Applications Glyph” would like, perhaps: “twisted-testing-event-driven-asynchrounousapplications” Sub-problem: remove stop words ← this is the hard part [I started with an article called Removing Stop Words from Strings in Python](https://stackabuse.com/removing-stop-words-from-strings-in-python/) It covered how to do this with NLTK, Gensim, and SpaCy I was most successful with `remove_stopwords() from Gensim` ``` from gensim.parsing.preprocessing import remove_stopwords ``` It’s part of a `gensim.parsing.preprocessing package` ----- I wonder what s all in there? a treasure trove ``` gensim.parsing.preprocessing.preprocess_string is one ``` this function applies filters to a string, with the defaults almost being just what I want: strip_tags() strip_punctuation() strip_multiple_whitespaces() strip_numeric() remove_stopwords() strip_short() stem_text() ← I think I want everything except this this one turns “Twisted” into “Twist”, not good. There’s lots of other text processing goodies in there also. Oh, yeah, and Gensim is also cool. topic modeling for training semantic NLP models So, I think I found a really big hammer for my little problem. But I’m good with that **Michael #2:** **[DevDocs](https://devdocs.io/)** via Loic Thomson Gather and search a bunch of technology docs together at once For example: Python + Flask + JavaScript + Vue + CSS Has an offline mode for laptops / tablets ----- Installs as a PWA (sadly not on Firefox) **Ian** **#3:** **[MSTICPy](https://msticpy.readthedocs.io/)** MSTICPy is toolset for CyberSecurity investigations and hunting in Jupyter notebooks. What is CyberSec hunting/investigating? - responding to security alerts and threat intelligence reports, trawling through security logs from cloud services and hosts to determine if it’s a real threat or not. Why Jupyter notebooks? SOC (Security Ops Center) tools can be excellent but all have limitations You can get data from anywhere Use custom analysis and visualizations Control the workflow…. workflow is repeatable Open source pkg - created originally to support MS Sentinel Notebooks but now supports lots of providers. When I start this 3+ yrs ago I thought a lot this would be in PyPI - but no 😞 MSTICPy has 4 main functional areas: Data querying - import log data (Sentinel, Splunk, MS Defender, others…working on Elastic Search) Enrichment - is this IP Address or domain known to be malicious? Analysis - extract more info from data, identify anomalies (simple example - spike in logon failures) Visualization - more specialized than traditional graphs - timelines, process trees. All components use pandas, Bokeh for visualizations ----- Current focus on usability, discovery of functionality and being able to chain Always looking for collaborators and contributors - code, docs, queries, critiques [https://github.com/microsoft/msticpy](https://github.com/microsoft/msticpy) [https://msticpy.readthedocs.io/](https://msticpy.readthedocs.io/) ----- **Brian #4:** **[The Right Way To Compare Floats in Python](https://davidamos.dev/the-right-way-to-compare-floats-in-python/)** David Amos ----- Definitely an easier read than the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic What many of us remember floating point numbers aren’t exact due to representation limitations and rounding error, errors can accumulate comparison is tricky Be careful when comparing floating point numbers, even simple comparisons, like: >>> 0.1 + 0.2 == 0.3 False >>> 0.1 + 0.2 <= 0.3 False David has a short but nice introduction to the problems of representation and rounding. Three reasons for rounding more significant digits than floating point allows irrational numbers rational but non-terminating So how do you compare: ``` math.isclose() ``` be aware of `rel_tol and` `abs_tol and when to use each.` ``` numpy.allclose(), returns a boolean comparing two arrays numpy.isclose(), returns an array of booleans pytest.approx(), used a bit differently 0.1 + 0.2 == pytest.approx(0.3) ``` Also allows `rel and` `abs comparisons` Discussion of `Decimal and` `Fraction types` And the memory and speed hit you take on when using them. **Michael #5:** **[Pypyr](https://pypyr.io/)** Task runner for automation pipelines For when your shell scripts get out of hand. Less tricky than makefile. Script sequential task workflow steps in yaml Conditional execution, loops, error handling & retries Have a look at [the getting started.](https://pypyr.io/docs/getting-started/run-your-first-pipeline/) **Ian** **#6:** **[Pygments](https://pygments.org/)** Python package that’s useful for anyone who wants to display code Jupyter notebook Markdown and GitHub markdown let you display code with syntax highlighting. (Jupyter uses Pygments behind the scenes to do this.) There are tools that convert code to image format (PNG, JPG, etc) but you lose the ability to copy/paste the code Pygments can intelligently render syntax-highlighted code to HTML (and other formats) Applications: Documentation (used by Sphinx/ReadtheDocs) - render code to HTML + CSS Displaying code snippets dynamically in readable form ----- Lots (maybe 100s) of code lexers - Python (code, traceback), Bash, C, JS, CSS, HTML, also config and data formats like TOML, JSON, XML Easy to use - 3 lines of code - example: ``` from IPython.display import display, HTML from pygments import highlight from pygments.lexers import PythonLexer from pygments.formatters import HtmlFormatter code = """ def print_hello(who="World"): message = f"Hello {who}" print(message) """ display(HTML( highlight(code, PythonLexer(), HtmlFormatter(full=True, nobackground=True)) )) # use HtmlFormatter(style="stata-dark", full=True, nobackground=True) # for dark themes ``` Output to HTML, Latex, image formats. We use it in MSTICPy for displaying scripts used in attacks. Example: ----- **Extras** Brian: **[smart-open](https://pypi.org/project/smart-open/)** one of the 3 Gensim dependencies It’s for streaming large files, from really anywhere, and looks just like Python’s ``` open() . ``` Michael: **Joke:** **[What’s your secret?](https://twitter.com/PR0GRAMMERHUM0R/status/1504231058165882888)** -----