Spam Filtering with Naive Bayes episode artwork

EPISODE · Jul 27, 2018 · 19 MIN

Spam Filtering with Naive Bayes

from Data Skeptic

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email. Whitelists, blacklists, traffic analysis, network analysis, and a variety of other tools are probably employed by most major players in this area. Naturally content analysis can be an especially powerful tool for detecting spam. Given the binary nature of the problem ( or ) its clear that this is a great problem to use machine learning to solve. In order to apply machine learning, you first need a labelled training set. Thankfully, many standard corpora of labelled spam data are readily available. Further, if you're working for a company with a spam filtering problem, often asking users to self-moderate or flag things as spam can be an effective way to generate a large amount of labels for "free". With a labeled dataset in hand, a data scientist working on spam filtering must next do feature engineering. This should be done with consideration of the algorithm that will be used. The Naive Bayesian Classifer has been a popular choice for detecting spam because it tends to perform pretty well on high dimensional data, unlike a lot of other ML algorithms. It also is very efficient to compute, making it possible to train a per-user Classifier if one wished to. While we might do some basic NLP tricks, for the most part, we can turn each word in a document (or perhaps each bigram or n-gram in a document) into a feature. The Naive part of the Naive Bayesian Classifier stems from the naive assumption that all features in one's analysis are considered to be independent. If  and  are known to be independent, then . In other words, you just multiply the probabilities together. Shh, don't tell anyone, but this assumption is actually wrong! Certainly, if a document contains the word algorithm, it's more likely to contain the word probability than some randomly selected document. Thus,  Pr(\text{algorithm}) \cdot Pr(\text{probability})" />, violating the assumption. Despite this "flaw", the Naive Bayesian Classifier works remarkably will on many problems. If one employs the common approach of converting a document into bigrams (pairs of words instead of single words), then you can capture a good deal of this correlation indirectly. In the final leg of the discussion, we explore the question of whether or not a Naive Bayesian Classifier would be a good choice for detecting fake news.    

NOW PLAYING

Spam Filtering with Naive Bayes

0:00 19:45

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

NEWMORROW SESSIONS - A PodCast Series on the Future of Hospitality Mario C. Bauer, Florian Schneider, Axel Weber & Dr. Tillman Bardt The Newmorrow PodCast is more than a podcast — it's a platform for open dialog on the future of our business, a platform for those building what doesn’t exist yet. Here, we share and embrace our passion for the hospitality industry, but we won’t romanticize the journey. We ask the tough questions, confront uncomfortable truths, and prepare for a future that resists easy answers. We believe that the tougher and wilder times become, the more openly, honestly and humanely people need to talk to each other and act together. We believe, openness, togetherness, and truthfulness should also be cornerstones of a professional community to develop our utopian idea of „open source“. This is a space where visionaries don’t just imagine the future — they wrestle with the paradoxes that shape it: success vs. happiness, data vs. instinct, stability vs. reinvention. Join leaders, entrepreneurs, and thinkers as they share not what made them — but what’s actively shaping them, now and next. So tune in The Health Odyssey: Navigating Tomorrow's Medicine Podcast Welcome to 'The Health Odyssey: Navigating Tomorrow's Medicine,' where we embark on an adventurous journey through the ever-evolving world of healthcare. Each episode is like a treasure map, guiding you through the rich tapestry of ancient healing arts mixed with futuristic tech wizardry. We’ll chat about the wild west of health data privacy, the corporate giants reshaping our care, and the mind-bending potential of psychedelics for mental wellness. Think of us as your trusty sidekicks, unraveling the mysteries of modern medicine while keeping it real and relatable. Let’s dive into the stories, the science, and the soul of healthcare, paving the way for a healthier tomorrow. Talent Stacker Jonathan Mendonsa Data suggests that the average cost of college in 2019 was $122,000 while the entry-level salary for a college graduate at the same time period was 50,000. ROI is a distant memory.hopefully for that that $122,000 the student graduates with a degree and possibly some skills. The reality is, as most individuals approach graduation, they realize that ultimately what they have to prove to their employers that they actually have the skills and since you don't need a degree or permission to start building skills, let’s document the stories and best practices of individuals that crushed the game by focusing on building their skills and their talent stack. Maybe you feel like you don’t have a talent stack. What are the skills you need to be able to generate an above-median income and when paired with interest-led learning this talent stack will allow you to work towards financial independence and design your future?If you're up for this challenge to go from no Talent Stack to designing you Sacramento, California Crime Report Inception Point Ai Sacramento, California Crime Report is your go-to podcast for the latest updates and in-depth analysis of crime trends in the Sacramento area. Join us as we explore real cases, discuss law enforcement efforts, and offer expert insights into the community's safety. Stay informed and engaged with comprehensive coverage of everything from local crime stories to broader criminal justice issues affecting Sacramento. Tune in for weekly episodes that dive into the data and deliver the facts you need to stay aware in California's capital city. For more info go to https://www.quietplease.ai Check out these deals https://amzn.to/48MZPjsThis show includes AI-generated content.

Frequently Asked Questions

How long is this episode of Data Skeptic?

This episode is 19 minutes long.

When was this Data Skeptic episode published?

This episode was published on July 27, 2018.

What is this episode about?

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email. Whitelists, blacklists, traffic analysis, network analysis, and a variety of other...

Can I download this Data Skeptic episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!