Extracting emails from a text is a common task in Python, especially when you are cleaning the data, or building a list of emails based on a text document.
In this tutorial, I will show you how to extract emails from text in Python using regular expression (RegEx).
Contents
- Extracting Emails from Text using RegEx
- Extracting Emails from a Text File
- Extracting Emails from a Large Text File
- Conclusion
Extracting Emails from Text using RegEx
To extract emails from text in Python using RegEx, we can use the re
module, which provides support for regular expressions. Here's a simple example:
import re def extract_emails(text): email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' return re.findall(email_regex, text) sample_text = "Please contact me at dminhvu.work@gmail.com or wisecode@gmail.com." emails = extract_emails(sample_text) print(emails)
This code snippet defines a function extract_emails
with:
- Input is a string
text
, - Output is a list of email addresses extracted from
text
based on the RegEx pattern\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
.
You can run this by using the python main.py
command in the Terminal (or Command Prompt on Windows).
python main.py ['dminhvu.work@gmail.com', 'wisecode@gmail.com']
To understand why we can construct this email_regex
pattern, you can learn more here, it has the explanation section in the right hand side.
Extracting Emails from a Text File
To extract emails from a text file, we'll read the file's content into a string using the read()
method and then use the same extract_emails
function defined earlier.
import re def extract_emails(text): email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' return re.findall(email_regex, text) def extract_emails_from_file(file_path): with open(file_path, 'r') as file: content = file.read() return extract_emails(content) file_path = 'example.txt' emails = extract_emails_from_file(file_path) print(emails)
In this example, extract_emails_from_file
reads the entire content of the file located at file_path
and then uses the extract_emails
function to find all email addresses.
Extracting Emails from a Large Text File
When dealing with large text files, reading the entire file into memory might not be feasible. In such cases, we can process the file line by line:
import re def extract_emails(text): email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' return re.findall(email_regex, text) def extract_emails_from_large_file(file_path): emails = [] with open(file_path, 'r') as file: for line in file: emails.extend(extract_emails(line)) return emails file_path = 'large_example.txt' emails = extract_emails_from_large_file(file_path) print(emails)
This function iterates over each line in the file, extracts emails from that line, and appends them to the emails
list. This approach is more memory-efficient for large files.
Conclusion
In this tutorial, we've learned how to extract email addresses from strings and text files using Python.
In general, to extract email addresses from strings in Python:
- Use the
re
package to extract emails, - with the
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
RegEx pattern to math the email pattern.
If you find the pattern does not work, please comment below so I will fix it. Thank you!
Comments
Be the first to comment!