advertools 0.16.1

Creator: bradpython12

Last updated:

0 purchases

TODO
Add to Cart

Description:

advertools 0.16.1

Announcing Data Science with Python for SEO course: Cohort based course, interactive, live-coding.

advertools: productivity & analysis tools to scale your online marketing

A digital marketer is a data scientist.
Your job is to manage, manipulate, visualize, communicate, understand,
and make decisions based on data.

You might be doing basic stuff, like copying and pasting text on spread
sheets, you might be running large scale automated platforms with
sophisticated algorithms, or somewhere in between. In any case your job
is all about working with data.
As a data scientist you don’t spend most of your time producing cool
visualizations or finding great insights. The majority of your time is spent
wrangling with URLs, figuring out how to stitch together two tables, hoping
that the dates, won’t break, without you knowing, or trying to generate the
next 124,538 keywords for an upcoming campaign, by the end of the week!
advertools is a Python package that can hopefully make that part of your job a little easier.

Installation
python3 -m pip install advertools


Philosophy/approach
It’s very easy to learn how to use advertools. There are two main reasons for that.
First, it is essentially a set of independent functions that you can easily learn and
use. There are no special data structures, or additional learning that you need. With
basic Python, and an understanding of the tasks that these functions help with, you
should be able to pick it up fairly easily. In other words, if you know how to use an
Excel formula, you can easily use any advertools function.
The second reason is that advertools follows the UNIX philosophy in its design and
approach. Here is one of the various summaries of the UNIX philosophy by Doug McIlroy:

Write programs that do one thing and do it well. Write programs to work together.
Write programs to handle text streams, because that is a universal interface.

Let’s see how advertools follows that:
Do one thing and do it well: Each function in advertools aims for that. There is a
function that just extracts hashtags from a text list, another one to crawl websites,
one to test which URLs are blocked by robots.txt files, and one for downloading XML
sitemaps. Although they are designed to work together as a full pipeline, they can be
run independently in whichever combination or sequence you want.
Write programs to work together: Independence does not mean they are unrelated. The
workflows are designed to aid the online marketing practitioner in various steps for
understanding websites, SEO analysis, creating SEM campaigns and others.
Programs to handle text streams because that is a universal interface: In Data
Science the most used data structure that can be considered “universal” is the
DataFrame. So, most functions return either a DataFrame or a file that can be read into
one. Once you have it, you have the full power of all other tools like pandas for
further manipulating the data, Plotly for visualization, or any machine learning
library that can more easily handle tabular data.
This way it is kept modular as well as flexible and integrated.
As a next step most of these functions are being converted to no-code
interactive apps for non-coders, and taking them to the next
level.


SEM Campaigns
The most important thing to achieve in SEM is a proper mapping between the
three main elements of a search campaign
Keywords (the intention) -> Ads (your promise) -> Landing Pages (your delivery of the promise)
Once you have this done, you can focus on management and analysis. More importantly,
once you know that you can set this up in an easy way, you know you can focus
on more strategic issues. In practical terms you need two main tables to get started:

Keywords: You can generate keywords (note I didn’t say research) with the
kw_generate function.
Ads: There are two approaches that you can use:

Bottom-up: You can create text ads for a large number of products by simple
replacement of product names, and providing a placeholder in case your text
is too long. Check out the ad_create function for more details.
Top-down: Sometimes you have a long description text that you want to split
into headlines, descriptions and whatever slots you want to split them into.
ad_from_string
helps you accomplish that.


Tutorials and additional resources

Get started with Data Science for Digital Marketing and SEO/SEM
Setting a full SEM campaign for DataCamp’s website tutorial
Project to practice generating SEM keywords with Python on DataCamp
Setting up SEM campaigns on a large scale tutorial on SEMrush
Visual tool to generate keywords online based on the kw_generate function





SEO
Probably the most comprehensive online marketing area that is both technical
(crawling, indexing, rendering, redirects, etc.) and non-technical (content
creation, link building, outreach, etc.). Here are some tools that can help
with your SEO

SEO crawler:
A generic SEO crawler that can be customized, built with Scrapy, & with several
features:

Standard SEO elements extracted by default (title, header tags, body text,
status code, response and request headers, etc.)
CSS and XPath selectors: You probably have more specific needs in mind, so
you can easily pass any selectors to be extracted in addition to the
standard elements being extracted
Custom settings: full access to Scrapy’s settings, allowing you to better
control the crawling behavior (set custom headers, user agent, stop spider
after x pages, seconds, megabytes, save crawl logs, run jobs at intervals
where you can stop and resume your crawls, which is ideal for large crawls
or for continuous monitoring, and many more options)
Following links: option to only crawl a set of specified pages or to follow
and discover all pages through links


robots.txt downloader
A simple downloader of robots.txt files in a DataFrame format, so you can
keep track of changes across crawls if any, and check the rules, sitemaps,
etc.
XML Sitemaps downloader / parser
An essential part of any SEO analysis is to check XML sitemaps. This is a
simple function with which you can download one or more sitemaps (by
providing the URL for a robots.txt file, a sitemap file, or a sitemap index
SERP importer and parser for Google & YouTube
Connect to Google’s API and get the search data you want. Multiple search
parameters supported, all in one function call, and all results returned in a
DataFrame
Tutorials and additional resources

A visual tool built with the serp_goog function to get SERP rankings on Google
A tutorial on analyzing SERPs on a large scale with Python on SEMrush
SERP datasets on Kaggle for practicing on different industries and use cases
SERP notebooks on Kaggle
some examples on how you might tackle such data
Content Analysis with XML Sitemaps and Python
XML dataset examples: news sites, Turkish news sites,
Bloomberg news





Text & Content Analysis (for SEO & Social Media)
URLs, page titles, tweets, video descriptions, comments, hashtags are some
examples of the types of text we deal with. advertools provides a few
options for text analysis

Word frequency
Counting words in a text list is one of the most basic and important tasks in
text mining. What is also important is counting those words by taking in
consideration their relative weights in the dataset. word_frequency does
just that.
URL Analysis
We all have to handle many thousands of URLs in reports, crawls, social media
extracts, XML sitemaps and so on. url_to_df converts your URLs into
easily readable DataFrames.
Emoji
Produced with one click, extremely expressive, highly diverse (3k+ emoji),
and very popular, it’s important to capture what people are trying to communicate
with emoji. Extracting emoji, get their names, groups, and sub-groups is
possible. The full emoji database is also available for convenience, as well
as an emoji_search function in case you want some ideas for your next
social media or any kind of communication
extract_ functions
The text that we deal with contains many elements and entities that have
their own special meaning and usage. There is a group of convenience
functions to help in extracting and getting basic statistics about structured
entities in text; emoji, hashtags, mentions, currency, numbers, URLs, questions
and more. You can also provide a special regex for your own needs.
Stopwords
A list of stopwords in forty different languages to help in text analysis.
Tutorial on DataCamp for creating the word_frequency function and
explaining the importance of the difference between absolute and weighted word frequency
Text Analysis for Online Marketers
An introductory article on SEMrush



Social Media
In addition to the text analysis techniques provided, you can also connect to
the Twitter and YouTube data APIs. The main benefits of using advertools
for this:

Handles pagination and request limits: typically every API has a limited
number of results that it returns. You have to handle pagination when you
need more than the limit per request, which you typically do. This is handled
by default
DataFrame results: APIs send you back data in a formats that need to be
parsed and cleaned so you can more easily start your analysis. This is also
handled automatically
Multiple requests: in YouTube’s case you might want to request data for the
same query across several countries, languages, channels, etc. You can
specify them all in one request and get the product of all the requests in
one response
Tutorials and additional resources
A visual tool to check what is trending on Twitter for all available locations
A Twitter data analysis dashboard with many options
How to use the Twitter data API with Python
Extracting entities from social media posts tutorial on Kaggle
Analyzing 131k tweets by European Football clubs tutorial on Kaggle
An overview of the YouTube data API with Python



Conventions
Function names mostly start with the object you are working on, so you can use
autocomplete to discover other options:

kw_: for keywords-related functions
ad_: for ad-related functions
url_: URL tracking and generation
extract_: for extracting entities from social media posts (mentions, hashtags, emoji, etc.)
emoji_: emoji related functions and objects
twitter: a module for querying the Twitter API and getting results in a DataFrame
youtube: a module for querying the YouTube Data API and getting results in a DataFrame
crawlytics: a module for analyzing crawl data (compare, links, redirects, and more)
serp_: get search engine results pages in a DataFrame, currently available: Google and YouTube
crawl: a function you will probably use a lot if you do SEO
*_to_df: a set of convenience functions for converting to DataFrames
(log files, XML sitemaps, robots.txt files, and lists of URLs)


Change Log - advertools



0.16.1 (2024-08-19)


Fixed

Ensure meta crawl data included in URLs crawled by following links.







0.16.0 (2024-08-18)


Added

Enable the meta parameter for the crawl function for: arbitrary metadata,
custom request headers, and 3rd party plugins like playwright.





Changed

Raise an error when supplying a custom log format with supplying fields.







0.15.1 (2024-07-16)


Fixed

Make file path for emoji_df relative to advertools __path__.
Allow the extension .jsonl for crawling.







0.15.0 (2024-07-15)


Added

Enable supplying request headers in sitemap_to_df, contributed by @joejoinerr
New function crawlytics.compare for comparing two crawls.
New function crawlytics.running_crawls for getting data on currently running crawl jobs (*NIX only for now).
New parameter date_format to logs_to_df for custom date formats.





Changed

Removed the relatedSite parameter from serp_goog - deprecated.
Update emoji regex and functionality to v15.1.





Fixed

Use int64 instead of int for YouTube count columns, contributed by @DanielP77







0.14.4 (2024-07-13)


Fixed

Use pd.NA instead of np.nan for empty values in url_to_df.







0.14.3 (2024-06-27)


Changed

Use a different XPath expression for body_text while crawling.







0.14.2 (2024-02-24)


Changed

Allow sitemap_to_df to work on offline sitemaps.







0.14.1 (2024-02-21)


Fixed

Preserve the order of supplied URLs in the output of url_to_df.







0.14.0 (2024-02-18)


Added

New module crawlytics for analyzing crawl DataFrames. Includes functions to
analyze crawl DataFrames (images, redirects, and links), as well as
functions to handle large files (jl_to_parquet, jl_subset, parquet_columns).
New encoding option for logs_to_df.
Option to save the output of url_to_df to a parquet file.





Changed

Remove requirement to delete existing log output and error files if they exist.
The function will now overwrite them if they do.
Autothrottling is enabled by default in crawl_headers to minimize being blocked.





Fixed

Always get absolute path for img src while crawling.
Handle NA src attributes when extracting images.
Change fillna(method=”ffill”) to ffill for url_to_df.







0.13.5 (2023-08-22)


Added

Initial experimental functionality for crawl_images.





Changed

Enable autothrottling by default for crawl_headers.







0.13.4 (2023-07-26)

Fixed
- Make img attributes consistent in length, and support all attributes.



0.13.3 (2023-06-27)


Changed

Allow optional trailing space in log files (contributed by @andypayne)





Fixed

Replace newlines with spaces while parsing JSON-LD which was causing
errors in some cases.







0.13.2 (2022-09-30)


Added

Crawling recipe for how to use the DEFAULT_REQUEST_HEADERS to change
the default headers.





Changed

Split long lists of URL while crawling regardless of the follow_links
parameter





Fixed

Clarify that while authenticating for Twitter only app_key and
app_secret are required, with the option to provide oauth_token
and oauth_token_secret if/when needed.







0.13.1 (2022-05-11)


Added

Command line interface with most functions
Make documentation interactive for most pages using thebe-sphinx





Changed

Use np.nan wherever there are missing values in url_to_df





Fixed

Don’t remove double quotes from etags when downloading XML sitemaps
Replace instances of pd.DataFrame.append with pd.concat, which is
depracated.
Replace empty values with np.nan for the size column in logs_to_df







0.13.0 (2022-02-10)


Added

New function crawl_headers: A crawler that only makes HEAD requests
to a known list of URLs.
New function reverse_dns_lookup: A way to get host information for a
large list of IP addresses concurrently.
New options for crawling: exclude_url_params, include_url_params,
exclude_url_regex, and include_url_regex for controlling which links to
follow while crawling.





Fixed

Any custom_settings options given to the crawl function that were
defined using a dictionary can now be set without issues. There was an
issue if those options were not strings.





Changed

The skip_url_params option was removed and replaced with the more
versatile exclude_url_params, which accepts either True or a list
of URL parameters to exclude while following links.







0.12.3 (2021-11-27)


Fixed

Crawler stops when provided with bad URLs in list mode.







0.12.0,1,2 (2021-11-27)


Added

New function logs_to_df: Convert a log file of any non-JSON format
into a pandas DataFrame and save it to a parquet file. This also
compresses the file to a much smaller size.
Crawler extracts all available img attributes: ‘alt’, ‘crossorigin’,
‘height’, ‘ismap’, ‘loading’, ‘longdesc’, ‘referrerpolicy’, ‘sizes’,
‘src’, ‘srcset’, ‘usemap’, and ‘width’ (excluding global HTML attributes
like style and draggable).
New parameter for the crawl function skip_url_params: Defaults to
False, consistent with previous behavior, with the ability to not
follow/crawl links containing any URL parameters.
New column for url_to_df “last_dir”: Extract the value in the last
directory for each of the URLs.





Changed

Query parameter columns in url_to_df DataFrame are now sorted by how
full the columns are (the percentage of values that are not NA)







0.11.1 (2021-04-09)


Added

The nofollow attribute for nav, header, and footer links.





Fixed

Timeout error while downloading robots.txt files.
Make extracting nav, header, and footer links consistent with all links.







0.11.0 (2021-03-31)


Added

New parameter recursive for sitemap_to_df to control whether or not
to get all sub sitemaps (default), or to only get the current
(sitemapindex) one.
New columns for sitemap_to_df: sitemap_size_mb
(1 MB = 1,024x1,024 bytes), and sitemap_last_modified and etag
(if available).
Option to request multiple robots.txt files with robotstxt_to_df.
Option to save downloaded robots DataFrame(s) to a file with
robotstxt_to_df using the new parameter output_file.
Two new columns for robotstxt_to_df: robotstxt_last_modified and
etag (if available).
Raise ValueError in crawl if css_selectors or
xpath_selectors contain any of the default crawl column headers
New XPath code recipes for custom extraction.
New function crawllogs_to_df which converts crawl logs to a DataFrame
provided they were saved while using the crawl function.
New columns in crawl: viewport, charset, all h headings
(whichever is available), nav, header and footer links and text, if
available.
Crawl errors don’t stop crawling anymore, and the error message is
included in the output file under a new errors and/or jsonld_errors
column(s).
In case of having JSON-LD errors, errors are reported in their respective
column, and the remainder of the page is scraped.





Changed

Removed column prefix resp_meta_ from columns containing it
Redirect URLs and reasons are separated by ‘@@’ for consistency with
other multiple-value columns
Links extracted while crawling are not unique any more (all links are
extracted).
Emoji data updated with v13.1.
Heading tags are scraped even if they are empty, e.g. <h2></h2>.
Default user agent for crawling is now advertools/VERSION.





Fixed

Handle sitemap index files that contain links to themselves, with an
error message included in the final DataFrame
Error in robots.txt files caused by comments preceded by whitespace
Zipped robots.txt files causing a parsing issue
Crawl issues on some Linux systems when providing a long list of URLs





Removed

Columns from the crawl output: url_redirected_to, links_fragment







0.10.7 (2020-09-18)


Added

New function knowledge_graph for querying Google’s API
Faster sitemap_to_df with threads
New parameter max_workers for sitemap_to_df to determine how fast
it could go
New parameter capitalize_adgroups for kw_generate to determine
whether or not to keep ad groups as is, or set them to title case (the
default)





Fixed

Remove restrictions on the number of URLs provided to crawl,
assuming follow_links is set to False (list mode)
JSON-LD issue breaking crawls when it’s invalid (now skipped)





Removed

Deprecate the youtube.guide_categories_list (no longer supported by
the API)







0.10.6 (2020-06-30)


Added

JSON-LD support in crawling. If available on a page, JSON-LD items will
have special columns, and multiple JSON-LD snippets will be numbered for
easy filtering





Changed

Stricter parsing for rel attributes, making sure they are in link
elements as well
Date column names for robotstxt_to_df and sitemap_to_df unified
as “download_date”
Numbering OG, Twitter, and JSON-LD where multiple elements are present in
the same page, follows a unified approach: no numbering for the first
element, and numbers start with “1” from the second element on. “element”,
“element_1”, “element_2” etc.







0.10.5 (2020-06-14)


Added


New features for the crawl function:

Extract canonical tags if available
Extract alternate href and hreflang tags if available
Open Graph data “og:title”, “og:type”, “og:image”, etc.
Twitter cards data “twitter:site”, “twitter:title”, etc.









Fixed


Minor fixes to robotstxt_to_df:

Allow whitespace in fields
Allow case-insensitive fields









Changed

crawl now only supports output_file with the extension “.jl”
word_frequency drops wtd_freq and rel_value columns if num_list
is not provided







0.10.4 (2020-06-07)


Added

New function url_to_df, splitting URLs into their components and to a
DataFrame
Slight speed up for robotstxt_test







0.10.3 (2020-06-03)


Added

New function robotstxt_test, testing URLs and whether they can be
fetched by certain user-agents





Changed

Documentation main page relayout, grouping of topics, & sidebar captions
Various documentation clarifications and new tests







0.10.2 (2020-05-25)


Added

User-Agent info to requests getting sitemaps and robotstxt files
CSS/XPath selectors support for the crawl function
Support for custom spider settings with a new parameter custom_settings





Fixed

Update changed supported search operators and values for CSE







0.10.1 (2020-05-23)


Changed

Links are better handled, and new output columns are available:
links_url, links_text, links_fragment, links_nofollow
body_text extraction is improved by containing <p>, <li>, and <span>
elements







0.10.0 (2020-05-21)


Added

New function crawl for crawling and parsing websites
New function robotstxt_to_df downloading robots.txt files into
DataFrames







0.9.1 (2020-05-19)


Added

Ability to specify robots.txt file for sitemap_to_df
Ability to retreive any kind of sitemap (news, video, or images)
Errors column to the returnd DataFrame if any errors occur
A new sitemap_downloaded column showing datetime of getting the
sitemap





Fixed

Logging issue causing sitemap_to_df to log the same action twice
Issue preventing URLs not ending with xml or gz from being retreived
Correct sitemap URL showing in the sitemap column







0.9.0 (2020-04-03)


Added

New function sitemap_to_df imports an XML sitemap into a
DataFrame







0.8.1 (2020-02-08)


Changed

Column query_time is now named queryTime in the youtube functions
Handle json_normalize import from pandas based on pandas version







0.8.0 (2020-02-02)


Added

New module youtube connecting to all GET requests in API
extract_numbers new function
emoji_search new function
emoji_df new variable containing all emoji as a DataFrame





Changed

Emoji database updated to v13.0
serp_goog with expanded pagemap and metadata





Fixed

serp_goog errors, some parameters not appearing in result
df
extract_numbers issue when providing dash as a separator
in the middle







0.7.3 (2019-04-17)


Added

New function extract_exclamations very similar to
extract_questions
New function extract_urls, also counts top domains and
top TLDs
New keys to extract_emoji; top_emoji_categories
& top_emoji_sub_categories
Groups and sub-groups to emoji db







0.7.2 (2019-03-29)


Changed

Emoji regex updated
Simpler extraction of Spanish questions







0.7.1 (2019-03-26)


Fixed

Missing __init__ imports.







0.7.0 (2019-03-26)


Added

New extract_ functions:

Generic extract used by all others, and takes
arbitrary regex to extract text.
extract_questions to get question mark statistics, as
well as the text of questions asked.
extract_currency shows text that has currency symbols in it, as
well as surrounding text.
extract_intense_words gets statistics about, and extract words with
any character repeated three or more times, indicating an intense
feeling (+ve or -ve).


New function word_tokenize:

Used by word_frequency to get tokens of
1,2,3-word phrases (or more).
Split a list of text into tokens of a specified number of words each.


New stop-words from the spaCy package:
current: Arabic, Azerbaijani, Danish, Dutch, English, Finnish,
French, German, Greek, Hungarian, Italian, Kazakh, Nepali, Norwegian,
Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
new: Bengali, Catalan, Chinese, Croatian, Hebrew, Hindi, Indonesian,
Irish, Japanese, Persian, Polish, Sinhala, Tagalog, Tamil, Tatar, Telugu,
Thai, Ukrainian, Urdu, Vietnamese






Changed


word_frequency takes new parameters:

regex defaults to words, but can be changed to anything ‘S+’
to split words and keep punctuation for example.
sep not longer used as an option, the above regex can
be used instead
num_list now optional, and defaults to counts of 1 each if not
provided. Useful for counting abs_freq only if data not
available.
phrase_len the number of words in each split token. Defaults
to 1 and can be set to 2 or higher. This helps in analyzing phrases
as opposed to words.




Parameters supplied to serp_goog appear at the beginning
of the result df
serp_youtube now contains nextPageToken to make
paginating requests easier







0.6.0 (2019-02-11)


New function

extract_words to extract an arbitrary set of words





Minor updates

ad_from_string slots argument reflects new text
ad lenghts
hashtag regex improved







0.5.3 (2019-01-31)


Fix minor bugs

Handle Twitter search queries with 0 results in final request







0.5.2 (2018-12-01)


Fix minor bugs

Properly handle requests for >50 items (serp_youtube)
Rewrite test for _dict_product
Fix issue with string printing error msg







0.5.1 (2018-11-06)


Fix minor bugs

_dict_product implemented with lists
Missing keys in some YouTube responses







0.5.0 (2018-11-04)


New function serp_youtube

Query YouTube API for videos, channels, or playlists
Multiple queries (product of parameters) in one function call
Reponse looping and merging handled, one DataFrame




serp_goog return Google’s original error messages
twitter responses with entities, get the entities extracted, each in a
separate column



0.4.1 (2018-10-13)


New function serp_goog (based on Google CSE)

Query Google search and get the result in a DataFrame
Make multiple queries / requests in one function call
All responses merged in one DataFrame




twitter.get_place_trends results are ranked by town and country



0.4.0 (2018-10-08)


New Twitter module based on twython

Wraps 20+ functions for getting Twitter API data
Gets data in a pands DataFrame
Handles looping over requests higher than the defaults




Tested on Python 3.7



0.3.0 (2018-08-14)

Search engine marketing cheat sheet.

New set of extract_ functions with summary stats for each:

extract_hashtags
extract_mentions
extract_emoji




Tests and bug fixes



0.2.0 (2018-07-06)

New set of kw_<match-type> functions.
Full testing and coverage.



0.1.0 (2018-07-02)

First release on PyPI.

Functions available:

ad_create: create a text ad place words in placeholders

ad_from_string: split a long string to shorter string that fit into
given slots



kw_generate: generate keywords from lists of products and words
url_utm_ga: generate a UTM-tagged URL for Google Analytics tracking

word_frequency: measure the absolute and weighted frequency of words in
collection of documents

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Files In This Product:

Customer Reviews

There are no reviews.