[](https://pypi.org/project/waybacktweets) [](https://waybacktweets.streamlit.app)
-Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing, and saves the data in CSV, JSON, and HTML formats.
+Retrieves archived tweets CDX data from the Wayback Machine, performs several parses to facilitate the analysis of archived tweets and types of tweets (see [Field Options](https://claromes.github.io/waybacktweets/field_options.html)), and saves the data in CSV, JSON, and HTML formats.
## Installation
.. autoclass:: WaybackTweets
:members:
+.. _parser:
+
Parse
---------
.. autoclass:: JsonParser
:members:
+.. _exporter:
+
Export
---------
:members:
:private-members:
+.. _utils:
+
Utils
-------
Field Options
================
-The package saves in three formats: CSV, JSON, and HTML. The files have the following fields:
+The package performs several parses to facilitate the analysis of archived tweets and types of tweets. The fields below are available, which can be passed to the :ref:`parser` and :ref:`exporter`, in addition, the command line tool returns all these fields.
- ``archived_urlkey``: (`str`) A canonical transformation of the URL you supplied, for example, ``org,eserver,tc)/``. Such keys are useful for indexing.
- ``archived_tweet_url``: (`str`) The original archived URL.
-- ``parsed_tweet_url``: (`str`) The original tweet URL after parsing. `Check the utility functions <api.html#module-waybacktweets.utils.utils>`_.
+- ``parsed_tweet_url``: (`str`) The original tweet URL after parsing. Old URLs were archived in a nested manner. The parsing applied here unnests these URLs, when necessary. Check the :ref:`utils`.
-- ``parsed_archived_tweet_url``: (`str`) The original archived URL after parsing. `Check the utility functions <api.html#module-waybacktweets.utils.utils>`_.
-
-.. TODO: JSON Issue
-.. - ``parsed_tweet_text_mimetype_json``: (`str`) The tweet text extracted from the archived URL that has mimetype ``application/json``.
+- ``parsed_archived_tweet_url``: (`str`) The original archived URL after parsing. It is not guaranteed that this option will be archived, it is just a facilitator, as the originally archived URL does not always exist, due to changes in URLs and web services of the social network Twitter. Check the :ref:`utils`.
- ``available_tweet_text``: (`str`) The tweet text extracted from the URL that is still available on the Twitter account.
.. image:: ../assets/waybacktweets.png
:align: center
-Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing, and saves the data in CSV, JSON, and HTML formats.
+Retrieves archived tweets CDX data from the Wayback Machine, performs several parses to facilitate the analysis of archived tweets and types of tweets (see :ref:`field_options`), and saves the data in CSV, JSON, and HTML formats.
.. note::
Intensive queries can lead to rate limiting, resulting in a temporary ban of a few minutes from web.archive.org.
<input type="checkbox">
-|uncheck| JSON Parser: Create a separate function to handle JSON return, apply JsonParser (``waybacktweets/api/parse.py:111``), and avoid rate limiting (`Planned for v1.2`)
+|uncheck| Review and publish the new version of the Streamlit Web App
-|uncheck| Download images when tweet URL has extensions like JPG or PNG (`Planned for v1.2`)
+|uncheck| Unit Tests
-|uncheck| Develop a scraper to download snapshots from https://archive.today (`Not planned`)
+|uncheck| JSON Parser: Create a separate function to handle JSON return, apply JsonParser (``waybacktweets/api/parse.py:111``), and avoid rate limiting
-|uncheck| Unit Tests (`Planned for v1.1`)
+|uncheck| Download images when tweet URL has extensions like JPG or PNG
-|uncheck| Mapping and parsing of other Twitter-related URLs (`Planned`)
+|uncheck| Implement logging system (remove print statements)
-|uncheck| Review and publish the new version of the Streamlit Web App (`Planned for v1.0.1`)
+|uncheck| Mapping and parsing of other Twitter-related URLs
+|uncheck| Develop a scraper to download snapshots from https://archive.today
from rich.progress import Progress
from waybacktweets.config.config import config
+from waybacktweets.config.field_options import FIELD_OPTIONS
from waybacktweets.exceptions.exceptions import (
ConnectionError,
GetResponseError,
username: str,
field_options: List[str],
):
+ if not all(option in FIELD_OPTIONS for option in field_options):
+ raise ValueError("Some field options are not valid.")
+
self.archived_tweets_response = archived_tweets_response
self.username = username
self.field_options = field_options
# flake8: noqa: F401
from waybacktweets.config.config import config
+from waybacktweets.config.field_options import FIELD_OPTIONS
--- /dev/null
+"""
+List of valid field options that can be used for parsing tweets.
+"""
+
+FIELD_OPTIONS = [
+ "archived_urlkey",
+ "archived_timestamp",
+ "original_tweet_url",
+ "archived_tweet_url",
+ "parsed_tweet_url",
+ "parsed_archived_tweet_url",
+ "available_tweet_text",
+ "available_tweet_is_RT",
+ "available_tweet_info",
+ "archived_mimetype",
+ "archived_statuscode",
+ "archived_digest",
+ "archived_length",
+]
def check_pattern_tweet(tweet_url: str) -> str:
"""
- Extracts the tweet ID from a tweet URL.
+ Extracts the URL from a tweet URL with patterns such as:
+
+ - Reply: /status//
+ - Link: /status///
+ - Twimg: /status/https://pbs
Args:
- tweet_url (str): The tweet URL to extract the ID from.
+ tweet_url (str): The tweet URL to extract the URL from.
Returns:
- The extracted tweet ID.
+ Only the extracted URL from a tweet.
"""
- pattern = re.compile(r'/status/"([^"]+)"')
-
- match = pattern.search(tweet_url)
- if match:
- return match.group(1).lstrip("/")
- else:
- return tweet_url
+ patterns = [
+ re.compile(r'/status/"([^"]+)"'),
+ re.compile(r'/status/"([^"]+)"'),
+ re.compile(r'/status/%3B([^"]+)%3B'),
+ ]
+
+ for pattern in patterns:
+ match = pattern.search(tweet_url)
+ if match:
+ return match.group(1).lstrip("/")
+ else:
+ return tweet_url
def delete_tweet_pathnames(tweet_url: str) -> str: