*.csv
*.json
*.html
+*.txt
+
+test.py
waybacktweets/__pycache__
waybacktweets/api/__pycache__
# Wayback Tweets
-[](https://pypi.org/project/waybacktweets) [](https://doi.org/10.5281/zenodo.12528447) [](https://waybacktweets.streamlit.app) [](https://colab.research.google.com/drive/1tnaM3rMWpoSHBZ4P_6iHFPjraWRQ3OGe?usp=sharing)
-
+[](https://pypi.org/project/waybacktweets) [](https://pepy.tech/projects/waybacktweets)
Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing (see [Field Options](https://claromes.github.io/waybacktweets/field_options.html)), and saves the data in HTML, for easy viewing of the tweets using the iframe tags, CSV, and JSON formats.
pip install waybacktweets
```
-## Quickstart
-
-### Using Wayback Tweets as a standalone command line tool
-
-waybacktweets [OPTIONS] USERNAME
+## CLI
```shell
-waybacktweets --from 20150101 --to 20191231 --limit 250 jack
+Usage: waybacktweets [OPTIONS] USERNAME
+
+ USERNAME: The Twitter username without @
+
+Options:
+ -c, --collapse [urlkey|digest|timestamp:XX]
+ Collapse results based on a field, or a
+ substring of a field. XX in the timestamp
+ value ranges from 1 to 14, comparing the
+ first XX digits of the timestamp field. It
+ is recommended to use from 4 onwards, to
+ compare at least by years.
+ -f, --from DATE Filtering by date range from this date.
+ Format: YYYYmmdd
+ -t, --to DATE Filtering by date range up to this date.
+ Format: YYYYmmdd
+ -l, --limit INTEGER Query result limits.
+ -rk, --resumption_key TEXT Allows for a simple way to scroll through
+ the results. Key to continue the query from
+ the end of the previous query.
+ -mt, --matchtype [exact|prefix|host|domain]
+ Results matching a certain prefix, a certain
+ host or all subdomains.
+ -v, --verbose Shows the log.
+ --version Show the version and exit.
+ -h, --help Show this message and exit.
+
+ Examples:
+
+ Retrieve all tweets: waybacktweets jack
+
+ With options and verbose output: waybacktweets --from 20200305 --to 20231231 --limit 300 --verbose jack
+
+ Documentation:
+
+ https://claromes.github.io/waybacktweets/
```
-### Using Wayback Tweets as a Web App
-
-[Open the application](https://waybacktweets.streamlit.app), a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
+## Module
-### Using Wayback Tweets as a Python Module
+[](https://colab.research.google.com/drive/1tnaM3rMWpoSHBZ4P_6iHFPjraWRQ3OGe?usp=sharing)
```python
from waybacktweets import WaybackTweets, TweetsParser, TweetsExporter
if archived_tweets:
field_options = [
+ "archived_urlkey",
"archived_timestamp",
- "original_tweet_url",
+ "parsed_archived_timestamp",
"archived_tweet_url",
+ "parsed_archived_tweet_url",
+ "original_tweet_url",
+ "parsed_tweet_url",
+ "available_tweet_text",
+ "available_tweet_is_RT",
+ "available_tweet_info",
+ "archived_mimetype",
"archived_statuscode",
+ "archived_digest",
+ "archived_length",
+ "resumption_key",
]
parser = TweetsParser(archived_tweets, USERNAME, field_options)
exporter = TweetsExporter(parsed_tweets, USERNAME, field_options)
exporter.save_to_csv()
+ exporter.save_to_json()
+ exporter.save_to_html()
```
+## Web App
+
+[](https://waybacktweets.streamlit.app)
+
+A prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
+
+> [!NOTE]
+> Starting from version 1.0, the web app will not receive all updates from the official package. To access all features, prefer the package via PyPI.
+
## Documentation
- [Wayback Tweets documentation](https://claromes.github.io/waybacktweets)
## Acknowledgements
-- Tristan Lee (Bellingcat's Data Scientist) for the idea of the application.
+- Tristan Lee (Bellingcat's Data Scientist) for the idea.
- Jessica Smith (Snowflake's Community Growth Specialist) and Streamlit/Snowflake team for the additional server resources on Streamlit Cloud.
-- OSINT Community for recommending the application.
+- OSINT Community for recommending the package and the application.
-> [!NOTE]
-> If the Streamlit application is down, please check the [Streamlit Cloud Status](https://www.streamlitstatus.com/).
+## License
+
+[GPL-3.0](LICENSE.md)
project = "Wayback Tweets"
release, version = get_version("waybacktweets")
rst_epilog = f".. |release| replace:: v{release}"
-copyright = f"2023 - {datetime.datetime.now().year}, Claromes · Icon by The Doodle Library · Title font by Google, licensed under the Open Font License · Pre-release: v{release}" # noqa: E501
+copyright = f"2023 - {datetime.datetime.now().year}, Claromes · Icon by The Doodle Library · Title font by Google, licensed under the Open Font License · Release: v{release}" # noqa: E501
author = "Claromes"
# -- General configuration ---------------------------------------------------
- ``archived_digest``: (`str`) The ``SHA1`` hash digest of the content, excluding the headers. It's usually a base-32-encoded string.
- ``archived_length``: (`int`) The compressed byte size of the corresponding WARC record, which includes WARC headers, HTTP headers, and content payload.
+
+- ``resumption_key``: (`str`) Allows for a simple way to scroll through the results. Key to continue the query from the end of the previous query.
:target: https://colab.research.google.com/drive/1tnaM3rMWpoSHBZ4P_6iHFPjraWRQ3OGe?usp=sharing
:alt: Open In Collab
-.. raw:: html
cli
-Streamlit Web App
--------------------
+API Reference
+---------------
.. toctree::
:maxdepth: 2
- streamlit
-
+ api
-API Reference
----------------
+Streamlit Web App
+-------------------
.. toctree::
:maxdepth: 2
- api
-
+ streamlit
Additional Information
-----------------------
- ``original_tweet_url``: (`str`) The original tweet URL.
-- ``parsed_tweet_url``: (`str`) The original tweet URL after parsing. Old URLs were archived in a nested manner. The parsing applied here unnests these URLs, when necessary. Check the :ref:`utils`.
+- ``parsed_tweet_url``: (`str`) The original tweet URL after parsing. Old URLs were archived in a nested manner. The parsing applied here unnests these URLs when necessary. Refer to the :ref:`utils` for more details.
Additionally, other fields are displayed.
+.. note::
+
+ The iframes (accordions) are best viewed in Firefox.
+
CSV
--------
waybacktweets --from 20150101 --to 20191231 --limit 250 jack
-Web App
--------------
-
-Using Wayback Tweets as a Streamlit Web App.
-
-`Open the application <https://waybacktweets.streamlit.app>`_, a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
-
Module
-------------
if archived_tweets:
field_options = [
+ "archived_urlkey",
"archived_timestamp",
- "original_tweet_url",
+ "parsed_archived_timestamp",
"archived_tweet_url",
+ "parsed_archived_tweet_url",
+ "original_tweet_url",
+ "parsed_tweet_url",
+ "available_tweet_text",
+ "available_tweet_is_RT",
+ "available_tweet_info",
+ "archived_mimetype",
"archived_statuscode",
+ "archived_digest",
+ "archived_length",
+ "resumption_key",
]
parser = TweetsParser(archived_tweets, USERNAME, field_options)
exporter = TweetsExporter(parsed_tweets, USERNAME, field_options)
exporter.save_to_csv()
+ exporter.save_to_json()
+ exporter.save_to_html()
+
+Web App
+-------------
+
+Using Wayback Tweets as a Streamlit Web App.
+
+`Open the application <https://waybacktweets.streamlit.app>`_, a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
Web App
=========
+.. note::
+
+ Starting from version 1.0, the web app will not receive all updates from the official package. To access all features, prefer the package via PyPI.
+
The application is a prototype hosted on Streamlit Cloud, serving as an alternative to the command line tool.
`Open the application <https://waybacktweets.streamlit.app>`_.
- Limit: Query result limits.
-- Resumption Key: Allows for a simple way to scroll through the results. Key to continue the query from the end of the previous query.
-
- Only unique Wayback Machine URLs: Filtering by the collapse option using the ``urlkey`` field and the URL Match Scope ``prefix``
[tool.poetry]
name = "waybacktweets"
-version = "1.0rc1"
+version = "1.0"
description = "Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing, and saves the data."
authors = ["Claromes <support@claromes.com>"]
license = "GPLv3"
repository = "https://github.com/claromes/waybacktweets"
keywords = [
"twitter",
+ "X",
"tweet",
"internet-archive",
"wayback-machine",
"command-line",
]
classifiers = [
- "Development Status :: 4 - Beta",
+ "Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Natural Language :: English",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
+ "Programming Language :: Python :: 3.12",
"Topic :: Software Development",
"Topic :: Utilities",
]
matchtype,
)
- print(f"Retrieving the archived tweets of @{username}...")
+ print("Retrieving...")
archived_tweets = api.get()
if archived_tweets:
"""
import datetime
-import os
import re
from typing import Any, Dict, List, Optional
print(f"Saved to {csv_file_path}")
+ def generate_json(self) -> str:
+ """
+ Generates JSON data from the DataFrame (without saving to a file).
+
+ Returns:
+ The JSON-formatted string of the DataFrame.
+ """
+
+ json_data = self.dataframe.to_json(orient="records", lines=False)
+ return json_data
+
def save_to_json(self) -> None:
"""
Saves the DataFrame to a JSON file.
"""
Saves the DataFrame to an HTML file.
"""
- json_path = f"{self.filename}.json"
-
- if not os.path.exists(json_path):
- self.save_to_json()
+ json_data = self.generate_json()
html_file_path = f"{self.filename}.html"
- html = HTMLTweetsVisualizer(self.username, json_path, html_file_path)
+ html = HTMLTweetsVisualizer(self.username, json_data, html_file_path)
html_content = html.generate()
html.save(html_content)
if not all(option in FIELD_OPTIONS for option in field_options):
raise ValueError("Some field options are not valid.")
- self.archived_tweets_response = archived_tweets_response
+ self.archived_tweets_response = archived_tweets_response[0]
self.username = username
self.field_options = field_options
self.parsed_tweets = {option: [] for option in self.field_options}
-
- if "resumption_key" not in self.parsed_tweets:
- self.parsed_tweets["resumption_key"] = []
+ self.show_resume_key = archived_tweets_response[1]["show_resume_key"]
self._add_resumption_key()
if not self.archived_tweets_response:
raise ValueError("The list of archived tweet responses is empty.")
- resumption_key = self.archived_tweets_response[-1][0]
- self.parsed_tweets["resumption_key"].append(resumption_key)
+ resumption_key = (
+ self.archived_tweets_response[-1][0] if self.show_resume_key else None
+ )
+ if self.show_resume_key and "resumption_key" in self.parsed_tweets:
+ self.parsed_tweets["resumption_key"] = []
+ self.parsed_tweets["resumption_key"].append(resumption_key)
def _add_field(self, key: str, value: Any) -> None:
"""
if print_progress:
progress.update(task, advance=1)
- rprint(
- f"[blue]Resumption Key: [bold]{self.archived_tweets_response[-1][0]}[/bold]\nUse the Resumption Key (--resumption_key, -rk) option to continue the query from where the previous one ended. This allows you to break a large query into smaller queries more efficiently.[/blue]\n" # noqa: E501
- )
+ if self.show_resume_key:
+ rprint(
+ f'[blue]Resumption Key: [bold]{self.archived_tweets_response[-1][0]}[/bold][/blue]\nUse this Resumption Key option (--resumption_key in the CLI or "resumption_key" in field_options via the API) to continue the query from where the previous one left off. This allows you to split a large query into smaller, more efficient ones.\n' # noqa: E501
+ )
return self.parsed_tweets
""" # noqa: E501
url = "https://web.archive.org/cdx/search/cdx"
- wildcard_pathname = "/*"
- if self.matchtype:
- wildcard_pathname = ""
+ wildcard_pathname = "" if self.matchtype else "/*"
+
+ show_resume_key = bool(self.limit)
params = {
"url": f"https://twitter.com/{self.username}/status{wildcard_pathname}",
- "showResumeKey": "true",
+ "showResumeKey": show_resume_key,
"output": "json",
}
try:
response = get_response(url=url, params=params)
- return response.json()
+ return response.json(), {"show_resume_key": show_resume_key}
except ReadTimeoutError:
if config.verbose:
rprint("[red]Connection to web.archive.org timed out.")
html += (
'<meta name="viewport" content="width=device-width, initial-scale=1.0">\n'
)
- html += f"<title>@{self.username}'s archived tweets</title>\n"
+ html += f"<title>Wayback Tweets from @{self.username}</title>\n"
# Adds styling
html += "<style>\n"
html += ".content { color: #000000; }\n"
html += ".source { font-size: 12px; text-align: center; }\n"
html += ".tweet a:hover { text-decoration: underline; }\n"
- html += "h1, h3 { text-align: center; }\n"
+ html += "h1, h3, .note { text-align: center; }\n"
html += "iframe { width: 600px; height: 600px; }\n"
html += "input { position: absolute; opacity: 0; z-index: -1; }\n"
html += ".accordion { margin: 10px; border-radius: 5px; overflow: hidden; box-shadow: 0 4px 4px -2px rgba(0, 0, 0, 0.4); }\n"
html += "</head>\n<body>\n"
- html += f"<h1>@{self.username}'s archived tweets</h1>\n"
+ html += f"<h1>Archived tweets of @{self.username}</h1>\n"
+ html += (
+ '<p class="note">The iframes (accordions) are best viewed in Firefox.</p>\n'
+ )
html += (
'<p id="loading_first_page">Building pagination with JavaScript...</p>\n'
"Parsed Tweet": tweet.get("parsed_tweet_url"),
}
- for key, value in iframe_src.items():
+ for key, value in (
+ (k, v) for k, v in iframe_src.items() if v is not None
+ ):
key_cleaned = key.replace(" ", "_")
html += '<div class="accordion">\n'
return False
-def timestamp_parser(timestamp):
+def timestamp_parser(timestamp: str) -> Optional[str]:
"""
Parses a timestamp into a formatted string.
timestamp (str): The timestamp string to parse.
Returns:
- The parsed timestamp in the format "%Y/%m/%d %H:%M:%S", or None if the
- timestamp could not be parsed.
- """
+ Returns the parsed timestamp in strftime format, or None if parsing fails.
+ """ # noqa: E501
formats = [
"%Y",
"%Y%m",
for fmt in formats:
try:
+ if not timestamp:
+ return None
parsed_time = datetime.strptime(timestamp, fmt)
formatted_time = parsed_time.strftime("%Y/%m/%d %H:%M:%S")
return formatted_time
except ValueError:
- continue
+ return None
return None