### Using Wayback Tweets as a Web App
-[Access the application](https://waybacktweets.streamlit.app), a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
+[Open the application](https://waybacktweets.streamlit.app), a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
## Documentation
import datetime
-import requests
import streamlit as st
import streamlit.components.v1 as components
"About": f"""
[](https://github.com/claromes/waybacktweets/releases) [](https://github.com/claromes/waybacktweets/blob/main/LICENSE.md) [](https://github.com/claromes/waybacktweets)
- Aplication that displays multiple archived tweets on Wayback Machine to avoid opening each link manually.
+ Application that displays multiple archived tweets on Wayback Machine to avoid opening each link manually.
The application is a prototype hosted on Streamlit Cloud, allowing users to apply filters and view tweets that lack the original URL. [Read more](https://claromes.github.io/waybacktweets/streamlit.html).
- © Copyright 2023 - {datetime.datetime.now().year}, [Claromes](https://claromes.com) · Icon by The Doodle Library
+ © 2023 - {datetime.datetime.now().year}, [Claromes](https://claromes.com) · Icon by The Doodle Library · Title font by Google, licensed under the Open Font License
---
""", # noqa: E501
def tweets_count(username, archived_timestamp_filter):
url = f"https://web.archive.org/cdx/search/cdx?url=https://twitter.com/{username}/status/*&output=json&from={archived_timestamp_filter[0]}&to={archived_timestamp_filter[1]}" # noqa: E501
- try:
- response = get_response(url=url)
-
- if response.status_code == 200:
- data = response.json()
- if data and len(data) > 1:
- total_tweets = len(data) - 1
- return total_tweets
- else:
- return 0
- except requests.exceptions.ReadTimeout:
- st.error("Connection to web.archive.org timed out.")
+ response, error, error_type = get_response(url=url)
+
+ if response.status_code == 200:
+ data = response.json()
+ if data and len(data) > 1:
+ total_tweets = len(data) - 1
+ return total_tweets
+ else:
+ return 0
+ elif error and error_type == "ReadTimeout":
+ st.error("Failed to establish a new connection with web.archive.org.")
st.stop()
- except requests.exceptions.ConnectionError:
+ elif error and error_type == "ConnectionError":
st.error("Failed to establish a new connection with web.archive.org.")
st.stop()
- except Exception as e:
- st.error(f"{e}")
+ elif error and error_type:
+ st.error(f"{error}")
st.stop()
project = "Wayback Tweets"
release, version = get_version("waybacktweets")
-copyright = f"2023 - {datetime.datetime.now().year}, Claromes · Icon by The Doodle Library · Title Font by Google, licensed under the Open Font License · Wayback Tweets v{version}" # noqa: E501
+copyright = f"2023 - {datetime.datetime.now().year}, Claromes · Icon by The Doodle Library · Title font by Google, licensed under the Open Font License · Wayback Tweets v{version}" # noqa: E501
author = "Claromes"
# -- General configuration ---------------------------------------------------
+++ /dev/null
-Errors
-================
-
-These are the most common errors and are handled by the ``waybacktweets`` package.
-
-ReadTimeout
-----------------
-
-This error occurs when a request to the web.archive.org server takes too long to respond. The server could be overloaded or there could be network issues.
-
-The output message from the package would be: ``Connection to web.archive.org timed out.``
-
-ConnectionError
-----------------
-
-This error is raised when the package fails to establish a new connection with web.archive.org. This could be due to network issues or the server being down.
-
-The output message from the package would be: ``Failed to establish a new connection with web.archive.org. Max retries exceeded.``
-
-
-This is the error often returned when performing experimental parsing of URLs with the mimetype ``application/json``.
-
-The warning output message from the package would be: ``Connection error with https://web.archive.org/web/<TIMESTAMP>/https://twitter.com/<USERNAME>/status/<TWEET_ID>. Max retries exceeded. Error parsing the JSON, but the CDX data was saved.``
-
-HTTPError
-----------------
-
-This error occurs when the Internet Archive services are temporarily offline. This could be due to maintenance or server issues.
-
-The output message from the package would be: ``Temporarily Offline: Internet Archive services are temporarily offline. Please check Internet Archive Twitter feed (https://twitter.com/internetarchive) for the latest information.``
-
-
--- /dev/null
+Exceptions
+================
+
+These are the most common errors and are handled by the ``waybacktweets`` package.
+
+ReadTimeout
+----------------
+
+This error occurs when a request to the web.archive.org server takes too long to respond. The server could be overloaded or there could be network issues.
+
+The output message from the package would be: ``Connection to web.archive.org timed out.``
+
+ConnectionError
+----------------
+
+This error is raised when the package fails to establish a new connection with web.archive.org. This could be due to network issues or the server being down.
+
+The output message from the package would be: ``Failed to establish a new connection with web.archive.org. Max retries exceeded.``
+
+
+This is the error often returned when performing experimental parsing of URLs with the mimetype ``application/json``.
+
+The warning output message from the package would be: ``Connection error with https://web.archive.org/web/<TIMESTAMP>/https://twitter.com/<USERNAME>/status/<TWEET_ID>. Max retries exceeded. Error parsing the JSON, but the CDX data was saved.``
+
+HTTPError
+----------------
+
+This error occurs when the Internet Archive services are temporarily offline. This could be due to maintenance or server issues.
+
+The output message from the package would be: ``Temporarily Offline: Internet Archive services are temporarily offline. Please check Internet Archive Twitter feed (https://twitter.com/internetarchive) for the latest information.``
+
+
quickstart
workflow
result
- errors
+ exceptions
contribute
todo
.. code-block:: shell
- waybacktweets --from 20150101 --to 20191231 --limit 250 jack`
+ waybacktweets --from 20150101 --to 20191231 --limit 250 jack
Module
Using Wayback Tweets as a Streamlit Web App
-`Access the application <https://waybacktweets.streamlit.app>`_, a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
+`Open the application <https://waybacktweets.streamlit.app>`_, a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
Aplication that displays multiple archived tweets on Wayback Machine to avoid opening each link manually. The application is a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud, allowing users to apply filters and view tweets that lack the original URL.
+`Open the application <https://waybacktweets.streamlit.app>`_.
+
Filters
----------
C--> |4xx| E[return None]
E--> F{request Archived\nTweet URL}
F--> |4xx| G[return Only CDX data]
- F--> |TODO: 2xx/3xx: application/json| J[return JSON text]
+ F--> |2xx/3xx: application/json| J[return JSON text]
F--> |2xx/3xx: text/html, warc/revisit, unk| K[return HTML iframe tag]
+"""
+Exports the parsed archived tweets.
+"""
+
import datetime
import os
import re
+"""
+Parses the returned data from the Wayback CDX Server API.
+"""
+
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from contextlib import nullcontext
from typing import Any, Dict, List, Optional, Tuple
from urllib.parse import unquote
-from requests import exceptions
from rich import print as rprint
from rich.progress import Progress
availability statuses, and URLs, respectively. If no tweets are available,
returns None.
"""
- try:
- url = f"https://publish.twitter.com/oembed?url={self.tweet_url}"
- response = get_response(url=url)
-
- if response:
- json_response = response.json()
- html = json_response["html"]
- author_name = json_response["author_name"]
-
- regex = re.compile(
- r'<blockquote class="twitter-tweet"(?: [^>]+)?><p[^>]*>(.*?)<\/p>.*?— (.*?)<\/a>', # noqa
- re.DOTALL,
- )
- regex_author = re.compile(r"^(.*?)\s*\(")
-
- matches_html = regex.findall(html)
-
- tweet_content = []
- user_info = []
- is_RT = []
-
- for match in matches_html:
- tweet_content_match = re.sub(
- r"<a[^>]*>|<\/a>", "", match[0].strip()
- ).replace("<br>", "\n")
- user_info_match = re.sub(
- r"<a[^>]*>|<\/a>", "", match[1].strip()
- ).replace(")", "), ")
- match_author = regex_author.search(user_info_match)
- author_tweet = match_author.group(1) if match_author else ""
-
- if tweet_content_match:
- tweet_content.append(tweet_content_match)
- if user_info_match:
- user_info.append(user_info_match)
- is_RT.append(author_name != author_tweet)
-
- return tweet_content, is_RT, user_info
- except exceptions:
+ url = f"https://publish.twitter.com/oembed?url={self.tweet_url}"
+ response, error, error_type = get_response(url=url)
+
+ if response:
+ json_response = response.json()
+ html = json_response["html"]
+ author_name = json_response["author_name"]
+
+ regex = re.compile(
+ r'<blockquote class="twitter-tweet"(?: [^>]+)?><p[^>]*>(.*?)<\/p>.*?— (.*?)<\/a>', # noqa
+ re.DOTALL,
+ )
+ regex_author = re.compile(r"^(.*?)\s*\(")
+
+ matches_html = regex.findall(html)
+
+ tweet_content = []
+ user_info = []
+ is_RT = []
+
+ for match in matches_html:
+ tweet_content_match = re.sub(
+ r"<a[^>]*>|<\/a>", "", match[0].strip()
+ ).replace("<br>", "\n")
+ user_info_match = re.sub(
+ r"<a[^>]*>|<\/a>", "", match[1].strip()
+ ).replace(")", "), ")
+ match_author = regex_author.search(user_info_match)
+ author_tweet = match_author.group(1) if match_author else ""
+
+ if tweet_content_match:
+ tweet_content.append(tweet_content_match)
+ if user_info_match:
+ user_info.append(user_info_match)
+ is_RT.append(author_name != author_tweet)
+
+ return tweet_content, is_RT, user_info
+ elif error and error_type == "ConnectionError":
rprint("[yellow]Error parsing the tweet, but the CDX data was saved.")
+ elif error and error_type == "HTTPError":
+ rprint(
+ f"[yellow]{self.tweet_url} not available on the user's account, but the CDX data was saved." # noqa: E501
+ )
return None
- except Exception as e:
- rprint(f"[red]{e}")
+ elif error and error_type:
+ rprint(f"[red]{error}")
return None
:returns: The parsed tweet text.
"""
- try:
- response = get_response(url=self.archived_tweet_url)
+ response, error, error_type = get_response(url=self.archived_tweet_url)
- if response:
- json_data = response.json()
+ if response:
+ json_data = response.json()
- if "data" in json_data:
- return json_data["data"].get("text", json_data["data"])
+ if "data" in json_data:
+ return json_data["data"].get("text", json_data["data"])
- if "retweeted_status" in json_data:
- return json_data["retweeted_status"].get(
- "text", json_data["retweeted_status"]
- )
+ if "retweeted_status" in json_data:
+ return json_data["retweeted_status"].get(
+ "text", json_data["retweeted_status"]
+ )
- return json_data.get("text", json_data)
- except exceptions.ConnectionError:
+ return json_data.get("text", json_data)
+ elif error and error_type == "ConnectionError":
rprint(
f"[yellow]Connection error with {self.archived_tweet_url}. Max retries exceeded. Error parsing the JSON, but the CDX data was saved." # noqa: E501
)
- return ""
- except exceptions:
- rprint("[yellow]Error parsing the JSON, but the CDX data was saved.")
-
- return ""
- except Exception as e:
- rprint(f"[red]{e}")
- return ""
+ return None
+ elif error and error_type:
+ rprint(f"[red]{error}")
+ return None
class TweetsParser:
+"""
+Requests data from the Wayback Machine API.
+"""
+
from typing import Any, Dict, Optional
-from requests import exceptions
from rich import print as rprint
from waybacktweets.utils.utils import get_response
if self.matchtype:
params["matchType"] = self.matchtype
- try:
- response = get_response(url=url, params=params)
+ response, error, error_type = get_response(url=url, params=params)
- if response:
- return response.json()
- except exceptions.ReadTimeout:
+ if response:
+ return response.json()
+ elif error and error_type == "ReadTimeout":
rprint("[red]Connection to web.archive.org timed out.")
- except exceptions.ConnectionError:
+ elif error and error_type == "ConnectionError":
rprint(
"[red]Failed to establish a new connection with web.archive.org. Max retries exceeded. Please wait a few minutes and try again." # noqa: E501
)
- except exceptions.HTTPError:
+ elif error and error_type == "HTTPError":
+ rprint("[red]Connection to web.archive.org timed out.")
+ elif error and error_type:
rprint(
"[red]Temporarily Offline: Internet Archive services are temporarily offline. Please check Internet Archive Twitter feed (https://twitter.com/internetarchive) for the latest information." # noqa: E501
)
- except Exception as e:
- rprint(f"[red]{e}")
# flake8: noqa: E501
+"""
+Generates an HTML file to visualize the parsed data.
+"""
+
import json
from typing import Any, Dict, List
from typing import Any, Optional
import click
-from requests import exceptions
from rich import print as rprint
from waybacktweets.api.export_tweets import TweetsExporter
exporter.save_to_csv()
exporter.save_to_json()
exporter.save_to_html()
- except exceptions as e:
+ except Exception as e:
rprint(f"[red]{e}")
finally:
rprint(
"""
-Module containing utility functions for handling HTTP requests and manipulating URLs.
+Utility functions for handling HTTP requests and manipulating URLs.
"""
import re
-from typing import Optional
+from typing import Optional, Tuple
import requests
from requests.adapters import HTTPAdapter
def get_response(
url: str, params: Optional[dict] = None
-) -> Optional[requests.Response]:
+) -> Tuple[Optional[requests.Response], Optional[str], Optional[str]]:
"""
- Sends a GET request to the specified URL and returns the response.
+ Sends a GET request to the specified URL and returns the response,
+ an error message if any, and the type of exception if any.
:param url: The URL to send the GET request to.
:param params: The parameters to include in the GET request.
- :returns: The response from the server,
- if the status code is not in the 400-511 range.
- If the status code is in the 400-511 range.
+ :returns: A tuple containing the response from the server or None,
+ an error message or None, and the type of exception or None.
"""
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.3)
session.mount("http://", adapter)
session.mount("https://", adapter)
- response = session.get(url, params=params, headers=headers)
-
- if 400 <= response.status_code <= 511:
- return None
-
- return response
+ try:
+ response = session.get(url, params=params, headers=headers)
+ response.raise_for_status()
+
+ if not response or response.json() == []:
+ return None, "No data was saved due to an empty response.", None
+ return response, None, None
+ except requests.exceptions.RequestException as e:
+ return None, str(e), type(e).__name__
+ except Exception as e:
+ return None, str(e), type(e).__name__
def clean_tweet_url(tweet_url: str, username: str) -> str: