From: Claromes Date: Sat, 4 Nov 2023 12:35:05 +0000 (-0300) Subject: update readme and var saved_at X-Git-Url: https://git.claromes.com/?a=commitdiff_plain;h=482b805fe6891d61d2c36efdf1a6d942a8275b0e;p=waybacktweets.git update readme and var saved_at --- diff --git a/.gitignore b/.gitignore index eba74f4..0cafc1c 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1 @@ -venv/ \ No newline at end of file +.venv/ \ No newline at end of file diff --git a/README.md b/README.md index 5002568..3d2432a 100644 --- a/README.md +++ b/README.md @@ -1,26 +1,17 @@ -> [!IMPORTANT] -> If the application is down, please check the [Streamlit Cloud Status](https://www.streamlitstatus.com/). - -
- # 🏛️ Wayback Tweets [![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://waybacktweets.streamlit.app) [![GitHub release (latest by date including pre-releases)](https://img.shields.io/github/v/release/claromes/waybacktweets?include_prereleases)](https://github.com/claromes/waybacktweets/releases) [![License](https://img.shields.io/github/license/claromes/waybacktweets)](https://github.com/claromes/waybacktweets/blob/main/LICENSE.md) -Tool that displays multiple archived tweets on Wayback Machine to avoid opening each link manually. Via [Wayback CDX Server API](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server). - -

- -

+Tool that displays, via [Wayback CDX Server API](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server), multiple archived tweets on Wayback Machine to avoid opening each link manually. The app is a prototype written in Python with Streamlit and hosted at Streamlit Cloud. *Thanks Tristan Lee for the idea.* ## Features - Tweets per page defined by user -- Filtering by saved date -- Filtering by deleted tweets +- Filter by years +- Filter by only deleted tweets ## Development @@ -43,13 +34,35 @@ Streamlit will be served at http://localhost:8501 ## Bugs - [ ] "web.archive.org took too long to respond." +- [ ] Pagination: set session variable on first click +- [ ] Timeout error - [x] `only_deleted` checkbox selected for handles without deleted tweets -- [x] Pagination: set session variable on first click - [x] Pagination: scroll to top - [x] `IndexError` -- [ ] Timeout error ## Docs - [Roadmap](docs/ROADMAP.md) - [Changelog](docs/CHANGELOG.md) + +## Testimonials + +>"Original way to find deleted tweets." — [Henk Van Ess](https://twitter.com/henkvaness/status/1693298101765701676) + +>"This is an excellent tool to use now that most Twitter API-based tools have gone down with changes to the pricing structure over at X." — [The OSINT Newsletter - Issue #22](https://osintnewsletter.com/p/22#%C2%A7osint-community) + +>"One of the keys to using the Wayback Machine effectively is knowing what it can and can’t archive. It can, and has, archived many, many Twitter accounts... Utilize fun tools such as Wayback Tweets to do so more effectively." — [Ari Ben Am](https://memeticwarfareweekly.substack.com/p/mww-paradise-by-the-telegram-dashboard) + +>"Want to see archived tweets on Wayback Machine in bulk? You can use Wayback Tweets." — [Daily OSINT](https://twitter.com/DailyOsint/status/1695065018662855102) + +>"Untuk mempermudah penelusuran arsip, gunakan Wayback Tweets." — [GIJN Indonesia](https://twitter.com/gijnIndonesia/status/1685912219408805888) + +>"A tool to quickly view tweets saved on archive.org." — [Irina_Tech_Tips Newsletter #3](https://irinatechtips.substack.com/p/irina_tech_tips-newsletter-3-2023#%C2%A7wayback-tweets) + + +## Contributing + +PRs are welcome. Please, check the bug topic above, the [roadmap](docs/ROADMAP.md) or add a new feature. + +> [!NOTE] +> If the application is down, please check the [Streamlit Cloud Status](https://www.streamlitstatus.com/). \ No newline at end of file diff --git a/app.py b/app.py index 2347fd6..a284774 100644 --- a/app.py +++ b/app.py @@ -18,13 +18,13 @@ st.set_page_config( [![GitHub release (latest by date including pre-releases)](https://img.shields.io/github/v/release/claromes/waybacktweets?include_prereleases)](https://github.com/claromes/waybacktweets/releases) [![License](https://img.shields.io/github/license/claromes/waybacktweets)](https://github.com/claromes/waybacktweets/blob/main/LICENSE.md) - Tool that displays multiple archived tweets on Wayback Machine to avoid opening each link manually. Via Wayback CDX Server API. + Tool that displays, via Wayback CDX Server API, multiple archived tweets on Wayback Machine to avoid opening each link manually. - Tweets per page defined by user - - Filtering by saved date - - Filtering by deleted tweets + - Filter by years + - Filter by only deleted tweets - This tool is experimental, please feel free to send your [feedbacks](https://github.com/claromes/waybacktweets/issues). + This tool is a prototype, please feel free to send your [feedbacks](https://github.com/claromes/waybacktweets/issues). Created and maintained by [@claromes](https://github.com/claromes). ------- ''', @@ -42,6 +42,9 @@ hide_streamlit_style = ''' background-color: #dddddd; border-radius: 0.5rem; } + div[data-testid="InputInstructions"] { + visibility: hidden; + } ''' @@ -71,8 +74,8 @@ if 'update_component' not in st.session_state: if 'offset' not in st.session_state: st.session_state.offset = 0 -if 'date_created' not in st.session_state: - st.session_state.date_created = (2006, year) +if 'saved_at' not in st.session_state: + st.session_state.saved_at = (2006, year) if 'count' not in st.session_state: st.session_state.count = False @@ -134,8 +137,8 @@ def embed(tweet): st.error('Connection to publish.twitter.com timed out.') @st.cache_data(ttl=1800, show_spinner=False) -def tweets_count(handle, date_created): - url = f'https://web.archive.org/cdx/search/cdx?url=https://twitter.com/{handle}/status/*&output=json&from={date_created[0]}&to={date_created[1]}' +def tweets_count(handle, saved_at): + url = f'https://web.archive.org/cdx/search/cdx?url=https://twitter.com/{handle}/status/*&output=json&from={saved_at[0]}&to={saved_at[1]}' try: response = requests.get(url) @@ -148,14 +151,15 @@ def tweets_count(handle, date_created): return 0 except requests.exceptions.Timeout: st.error('Connection to web.archive.org timed out.') + st.stop() @st.cache_data(ttl=1800, show_spinner=False) -def query_api(handle, limit, offset, date_created): +def query_api(handle, limit, offset, saved_at): if not handle: st.warning('username, please!') st.stop() - url = f'https://web.archive.org/cdx/search/cdx?url=https://twitter.com/{handle}/status/*&output=json&limit={limit}&offset={offset}&from={date_created[0]}&to={date_created[1]}' + url = f'https://web.archive.org/cdx/search/cdx?url=https://twitter.com/{handle}/status/*&output=json&limit={limit}&offset={offset}&from={saved_at[0]}&to={saved_at[1]}' try: response = requests.get(url) response.raise_for_status() @@ -185,7 +189,7 @@ def parse_links(links): return parsed_links, tweet_links, parsed_mimetype, timestamp def attr(i): - st.markdown(f'{i+1 + st.session_state.offset}. **Wayback Machine:** [link]({link}) · **MIME Type:** {mimetype[i]} · **Saved at:** {datetime.datetime.strptime(timestamp[i], "%Y%m%d%H%M%S")} · **Tweet:** [link]({tweet_links[i]})') + st.markdown(f'{i+1 + st.session_state.offset}. [**web.archive.org**]({link}) · **MIME Type:** {mimetype[i]} · **Saved at:** {datetime.datetime.strptime(timestamp[i], "%Y%m%d%H%M%S")} · [**tweet**]({tweet_links[i]})') # UI st.title('Wayback Tweets [![Star](https://img.shields.io/github/stars/claromes/waybacktweets?style=social)](https://github.com/claromes/waybacktweets)', anchor=False) @@ -193,16 +197,14 @@ st.write('Display multiple archived tweets on Wayback Machine and avoid opening handle = st.text_input('Username', placeholder='jack') -st.session_state.date_created = st.slider('Tweets saved between', 2006, year, (2006, year)) +st.session_state.saved_at = st.slider('Tweets saved between', 2006, year, (2006, year)) -tweets_per_page = st.slider('Tweets per page', 25, 1000, 25, 25) +tweets_per_page = st.slider('Tweets per page', 25, 250, 25, 25) only_deleted = st.checkbox('Only deleted tweets') query = st.button('Query', type='primary', use_container_width=True) -bar = st.empty() - if query or st.session_state.count: if handle != st.session_state.current_handle: st.session_state.offset = 0 @@ -210,17 +212,17 @@ if query or st.session_state.count: if query != st.session_state.current_query: st.session_state.offset = 0 - st.session_state.count = tweets_count(handle, st.session_state.date_created) + st.session_state.count = tweets_count(handle, st.session_state.saved_at) st.write(f'**{st.session_state.count} URLs have been captured**') - if tweets_per_page > st.session_state.count: - tweets_per_page = st.session_state.count + if st.session_state.count: + if tweets_per_page > st.session_state.count: + tweets_per_page = st.session_state.count try: - bar.progress(0) progress = st.empty() - links = query_api(handle, tweets_per_page, st.session_state.offset, st.session_state.date_created) + links = query_api(handle, tweets_per_page, st.session_state.offset, st.session_state.saved_at) parse = parse_links(links) parsed_links = parse[0] @@ -290,56 +292,54 @@ if query or st.session_state.count: start_index = st.session_state.offset end_index = min(st.session_state.count, start_index + tweets_per_page) - for i in range(tweets_per_page): - try: - bar.progress((i*3) + 13) + with st.spinner('Fetching...'): + for i in range(tweets_per_page): + try: + link = parsed_links[i] + tweet = embed(tweet_links[i]) - link = parsed_links[i] - tweet = embed(tweet_links[i]) + if not only_deleted: + attr(i) - if not only_deleted: - attr(i) + if tweet: + status_code = tweet[0] + tweet_content = tweet[1] + user_info = tweet[2] + is_RT = tweet[3] - if tweet: - status_code = tweet[0] - tweet_content = tweet[1] - user_info = tweet[2] - is_RT = tweet[3] + if mimetype[i] == 'application/json': + display_tweet() - if mimetype[i] == 'application/json': - display_tweet() + if mimetype[i] == 'text/html': + display_tweet() + elif not tweet: + display_not_tweet() - if mimetype[i] == 'text/html': - display_tweet() - elif not tweet: - display_not_tweet() + if only_deleted: + if not tweet: + return_none_count += 1 + attr(i) - if only_deleted: - if not tweet: - return_none_count += 1 - attr(i) + display_not_tweet() - display_not_tweet() + progress.write(f'{return_none_count} URLs have been captured in the range {start_index}-{end_index}') - progress.write(f'{return_none_count} URLs have been captured in the range {start_index}-{end_index}') + if start_index <= 0: + st.session_state.prev_disabled = True + else: + st.session_state.prev_disabled = False - if start_index <= 0: - st.session_state.prev_disabled = True - else: - st.session_state.prev_disabled = False + if i + 1 == st.session_state.count: + st.session_state.next_disabled = True + else: + st.session_state.next_disabled = False + except IndexError: + if start_index <= 0: + st.session_state.prev_disabled = True + else: + st.session_state.prev_disabled = False - if i + 1 == st.session_state.count: st.session_state.next_disabled = True - else: - st.session_state.next_disabled = False - # TODO - except IndexError: - if start_index <= 0: - st.session_state.prev_disabled = True - else: - st.session_state.prev_disabled = False - - st.session_state.next_disabled = True prev, _ , next = st.columns([3, 4, 3]) @@ -350,7 +350,7 @@ if query or st.session_state.count: st.error('Unable to query the Wayback Machine API.') except TypeError as e: st.error(f''' - {f}. Refresh this page and try again. + {e}. Refresh this page and try again. If the problem persists [open an issue](https://github.com/claromes/waybacktweets/issues). ''') diff --git a/assets/wbt-0.2.gif b/assets/wbt-0.2.gif deleted file mode 100644 index 1cf478c..0000000 Binary files a/assets/wbt-0.2.gif and /dev/null differ diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md index e6ae500..04a30e2 100644 --- a/docs/ROADMAP.md +++ b/docs/ROADMAP.md @@ -10,9 +10,8 @@ - [ ] Prevent duplicate URLs - [x] Range size defined by user - [ ] `parse_links` exception -- [ ] Add current page to page title - [ ] Parse MIME type `warc/revisit` - [ ] Parse MIME type `text/plain` - [x] Filter by period/datetime - [ ] Apply filters by API endpoints -- [ ] Add contributing guidelines \ No newline at end of file +- [x] Add contributing guidelines \ No newline at end of file