diff --git a/frontend/docs/docs/user-guide/browser-profiles/browser-profiles-overview.md b/frontend/docs/docs/user-guide/browser-profiles/browser-profiles-overview.md
index b7a459589d..6092a1fbd7 100644
--- a/frontend/docs/docs/user-guide/browser-profiles/browser-profiles-overview.md
+++ b/frontend/docs/docs/user-guide/browser-profiles/browser-profiles-overview.md
@@ -8,22 +8,21 @@ Browser profiles are saved instances of a web browsing session that can be used
Pre-configure a social media site to be logged in so that the crawler can access content that can only be viewed by logged-in users.
-!!! tip "Best Practices: Use an account created specifically for archiving a website"
-
- Websites may require user registration to view content at URLs that are otherwise public. This practice is sometimes referred to as a login wall. Login walls are commonly used by social media and publishing platforms.
-
- We recommend creating dedicated accounts when archiving any website that requires a user account. Although dedicated accounts are not required to benefit from browser profiles, they can address the following potential issues:
+### Hide Popup Prompts
- - While usernames and passwords are never saved by Browsertrix, the private tokens that enable access to logged in content _are_ stored. Thus, anyone with access to your Browsertrix account, intentional or malicious, may be able to access the logged in content.
+Websites may prompt users for a number of reasons before displaying the rest of the page, such as for age verification, informed consent requirements, or geographical location. Configure a browser profile to accept, dismiss, or otherwise hide these dialogs so that the content behind them is visible to the crawler.
- - Some websites may rate limit or lock accounts for reasons they deem to be suspicious, such as logging in from a new geographical location or if the site determines crawls to be robot activity.
+## Best Practices
- - Personalized data such as cookies, location, etc. may be included in the resulting crawl.
+### Use logins dedicated to web archiving
- - The logged in interface may display unwanted personally identifiable information such as a username or profile picture.
+Websites may require user registration to view content at URLs that are otherwise public. This practice is sometimes referred to as a login wall. Login walls are commonly used by social media and publishing platforms.
- An exception to this practice is if your goal is to archive personalized or private content accessible only from designated accounts. In these instances we recommend changing the account's password after crawling is complete.
+We highly recommend avoiding use of your personal accounts when logging into websites during the profile creation process. Instead, sign up for a new account dedicated to archiving and use that dedicated account in your browser profile. Although dedicated accounts are not necessary to benefit from browser profiles, they can address the following potential issues:
-### Hide Popup Prompts
+- While usernames and passwords are never saved by Browsertrix, the private tokens that enable access to logged-in content _are_ stored in WACZ files. Thus, anyone with access to your Browsertrix account or WACZ files, intentional or malicious, may be able to view and use the token to log in to your account.
+- Some websites may rate limit or lock accounts for reasons they deem to be suspicious, such as logging in from a new geographical location or if the site determines crawls to be robot activity.
+- Personalized data such as cookies, location, etc. may be included in the resulting crawl.
+- The logged-in interface may display unwanted personally identifiable information such as a username or profile picture.
-Websites may prompt users for a number of reasons before displaying the rest of the page, such as for age verification, informed consent requirements, or geographical location. Configure a browser profile to accept, dismiss, or otherwise hide these dialogs so that the content behind them is visible to the crawler.
+An exception to this practice is if your goal is to archive personalized or private content accessible only from designated accounts. In these instances we recommend changing the account's password or logging out of all active sessions after crawling is complete.
diff --git a/frontend/docs/docs/user-guide/workflow-setup.md b/frontend/docs/docs/user-guide/workflow-setup.md
index 14c2fcc967..bc9d86e655 100644
--- a/frontend/docs/docs/user-guide/workflow-setup.md
+++ b/frontend/docs/docs/user-guide/workflow-setup.md
@@ -287,12 +287,24 @@ Configure the browser used to visit URLs during the crawl.
Sets the [_Browser Profile_](browser-profiles/browser-profiles-overview.md) to be used for this crawl.
+!!! Tip "Best Practices: Use login profiles dedicated to crawling"
+ We highly recommend avoiding use of your personal accounts when logging into websites during the profile creation process. Crawling with a browser profile that uses your personal account may expose you to risks such as compromised private tokens and unwanted sharing of user preferences. Although accounts dedicated to crawling are not necessary to benefit from browser profiles, they can address these potential issues and more. [Continue reading about dedicated accounts](browser-profiles/browser-profiles-overview.md#use-logins-dedicated-to-web-archiving)
+
### Fail Crawl if Not Logged In
When enabled, the crawl will fail if a [page behavior](#page-behavior) detects the presence or absence of content on supported pages indicating that the browser is not logged in.
For details about which websites are supported and how to add this functionality to your own [custom behaviors](#use-custom-behaviors), see the [Browsertrix Crawler documentation for Fail on Content Check](https://crawler.docs.browsertrix.com/user-guide/behaviors/#fail-on-content-check).
+### Include Browser Storage Data
+
+When enabled, instructs the crawler to save the browser's `localStorage` and `sessionStorage` data for each page in the web archive as part of the `WARC-JSON-Metadata` field. Enabling this option is recommended to properly archive and replay certain websites, as long as privacy and security implications have been reviewed.
+
+!!! Warning "Privacy & security implications when used with browser profiles"
+ Websites can use browser storage to store arbitrary data. During the browser profile creation process, some websites may save sensitive data such as login information and user-identifying preferences in browser storage. Since every website can implement browser storage differently, Browsertrix does not attempt to detect whether the information stored is potentially sensitive.
+
+ Use caution when sharing WACZ files created with this option enabled, especially if you’re crawling pages that require login. We always recommend creating dedicated website logins to be used only for crawling to mitigate the risk of compromised login information.
+
### Crawler Proxy Server
!!! Info "This setting will be shown if the organization supports multiple proxies."
@@ -320,10 +332,6 @@ Sets the release channel of [Browsertrix Crawler](https://github.com/webrecorder
Will prevent any content from the domains listed in [Steven Black's Unified Hosts file](https://github.com/StevenBlack/hosts) (ads & malware) from being captured by the crawler.
-### Save Local and Session Storage
-
-When enabled, instructs the crawler to save the browser's `localStorage` and `sessionStorage` data for each page in the web archive as part of the `WARC-JSON-Metadata` field. This option may be necessary to properly archive and replay certain websites. Use caution when sharing WACZ files created with this option enabled, as the saved browser storage may contain sensitive information.
-
### User Agent
Sets the browser's [user agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) in outgoing requests to the specified value. If left blank, the crawler will use the Brave browser's default user agent. For a list of common user agents see [useragents.me](https://www.useragents.me/).
diff --git a/frontend/src/components/ui/config-details.ts b/frontend/src/components/ui/config-details.ts
index 0a9e97ac77..1cd856cc3e 100644
--- a/frontend/src/components/ui/config-details.ts
+++ b/frontend/src/components/ui/config-details.ts
@@ -292,7 +292,7 @@ export class ConfigDetails extends BtrixElement {
seedsConfig?.blockAds,
)}
${this.renderSetting(
- msg("Save Local and Session Storage"),
+ labelFor["saveStorage"],
seedsConfig?.saveStorage,
)}
${this.renderSetting(
diff --git a/frontend/src/features/crawl-workflows/workflow-editor.ts b/frontend/src/features/crawl-workflows/workflow-editor.ts
index 25468e40f3..32723ccef9 100644
--- a/frontend/src/features/crawl-workflows/workflow-editor.ts
+++ b/frontend/src/features/crawl-workflows/workflow-editor.ts
@@ -2061,11 +2061,21 @@ https://archiveweb.page/images/${"logo.svg"}`}
this.updateFormState({
browserProfile: profile ?? null,
proxyId: profile?.proxyId ?? null,
+ saveStorage:
+ profile && this.formState.saveStorage === undefined
+ ? true
+ : this.formState.saveStorage,
});
}}
>
`)}
- ${this.renderHelpTextCol(infoTextFor["browserProfile"])}
+ ${this.renderHelpTextCol(html`
+ ${infoTextFor["browserProfile"]}
+ ${this.renderUserGuideLink({
+ hash: "browser-profile",
+ content: msg("More details"),
+ })}
+ `)}
${when(
this.formState.browserProfile,
() => html`
@@ -2088,6 +2098,19 @@ https://archiveweb.page/images/${"logo.svg"}`}
)}
`,
)}
+ ${inputCol(html`
+
+ ${labelFor["saveStorage"]}
+
+ `)}
+ ${this.renderHelpTextCol(
+ html`${infoTextFor["saveStorage"]}
+ ${this.renderUserGuideLink({
+ hash: "include-browser-storage-data",
+ content: msg("More details"),
+ })}.`,
+ false,
+ )}
${proxies?.servers.length
? [
inputCol(html`
@@ -2164,19 +2187,6 @@ https://archiveweb.page/images/${"logo.svg"}`}
`)}
${this.renderHelpTextCol(infoTextFor["blockAds"], false)}
- ${inputCol(html`
-
- ${msg("Save local and session storage")}
-
- `)}
- ${this.renderHelpTextCol(
- html`${infoTextFor["saveStorage"]}
- ${this.renderUserGuideLink({
- hash: "save-local-and-session-storage",
- content: msg("Implications for shared archives"),
- })}.`,
- false,
- )}
${inputCol(html`
+ ${msg(
+ "For websites that require login, we always recommend using a profile that's logged-in to an account created specifically for crawling.",
+ )} `,
crawlerChannel: msg(
`Choose a Browsertrix Crawler release channel. If available, other versions may provide new or experimental crawling features.`,
),
@@ -79,12 +81,19 @@ export const infoTextFor = {
customBehavior: msg(
`Enable custom page actions with behavior scripts. You can specify any publicly accessible URL or public Git repository.`,
),
- failOnContentCheck: msg(
- `Fail the crawl if a page behavior detects the browser is not logged in on supported pages.`,
- ),
- saveStorage: msg(
- `Include data from the browser's local and session storage in the web archive.`,
- ),
+ failOnContentCheck: html`${msg(
+ "Fail the crawl if a page behavior detects the browser is not logged in on supported pages.",
+ )}
+ ${msg("Note that websites may log profiles out after a period of time.")}`,
+ saveStorage: html`${msg(
+ "During a crawl, websites may store data in the browser itself, e.g. to persist logins.",
+ )}
+ ${msg(
+ "Checking this will include data from the browser’s local and session storage in the archive.",
+ )}
+ ${msg(
+ "This can improve replay quality, but may come with security implications.",
+ )}`,
useRobots: msg(
`Check for a robots.txt file for each host and skip any disallowed pages.`,
),
diff --git a/frontend/src/strings/crawl-workflows/labels.ts b/frontend/src/strings/crawl-workflows/labels.ts
index 50ca071464..2fca97e7e1 100644
--- a/frontend/src/strings/crawl-workflows/labels.ts
+++ b/frontend/src/strings/crawl-workflows/labels.ts
@@ -13,4 +13,5 @@ export const labelFor = {
selectLinks: msg("Link Selectors"),
clickSelector: msg("Click Selector"),
dedupeType: msg("Crawl Deduplication"),
+ saveStorage: msg("Include browser storage data"),
} as const satisfies Partial>;