[ruby-core:111263] [Ruby master Bug#19196] The string saved to Tempfile from URI.open escapes "&" characters

Issue #19196 has been reported by westoque (William Estoque). ---------------------------------------- Bug #19196: The string saved to Tempfile from URI.open escapes "&" characters https://bugs.ruby-lang.org/issues/19196 * Author: westoque (William Estoque) * Status: Open * Priority: Normal * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- When I am reading the string response from a URI.open, the response is not equivalent to the response body. URI.open escapes the response body string. How to reproduce: ``` url = "https://www.podcastone.com/podcast?categoryID2=1237" handle = URI.open(url) => #<Tempfile:/path/to/tempfile> puts handle.read .... https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&adwNewID3=true&awNetwork=309... ``` The string in the browser says "https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&adwNewID3=true&awNetwork=309" Notice the characters "&" My initial research is that it's because the Tempfile that gets created is in ascii-8bit which the amperstand is a "38". We should create a way to force the encoding of the Tempfile to UTF8 so that this character is not escaped. -- https://bugs.ruby-lang.org/

Issue #19196 has been updated by ufuk (Ufuk Kayserilioglu). The content you are reading is XML and `&` characters are there because of [XML-escaping](https://www.liquid-technologies.com/XML/EscapingData.aspx). They are not related to any kind of file encoding, `ASCII-8BIT` or `UTF-8`. Moreover, they are there in the response from the server, which you can see by looking at the output of `curl` for the same resource: ```shell $ curl -s "https://www.podcastone.com/podcast?categoryID2=1237" | grep "aw.noxsolutions.com/launchpod/adswizz/1237/762-" ... <enclosure length="74614442" type="audio/mpeg" url="https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&adwNewID3=true&awNetwork=309"></enclosure> ... ``` So, this is not a Ruby problem at all. On the contrary, Ruby can help you unescape these characters: ```ruby require "cgi" CGI.unescapeHTML("foo&bar") # => "foo&bar" ``` ---------------------------------------- Bug #19196: The string saved to Tempfile from URI.open escapes "&" character https://bugs.ruby-lang.org/issues/19196#change-100578 * Author: westoque (William Estoque) * Status: Open * Priority: Normal * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- When I am reading the string response from a URI.open, the response is not equivalent to the response body. How to reproduce: ``` url = "https://www.podcastone.com/podcast?categoryID2=1237" handle = URI.open(url) => #<Tempfile:/path/to/tempfile> puts handle.read .... https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&adwNewID3=true&awNetwork=309... ``` In the browser, the actual string reads: ``` https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&adwNewID3=true&awNetwork=309 ``` Notice the characters `#38;` My initial research is that it's because the Tempfile that gets created is in ascii-8bit, and in ascii-8bit, the amperstand is a "38". I propose that we should have a way to force the encoding of the Tempfile to UTF8 so that this character is not escaped and the string encoding is preserved. -- https://bugs.ruby-lang.org/

Issue #19196 has been updated by westoque (William Estoque). @ufuk thank you for that explanation. I may have jumped to conclusions when checking that response in the browser (Chrome) vs curl which unescaped the characters. ---------------------------------------- Bug #19196: The string saved to Tempfile from URI.open escapes "&" character https://bugs.ruby-lang.org/issues/19196#change-100628 * Author: westoque (William Estoque) * Status: Rejected * Priority: Normal * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- When I am reading the string response from a URI.open, the response is not equivalent to the response body. How to reproduce: ``` url = "https://www.podcastone.com/podcast?categoryID2=1237" handle = URI.open(url) => #<Tempfile:/path/to/tempfile> puts handle.read .... https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&adwNewID3=true&awNetwork=309... ``` In the browser, the actual string reads: ``` https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&adwNewID3=true&awNetwork=309 ``` Notice the characters `#38;` My initial research is that it's because the Tempfile that gets created is in ascii-8bit, and in ascii-8bit, the amperstand is a "38". I propose that we should have a way to force the encoding of the Tempfile to UTF8 so that this character is not escaped and the string encoding is preserved. -- https://bugs.ruby-lang.org/
participants (2)
-
ufuk (Ufuk Kayserilioglu)
-
westoque (William Estoque)