Let's say for example the image URL for certain images on a website is something along the lines of:
> www.example.com/store/productimages/details/87540_item_a.jpg
> www.example.com/store/productimages/details/48395_item_b.jpg
> www.example.com/store/productimages/details/75435_item_c.jpg
The site is constantly getting updated and while those image links can still be pulled up, one can no longer find the exact image unless they have the exact link to it.
So in other words, once the page for: www.example.com/store/productimages/details/87540_item_a.jpg
is taken down and no longer indexed buy Google, the only way to find it again is to have had that particular link copied down somewhere and to search it manually to pull it up.
However, since web archives exist, I'm able to locate a few of those said links, but obviously only the ones that have been archived. The thing is, those links aren't broken and are still being hosted on that particular www.example.com
site, independent from the web archive site. (ex. I can take an image link that is 6 years old from that web archiver, paste, and search it for it as usual in the address bar.) So this leads me to believe that they are still on the website's index. The problem is that the index isn't publicly accessible. I've tried using the site's sitemap.xml, but it is the most current one and doesn't contain any of the older image links. Even the older image links are set up the same way as these newer ones and are presumably under the same directory mixed in with the old ones.
My question is how would I go about finding all of those older image links if tools like HTTrack and wget only crawl what is currently on the site? Or am I just using these wrong? I've tried to use HTTrack and wget by telling them to crawl all .jpgs under the www.example.com/store/productimages/details/
directory, but it comes up with nothing because its forbidden I take. Is there another way to go about finding those old image links?
since in your example the images are all numbers, you could try all numbers, one after another. I.e. have your browser/wget/curl ...
try to GET files with names 00000.jpg ..... infinity.jpg
and the server will serve up the file, if it is present.
or if they are not numbers, have your program run through all the possible combinations that match the pattern of filenames that you know about.
On the whole, it sounds like you are going against the intentions of the site owners, if you are not accessing the data according to a mechanism that the site offers to its standard users. This is not a bad thing in and of itself, as interesting results can come from unintended uses, which the originators had not thought of, and was not necessarily against their wishes. (you might try to ask them if they have any objections)