SQLAlchemy ORM Analysis


Summary

We are tasked with creating a program that scrapes NASA's website, and others, to retrieve pictures. Doing this requires the use of Splinter, BeautifulSoup, and pandas.


Splinter is a library that works in conjunction with a program called "chromedriver" to operate a remote instance of google chrome via commands that can be run from the console or a Jupyter Notebook. (To run splinter, we must have chromedriver installed on our computer, and be able to reference the folder in which it is contained. In the bootcamp, we were instructed to copy the chromedriver file into each of the folders of the assignments or exercises we worked on, but I placed it in a global folder much higher up so I can easily reference it at any time).


We set the URLs to the websites of interet that we want to livescrape. Then, we use BeautifulSoup to isolate the "src=..." and so extract the source of the image we want to use for our purposes.


Pandas makes it especially easy to scrape tables on websites we find interesting, so it was used instead of beautifulsoup wherever possible.


A loop was utilized in conjunction with chromedriver to retrieve all of the images, in this case by cleverly referencing the h3 tag, since that was used exclusively in conjunction with the categories of interest.


Solution



In [42]:
from splinter import Browser
from splinter.exceptions import ElementDoesNotExist
from bs4 import BeautifulSoup
import pandas as pd
In [3]:
executable_path = {'executable_path': 'C:\ChromeSafe\chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)
In [4]:
url = 'https://mars.nasa.gov/news/'
browser.visit(url)
In [5]:
html = browser.html
soup = BeautifulSoup(html, 'html.parser')
In [6]:
soup_chunk_1 = soup.select_one('ul.item_list li.slide')
first_news_title = soup_chunk_1.find("div", class_='content_title').get_text()
first_news_body = soup_chunk_1.find("div", class_='article_teaser_body').get_text()
In [7]:
print(first_news_title + first_news_body)
Why This Martian Full Moon Looks Like CandyFor the first time, NASA's Mars Odyssey orbiter has caught the Martian moon Phobos during a full moon phase. Each color in this new image represents a temperature range detected by Odyssey's infrared camera.
In [8]:
url2 = 'https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'
browser.visit(url2)
In [9]:
first_click = browser.find_by_id('full_image')
first_click.click()
In [10]:
second_click = browser.find_link_by_partial_text('more info')
second_click.click()
In [11]:
html2 = browser.html
soup2 = BeautifulSoup(html2, 'html.parser')
In [13]:
partial_url = soup2.select_one('figure.lede a img').get('src')
print(partial_url)
/spaceimages/images/largesize/PIA17793_hires.jpg
In [14]:
full_url = f'https://www.jpl.nasa.gov{partial_url}'
In [15]:
print(full_url)
https://www.jpl.nasa.gov/spaceimages/images/largesize/PIA17793_hires.jpg
In [16]:
url3 = 'https://twitter.com/marswxreport?lang=en'
browser.visit(url3)
html3 = browser.html
soup3 = BeautifulSoup(html3, 'html.parser')
In [17]:
mars_weather_tweet = soup3.find('div', attrs={"class": "tweet", "data-name": "Mars Weather"})

mars_weather_tweet
Out[17]:
<div class="tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable dismissible-content original-tweet js-original-tweet has-cards has-content" data-conversation-id="1128070234047418368" data-disclosure-type="" data-follows-you="false" data-has-cards="true" data-item-id="1128070234047418368" data-name="Mars Weather" data-permalink-path="/MarsWxReport/status/1128070234047418368" data-reply-to-users-json='[{"id_str":"786939553","screen_name":"MarsWxReport","name":"Mars Weather","emojified_name":{"text":"Mars Weather","emojified_text_as_html":"Mars Weather"}}]' data-screen-name="MarsWxReport" data-tweet-id="1128070234047418368" data-tweet-nonce="1128070234047418368-d84c406a-1ba7-4d61-882e-254018b91fc3" data-tweet-stat-initialized="true" data-user-id="786939553" data-you-block="false" data-you-follow="false">
<div class="context">
</div>
<div class="content">
<div class="stream-item-header">
<a class="account-group js-account-group js-action-profile js-user-profile-link js-nav" data-user-id="786939553" href="/MarsWxReport">
<img alt="" class="avatar js-action-profile-avatar" src="https://pbs.twimg.com/profile_images/2552209293/220px-Mars_atmosphere_bigger.jpg"/>
<span class="FullNameGroup">
<strong class="fullname show-popup-with-id u-textTruncate" data-aria-label-part="">Mars Weather</strong><span>‏</span><span class="UserBadges"></span><span class="UserNameBreak"> </span></span><span class="username u-dir u-textTruncate" data-aria-label-part="" dir="ltr">@<b>MarsWxReport</b></span></a>
<small class="time">
<a class="tweet-timestamp js-permalink js-nav js-tooltip" data-conversation-id="1128070234047418368" href="/MarsWxReport/status/1128070234047418368" title="3:51 PM - 13 May 2019"><span class="_timestamp js-short-timestamp" data-aria-label-part="last" data-long-form="true" data-time="1557787876" data-time-ms="1557787876000">May 13</span></a>
</small>
<div class="ProfileTweet-action ProfileTweet-action--more js-more-ProfileTweet-actions">
<div class="dropdown">
<button aria-haspopup="true" class="ProfileTweet-actionButton u-textUserColorHover dropdown-toggle js-dropdown-toggle" type="button">
<div class="IconContainer js-tooltip" title="More">
<span class="Icon Icon--caretDownLight Icon--small"></span>
<span class="u-hiddenVisually">More</span>
</div>
</button>
<div class="dropdown-menu is-autoCentered">
<div class="dropdown-caret">
<div class="caret-outer"></div>
<div class="caret-inner"></div>
</div>
<ul>
<li class="copy-link-to-tweet js-actionCopyLinkToTweet">
<button class="dropdown-link" type="button">Copy link to Tweet</button>
</li>
<li class="embed-link js-actionEmbedTweet" data-nav="embed_tweet">
<button class="dropdown-link" type="button">Embed Tweet</button>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="js-tweet-text-container">
<p class="TweetTextSize TweetTextSize--normal js-tweet-text tweet-text" data-aria-label-part="0" lang="en">InSight sol 163 (2019-05-13) low -99.9ºC (-147.7ºF) high -17.7ºC (0.2ºF)
winds from the SW at 4.3 m/s (9.7 mph) gusting to 15.2 m/s (34.0 mph)
pressure at 7.50 hPa<a class="twitter-timeline-link u-hidden" data-pre-embedded="true" dir="ltr" href="https://t.co/qtElTnSRJj">pic.twitter.com/qtElTnSRJj</a></p>
</div>
<div class="AdaptiveMediaOuterContainer">
<div class="AdaptiveMedia is-square">
<div class="AdaptiveMedia-container">
<div class="AdaptiveMedia-singlePhoto" style="padding-top: calc(0.5625 * 100% - 0.5px);">
<div class="AdaptiveMedia-photoContainer js-adaptive-photo" data-dominant-color="[51,52,64]" data-element-context="platform_photo_card" data-image-url="https://pbs.twimg.com/media/D6e15hvWwAEaB8e.jpg" style="background-color:rgba(51,52,64,1.0);">
<img alt="" data-aria-label-part="" src="https://pbs.twimg.com/media/D6e15hvWwAEaB8e.jpg" style="width: 100%; top: -0px;"/>
</div>
</div>
</div>
</div>
</div>
<div class="stream-item-footer">
<div class="ProfileTweet-actionCountList u-hiddenVisually">
<span class="ProfileTweet-action--reply u-hiddenVisually">
<span class="ProfileTweet-actionCount" data-tweet-stat-count="1">
<span class="ProfileTweet-actionCountForAria" data-aria-label-part="" id="profile-tweet-action-reply-count-aria-1128070234047418368">1 reply</span>
</span>
</span>
<span class="ProfileTweet-action--retweet u-hiddenVisually">
<span class="ProfileTweet-actionCount" data-tweet-stat-count="6">
<span class="ProfileTweet-actionCountForAria" data-aria-label-part="" id="profile-tweet-action-retweet-count-aria-1128070234047418368">6 retweets</span>
</span>
</span>
<span class="ProfileTweet-action--favorite u-hiddenVisually">
<span class="ProfileTweet-actionCount" data-tweet-stat-count="14">
<span class="ProfileTweet-actionCountForAria" data-aria-label-part="" id="profile-tweet-action-favorite-count-aria-1128070234047418368">14 likes</span>
</span>
</span>
</div>
<div aria-label="Tweet actions" class="ProfileTweet-actionList js-actions" role="group">
<div class="ProfileTweet-action ProfileTweet-action--reply">
<button aria-describedby="profile-tweet-action-reply-count-aria-1128070234047418368" class="ProfileTweet-actionButton js-actionButton js-actionReply" data-modal="ProfileTweet-reply" type="button">
<div class="IconContainer js-tooltip" title="Reply">
<span class="Icon Icon--medium Icon--reply"></span>
<span class="u-hiddenVisually">Reply</span>
</div>
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">1</span>
</span>
</button>
</div>
<div class="ProfileTweet-action ProfileTweet-action--retweet js-toggleState js-toggleRt">
<button aria-describedby="profile-tweet-action-retweet-count-aria-1128070234047418368" class="ProfileTweet-actionButton js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Retweet">
<span class="Icon Icon--medium Icon--retweet"></span>
<span class="u-hiddenVisually">Retweet</span>
</div>
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">6</span>
</span>
</button><button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--medium Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">6</span>
</span>
</button>
</div>
<div class="ProfileTweet-action ProfileTweet-action--favorite js-toggleState">
<button aria-describedby="profile-tweet-action-favorite-count-aria-1128070234047418368" class="ProfileTweet-actionButton js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Like">
<span class="Icon Icon--heart Icon--medium" role="presentation"></span>
<div class="HeartAnimation"></div>
<span class="u-hiddenVisually">Like</span>
</div>
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">14</span>
</span>
</button><button class="ProfileTweet-actionButtonUndo ProfileTweet-action--unfavorite u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<span class="Icon Icon--heart Icon--medium" role="presentation"></span>
<div class="HeartAnimation"></div>
<span class="u-hiddenVisually">Liked</span>
</div>
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">14</span>
</span>
</button>
</div>
</div>
</div>
</div>
</div>
In [21]:
mars_weather_tweet_text = mars_weather_tweet.find('p', 'tweet-text').get_text()
print(mars_weather_tweet_text)
InSight sol 163 (2019-05-13) low -99.9ºC (-147.7ºF) high -17.7ºC (0.2ºF)
winds from the SW at 4.3 m/s (9.7 mph) gusting to 15.2 m/s (34.0 mph)
pressure at 7.50 hPapic.twitter.com/qtElTnSRJj
In [31]:
def tweets_not_timeline_link(tag):
  return tag.has_attr('class') and not tag.has_attr('id')

try1 = mars_weather_tweet.find_all(tweets_not_timeline_link)
In [51]:
tables = pd.read_html('http://space-facts.com/mars/')
tables
Out[51]:
[                      0                              1
0  Equatorial Diameter:                       6,792 km
1       Polar Diameter:                       6,752 km
2                 Mass:  6.42 x 10^23 kg (10.7% Earth)
3                Moons:            2 (Phobos & Deimos)
4       Orbit Distance:       227,943,824 km (1.52 AU)
5         Orbit Period:           687 days (1.9 years)
6  Surface Temperature:                  -153 to 20 °C
7         First Record:              2nd millennium BC
8          Recorded By:           Egyptian astronomers]
In [55]:
table_df = tables[0]
table_df.columns = ['Parameter','Value']
table_df
Out[55]:
Parameter Value
0 Equatorial Diameter: 6,792 km
1 Polar Diameter: 6,752 km
2 Mass: 6.42 x 10^23 kg (10.7% Earth)
3 Moons: 2 (Phobos & Deimos)
4 Orbit Distance: 227,943,824 km (1.52 AU)
5 Orbit Period: 687 days (1.9 years)
6 Surface Temperature: -153 to 20 °C
7 First Record: 2nd millennium BC
8 Recorded By: Egyptian astronomers
In [56]:
table_df.to_html()
Out[56]:
'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Parameter</th>\n      <th>Value</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Equatorial Diameter:</td>\n      <td>6,792 km</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Polar Diameter:</td>\n      <td>6,752 km</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Mass:</td>\n      <td>6.42 x 10^23 kg (10.7% Earth)</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Moons:</td>\n      <td>2 (Phobos &amp; Deimos)</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Orbit Distance:</td>\n      <td>227,943,824 km (1.52 AU)</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>Orbit Period:</td>\n      <td>687 days (1.9 years)</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>Surface Temperature:</td>\n      <td>-153 to 20 °C</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>First Record:</td>\n      <td>2nd millennium BC</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>Recorded By:</td>\n      <td>Egyptian astronomers</td>\n    </tr>\n  </tbody>\n</table>'
In [ ]:
 
In [ ]:
something something .to_html()
In [35]:
url4 = 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'
browser.visit(url4)
html4 = browser.html
soup4 = BeautifulSoup(html4, 'html.parser')
In [36]:
# First, get a list of all of the hemispheres
links = browser.find_by_css("a.product-item h3")
In [37]:
print(links)
[<splinter.driver.webdriver.WebDriverElement object at 0x0000020E54E6F278>, <splinter.driver.webdriver.WebDriverElement object at 0x0000020E553DB2E8>, <splinter.driver.webdriver.WebDriverElement object at 0x0000020E553DB630>, <splinter.driver.webdriver.WebDriverElement object at 0x0000020E553DB5C0>]
In [39]:
hemisphere_image_urls = []

# First, get a list of all of the hemispheres
links = browser.find_by_css("a.product-item h3")

# Next, loop through those links, click the link, find the sample anchor, return the href
for i in range(len(links)):
  hemisphere = {}

 # We have to find the elements on each loop to avoid a stale element exception
  browser.find_by_css("a.product-item h3")[i].click()

 # Next, we find the Sample image anchor tag and extract the href
  sample_elem = browser.find_link_by_text('Sample').first
  hemisphere['img_url'] = sample_elem['href']

 # Get Hemisphere title
  hemisphere['title'] = browser.find_by_css("h2.title").text

 # Append hemisphere object to list
  hemisphere_image_urls.append(hemisphere)

 # Finally, we navigate backwards
  browser.back()
In [40]:
hemisphere_image_urls
Out[40]:
[{'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg',
'title': 'Cerberus Hemisphere Enhanced'},
{'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif/full.jpg',
'title': 'Schiaparelli Hemisphere Enhanced'},
{'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/syrtis_major_enhanced.tif/full.jpg',
'title': 'Syrtis Major Hemisphere Enhanced'},
{'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/valles_marineris_enhanced.tif/full.jpg',
'title': 'Valles Marineris Hemisphere Enhanced'}]
In [ ]: