Thursday, January 30, 2025
HomeWordPress TutorialsIs DeepSeek AI Really THAT Good? I Tested It vs Claude snd...

Is DeepSeek AI Really THAT Good? I Tested It vs Claude snd o1


But we all know how the internet works. Whenever some shiny new AI toy is released there’s always a lot of chatter and excitement. Sometimes it lives up to the hype. Other times it fizzles out and turns into a historical Artifact

Where will DeepSeek end up?

Only time will tell. But if you’re curious about it and want to get an overview of what it’s capable of without testing it yourself, then you’re in the right place. After reading about ten articles on how amazing it is, I decided to take it for a spin. Keep reading to learn more about the experiments I ran and what I discovered about DeepSeek’s strengths and limitations.

Oh, and there was that “censorship feature” as well. 😱

My testing setup and methodology

I spent a few hours of quality time with DeepSeek this past Sunday and ran it through a battery of tests. For some of the more technical ones I asked Claude 3.5 Sonnet to generate a prompt for me and I fed this prompt to both DeepSeek and GPT-o1. In addition, I asked Claude to produce an answer to its own prompt.

I then read the individual responses, and for an even deeper insight, I cross-referenced them by giving each model the answers of the other two. Reading through all the analyses gave me a strong understanding of each model’s capabilities.

As for the tests, I tried to make all of them a different type. This way we could see how DeepSeek handles knowledge across subjects and task types. These were the tests:

  • Solving a complex mathematical problem that requires a high level of reasoning capability.
  • Generating Python code to create a program that analyzes a string of text and builds a word ladder.
  • Identifying common scientific misconceptions and explaining them to a middle schooler.
  • Answering questions about political leaders, historical figures, and atrocities.

⚠️ A QUICK DISCLAIMER BEFORE WE GET INTO IT

My testing, while relatively thorough for one person on a Sunday afternoon tinkering with AI, is still exactly that. It’s one person spending a few hours running tests. AI systems are complex and frequently updated, so your experience might differ from what I observed. Nonetheless, I think you’ll find the results interesting – particularly because they don’t fully line up with the current media hype.

Mathematical reasoning

To test DeepSeek’s mathematical reasoning abilities, I asked Claude 3.5 Sonnet to create a challenging optimization puzzle. The prompt needed to be complex enough to require careful analysis, but straightforward enough that we could easily verify the correct answer.

Claude designed a scenario about a toy factory with three production lines and limited workers:

The results

GPT-o1 wrote the most comprehensive solution, methodically explaining multiple valid ways to reach the 1,080-toy maximum. It spotted that Lines A and C produced 60 toys per worker-hour, while Line B lagged at 50 – a crucial insight that DeepSeek missed entirely.

GPT-o1's solution to the toy factory problem.

Claude’s solution, while reaching the same correct number, took a more direct route. It identified the most efficient lines and allocated workers accordingly, but it didn’t explore alternative ways to arrive at 1,080 like GPT did.

Claude Sonnet's solution to the toy factory problem.

DeepSeek moved fast, but arrived at a less efficient solution of 900 toys per hour. It also excelled in presentation. Its response came formatted with clean headers and precise mathematical notation. Unfortunately, neat formatting couldn’t compensate for the fundamental flaw in its reasoning.

DeepSeek's solution to the toy factory problem.

Coding aptitude

For the next test, I once again turned to Claude for assistance in generating a coding challenge. It created a classic computer science problem – building a word ladder. The challenge required finding the shortest chain of words connecting two four-letter words, changing only one letter at a time. For example, turning “COLD” into “WARM” through valid intermediate words.

Since there was more than one way to arrive at the solution, it made for some interesting results. Each of the three tools did it their own way, but the winner was once again…

The results

GPT-o1’s code stood out as the most polished of the three. By generating word connections on-demand rather than preprocessing everything, it struck a balance between efficiency and memory usage. The extensive documentation and clean organization made it feel like something you’d find in a professional codebase.

Click here if you want to see GPT-o1’s code











from collections import deque


VALID_WORDS = {
    "COLD", "CORD", "CORE", "CARE", "CARD", "WARD", "WARM", "WORM",
    "WORD", "WOOD", "WOOL", "POOL", "POLL", "POLE", "PALE", "SALE",
    "SAFE", "SAGE", "WAGE", "WADE", "WARE", "WORE", "PORE", "PORT",
    "PART", "CART", "CAST", "COST", "COAT", "CHAT", "THAT", "THAN"
}

def differ_by_one_letter(word1, word2):
    """
    Checks if two words differ by exactly one letter.
    Returns True if they differ by one letter, otherwise False.
    """
    differences = 0
    for c1, c2 in zip(word1, word2):
        if c1 != c2:
            differences += 1
            if differences → 1:
                return False
    return differences == 1

def get_neighbors(word):
    """
    Given a word, return all valid dictionary words that
    differ by exactly one letter.
    """
    neighbors = []
    for candidate in VALID_WORDS:
        if differ_by_one_letter(word, candidate):
            neighbors.append(candidate)
    return neighbors

def find_shortest_ladder(start_word, end_word):
    """
    Uses BFS to find the shortest ladder from start_word to end_word.
    If no ladder is possible, returns an empty list.
    """
    
    if start_word not in VALID_WORDS or end_word not in VALID_WORDS:
        return []

    
    queue = deque([[start_word]])
    visited = set([start_word])

    while queue:
        path = queue.popleft()
        current_word = path[-1]

        
        if current_word == end_word:
            return path

        
        for neighbor in get_neighbors(current_word):
            if neighbor not in visited:
                visited.add(neighbor)
                new_path = path + [neighbor]
                queue.append(new_path)

    
    return []

def main():
    """
    Main function to read user input, find the shortest ladder,
    and print the result.
    """
    start = input("Enter a 4-letter start word: ").strip().upper()
    end = input("Enter a 4-letter end word: ").strip().upper()

    ladder = find_shortest_ladder(start, end)

    if ladder:
        print("Shortest ladder from", start, "to", end, "is:")
        print(" → ".join(ladder))
    else:
        print(f"No valid path from {start} to {end} in the dictionary.")

if __name__ == "__main__":
    main()Code language: PHP (php)

Claude’s solution preprocessed the entire word graph before searching. It prioritized speed for multiple queries at the cost of initial memory usage. The approach was akin to reading an instruction manual before beginning a project, taking more time upfront, but making the process faster once started.

Click here if you want to see Claude’s code

from collections import deque, defaultdict

def build_word_graph(word_list):
    """
    Creates a graph where each word is connected to other words that differ by one letter.
    Returns a dictionary where keys are words and values are lists of neighboring words.
    """
    
    graph = defaultdict(list)
    
    
    for word1 in word_list:
        for word2 in word_list:
            if word1 != word2:
                
                differences = sum(1 for c1, c2 in zip(word1, word2) if c1 != c2)
                
                if differences == 1:
                    graph[word1].append(word2)
    
    return graph

def find_word_ladder(start, end, word_list):
    """
    Finds the shortest path from start word to end word using BFS.
    Returns the path as a list of words if found, None if no path exists.
    """
    
    if start not in word_list or end not in word_list:
        return None
    
    
    graph = build_word_graph(word_list)
    
    
    queue = deque([(start, [start])])
    
    visited = {start}
    
    while queue:
        current_word, path = queue.popleft()
        
        
        for next_word in graph[current_word]:
            if next_word == end:
                
                return path + [end]
            
            if next_word not in visited:
                visited.add(next_word)
                queue.append((next_word, path + [next_word]))
    
    
    return None

def main():
    
    word_list = {
        "COLD", "CORD", "CORE", "CARE", "CARD", "WARD", "WARM", "WORM", 
        "WORD", "WOOD", "WOOL", "POOL", "POLL", "POLE", "PALE", "SALE", 
        "SAFE", "SAGE", "WAGE", "WADE", "WARE", "WORE", "PORE", "PORT", 
        "PART", "CART", "CAST", "COST", "COAT", "CHAT", "THAT", "THAN"
    }
    
    
    test_cases = [
        ("COLD", "WARM"),
        ("POOL", "SAGE"),
        ("WOOL", "WAGE")
    ]
    
    for start, end in test_cases:
        path = find_word_ladder(start, end, word_list)
        if path:
            print(f"Path from {start} to {end}:")
            print(" → ".join(path))
        else:
            print(f"No path exists from {start} to {end}")
        print()

if __name__ == "__main__":
    main()Code language: PHP (php)

Finally, DeepSeek’s approach, while functional, lacked the sophistication of the other two. It took a more direct path to solving the problem but missed opportunities for optimization and error handling.

Click here if you want to see DeepSeek’s code

from collections import deque


valid_words = {
    "COLD", "CORD", "CORE", "CARE", "CARD", "WARD", "WARM", "WORM", "WORD", "WOOD", 
    "WOOL", "POOL", "POLL", "POLE", "PALE", "SALE", "SAFE", "SAGE", "WAGE", "WADE", 
    "WARE", "WORE", "PORE", "PORT", "PART", "CART", "CAST", "COST", "COAT", "CHAT", 
    "THAT", "THAN"
}

def is_one_letter_diff(word1, word2):
    """Check if two words differ by exactly one letter."""
    diff_count = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            diff_count += 1
            if diff_count → 1:
                return False
    return diff_count == 1

def find_word_ladder(start, end):
    """Find the shortest word ladder from start to end using BFS."""
    if start not in valid_words or end not in valid_words:
        return None  
    
    queue = deque()
    queue.append([start])  
    
    visited = set()
    visited.add(start)  
    
    while queue:
        path = queue.popleft()
        current_word = path[-1]
        
        
        if current_word == end:
            return path
        
        
        for word in valid_words:
            if word not in visited and is_one_letter_diff(current_word, word):
                visited.add(word)
                new_path = list(path)
                new_path.append(word)
                queue.append(new_path)
    
    return None  


start_word = "COLD"
end_word = "WARM"
ladder = find_word_ladder(start_word, end_word)

if ladder:
    print(" → ".join(ladder))
else:
    print(f"No valid ladder found from {start_word} to {end_word}.")Code language: PHP (php)

Scientific understanding

To test DeepSeek’s ability to explain complex concepts clearly, I gave all three AIs eight common scientific misconceptions and asked them to correct them in language a middle school student could understand. The topics ranged from basic physics (why metal feels colder than wood) to astronomy (what causes Earth’s seasons).

Giving DeepSeek a prompt about science.

Calling a clear winner on this test was challenging because even though there were objective facts at play, there were also subjective nuances. Obviously science is science, so if something is not factually sound, that’s one thing. But if you look at the prompt, I set a target audience here – middle school students. This means that I wasn’t only looking for accuracy, but also delivery. And therein lies the subjective element.

The results

Overall, all three models excelled in their own way and rather than one being better than another, it was more like each had their own strengths and weaknesses. There were also some mild disagreements.

GPT-o1 delivered a rapid, well-structured response. Each explanation flowed logically from identifying the error to providing the correct science, using relevant examples like comparing heat energy in a hot cup versus a cool swimming pool.

It mostly maintained middle-school-appropriate language. However, there were a few words that I’m not sure every middle schooler would understand (e.g., thermal equilibrium, thermal conductor). Then again, it’s been a very long time since I was that age so what do I know?! 🤷‍♂️

GPT-o1's response to a scientific question.

Claude matched GPT-o1’s scientific accuracy but took a more systematic approach. Rather than providing standalone explanations, it built connections between concepts. For example, it illustrated how understanding thermal conductivity helps explain both why metal feels cold and how heat moves through different materials. The only minor drawback I found was the same as GPT, which is that I wasn’t entirely convinced that all of the explanations were written at a middle school level.

Claude's response to a science question.

DeepSeek’s responses were scientifically accurate but sometimes repetitive without adding depth. When explaining warm air rising, for instance, it restated the same basic concept three times instead of building toward deeper understanding. On the plus side, it did excel at keeping technical language simple and accessible. I felt that it came the closest to that middle school level that both GPT-o1 and Claude seemed to overshoot.

DeepSeek's response to a science question.

Nuances in conclusions

You may have noticed that I deliberately chose to use the same response for all three screenshots above. It’s because this particular one had the most “disagreement.” GPT and Claude said similar things but drew opposite conclusions, while DeepSeek didn’t even mention certain elements that the other two did.

GPT said the statement was correct because pressure and salt can make water boil at higher temperatures, which leads to faster cooking. Claude declared it incorrect. Even though it noted that pressure cookers can achieve higher cooking temperatures, it considered pressure as an external factor and not applicable to the original statement.

DeepSeek’s explanation, while not incorrect, missed these practical details entirely. It simply stated that cooking speed depends on water temperature without exploring any of the factors that could actually change that temperature. Nonetheless, I still think that DeepSeek had a strong showing on this test. Certainly stronger than on the previous tests.

DeepSeek’s political boundaries

The final test I ran was more of a “just for fun” experiment to see how DeepSeek would react when asked about Chinese history and Chinese leaders versus those outside of China. I didn’t compare these responses with GPT-o1 or Claude. It was more to see how censorship works on the platform.

Bill Clinton versus Chinese leaders

I began by giving DeepSeek a very neutral, informative query: Tell me about Bill Clinton.

And it responded with an equally neutral, very balanced take on the former U.S. president:

DeepSeek responding to an informative query about Bill Clinton.

All things considered it was a standard AI response. I followed up by giving it the same exact prompt, but substituting Bill Clinton for Jiang Zemin – who was president of China during roughly the same time period that Clinton served. I immediately got back the following message: Sorry, that’s beyond my current scope. Let’s talk about something else.

I wanted to see how many presidents back this censorship would go, so I repeated the same prompt for the president before Jiang Zemin.

And then the one before him, and so on:

Asking DeepSeek about Jiang Zemin, Zhao Ziyang, Hu Yaobang and Chen Yun.

I was convinced the pattern would continue ad infinitum. But then all of a sudden, when I asked about Chen Yun, it gave me a complete answer – just like it did for Bill Clinton.

At that point I was surprised but also curious to see what it would do for the next preceding Chinese leader – Deng Xiaoping:

Asking DeepSeek about Deng Xiaoping.

And it was right back to censorship mode. I played around with a few more Chinese figures from recent history, mixing those who had more of an economic impact on China with those who can be considered more political. The results continued to surprise me as I couldn’t find a clear pattern or possible criteria that DeepSeek might be using to decide which people to censor and which to allow.

Historical atrocities

For this test, I wanted to see if DeepSeek handled historical events differently than political figures. I started with an event outside China’s borders – the My Lai massacre committed by American troops during the Vietnam War. DeepSeek provided a thorough, balanced account similar to its Bill Clinton biography.

Next, I asked about the Tiananmen Square massacre. As expected, DeepSeek immediately gave its standard deflection: Sorry, that’s beyond my current scope. Let’s talk about something else.

But then a funny thing happened.

I followed up my Tiananmen query by asking about Mao Zedong’s Cultural Revolution. DeepSeek initially produced an entire full-length response about the decade-long sociopolitical movement. Then, a few seconds later, the text vanished, replaced by the same deflection message from before:

DeepSeek censoring itself in real time.

This behavior suggests at least two distinct filtering systems:

  • An immediate block for certain keywords or phrases.
  • A secondary review that catches potentially sensitive content even after it’s been generated.

I continued by trying several variations of historically sensitive topics, alternating between Chinese and non-Chinese events. The pattern held: immediate blocks for some topics, delayed censorship for others, and full responses for non-Chinese historical events.

Final thoughts

In my humble opinion, DeepSeek is not the GPT killer that it was made out to be all last week – at least not yet. Don’t get me wrong, it’s impressive. Especially considering the vast funding gap and limited resources they worked with to create it. 1

But as you saw for yourself, GPT-o1 dominated the tests I ran and Claude did better as well. Though the Claude results were also influenced by the fact that it was the creator of most of the tests so it’s not exactly a fair comparison in that sense.

I’m also not naive and realize that my tiny little experiment here isn’t the equivalent of the rigorous tests they run these AI models through when they do the official benchmarks. On the flip side, it does give you a perspective of the “average user” experience. And that has its own inherent value.

Have you tried DeepSeek yet? What do you think? Is it over-hyped or is it really that good? Let me know in the comments. I’ll see you there.



Source link

RELATED ARTICLES
Continue to the category

LEAVE A REPLY

Please enter your comment!
Please enter your name here


Most Popular

Recent Comments